# Data Exploration, Preprocessing, and Problem Framing
Machine learning is a method where a computer learns patterns from data so it can make predictions or decisions without being explicitly programmed for each task.

Generally, machine learning consists of using data to create a model, and then to fine-tune it to make better predictions. The predictive model can then be make used to predictions on previously unseen data in order to perform specific tasks.

## Supervised learning
Supervised learning is a type of machine learning where the model learns using labeled data - meaning the correct answer is already known.

The general flow is:
- the machine is provided input data, as well as the correct output for each input
- the algorithm makes predictions and compares them with the correct answers
- the difference (error) is used to adjust the algorithm
- this repeats many times until the machine performs well enough

## Unsupervised learning
Unsupervised learning is a type of machine learning where the algorithm learns patterns, structures, or relationships in data without any labeled output.

The machine is provided unlabelled data as the input data (with no corresponding output), and the algorithm attempts to interpret the data on its own, such as by grouping similar data points together. 

## Reinforcement learning
Reinforcement learning is a type of machine learning where an agent (a computer program) learns by interacting with an environment and receiving feedback in the form of rewards or punishments.

The general flow is:
- the agents takes an action
- the environment gives feedback
  - positive feedback (reward) is given for good actions
  - negative feedback (punishment) is given for bad actions
- the agent learns from the feedback
  - actions that lead to rewards become more likely in the future
  - actions that lead to punishment become less likely

Reinforcement learning requires a balance between exploration (trying new/random actions) and exploitation (using what the agent already knows). If there is too much exploration, the agent's behaviour will be too random; if there is too much exploitation, the agent may miss better strategies. 

## Machine learning workflow
The general machine learning workflow involves:
1. problem framing: defining the problem in a way a machine learning model can understand and solve. This determines the kind of data used, the type of model used and how success is evaluted
2. data preperation: the performance of the model is directly determined by the quality of the data used to train it. The training data should not have any innaccurate or missing values
3. algorithm selection: deciding which machine learning algorithm (e.g. regression, tree-based) is the most appropriate for the problem
4. model training: the training data is used to incrementally improve the modelâ€™s performance
5. model testing: once the training process has stopped, the model is evaluated using the test set of data; this aims to see how good the trained model is at performing its intended task on previously unseen data
6. hyperparameter testing: the process of finding the best set of hyperparameters for a machine learning model to maximise its performance. Hyperparameters are settings outside the model e.g. learning rate
7. inference/prediction: the model should now be able to make a prediction for unseen data

## Data construction

### Data sampling
In machine learning (with limited resources) we want to obtain the minimum amount of data containing sufficient information required to learn properly from the phenomenon without wasting time. This can be done by reducing information redudancy - removing information that doesn't add any value for  the particular task at hand. 

To ensure a sample is representitive of the real world population and not redundant:
- perform a high level comparison between the sample and population to ensure the sample reflects the same distributions, patterns and statistics as the whole population
- take a sub-sampling of a large data set which still keeps all the statistics intact

An imbalanced dataset is one where one class appears far more frequently than another. In these cases, the minority class is often the most important to detect, but the model struggles to learn enough about it and becomes biased towards the majority class. It can lead to a misleading high accuracy; a high accuraccy score may not mean the model performs well as it may just be reflecting the underlying class distribution
- say we want a model to identify a rare disease, with a prevalance of 1 in 1000. If we give the model 1000 entries of test input reflecting the real world prevalance (i.e. in the 1000 entries, only 1 is positive), it could say all 1000 are negative. In this case, it was correct 99.9% of the time, but it failed to identify the one positive case, which was the main interest

To counteract this:
- start by trying training on the true distribution (using the whole dataset) - if the trained model works well and generalises well, then there is no problem
- if it doesn't, use down-sampling (training the model on a disproportionately low subset of the majority class examples) and up-weighting (adding an example weight to the down-sampled class equal to the factor by which it was down-sampled)