# Week 9 - Main Challenges of Machine Learning

In ML our main task is to select a learning algorithm and train it on some data. The **two things that can go wrong** are **bad algorithm** and **bad data**.

## Problems with Data

### Insufficient Quantity of Training Data

We need sufficient data to train our model

### Nonrepresentative Training Data

In order to generalise well, our data **needs to be representative of the new cases we want to generalise to**. This is true for instance-based or model-based learning. If the sample is too small, we might get sampling noise.

### Poor Quality Data

If our training data is full of errors, outliers and noise, it will make it harder for the system to detect underlying patterns. It is often well worth the effort to spend time cleaning up our training data. The truth is most data scientists spend a lot of time doing just that. For example:

1. If some instances are clear outliers, it might help to simply discard them or try to fix the errors manually
2. If some instances are missing a few features (e.g 5% of our customers didn't tell us their age) then we must decide whether we want to ignore this attribute, these instances, fill in the missing values, or train a model with the feature and one without it, etc... **Is it possible to train a model to predict these missing values and fill them in accordingly?**

### Irrelevant Features

Our system will only be capable of learning if the training data contains enough relevant features and not too many irrelevant ones. **A critical part of the success of a Machine Learning project is coming up with a good set of features to train on**. This is called **feature engineering**, and involves:

1. **Feature selection**: Selecting the most useful features to train on
2. **Feature extraction**: Combining existing features to produce a more useful one [dimensionality reduction algorithms can help]
3. **Creating new features by gathering new data**

## Problems with Algorithms

### Overfitting the Training Data

Overfitting means that the model performs well on the training data, but not the test data - It does not generalise well. Complex models like DNN's can detect subtle parrerns in data, but if the training set is noisy or too small then the model is more likely to detect patterns in the noise itself. 

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The solutions are:

1. To simplify the model by selecting one with fewer parameters, by reducing the number of attributes in the training data, or by constraining the model
2. To gather more training data
3. To reduce the noise in the training data [fix data errors and remove outliers]

Constraining a model to make it simply and reduce the risk of overfitting is called **regularisation**. **Say we have a linear model with two parameters, 𝜃0 and 𝜃1. This gives the learning algorithm two degrees of freedom to adapt the model to the training data: it can tweak both the height (𝜃0) and the slope - 𝜃1 - of the line. If we forced 𝜃0 = 0 the algorithm would only have one degree of freedom and would have a much harder time fitting the data properly - All it could do is move the line up or down to get as close as possible to the training instances, so would end up around the mean.** 

**If we allow the algorithm to modify 𝜃1 but we force it to keep it small then the learning algorithm will effectively have somewhere in between one and two degrees of freedom** - It will produce a simpler model than with two degrees of freedom but more complex than with just one. **You want to find the right balance between fitting the training data perfectly and keeping the model simple enough to ensure that it will generalise well.**

**The amount of regularisation to apply during learning can be controlled by hyperparameters.** Tuning hyperparameters is an important part of building a model.

### Underfitting the Training Data

As you might guess, **underfitting** is the opposite of overfitting. It occurs when your model is too simple to learn the underlying structure of the data. You can fix underfitting by:

1. Selecting a more powerful model, with more parameters
2. Feeding better features to the learning algorithm [feature engineering]
3. Reducing the constraints on the model [e.g reducing the regularisation hyperparameter]

### Stepping Back

1. Machine Learning is about making machines get better at a task by learning from data, instead of having to explicitly code rules

2. There are many different types of ML systems: supervised or not, batch or online, instance-based or model-based etc

3.  In a ML project you gather data in a training set and feed the training set to a learning algorithm. If the algorithm is model base it tunes some parameters to fit the model to the training set. If the algorithm is instance-based, it just learns the examples by heart and generalises to new instances using a similarity measure.

4. The system will not perform well if the training set is too small, or if the data is not representative, is noisy, or is polluted with irrelevant feautres. It needs to be neither too simple [to not underfit] or too complex [to not overfit]

## Testing and Validating

The only way to know how well a model generalises to new cases it to try it on them. The best way is to split out data into two sets, the **training set** and the **test set**. The **error rate** on new cases is called the **generalisation error [or out of sample error]** and by **evaluating our model on the test set, we will get an estimate of this error**. This value tells us **how well our model will perform on instances it has never seen before**. 

If the **training error is low but our generalisation error is high, it means our model is overfitting the training data**. 

### Hyperparameter Tuning and Model Selection

Evaulating a model is easy enough - Just use a test set. Suppose we are hesitating between two models - How can we decide? One option is to train both and compare how well they generalise using the test set. Suppose that one model generalises better but we want to apply some regularisation to avoid overfitting. How would we choose the value of the hyperparameter? We could train 100 different models using 100 different values. Suppose we found the best hyperparameter value that produces a model with the lowest generalisation error, say 5%.

Say we launch this model into prod, but it doesn't perform as well as expected and produces 15% errors. What happened? The problem is that we measured the generalisation error multiple times on the test set, and we adapted the model and hyperparameters to produce the best model for that particular set. This means the model is unlikely to perform as well on new data.

A common solution is called **holdout validation** - we simply **hold out part of the training set to evaluate several candidate models** and select the best one. The new holdout set is called the validation set [or somtimes the development set, or dev set]. More specifically, we train multiple models with various hyperparameters on the reduced training set [i.e the full training set minus the validation set] and we select the model that performs best on the validation set. After the holdout validation process, we train the best model on the full training set [including the validation set] and this gives us our final model. We can then evaluate our final model on the test set to get an estimate of the generalisation error. 

This usually works quite well. However, if the validation set is too small, then model evaluations will be imprecise - We may end up selecting a suboptimal model by mistake. On the other hand if the validation set is too large then the remaining training set will be much smaller than then full training set. Why is this bad? Because the final model will be trained on the full training set, it is not ideal to compare candidate models trained on a much smaller training set. One way to solve this problem is to perform **repeated cross-valdation**, using many small validation sets. Each model is evaluated once per validation set, after it's trained on the rest of the data. By averaging out the evaluations of a model, we get a more accurate measure of its performance. There is one drawback - The training time is multiplied by the number of validation sets.

### Data Mismatch

In some cases it's easy to get a large amount of data for training but the data is not perfectly representative of the data that will be used in production. We encountered this with our vehicle identifier. A lot of the pictures were either adverts or unsuitable for training our model - especialy where the cars were newer. In this case, the most important thing to remember is that the validation set and test must be as representative as possible of the data we expect to use in production, so they should be composed exclusively of representative pictures - We can shuffle them and put half in the validation set and half in the test set, making sure no duplicates or near duplicates end up in both sets.

**After training our model on web pictures, if we observe that the performance of our model on the validation set it disappointing, we will not know whether this is because our model has overfit the training set, or whether it is due to the mismatch between training data and new data [user pictures].**

One solution is to hold out part of the training pictures [from the web] in yet another set that **Andrew Ng** calls the **train-dev set**. **After the model is trained [on the training set, not on the train-dev set] we can evaluate it on the train-dev set. If it performs well then the model is not overfitting on the training set, so if it performs poorly on the validation set the problem must come from the data mismatch. We can tackle this problem by pre-processing the training data to make it look more like new data [pictures submitted in the app], and then retraining the model.**

**Conversely if the model performs poorly on the train-dev set, then the model must have overfit the training set, so we should try to simplify of regularise the model, get more training data and clear up the training data, etc.**

### Generalisation, Overfitting and Underfitting

**In supervised learning, we want to build a model on the training data and then be able to make predictions on new unseen data that has similar characteristics to the training data we used to train our model.**

Usually we build a model so that it can make accurate predictions on the training set. If the training and test sets have enough in common then we expect the model to also be accurate on the test. However there are some cases where this can go wrong. For example, if we allow ourselves to build very complex models, we can always be as accurate as we like on the training set. For example, given a small dataset, we can make up any number of rules to make a prediction, but these may not extrapolate well to new data.

The only measure of whether an algorithm will perform well on new data is the evaluation on the test set. However, intuitively we expect simple models to generalise better to new data. Therefore we want to find the simplect model. Building a model that is too complex for the information we have is called overfitting. Overfitting occurs when you fit a model too closely to the particularities of the training set and obtain a model that works well on the training set but is not able to generalise to new data. On the other hand, if our model is too simple, then we might not be able to capture all the aspects of variability in the data, and our model will do badly even on the training set. Choosing too simple a model is called underfitting.

The more complex we allow our model to be, the better we will be able to predict on the training data. However if our model becomes too complex, it will focus too much on individual datapoints in our training set and be unable to generalise new data. There is a sweet spot in between that will yield the best generalisation performance.

### Relation of Model Complexity to Dataset Size

It's important to note that model complexity is intimately related to the variation of inputs covered in our training dataset. The larger variety of data points our dataset contains, the more complex a model we can use without overfitting. Usually collecting more data points will yield more variety, so larger datasets allow building more complex models. However simply duplicating the same data points or collecting similar data will not help.

