Machine Learning Diagnostic
===========================
A test to run to gain insight what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.

Evaluating a Model
===================================
- A model can be evaluated using a training set and a test set.
- The training set is used to train the model.
- The test set is used to evaluate the model.

Train/test procedure for linear regression
==========================================
1. Learn parameter theta from training data (minimize J(theta)).
2. Compute test set error:
3. For linear regression: J(test) = 1/2m_test * sum((h_theta(x_test(i)) - y_test(i))^2)
4. in which h_theta(x) = theta' * x, theta' is the learned parameter, x is the input, y is the output.
5. Compute training set error:
6. For linear regression: J(train) = 1/2m_train * sum((h_theta(x_train(i)) - y_train(i))^2)

Train/test procedure for logistic regression
============================================
1. Learn parameter theta from training data (minimize J(theta)).
2. Compute test set error:
3. For logistic regression: J(test) = -1/m_test * sum(y_test(i) * log(h_theta(x_test(i))) + (1 - y_test(i)) * log(1 - h_theta(x_test(i))))
4. in which h_theta(x) = g(theta' * x), g(z) = 1 / (1 + e^(-z)), theta' is the learned parameter, x is the input, y is the output.
5. Compute training set error:
6. For logistic regression: J(train) = -1/m_train * sum(y_train(i) * log(h_theta(x_train(i))) + (1 - y_train(i)) * log(1 - h_theta(x_train(i))))

For classification problems, there are other definitions of J(test) and J(train) that are maybe more common.
They instead measure the fraction of the test set and the training set that are misclassified.
J(test): fraction of the test set that are misclassified.
J(train): fraction of the training set that are misclassified.

In an over-fitted model, the training error will be low, but the test error will be high.
So this is a way to evaluate the model.

Model Selection
================
- How to choose the model?
- For example, how to choose the degree of the polynomial?
- One way is to split the data into three parts: training set, cross-validation set, and test set.
- Use the training set to learn the parameters.
- Then choose the degree of the polynomial using the cross-validation set: try different degrees of the polynomial, and choose the one that gives the lowest cross-validation error.
- Finally, estimate the generalization error using the test set: this is the error of the model on new data.

- This procedure can be used for other models as well, such as neural networks.
- Cross validation set is also called validation, development, or dev set.
Note: When selecting a model, you want to choose one that performs well both on the training and cross-validation set. It implies that it is able to learn the patterns from your training set without over-fitting.

A note on feature scaling:

As with the training set, you will also want to scale the cross-validation and test sets. An important thing to note when using the z-score is you have to use the mean and standard deviation of the training set when scaling the cross-validation and test sets. This is to ensure that your input features are transformed as expected by the model.

Bias and Variance
=================
- Bias and variance are two sources of error in a model.
- High bias is under-fitting: the model is too simple to capture the underlying structure of the data.
- High variance is over-fitting: the model is too complex and captures noise in the data.
- High bias(underfit): training error is high, cross-validation error is high ----> J(train) ~ J(cross-validation)
- High variance(overfit): training error may be low, cross-validation error is much higher ----> J(train) << J(cross-validation)
- High bias and high variance: training error is high, cross-validation error is much higher ----> J(train) << J(cross-validation)
- The optimal model will have low training error and low cross-validation error.

- The training error decreases as the model complexity increases.
- The cross-validation error decreases as the model complexity increases, but then increases as the model becomes too complex.

Regularization and Bias/Variance
=================================
How to choose the regularization parameter lambda?
The procedure is similar to choosing the degree of the polynomial.
- Split the data into three parts: training set, cross-validation set, and test set.
- Use the training set to learn the parameters.
- Choose the regularization parameter lambda using the cross-validation set.
- Estimate the generalization error using the test set.
So, try different values of lambda and choose the one that gives the lowest cross-validation error.

If we were to draw a graph of the training error and the cross-validation error as a function of the regularization parameter lambda, we would see that:
- The training error increases as lambda increases.
- The cross-validation error decreases as lambda increases.
- The cross-validation error is high when lambda is too low or too high.
The graph would be a mirror image of the graph of the training error and the cross-validation error as a function of the model complexity.

Establishing a Baseline Level of Performance
==============================================
- We need concrete numbers to determine if a learning algorithm has high bias or high variance.
- To judge if the training error is high, we establish a baseline level of performance.
- If the training error is much higher than the baseline level, it indicates a high bias problem.
- If the training error is low, but the cross-validation error is much higher than the training error, it indicates a high variance problem.
- If both the training error and the cross-validation error are high, it indicates a high bias and high variance problem.

Some common ways to set a baseline level of performance:
- Human-level performance: If humans can do the task, then the error they make is a good baseline.
- Competing models: If there are other models that can do the task, then the error they make is a good baseline.
- Guess based on experience

Learning Curves
================
- Learning curves are a good way to diagnose bias and variance problems.
- They plot the training error and the cross-validation error as a function of the training set size.
- Usually, the training error increases as the training set size increases, because it is harder to fit a larger training set.
- The cross-validation error decreases as the training set size increases.
- The cross-validation error is higher than the training error, but they tend to converge as the training set size increases.
- High bias: The J(train) increases rapidly, but tends to flatten out. The J(cv) decreases rapidly, but tends to flatten out. J(cv) is still higher. If you set a baseline level of performance, you can see a large gap between that and J(train); we know that this indicates high bias.
- The flattening out of the curves indicates that the model is just too simple, and it doesn't really matter how much data you feed it. So for a high bias problem, more training data is not the solution. You need to try a more complex model.
- High variance: The J(train) is low, but the J(cv) is high. The two curves are far apart. If you set a baseline level of performance, you can see a large gap between that and J(cv); we know that this indicates high variance.
- In a high variance problem, the model is too complex, and it is over-fitting the training data. It turns out that if you feed it more training data, the model will generalize better. So for a high variance problem, more training data is likely the solution.

High bias:
- Try getting additional features
- Try adding polynomial features
- Try decreasing lambda

High variance:
- Get more training examples
- Try smaller sets of features
- Try increasing lambda

Bias/Variance and Neural Networks
=================================
- We have seen the bias variance tradeoff.
- Turns out that neural networks offer us a way out of this tradeoff.
- Large neural networks with small to moderate sized datasets are low bias machines; meaning that if your neural network is large enough, it can fit the training set well, so long as the dataset is not enormous.
- The method we can use is as follows:
1. Does the model do well on the training set? -> high bias
2. If not, increase the size of the neural network.
3. Repeat until the model does well on the training set.
4. Does the model do well on the cross-validation set? -> high variance
5. If not, increase the size of the training set.
6. Go to step 1.
7. If the model does well on the training and cross-validation sets, then you have a good model!
- Remember that for steps 1 and 4, we use the baseline performance we talked about earlier.

Does increasing the size of the neural network cause over-fitting?
It turns out that a large neural network will usually do as well or better than a smaller one so long as regularization is chosen appropriately.

Neural Network Regularization
==============================
J(w, b) = 1/m * sum(L(f(x(i)), y(i))) + {lambda/2m * sum(w^2)}: regularization term
- L(f(x(i)), y(i)) is the loss function.
- Regularized model in Tenserflow:
- Dense(units=..., activation=..., kernel_regularizer=L2(lambda))

Machine Learning Development Process
====================================
The iterative loop of ML development:
- Choose architecture: model, data, hyperparameters, etc.
- Train the model.
- Diagnostics: bias, variance and error analysis.
- Go through the loop until you have a good model.

Error Analysis
================
- Error analysis refers to the process of manually examining the errors that the model makes.
- This can give you insight into what is/isn't working with the model.
- For example, in the case of spam email classification, say the model is misclassifying 100 emails.
- You could manually examine these emails to see if there is a pattern.
- Categorize them based on common traits:
- E.g. pharma related emails are about 50% of the errors, but deliberate misspelling is only 5%.
- This can help you decide what to do next.
- You might decide to get more data on pharma related emails, or create more features, 

- The categories might be overlapping, so one email can fall into multiple categories.
- If the amount of misclassified emails is large, you can sample a subset of them.
- One downside of error analysis is that it's much easier to do for problems that humans are good at.

Adding Data
============
- Add more data of the types where error analysis has indicated it might help.
- Besides getting new data, another technique is data augmentation.
- Augmentation: modifying an existing data set to create new data.
- For example for OCR: rotate, scale, or skew the images. Or you can place a grid on top of the images and randomly warp the images.
- for speech recognition: add background noise.

- Distortions should be reasonable, so that the data is still representative of the real data or the test set.
- Usually does not help to add purely random/meaningless noise to the data.


- Data synthesis: using artificial data inputs to create a new data set.
- For example for OCR: using fonts on your computer, create new data.
- Data synthesis is usually used for vision problems.

- Conventional model-centric approach: focus on improving the model.
- Data-centric approach: focus on improving the data.

Transfer Learning
==================
Sometimes you don't have that much data and getting data is expensive or challenging.
- Transfer learning is a way to use a model trained on one task and apply it to a different task.
- For example, you have a model trained to classify 1000 categories of objects in images.
- You can use this model to classify digits from 0 to 9.
- You take the pre-trained model and remove the last layer.
- You add a new output layer that will classify the digits.
- There are two options:
- Option 1: Only train the new output layer -> good for a very small dataset.
- Option 2: Retrain the entire model -> good for a larger dataset.

- So in transfer learning we have two tasks:
- Supervised pre-training and then Fine-tuning.

- But why does transfer learning work?
- The intuition is that the features learned in the lower layers are more general, e.g., the edges, curves, etc.
- One restriction of pre-trained models is that the input type should be the same: image, text, audio, etc.

Full Cycle of a Machine Learning Project
========================================
1. Scope project: define the goal
2. Data collection: define and collect data
3. Train the model: training, error analysis and iterative improvement
4. Go to step 2 until you have a good model
5. Deploy the model in production: deploy, monitor, and maintain the system
6. Go to step 2 or 3 if needed

Skewed Datasets
================
- Sometimes the ratio of positive to negative examples is very high and far from 50/50.
- In these cases, metrics like accuracy are not very informative.
- Consider the case of a classifier for a rare disease that is present for only 0.5% of the population.
- If the classifier predicts that no one has the disease, it will have an accuracy of 99.5%.

- A good pair of metrics to use in this case is precision/recall.
- For that, we draw a table with four cells: true positive, false positive, true negative, false negative.
- Precision: of all the patients with the rare disease, what fraction actually have the disease?
- Precision = true positive / (true positive + false positive)
- Recall: of all the patients that actually have the disease, what fraction did the model correctly identify?
- Recall = true positive / (true positive + false negative)
- These metrics are desired to be high.

Trading off Precision and Recall
=================================
- These metrics are both desired to be high.
- In practice, there is a trade-off between precision and recall.
- Suppose we're using logistic regression to classify the presence of the rare disease.
- We would like to predict y=1 only if we are very confident.
- That means setting the threshold higher, like 0.7 instead of the usual 0.5.
- This will increase precision, but decrease recall.
- On the other hand, if we want to catch all the cases of the disease, we can set the threshold lower.
- This will increase recall, but decrease precision.

Plotting precision and recall against different values of the threshold gives a precision-recall curve.
- The precision-recall curve is a good way to compare different models and pick the best threshold.
- It turns out that if you want to automatically trade off precision and recall, you can use the F1 score.

F1 Score
=========
- The F1 score is the harmonic mean of precision and recall.
- F1 = 2 * precision * recall / (precision + recall)
- Or,
- F1 = 2 / (1 / precision + 1 / recall)
- The F1 score pays more attention to the smaller of the two values, because if one of them is very low, the F1 score will be very low and the model is probably not good.