<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** whitepaper written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Whitepapers/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Other Considerations - How to choose ML Algorithms](06.01-mlpgw-Other-Considerations-How-to-choose-ML-Algorithms.ipynb) | [Contents and Acronyms](00.00-mlpgw-Contents-and-Acronyms.ipynb) | [Other Considerations - Model Performance Improvement](06.03-mlpgw-Other-Considerations-Model-Performance-Improvement.ipynb) ]>

# 6. Other Considerations

## 6.2. Metrics and Error Analysis

### 6.2.1. Learning Curves
* Generally, a learning curve is a plot that shows time or experience on the x-axis and learning or improvement on the y-axis
* Learning curves are widely used in machine learning for algorithms that learn (optimize their internal parameters) incrementally over time, such as deep learning neural networks
* The metric used to evaluate learning could be maximizing, meaning that better scores (larger numbers) indicate more learning. An example would be classification accuracy
* It is more common to use a score that is minimizing, such as loss or error whereby better scores (smaller numbers) indicate more learning and a value of 0.0 indicates that the training dataset was learned perfectly and no mistakes were made
* There are three common dynamics that you are likely to observe in learning curves; they are:
  - Underfit
  - Overfit
  - Good Fit
* Most commonly, learning curves are used to diagnose the overfitting behavior of a model that can be addressed by tuning the hyperparameters of the model

### 6.2.2. Precision and Recall
* Both are threshold metrics for classification problems (for both balanced and Imbalanced)
* Precision attempts to answer “What proportion of positive identifications was actually correct?”
* Recall attempts to answer “What proportion of actual positives was identified correctly?”
* A **false positive** is an incorrect identification of the presence of a condition when it’s absent
* A **false negative** is an incorrect identification of the absence of a condition when it’s actually present
* Examples of when to give importance to false positives and false negatives:
  - Consider a case when a country wants to vaccinate all its people aged over 50
    - ``False positives are more important`` if the policy is not to leave even a single person over the age of 50 and it’s ok to vaccinate up to 5% of its people below the age of 50 as well if they are wrongly identified because the country has 5% more vaccinations as a contingency
    - ``False negatives are more important`` if the policy is not to allow even a single person below the age of 50 because of the shortage of vaccinations

### 6.2.3. Performing Error Analysis/Troubleshooting Prediction Errors
* **Error analysis**: The **``recommended approach``** to solving machine learning problems is,
  - Start with a simple algorithm, implement it quickly, and test it early on the cross-validation dataset
  - Plot learning curves to decide if more data, more features, etc., are likely to help
  - Manually examine the errors on examples in the cross-validation dataset and try to spot a trend where most of the errors were made
* Errors in predictions or evaluating a hypothesis can be troubleshot by the following, to name a few.
  - Getting more training examples
  - Trying smaller sets of features
  - Trying additional features
  - Trying polynomial features
  - Regularization methods (increasing or decreasing “lambda”)
* Don't just pick one of these avenues at random, the diagnostic techniques for choosing one of the above solutions are discussed below

**1) Model selection and ``Train-Validation-Test datasets``**
  - Split the dataset into 3 datasets (the % can differ):
  ```
    - Training set         - 60%
    - Cross-validation set - 20%
    - Test set             - 20%
  ```
  - Now, calculate 3 separate error values for the 3 different datasets using the following method
  - Optimize the parameters in **``theta``** using the training set for each polynomial degree
  - Find the polynomial degree **``d``** with the least error using the cross-validation set
  - Estimate the generalization error using the test set with d (note, **``d``** is not trained using the test set)

**2) Diagnosing Bias vs Variance – ``Model complexity``** (lower-order/higher-order polynomial **d** contribution)
  - We need to distinguish whether ``bias`` or ``variance`` is the problem contributing to the bad predictions
  - ``High bias`` means ``underfitting`` and ``high variance`` means ``overfitting``
  - Ideally, we need to find a golden mean (sweet-spot) between bias and variance (trade-off)
  - The ``training error`` will tend to ``decrease`` as we increase the degree **d** of the polynomial, at the same time the ``cross-validation error`` will tend to ``decrease`` as we increase **d** up to a point, then it will ``increase`` as **d** is increased, forming a convex curve as depicted below

**3) Diagnosing and Bias-Variance –** **``Regularization``** ``(parameter``**``lambda``**)
  - ``Large lambda`` means ``high bias (underfitting)``
  - ``Low lambda`` means ``high variance (overfitting)``
  - ``Intermediate lambda`` means ``right-fit or just-right``<br>
  The figure below illustrates the relationship between lambda and the hypothesis.

![](figures/MLPG-OC-ModelComplexity.png) ![](figures/MLPG-OC-Regularizattion.png) <br>
Image credit [ (Source) ](https://www.coursera.org/in)

**4) Learning curves**
  - Training an algorithm on a very few numbers of data points will easily have 0 errors because we can always find a quadradic curve that touches exactly those number of points, hence:
    - As the training set gets larger, the error for the quadradic function increases
    - The error value will plateau out after a certain m or training set size
  - **Experiencing high bias:**
    - **``Low training set size``**: causes ``low train error`` and ``high CV error``
    - **``Large training set size``**: causes both ``train error`` and ``CV error high``
    - If a learning algorithm is suffering from ``high bias, getting more training data will not (by itself) help much``
  - **Experiencing high variance:**
    - **Low training set size**: causes ``low train error`` and ``high CV error``
    - **Large training set size**: train error increases with the training set and CV error continues to decrease without leveling off, also, the train error < CV error, but the difference between them is significant
    - If a learning algorithm is suffering from ``high variance, getting more training data likely to help``

![](figures/MLPG-OC-BiasVsVariance.png)
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; Image credit [ (Source) ](https://www.coursera.org/in)

**5) Diagnosing Neural Networks (NN)**
  - A NN with ``fewer parameters`` is ``prone to underfitting`` but computationally cheaper
  - A large NN with ``more parameters`` is ``prone to overfitting`` and computationally expensive
  - Using a single hidden layer is a good starting default
  - We can train our NN on many hidden layers using the CV set, then select the one that performs best

**In SUMMARY - Deciding what to do next**
* Model complexity effects
  - Low model complexity results in  ``High Bias``, but ``Low Variance`` (lower-order polynomials)
  - High model complexity results in ``High Variance``, but ``Low Bias`` (higher-order polynomials)
* A **``rule of thumb``** when running diagnostics is:
  - Getting more training examples: ``Fixes High Variance``, but ``not High Bias``
  - Trying fewer features: ``Fixes High Variance``, but ``not High Bias``
  - Adding features: ``Fixes High Bias``, but ``not High Variance`` 
  - Adding polynomial features: ``Fixes High Bias``, but ``not High Variance`` 
  - Decreasing lambda: ``Fixes High Bias``
  - Increasing lambda: ``Fixes High Variance``
  - Smaller NN: prone to ``Underfitting``
  - Larger NN: prone to ``Overfitting``

**NOTE:**
* Adding features or adding polynomial features means increasing the complexity of the model
* Lower-order polynomials mean low model complexity
* Higher-order polynomials mean high model complexity

### 6.2.4. How to evaluate and select ML models in Supervised Learning?
* There are a variety of evaluation metrics to evaluate supervised ML models for,
  - Classification problems, and 
  - Regression problems
* Choose the right metric for selecting between models and/or for doing parameter tuning
* It's very important to choose evaluation methods that match the goal of the application
* Compute the selected evaluation metric for multiple different models
* Then select the model with the "best" value of the evaluation metric

**Basic Evaluation Metrics (for ``Classification``):**
* Different applications have different goals, Accuracy is widely used, but many others are possible
* The metric Accuracy gives only a partial picture of a classifier's performance
* Dummy classifiers:
  - Dummy classifiers completely ignore the input data!
  - They provide a null-numeric (e.g., null accuracy) baseline
  - ``Dummy classifiers should not be used for real problems``
* **``Confusion Matrix``** for binary prediction task:

![](figures/MLPG-OC-ConfusionMatrix.png)
  - **``Rule of thumb``**: as part of model evaluation, always look at the confusion matrix for the classifier
  - To get some insight into what kind of errors it is making for each class including whether some classes are much more prone to certain kinds of errors than others

* ``Accuracy``: for what fraction of all instances is the classifier's prediction ``correct`` (for either +ve or -ve class)?<br> 
    $Accuracy = \frac{TN\ +\ TP}{TN\ +\ TP\ +\ FN\ +\ FP}$

* ``Classification error``: for what fraction of all instances is the classifier's prediction ``incorrect``?

* ``Precision``: what fraction of ``positive`` predictions are correct?<br>
$Precision = \frac{TP}{TP\ +\ FP}$

* ``Recall (aka TPR)``: what fraction of all ``positive`` instances does the classifier ``correctly`` identify as positive?<br>
$Recall\ [aka\ True\ Positve\ Rate\ (TPR)] = \frac{TP}{(TP\ +\ FN)}$  

* ``FPR``: what fraction of all ``negative`` instances does the classifier ``incorrectly`` identify as positive?<br>
$False\ Positive\  Rate\ (FPR) = \frac{FP}{FP\ +\ TN}$

* ``F1-Score``: combines Precision and Recall into a single number<br>
$F1_{Score} = 2 * \frac{Precision\ *\ Recall}{Precision\ +\ Recall}   =  \frac{2\ *\ TP}{2\ *\ TP\ +\ FN\ +\ FP}$

* There is often a trade-off between Precision and Recall
  - Recall-oriented ML tasks
    - Tumor detection
    - Search and information extraction in legal discovery
    - Often paired with a human expert to filter out false positives
  - Precision-oriented ML tasks
    - Search engine ranking, query suggestions
    - Document classification
    - Many customer-facing tasks (users remember failures!)

**Predicted Probability of Class Membership (predict_proba):**
* Typical rule: Choose most likely class (e.g., Class 1 if threshold > 0.50)
* The adjusting threshold affects the predictions of the classifier
* Higher threshold results in a more conservative classifier
* Not all models provide realistic probability estimates

![](figures\MLPG-OC-PRCurves.png) 
![](figures\MLPG-OC-ROCCurves.png)<br>
Image Credit [ (Source) ](https://www.coursera.org/in)

**Basic Evaluation Metrics (for ``Regression``):**
* Typically, _**``r2_score``**_ is enough
  - Computes how well the future instances will be predicted
  - The best possible score is **1.0**
  - The constant prediction score is **0.0**
* Alternative metrics include:
  - _**``MAE``**_: ``Mean Absolute Error`` (absolute difference of target & predicted values)
  - _**``MSE``**_: ``Mean Squared Error`` (squared difference of target & predicted values)
  - ``Median Absolute Error`` (robust to outliers)

**Model Selection: Optimizing Classifiers for different evaluation metrics:**
* Training, Validation, and Test framework for model evaluation and selection
  - Using only CV or test set to do model selection may lead to more subtle overfitting
  - Instead, use 3 data splits:
    - Training set (for model building)
    - Validation set (for initial model selection)
    - Test set (for final model evaluation & selection)
  - In practice:
    - ``Create an initial train/test split``
    - ``Do CV on the training set for initial model/parameter selection``
    - ``Save the held-out test set for final model evaluation & selection``

* ``Accuracy`` is often not the right evaluation metric for many real-world ML tasks
  - False positives and False negatives may need to be treated very differently
  - Make sure we understand the needs of our application and choose an evaluation metric that matches our application, user, and business goals

* Additional evaluation methods include (_examples_):
  - ``Learning curve``
    - How much does accuracy/other metrics change as a function of the amount of training data?
  - ``Sensitivity analysis``
    - How much does accuracy/other metrics change as a function of key learning parameter values?

<!--NAVIGATION-->
<br>

<[ [Other Considerations - How to choose ML Algorithms](06.01-mlpgw-Other-Considerations-How-to-choose-ML-Algorithms.ipynb) | [Contents and Acronyms](00.00-mlpgw-Contents-and-Acronyms.ipynb) | [Other Considerations - Model Performance Improvement](06.03-mlpgw-Other-Considerations-Model-Performance-Improvement.ipynb) ]>