<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** whitepaper written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Whitepapers/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Other Considerations - Metrics-and-Error-Analysis](06.02-mlpgw-Other-Considerations-Metrics-and-Error-Analysis.ipynb) | [Contents and Acronyms](00.00-mlpgw-Contents-and-Acronyms.ipynb) | [Summary](07.00-mlpgw-Summary.ipynb) ]>

# 6. Other Considerations

##  6.3. Model Performance Improvement

### 6.3.1. L1 and L2 regularization
* One of the most common problems for data science professionals is **overfitting**, i.e., the model performing well on the train dataset and performing poorly on the test dataset
* Regularization is a technique that makes slight modifications to the learning algorithm such that the model generalizes better and it improves the model’s performance on the unseen data as well
* In ML the Regularization penalizes the coefficients and in deep learning, it penalizes the weight matrices of the nodes which will result in a simpler model and slight underfitting of the train dataset

![](figures/MLPG-OC-UFJROFCurves.png) ![](figures/MLPG-OC-TrainingVsTestSetErrors.png)<br>
When the complexity of the model increases the training error reduces but the testing error doesn’t.

**Regularization techniques:**
* There are **5 types**:
```
  1) L1 Regularization (LASSO)        - used in both ML and DL 
  2) L2 Regularization (Ridge)        - used in both ML and DL
  3) Dropout                          - used in DL
  4) Data Augmentation/transformation - used in DL
  5) Early stopping                   - used in DL
```

* **L1 Regularization (LASSO):**
  - A most common type of regularization
  - Updates the general cost function by adding a ``regularization term``
  - ``L1 gives output in binary weights from 0 to 1`` for the model's features and is adopted for decreasing the number of features in a high dimensional dataset, which in turn reduces the complexity of the model (i.e., reduces the overfitting by generalizing the model better)
  - ``L1 tends to shrink coefficients to zero``
  - L1 is useful for feature selection, as we can drop any variables with coefficients that go to zero

* **L2 Regularization (Ridge):**
  - A most common type of regularization
  - Updates the general cost function by adding a ``regularization term``
  - L2 regularization is also known as _``weight decay``_ as it forces the weights to ``decay towards zero (but not exactly zero)``
  - L2 tends to shrink coefficients evenly
  - L2 is useful when we have collinear/codependent features

* **Dropout:**
  - A most frequently used regularization technique in DL (Artificial Neural Networks)
  - At every iteration, it randomly selects some nodes and removes them along with all of their incoming and outgoing connections, so each iteration has a different set of nodes, and this results in a different set of outputs
  - It can also be thought of as an ensemble technique in machine learning
  - Ensemble models usually perform better than a single model as they capture more randomness, similarly, dropout also performs better than a normal neural network model
  - Dropout is usually preferred when we have a large NN structure to introduce more randomness

* **Data Augmentation:**
  - The simplest way to reduce overfitting is to increase the size of the training data thru data transformations such as rotating, flipping, scaling, etc., if we are dealing with images in the datasets
  - It may be too costly in ML, but may not be costly in DL

* **Early stopping:**
  - A kind of cross-validation strategy where we keep one part of the training set as the validation set
  - When we see that the performance on the validation set is getting worse, we immediately stop the training on the model
![](figures/MLPG-OC-EarlyStopping.png)

### 6.3.2. Overfitting, Underfitting, and how to limit Overfitting
**Generalization:**
* It refers to how well the concepts learned by an ML model are applied to new data, i.e., to specific examples not seen by the model when it was learning
* The goal of an ML model is to generalize well from the training data to any data from a problem domain
* 	Overfitting and underfitting are the two biggest causes for the poor performance of ML algorithms

**Overfitting:**
* A model that performs good on the training data and poorly generalizes other data
* It happens when a model learns the details and noises in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the model's ability to generalize
* Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. For example, decision trees are a nonparametric ML algorithm that is very flexible and is subject to overfitting training data ``(This problem can be addressed by pruning a tree after it has learned to remove some of the detail it has picked up.)``

**Underfitting:**
* A model that performs poorly on the training data and poorly generalizes other data
* It refers to a model that can neither model the training data well nor generalizes to new data
* An underfit ML model is not a suitable model and will be ``obvious`` as it will have poor performance on the training data
* Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate ML algorithms

**Good-fit (or) Right-fit:**
* Ideally, we want to select a model at the sweet spot between underfitting and overfitting
* This is the goal but is very difficult to do in practice
* To understand this goal, we can look at the performance of an ML algorithm over time as it is learning training data, then, plot both the skill on the training data and the skill on a test dataset
* Over time, as the algorithm learns, the error for the model on the training data goes down and so does the error on the test dataset
* If we train for too long, the performance on the training dataset may continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the training dataset, at the same time, the error for the test set starts to rise again as the model’s ability to generalize decreases
* The sweet spot is the point just before the error on the test dataset starts to increase where the model has good skill on both the training dataset and the unseen test dataset

**How to limit overfitting:**
* Both overfitting and underfitting can lead to poor model performance
* Overfitting occurs when there are outliers, noises, random fluctuations, skews, etc. in the dataset
* There are 2 main options to address the issue of overfitting:
  - Data pre-processing
    - Data cleaning steps help to remove noises, outliers, noises, random fluctuations, etc.
    - Feature selection and feature engineering steps help to retain only the most relevant features
    - Data transformation steps help to remove skews
  - Regularization 
    - Keeps all the features, but reduces the magnitude of the parameters
    - Works well when we have a lot of slightly useful features
* There are 2 important techniques that we can use when evaluating ML algorithms to limit overfitting:
  - Use a resampling technique to estimate model accuracy (``k-fold cross-validation``)
    - The most popular resampling technique is k-fold cross-validation which allows us to train and test our model k-times on different subsets of training data and build up an estimate of the performance of an ML model on unseen data
  - Hold back a validation dataset (``validation dataset; not to be confused with test dataset``)
    - A validation dataset is simply a subset of our training data that we hold back from our ML algorithms until the very end of your project
    - After we have selected and tuned our ML algorithms on our training dataset we can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data
* Using cross-validation is a gold standard in applied ML for estimating model accuracy on unseen data
* If we have the data, using a validation dataset is also an excellent practice

### 6.3.3. Data leakage: How to detect and prevent it?
**Data leakage:**
* When the data we are using to train contains information about what we are trying to predict
* Introducing information about the target during training that would not be available during actual use
* ``Examples:``
  - _``Including test data along with the training data``_
  - _``Including the label to be predicted as a feature``_
* If the model performance is too good to be true, it is likely due to "giveaway" features

**Examples of Data leakages:**
* ``Leakage in training data:``
  - Performing data preprocessing using parameters or results from analyzing the entire dataset: Normalizing and rescaling, detecting and removing outliers, estimating missing values, feature selection, etc.
  - Time-series datasets: Using records from the future when computing features from the current prediction
  - Errors in data values/gathering or missing variable indicators (e.g., the special value 999) can encode information about missing data that reveals information about the future
* ``Leakage in features:``
  - Removing variables that are not legitimate without also removing variables that encode the same or related information (e.g., diagnosis info may still exist in a patient ID)

**Detecting Data leakage:**
* Before building the model:
  - Perform exploratory data analysis (EDA) to find surprises in the data
  - Are there features very highly correlated with the target value?
* After building the model:
  - Look for surprising feature behavior in the fitted/trained model
  - Are there features with very high weights, or high information gain?
  - Simple rule-based models like DTs can help with features like account numbers, patient IDs
  - Is overall model performance surprisingly good compared to known results on the same dataset, or for similar problems on similar datasets?
* After real-world deployment of the trained model:
  - Is the trained model generalizing well to new data?

**Preventing/minimizing Data leakage:**
* Perform data preparation within the train and test datasets separately (not using the entire dataset):
  - Perform feature selection, transform/scale/normalize data, etc. for train & test datasets separately
  - Care must be taken while “handling the missing numerical values – especially when replacing with a test statistic such as mean/median/mode/forward-fill/back-fill)”. For example, if we use ``mean``, 
    - We should compute the ``mean`` value on the training set and use it to fill the missing values in the training set
    - Save the mean value that we have computed
    - Later, replace missing values in the test set with the ``mean`` value to evaluate the system
  - The parameters used to prepare the test dataset MUST be the same as the ones used in the training dataset
* Perform data preparation within each cross-validation fold separately (not using the training dataset):
  - Transform/scale/normalize data, etc. within each fold of the CV dataset separately
  - The parameters used to prepare each CV fold MUST be the same as the ones used in the training dataset
* With time-series data, use timestamp cutoff:
  - The cutoff value is set to the specific time point where prediction is to occur using current and previous/old records
  - Using a cutoff time will make sure you aren't accessing any data records that were gathered after the prediction time, i.e., in the future
* Before any work with a new dataset, split off a final test validation dataset
  - ... if you have enough data
  - Use this final test dataset as the very last step in your validation
  - Helps to check the true generalization performance of any trained models
* In a nutshell:
  - Apply any feature scaling or transformation technique (such as normalization or standardization etc.) on training & testing datasets separately to prevent DATA LEAKAGE. In other words, _``DO NOT apply`` **``data transformation techniques``** ``before splitting the datasets into training & testing datasets``_
  - Apply any algorithm fine-tuning of hyperparameter technique on training & testing datasets separately to prevent DATA LEAKAGE. In other words, _``DO NOT apply`` **``algorithm fine-tuning techniques``** ``before splitting the datasets into training & testing datasets``_

<!--NAVIGATION-->
<br>

<[ [Other Considerations - Metrics-and-Error-Analysis](06.02-mlpgw-Other-Considerations-Metrics-and-Error-Analysis.ipynb) | [Contents and Acronyms](00.00-mlpgw-Contents-and-Acronyms.ipynb) | [Summary](07.00-mlpgw-Summary.ipynb) ]>