<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** whitepaper written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Whitepapers/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Machine Learning Project – Process Definition](04.00-mlpgw-Machine-Learning-Project–Process-Definition.ipynb) | [Contents and Acronyms](00.00-mlpgw-Contents-and-Acronyms.ipynb) | [Other Considerations - How to choose ML Algorithms](06.01-mlpgw-Other-Considerations-How-to-choose-ML-Algorithms.ipynb) ]>

# 5. ML Project - Stages in detail with Guidelines

# 5.1. Stage-1: Business Understanding
During this stage, the ML Problem to be solved is clearly understood and documented here. It would be a good idea to describe the problem in both terms:
* Problem statement in business terms (Business problem)
* Problem statement in analytical terms (Converted analytics problem)

# 5.2. Stage-2: Data Understanding
It's assumed that the following are already made available by the DS team. EDA is the main focus of the ML team.
* Data requirements definition identifying the following:
  - Data sources (and they could be)
      - In-house or external source
      - APIs, XML feeds, CSVs, Excel files
      - Data mining/scrapping from online
  - Data pipelines (and they could be)
      - Streaming vs Batch
      - Ingestion frequency
  - Data environments
      - Small vs Medium vs Big data
* Collected raw datasets (and they should be)
  - Diverse
  - Unbiased
  - Abundant

## 5.2.1. Exploratory Data Analysis (EDA)
### 5.2.1.1. Objectives of EDA
* To get an overview of the distribution of the dataset
* Check for missing numerical values, outliers, or other anomalies in the dataset
* Discover patterns and relationships between variables in the dataset
* Check the underlying assumptions in the dataset

## 5.2.2. Summary of EDA Techniques
EDA techniques depend on the type of data and the objectives of the analysis. The following is a summary of useful EDA techniques:
```
Type of data                  EDA techniques
---------------------------   ------------------------
- Categorical                 - Descriptive statistics
- Univariate discrete         - Barplot
- Univariate continuous       - Line plot, Histogram
- Bivariate continuous        - 2-D scatter plot
- 2-D arrays                  - Heatmap
- Multivariate distribution   - 3-D scatter plot
- Multiple groups             - Boxplot
```

The following table summarizes the useful EDA techniques depending on the objective:
```
Objective                                        EDA techniques
----------------------------------------------   ------------------------------------------------
- Check the distribution of a variable           - Histogram
- Find outliers                                  - Histogram, scatterplot, box-and-whisker plot
- Quantify the relationship between variables    - 2-D scatter plot, covariance, and correlation
- Visualize the relationship between variables   - Heatmap
- Visualize high-dimensional data                - Principal component analysis, 3-D scatter plot
```

## 5.2.3. Text EDA: Understand the data with Descriptive Statistics
* Dimensions of the dataset
* An initial look at the raw data (e.g., first 10 rows & last 10 rows)
* Basic information of the dataset
* Statistical summary of the dataset
* Class distribution of the dataset
* Explore NA / NULL values in the dataset
* Explore duplicates in the dataset

## 5.2.4. Visual EDA: Understand the data with Visualizations
* Draw Univariate plots to better understand each attribute
  - Box and Whisker plots
  - Histograms
  - Pie-charts or Bar-charts (horizontal or vertical) to understand the distributions of data in int or float data-type
* Draw Multivariate plots to better understand the relationships between attributes
  - Scatter Plot Matrix
  - Correlation maps

# 5.3. Stage-3: Research
Based on the type of the problem (regression, classification or clustering, etc.) do research on the available algorithms and mention here the list of ML algorithms to be used to build the models. The final model will be selected based on their performances.
* List down the names of algorithms, classifiers, and types of algorithms (linear, non-linear, ensemble, etc.) 
* Identify the Evaluation metrics selected for the project with the reasons
* An overall approach on how this project will be developed (similar to development methodology in traditional projects)

## 5.3.1. List of model evaluation metrics
Performance/Evaluation Metric: 
* An evaluation metric is a way to quantify the performance of a predictive model
* There is no "one fits all" evaluation metric
* Get to know your data
* Keep in mind the business objective of your ML problem

Select one or more metrics based on the problem type and business priorities. Commonly used metrics are:
![](figures/MLPG-ModelEvalMetrics.png)

**NOTE:**
* CV is a cross-validation score and, for regression, the scorer can be anything such as ``R^2, MAE, MSE, RMSE, and RMSLE``
* CV is a cross-validation score and, for classification, the scorer can be anything such as ``Accuracy, ROC-AUC, PR-AUC, Logloss``, etc.
* The following are not metrics, but they help to gain insight into the type of errors a model is making
  - ``Confusion matrix``
  - ``Classification report (produces Precision, Recall, F1 scores)``
* Algorithm ``run-time is also a metric``

# 5.4. Stage-4: Data Preprocessing
* The success of the ML algorithms depends on the quality of the data and the data must be free from errors and discrepancies
* It must adhere to a specific standard so that ML algorithms can accept them, but this does not happen in reality
* In reality, the data is dirty, incomplete, noisy, and inconsistent
* Incomplete data means it has missing values and lacks certain attributes
* The data may be noisy as it contains errors and outliers and hence does not produce desired results
* The data may be inconsistent as it contains discrepancies in data or duplicate data
* ML practitioners take steps to transform the collected raw data and process them to meet the input requirements of model training and testing that are suitable for ML algorithms
* It involves several steps for cleaning, transforming, normalizing, and standardizing data to remove all the inadequacies and irregularities in the data
* These steps are collectively known as _**Data Preprocessing**_ (or) _**Data Wrangling**_ (or) _**Data Preparation**_ (or) _**Data Augmentation**_

## 5.4.1. Data Preparation tasks
* **Data Cleaning**: Identifying and correcting mistakes or errors in the data
* **Feature Selection**: Identifying those input variables that are most relevant to the task
* **Feature Engineering**: Deriving new variables from available data
* **Dimensionality Reduction**: Creating compact projections of the data
* **Split datasets for train-test**: Separating datasets into input (X) and output (y) components
* **Data Transforms**: Changing the scale or distribution of variables
* **Handling Imbalanced Classes**: Imbalanced classes arise when one set of classes (majority class) dominates over another (minority class)

### 5.4.1.1. Data Cleaning
* _``The process of identifying, correcting mistakes/ errors/ incomplete/ inconsistent/ noisy data, and preparing the dataset for analysis``_
* The real-world dataset never comes clean; it comes in a wide variety of shapes and formats

**Tidy data**
* Tidy data provides a standard way to organize data values within a dataset
* There are three principles of tidy data and they are:
  - Columns represent separate variables
  - Rows represent individual observations
  - Observational units form tables
* Tidy data makes it easier to fix common data problems, so, we need to transform the untidy dataset into tidy data

**Signs of Untidy dataset**
* _``Missing numerical data``_: Either they need to be deleted or replaced with a suitable test statistic
* _``Unexpected data values``_: Solve the mismatched data types of a column and data values
* _``Inconsistent column names``_: Address the column names that contain inconsistent caps and bad characters
* _``Outliers``_: Remove or replace them with suitable test statistic as they can skew the results
* _``Duplicate rows and columns``_: Drop them as they can cause bias in the analysis

#### 5.4.1.1.1. Basic data cleaning activities
* Remove unwanted/duplicate columns/features/attributes
* Remove unwanted/duplicate rows/samples/examples
* Remove embedded characters that may cause data misalignment 
  - Eg., embedded tabs in a tab-separated data file, embedded new lines that may break records, etc
* Identify the inconsistent values and bring them to a common standard of expression (eg., N.Y or NY into New York)

#### 5.4.1.1.2. Outliers detection
* Outliers are extreme values that will skew the results if not addressed properly
* There are many ways to identify outliers. Following are some of them:
  - Box and Whisker plots
  - Calculate Z-Score using the formula given below to identify the outliers (data points that fall outside a threshold)
    ```
    Z-Score = (x - Mean) / SD
    ```
* A couple of ways to handle outliers are:
  - Drop them
  - Replace with a suitable test statistic such as mean/median/mode

_**IMPORTANT NOTE:**_
* _Do outliers removals (discarding the samples or replacing the outlier values with 'mean' or 'median', 'mode' value) on training & test datasets separately_

#### 5.4.1.1.3. Handling Missing Numerical values
* Care must be taken while dealing with missing numerical values
* Need to first identify the reason for the missing numerical values
* There are several methods to handle missing values and each method has its advantages and disadvantages
* The choice of the method is subjective and depends on the nature of the data and the missing values
* ``Mark/Encode`` missing numerical values with ``NaN``:
  - Missing values are encoded in different ways and they can appear as ``NaN, NA, ?, 0, ‘xx’, -1, or " " (blank space)``
  - Pandas always recognize missing values as ``NaN``
  - So, we must first convert all the ``NA, ?, 0, xx, -1, or " " to NaN``
  - If the missing values aren’t identified as ``NaN``, then we have to first convert or replace such ``non-NaN`` entry with a ``NaN``
* ``Drop`` them using ``dropna()`` method
  - This is the easiest method to handle missing values. In this method, we drop labels or columns from a dataset with missing values
* ``Replace`` with a suitable test statistic (such as mean/median/mode/forward-fill/back-fill)
* ``Impute``
  - In this method, we replace the missing value with the mean value of the entire feature column

_**NOTE**_: _If we choose to fill the missing values with a test statistic like ``mean``, then we should compute the ``mean`` value on the training set and use it to fill the missing values in the training set. Then we should save the ``mean`` value that we have computed. Later, we will replace missing values in the test set with the mean value to evaluate the system. This is to avoid DATA LEAKAGE._

#### 5.4.1.1.4. Handling Missing Categorical values
* Care must be taken while dealing with missing categorical values
* Need to first identify the reason for the missing categorical values
* In addition to the numerical values, the real-world datasets contain categorical data as well
* ML algorithms require that some input data must be in numerical format; only then do the algorithms work successfully on them
* The categorical data must be converted into numbers before they are fed into an algorithm
* Some of the methods to handle missing values of categorical variables are:
  - ``Drop``, i.e., ignore the variables if it is not significant
  - ``Replace` it with the most frequent value (mode)
  - ``Treat the missing data as just another category``

### 5.4.1.2. Feature Selection
* Feature selection (FS) refers to techniques for selecting a subset of input features that are most relevant to the target variable that is being predicted
* FS is primarily focused on removing non-informative or redundant predictors from the model
* FS is about identifying those input variables that are most relevant to the task
* FS removes columns with duplicate data/empty/unwanted data
* FS techniques are generally grouped into those that use the target variable (``supervised``) and those that do not (``unsupervised``)

#### 5.4.1.2.1. The Goals of Feature Selection Techniques
* To reduce the computational cost of modeling
* To improve the performance of the model

#### 5.4.1.2.2. Overview of Feature Selection Techniques
* **Feature Selection**: Select a subset of input features from the dataset
  - **Unsupervised**: Do not use the target variable (e.g. remove redundant variables)
  - **Supervised**: Use the target variable (e.g. remove irrelevant variables)
  - Some models are naturally resistant to non-informative predictors. Tree- and rule-based models, MARS and the Lasso, for example, intrinsically conduct FS
* **Dimensionality Reduction**: 
  - Project input data into a lower-dimensional feature space
  - Dimensionality reduction (eg., PCA) is an alternative to FS rather than a type of feature selection

_**NOTE**_: _Just like there is no best set of input variables or best ML algorithm, there is no best feature selection method. One must discover what works best for a specific problem._

#### 5.4.1.2.3. Feature Importance
**Feature importance** (FI) refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

**The Uses of Feature Importance:**
* Better understanding the data – FI scores can provide insight into the dataset
* Better understanding a model – FI scores can provide insight into the model
* Reducing the number of input features – FI scores can be used to improve a predictive model
* Many ways to calculate FI scores and many models can be used for this purpose. Commonly used are,
  - Feature importance from model coefficients
  - Feature importance from decision trees
  - Feature importance from permutation testing

#### 5.4.1.2.4. Mutual Information (MI)
* MI is a lot like correlation in that it measures a relationship between two quantities
* MI describes relationships in terms of _``uncertainty``_
* The MI between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other
* The benefit of MI is that it can detect _``any kind of relationship``_, while correlation only detects _``linear relationships``_
* MI is a great general-purpose metric and especially useful at the start of feature development when we might not know what model we’d like to use yet
* ``Advantages:``
  - Easy to use and interpret
  - Computationally efficient
  - Theoretically well-founded
  - Resistant to overfitting
  - Able to detect any kind of relationship

_**NOTE**_: _The uncertainty is measured using a quantity from information theory known as "entropy". The entropy of a variable means roughly: "how many yes-or-no questions you would need to describe an occurrence of that variable, on average." The more questions you have to ask, the more uncertain you must be about the variable. MI is how many questions you expect the feature to answer about the target._

### 5.4.1.3. Feature Engineering
    “Better features make better data and better data make better models”.

#### 5.4.1.3.1. The Goal and a Guiding principle
* The goal of Feature Engineering is to make data better suited to the problem to be solved
* Establish a baseline (score) by training the model on the un-augmented dataset. This will help us determine whether the new features are useful/worth keeping or discard them and try something else

#### 5.4.1.3.2. Creating features
* Tips on discovering new features:
  - Understand the existing features in the dataset by referring to the “data description document”
  - Research the problem domain to acquire **domain knowledge**
  - Study previous work. Solution write-ups are a great resource
  - Use data visualization
* Combine/aggregate existing features to create a new one (e.g., Total or Count or Mean or Ratio)
* Reorder columns, if needed

#### 5.4.1.3.3. Tips on creating new features
Keep in mind the model's strengths and weaknesses when creating features. Here are some guidelines: 
* Linear models learn sums and differences naturally, but can't learn anything more complex
* Ratios seem to be difficult for most models to learn. Ratio combinations often lead to some easy performance gains
* Linear models and NNs generally do better with normalized features. NNs especially need features scaled to values not too far from 0. Tree-based models (like random forests and XGBoost) can sometimes benefit from normalization, but usually much less so
* Tree models can learn to approximate almost any combination of features, but when a combination is especially important, they can still benefit from having it explicitly created, especially when data is limited
* Counts are especially helpful for tree models since these models don't have a natural way of aggregating information across many features at once

#### 5.4.1.3.4. Clustering with K-Means
* We can use clustering algorithms (such as K-Means) in feature engineering. For example, we could attempt to discover groups of customers representing a market segment, for instance, or geographic areas that share similar weather patterns. Adding a feature of cluster labels can help machine learning models untangle complicated relationships of space or proximity

#### 5.4.1.3.5. Principal Component Analysis (PCA)
* Just like clustering is a partitioning of the dataset based on proximity, we could think of PCA as a partitioning of the variation in the data
* PCA is a great tool to help discover important relationships in the data and can also be used to create more informative features
* **NOTE**: PCA is typically applied to standardized data. With standardized data "variation" means "correlation". With unstandardized data "variation" means "covariance". 
* There are two ways you could use PCA for feature engineering:
  - The **first way** is to use it as a **descriptive technique**. Since the components tell us about the variation, we could compute the MI scores for the components and see what kind of variation is most predictive of our target. That could give us ideas for kinds of features to create: 
     - A product of ``'Height'`` and ``'Diameter'`` if ``'Size'`` is important, say, or a ratio of ``'Height'`` and ``'Diameter'`` if ``'Shape'`` is important. You could even try clustering on one or more of the high-scoring components
  - The **second way** is to use the **components themselves as features**. Because the components expose the variational structure of the data directly, they can often be more informative than the original features
* PCA Best Practices:
  - PCA only works with numeric features, like continuous quantities or counts
  - PCA is sensitive to scale. It's good practice to standardize the data before applying PCA unless we know we have a good reason not to
  - Consider removing or constraining outliers, since they can have an undue influence on the results

#### 5.4.1.3.6. Target Encoding
All the techniques discussed above so far are for numerical features. The technique used to encode categorical features is called “target encoding”
* A target encoding is any kind of encoding that replaces a feature's categories with some number derived from the target
* Target encoding is sometimes called _mean encoding_ or _likelihood encoding_ or _impact encoding_ or _leave-one-out encoding_ or _binary encoding_ (if applied to a binary target)
* Use cases for Target Encoding:
  - **High-cardinality features**: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature's most important property: its relationship with the target
  - **Domain-motivated features**: From prior experience, we might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature's true informativeness

### 5.4.1.4. Dimensionality Reduction
* Creating compact projections of the data
* Dimensionality reduction can be considered as part of Feature Selection as well

### 5.4.1.5. Split datasets for train-test
* Separate datasets into Input (X) and output (y) components, if needed
* Split the preprocessed datasets into _``train``_ and _``test``_ datasets, for model development, training, and evaluations
  - _``train-test-split``_: 
    - A procedure to evaluate the performance of models on a large dataset (less computational cost)
    - The dataset split can be done during the Preprocessing stage and the train & test datasets should be used at later stages
    - Use _``train-test-split()``_ function from sklearn.model_selection
    - k is internally set to 2 for a single split (1 train and 1 test dataset)
  - _``LOOCV (Leave-One-Out-Cross-Validation)``_:
    - This is another extreme to train-test-split where k is set to the total number of observations (k=n) such that each observation is given a chance to be held out of the dataset
    - Use _``LeaveOneOut()``_ class from sklearn.model_selection
* Remove outliers (replace with a suitable test statistic mean/median/mode)

### 5.4.1.6. Data Transforms
* Changing the scale or distribution of variables.
* Data transformation can be considered as part of Feature engineering as well

#### **5.4.1.6.1. Numerical type**
#### **Change scale**

#### 5.4.1.6.1.1. Normalization a.k.a. Feature Scaling
* Rescales numerical values to a specific range of 0-1 to reduce skews
* Feature Scaling is a process used to normalize the range of independent variables so that they can be mapped onto the same scale
* Exceptions:
  - ``Decision trees`` and ``random forests`` are scale-invariant algorithms where we don’t need to worry about feature scaling
  - Similarly, ``Naive Bayes`` and ``Linear Discriminant Analysis`` are not affected by feature scaling
  - In Short, ``any Algorithm which is not distance-based is not affected by feature scaling``
* Some of the common methods are:
  - _``Log``_ transform
  - _``MinMaxScaler``_ transform
  - _``MaxAbsScaler``_ transform
  - _``Normalizer``_ transform
  - _``Binarizer``_ transform
  - _``Decimal``_ scaling

#### 5.4.1.6.1.2. Standardization (_``StandardScaler``_)
* Standardize numerical data using the scale and center options (i.e., Mean of 0 and SD of 1))
* It can be more useful for many ML algorithms, especially for optimization algorithms such as gradient descent
* In standardization, first, we determine the distribution mean and standard deviation for each feature
* Next, we subtract the mean from each feature, then we divide the values of each feature by its standard deviation
* So, in standardization, we center the feature columns at mean 0 with a standard deviation of 1 so that the feature columns take the form of a normal distribution, which makes it easier to learn the weights

_**NOTE**_: _Again, we should fit the _``StandardScaler``_ class only once on the training data set and use those parameters to transform the test set or new data set. This is to avoid DATA LEAKAGE._

#### 5.4.1.6.1.3. Robust (_``RobustScaler``_)
* _``StandardScaler``_ can often give misleading results when the data contain outliers
* Outliers can often influence the sample mean and variance and hence give misleading results
* In such cases, it is better to use a scalar that is robust against outliers
* The _``RobustScaler``_ is very similar to _``MinMaxScaler``_. The difference lies in the parameters used for scaling. While _``MinMaxScaler``_ uses the minimum and maximum values for rescaling, _``RobustScaler``_ uses the interquartile (IQR) range for the same

#### **Change distribution**

#### 5.4.1.6.1.4. Power (_``PowerTransformer``_)
* A power transform will make the probability distribution of a variable more Gaussian distribution
* This power transform is available in the scikit-learn Python machine learning library via the _``PowerTransformer``_ class with the methods:
  - _``Box-Cox``_ transform: Automatic power transform
  - _``Yeo-Johnson``_ transform: Automatic power transform 

#### 5.4.1.6.1.5. Quantile (_``QuantileTransformer``_)
* Numerical input variables may have a highly skewed or non-standard distribution
* The quantile transform provides an automatic way to transform a numeric input variable to have a different data distribution, which in turn, can be used as input to a predictive model
* A quantile transform will map a variable’s probability distribution to another probability distribution
* The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution

#### 5.4.1.6.1.6. Discretize (_``KBinsDiscretizer``_)
* This is also called _``BINNING``_ i.e., a grouping of values into 'bins' (e.g. High, Medium, Low)
* The discretization transform provides an automatic way to change a numeric input variable to have a different data distribution, which in turn can be used as input to a predictive model

#### **Engineer**

#### 5.4.1.6.1.7. Polynomial (_``PolynomialFeatures``_)
* Polynomial features are those features created by raising existing features to an exponent

#### **5.4.1.6.2. Categorical type**
#### **Nominal**

#### 5.4.1.6.2.1. One-Hot encode (_``OneHotEncoder``_)
* For categorical variables that do not have a natural rank-ordering, i.e., no relationships
  ```
  Eg., 'red' is (1,0,0), 'green' is (0,1,0), and 'blue' is (0,0,1)
  ```
* _``LabelEncoder``_ treats class labels as categorical data with no order associated with it
* The problem arises when we apply the same approach to transform the nominal variable with _``LabelEncoder``_
* For example, in _``LabelEncode``_, the values are encoded as 0, 1, 2 for 'high', 'low', 'medium' respectively. This is OK for ordinal variables but not for nominal variables
* Although there is no order involved, a learning algorithm will assume that _``high < low < medium``_. This is a wrong assumption and it will not produce desired results
* To fix this issue, a common solution is to use a technique called _``one-hot-encoding``_
* In this technique, we create a new dummy feature for each unique value in the nominal feature column. The value of the dummy feature is equal to one when the unique value is present and zero otherwise. Similarly, for another unique value, the value of the dummy feature is equal to one when the unique value is present and zero otherwise. This is called one-hot encoding because only one dummy feature will be equal to one (hot), while the others will be zero (cold)

#### 5.4.1.6.2.2. Dummy encode
* _``Dummy encoding``_: Avoid redundancy in One-Hot encoding
  ```
  Eg., 'red' is (1,0), 'green' is (0,1), and 'blue is (0,0)
  ```

#### 5.4.1.6.2.3. Label binarize (_``LabelBinarizer``_)
* We can accomplish two tasks (encoding multi-class labels to integer categories, then from integer categories to one-hot vectors or binary labels) in one shot using the Scikit-Learn’s _``LabelBinarizer``_ class, in other words, it combines _``Label encode + One-hot encode``_.

#### **Ordinal**

#### 5.4.1.6.2.4. Label encode (_``LabelEncoder``_)
* The ML algorithms require that class labels are encoded as integers and most estimators for classification convert class labels to integers internally

#### 5.4.1.6.2.5. Ordinal encode (_``OrdinalEncoder``_)
* For categorical variables that do not have a natural rank/ ordering, i.e., each unique category value is assigned an integer value
  ```
  Eg., 'red' is 1, 'green' is 2, and 'blue' is 3
  ```

_**Differences between _``LabelEncoder``_ and _``OrdinalEncoder:``_**
```
LabelEncoder                            OrdinalEncoder
-------------------------------------   --------------------------------------------------
- Deals with 1D data, i.e., n_samples   - Deals with 2D data, i.e., n_features, n_samples
- Used to encode ‘Target variable’      - Used to encode ‘independent features’
```

_**IMPORTANT NOTE:**_ 
* _Apply any feature scaling or transformation technique (such as normalization or standardization etc.) on training & testing datasets separately to prevent DATA LEAKAGE. In other words, ``DO NOT apply data transformation techniques before splitting the datasets into training & testing datasets``_.

### 5.4.1.7. Handling Imbalanced classes
* Any real-world dataset may come with several problems and the imbalanced classes are one of them
* The problem of imbalanced classes arises when one set of classes dominates over another set of classes
* The former is called the majority class while the latter is called the minority class
* This is a very common problem in machine learning where we have datasets with a disproportionate ratio of observations in each class

#### 5.4.1.7.1. Problems with imbalanced learning
* The problem of imbalanced classes is very common and it is bound to happen
* This learning from imbalanced data is referred to as imbalanced learning
* Significant problems may arise with imbalanced learning. These are as follows:
  - It causes the machine learning model to be more biased towards the majority class
  - It causes poor classification of minority classes. Hence, this problem throws the question of "accuracy" out of the question
  - If the Imbalanced classes problem is not addressed properly, then we may end up with higher accuracy. But this higher accuracy is meaningless because it comes from a meaningless metric that is not suitable for the dataset in question. Hence, this higher accuracy no longer reliably measures model performance
  - There may be inherent complex characteristics in the dataset. Imbalanced learning from such datasets requires new approaches, principles, tools, and techniques. But it cannot guarantee an efficient solution to the business problem

#### 5.4.1.7.2. Example of imbalanced classes
* The problem of imbalanced classes may appear in many areas including the following:
  - Disease detection, Fraud detection, Anomaly detection
  - Earthquake prediction, Churn prediction, Intrusion prediction
  - Spam filtering, etc.

#### 5.4.1.7.3. Approaches to handle imbalanced classes
There are several methods to deal with the imbalanced class problems and the common ones are listed below
* ``Undersampling methods:``
The undersampling methods work with the majority class. In these methods, we randomly eliminate instances of the majority class. It reduces the number of observations from the majority class to make the dataset balanced. It results in a severe loss of information. This method is applicable when the dataset is huge and reducing the number of training samples makes the dataset balanced.
    1. Random
    2. Informative
    3. NearMiss
    4. Tomek links
    5. Edited nearest neighbors
    6. Cluster centroids
<br><br>
* ``Oversampling methods:``
The Oversampling methods work with the minority class. In these methods, we duplicate random instances of the minority class. So, it replicates the observations from minority classes to balance the data. It is also known as upsampling. It may result in overfitting due to duplication of data points.
    1. Random 
    2. Cluster-based
    3. Synthetic data generation (SMOTE & ADASYN)

    _**IMPORTANT NOTE**_: _``SMOTE resampling should NOT be done for test datasets ("X_test" & "y_test")``_

* ``Other methods:``
    1. Cost-sensitive learning
    2. Algorithmic Ensemble methods
    3. Imbalanced learn

# 5.5. Stage-5: Model Development
ML team owns full responsibility for this stage. During this stage, the ML models are developed using the selected list of algorithms at Stage 3 and the pre-processed data at Stage 4. The model could be developed from scratch or fine-tune a pre-trained/available one.

# 5.6. Stage-6: Model Training
ML team owns full responsibility for this stage. During this stage, the _``Evaluation metrics``_ are generated using the _``Training datasets``_. Based on their performance, only a ``few models are selected`` for improvement.

There are many ways to split the datasets depending on the size of the dataset to improve model learning. All these methods are used to train, refine and evaluate the ML models.
* _K-Fold cross-validation_: 
  - A resampling procedure to evaluate the performance of the models on a small dataset (high computational cost)
  - This procedure should be used directly during the Model Training and initial model selection stage
  - Use _``KFold()``_ class from _sklearn.model_selection_
  - Usually, k is set to 10 (k=10) for 10 splits (for testing 10 models, each time training with 9 sets & testing with 1 hold-out set)
* _StratifiedKFold_: 
  - Each fold in the split has the same proportion of observations with a given categorical value such as class outcome
  - Use _``StratifiedKFold()``_ class from _sklearn.model_selection_
* _RepeatedKFold_:
  - KFold cross-validation is repeated n times
  - Use _``RepeatedKFold()``_ class from _sklearn.model_selection_
  - Use this for _**Regression**_ models
* _RepeatedStratifiedKFold_:
  - It's a combination of Stratified KFold and Repeated KFold procedures
  - Use _``RepeatedStratifiedKFold()``_ class from _sklearn.model_selection_
  - Use this for _**Classification**_ models

# 5.7. Stage-7: Model Refinement
ML team owns full responsibility for this stage. During this stage, two main activities are performed:
   1. Based on the metrics generated at Stage 6, the models are compared and initial selections are made
   2. The selected models are refined to improve their performances using _``Training datasets``_

## 5.7.1. Hyperparameters optimization
### 5.7.1.1. Differences between Model Parameters and Model Hyperparameters

![](figures/MLPG-DiffParamsHyperparams.png)

### 5.7.1.2. Hyperparameters-Tuning for Classification Algorithms
``A model is a hypothesis and its parameters allow us to tailor the hypothesis (i.e., the behavior of the algorithm) to a specific dataset.``
* The more hyperparameters of an algorithm that one needs to tune, the slower the tuning process is. Therefore, it is desirable to select a minimum subset of model hyperparameters to search or tune
* Not all model hyperparameters are equally important. Some hyperparameters have an outsized effect on the behavior, and in turn, the performance of an ML algorithm
* As an ML practitioner, one must know which hyperparameters to focus on to get a good result quickly

The following table summarizes the suggestions for hyperparameters-tuning for 7 Classification algorithms. Please note, that the list of algorithms and the hyperparameters are not exhaustive; these are just examples.

![](figures/MLPG-HyperparamsTuning.png)
* Hyperparameters Optimization can be done with ``Random Search`` and ``Grid Search``

_**IMPORTANT NOTE**_: _Apply any algorithm fine-tuning of hyperparameter technique on training & testing datasets separately to prevent DATA LEAKAGE. In other words, DO NOT apply algorithm fine-tuning techniques before splitting the datasets into training & testing datasets._

# 5.8. Stage-8: Model Evaluation
ML team owns full responsibility for this stage. During this stage, the models trained using _``Test datasets``_ are evaluated by comparing the Evaluation metrics generated for each model (eg., Accuracy score).

_**NOTE:**_
* _The _``evaluation Metrics``_ should be the same across all stages so that the comparisons can be made and models can be selected_
* _One way to compare the models is to generate bar charts with Accuracy/ROC-AUC/CV scores_

# 5.9. Stage-9: Final Model Selection
Based on the _``evaluation metrics``_ generated at Stage 7 (on training datasets) and Stage 8 (on test datasets), the final model is selected. The final model is trained on the entire dataset, saved, and deployed into the test infrastructure for business user testing (i.e., for Model Validation).

# 5.10. Stage-10: Model Validation
The business users perform the validations using the _``unseen UAT datasets``_.

_**NOTE:**_ 
* _The Evaluation Metrics and Testing metrics should be the same_

# 5.11. Stage-11: Model Deployment
The model deployment into the production environment is usually done by the DS team along with the ML team. 

<!--NAVIGATION-->
<br>

<[ [Machine Learning Project – Process Definition](04.00-mlpgw-Machine-Learning-Project–Process-Definition.ipynb) | [Contents and Acronyms](00.00-mlpgw-Contents-and-Acronyms.ipynb) | [Other Considerations - How to choose ML Algorithms](06.01-mlpgw-Other-Considerations-How-to-choose-ML-Algorithms.ipynb) ]>