<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/MLPG-Book-Cover-Small.png"><br>

This notebook contains an excerpt from the **`Machine Learning Project Guidelines - For Beginners`** book written by *Balasubramanian Chandran*; the content is available [on GitHub](https://github.com/BalaChandranGH/Books/ML-Project-Guidelines).

<br>
<!--NAVIGATION-->

<[ [Stage-3: Research](09.00-mlpg-Stage-3-Research.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Stage-5: Model Development](11.00-mlpg-Stage-5-Model-Development.ipynb) ]>

# 10. Stage-4: Data Preprocessing

* The success of the ML algorithms depends on the quality of the data and the data must be free from errors and discrepancies
* It must adhere to a specific standard so that ML algorithms can accept them, but this does not happen in reality
* In reality, the data is dirty, incomplete, noisy, and inconsistent
* Incomplete data means it has missing values and lacks certain attributes
* The data may be noisy as it contains errors and outliers and hence does not produce desired results
* The data may be inconsistent as it contains discrepancies in data or duplicate data
* ML practitioners take steps to transform the collected raw data and process them to meet the input requirements of model training and testing that are suitable for ML algorithms
* It involves several steps for cleaning, transforming, normalizing, and standardizing data to remove all the inadequacies and irregularities in the data
* These steps are collectively known as _**Data Preprocessing**_ (or) _**Data Wrangling**_ (or) _**Data Preparation**_ (or) _**Data Augmentation**_

## 10.1. Data Preparation framework (for structured/tabular data)
![](figures/MLPG-DataPrepFramework.png)

## 10.2. Data Preparation tasks
* **Data Cleaning**: Identifying and correcting mistakes or errors in the data
* **Feature Selection**: Identifying those input variables that are most relevant to the task
* **Feature Engineering**: Deriving new variables from available data
* **Dimensionality Reduction**: Creating compact projections of the data
* **Split datasets for train-test**: Separating datasets into input (X) and output (y) components
* **Data Transforms**: Changing the scale or distribution of variables
* **Handling Imbalanced Classes**: Imbalanced classes arise when one set of classes (majority class) dominates over another (minority class)

### 10.2.1. Data Cleaning
* _``The process of identifying, correcting mistakes/ errors/ incomplete/ inconsistent/ noisy data, and preparing the dataset for analysis``_
* The real-world dataset never comes clean; it comes in a wide variety of shapes and formats

**Tidy data**
* Tidy data provides a standard way to organize data values within a dataset
* There are three principles of tidy data and they are:
  - Columns represent separate variables
  - Rows represent individual observations
  - Observational units form tables
* Tidy data makes it easier to fix common data problems, so, we need to transform the untidy dataset into tidy data

**Signs of Untidy dataset**
* _``Missing numerical data``_: Either they need to be deleted or replaced with a suitable test statistic
* _``Unexpected data values``_: Solve the mismatched data types of a column and data values
* _``Inconsistent column names``_: Address the column names that contain inconsistent caps and bad characters
* _``Outliers``_: Remove or replace them with suitable test statistic as they can skew the results
* _``Duplicate rows and columns``_: Drop them as they can cause bias in the analysis

#### 10.2.1.1. Basic data cleaning activities
* Remove unwanted/duplicate columns/features/attributes
* Remove unwanted/duplicate rows/samples/examples
* Remove embedded characters that may cause data misalignment 
  - Eg., embedded tabs in a tab-separated data file, embedded new lines that may break records, etc
* Identify the inconsistent values and bring them to a common standard of expression (eg., N.Y or NY into New York)

#### 10.2.1.2. Outliers detection
* Outliers are extreme values that will skew the results if not addressed properly
* There are many ways to identify outliers. Following are some of them:
  - Box and Whisker plots
  - Calculate Z-Score using the formula given below to identify the outliers (data points that fall outside a threshold)
    $Z-Score = (x - Mean) / SD$
* A couple of ways to handle outliers are:
  - Drop them
  - Replace with a suitable test statistic (such as mean/median/mode), examples are:
    - _height_ column has a value of 0 which is invalid (outlier). It can be replaced with a mean value
    ```
      mean = df['height(cm)'].mean()
      df['height(cm)'].replace(0.0, mean, inplace=True)
    ```
    - _weight_ column has a value of 190kg which may be mistakenly typed instead of 90kg
    ```
      df['weight(kg)'].replace(190.0, 90.0, inplace=True)
    ```

_**IMPORTANT NOTE**: Do outliers removals (discarding the samples or replacing the outlier values with 'mean' or 'median', 'mode' value) on training & test datasets separately_

#### 10.2.1.3. Handling Missing Numerical values
* Care must be taken while dealing with missing numerical values
* Need to first identify the reason for the missing numerical values
* There are several methods to handle missing values and each method has its advantages and disadvantages
* The choice of the method is subjective and depends on the nature of the data and the missing values
* Commands that help to detect the missing numerical values are:
```
isnull()
isnull.sum()
isna()
notna()
```
* `Mark/Encode` missing numerical values with `NaN`:
  - Missing values are encoded in different ways and they can appear as `NaN, NA, ?, 0, ‘xx’, -1, or " " (blank space)`
  - Pandas always recognize missing values as `NaN`
  - So, we must first convert all the `NA, ?, 0, xx, -1, or " " to NaN`
  - If the missing values aren’t identified as `NaN`, then we have to first convert or replace such `non-NaN` entry with a `NaN`
```
df[df == '?'] = np.nan       # Convert '?' to 'NaN'
```
* `Drop` them using `dropna()` method
  - This is the easiest method to handle missing values. In this method, we drop labels or columns from a dataset with missing values
  - We can drop labels or rows from a dataset containing missing values as follows:
```
df.dropna(axis = 0)          # Drop rows with missing values
df.dropna(axis = 1)          # Drop columns with missing values
df.drop(‘col1’, axis = 1)    # Drop column ‘col1’
```

**A note about axis parameter:**
* Axis value may contain (0 or ‘index’) or (1 or ‘columns’). Its default value is 0
* We set axis = 0 or ‘index’ to drop rows that contain missing values
* We set axis = 1 or ‘columns’ to drop columns that contain missing values
* But, this method has one disadvantage as it involves the risk of losing useful information along with the missing data
* It is advised to use this method only when there are a few missing values in our dataset
* It's better to develop an imputation strategy so that we can impute missing values with the mean or the median of the row or column containing the missing values
* `Replace` with a suitable test statistic (such as mean/median/mode/forward-fill/back-fill), examples are:
  - Fill the missing values with a test statistic like `mean`, `median`, or `mode` of the particular feature the missing value belongs to
  - One can also specify a `forward-fill` or `back-fill` to propagate the next values backward or previous value forward
  - We can fill missing values with a test statistic like `mean` as follows:
    ```
    mean = df['col_name'].mean()
    df['col1'].fillna(value=mean, inplace=True )
    ```
  - We can also use replace() in place of fillna()
    ```
    df[‘col1’].replace(to_replace=NaN, value=mean, inplace=True)
    ```

_**NOTE**: If we choose this method, then we should compute the mean value on the training set and use it to fill the missing values in the training set. Then we should save the mean value that we have computed. Later, we will replace missing values in the test set with the mean value to evaluate the system. This is to avoid DATA LEAKAGE._

* `Impute`
  - Scikit-Learn provides an Imputer class to deal with the missing values
  - In this method, we replace the missing value with the mean value of the entire feature column. The sample code is given below
    ```
    from sklearn.preprocessing import Imputer
    imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
    imp = imp.fit(df.values)
    imputed_data = imp.transform(df.values)
    imputed_data
    ```

**Reshaping the data into tidy data format:**
* If a dataframe is not in tidy format, we can convert it into the tidy data format using the `pd.melt()` function. For example:
  ```
  pd.melt(frame=df,id_vars=['fname','lname','age','sex','section',
          'height(cm)','weight(kg)'], value_vars=['spend_A','spend_B', 
          spend_C'], var_name='expenditure', value_name='amount')`
  ```

#### 10.2.1.4. Handling Missing Categorical values
* Care must be taken while dealing with missing categorical values
* Need to first identify the reason for the missing categorical values
* In addition to the numerical values, the real-world datasets contain categorical data as well
* ML algorithms require that some input data must be in numerical format; only then do the algorithms work successfully on them
* The categorical data must be converted into numbers before they are fed into an algorithm. Scikit-learn provides useful classes to do the same
* Some of the methods to handle missing values of categorical variables are:
  - `Drop`, i.e., ignore the variables if it is not significant
  - `Replace` it with the most frequent value (mode)
  - `Treat` the missing data `as just another category`

**Summary of Data Cleaning tasks:**
![](figures/MLPG-DataCleaning.png)

### 10.2.2. Feature Selection
* Feature selection (FS) refers to techniques for selecting a subset of input features that are most relevant to the target variable that is being predicted
* FS is primarily focused on removing non-informative or redundant predictors from the model
* FS is about identifying those input variables that are most relevant to the task
* FS removes columns with duplicate data/ empty/ unwanted data
* FS techniques are generally grouped into those that use the target variable (`supervised`) and those that do not (`unsupervised`)
* The difference has to do with whether features are selected based on the target variable or not. _**Unsupervised**_ feature selection techniques _**ignore the target variable**_, such as methods that remove redundant variables using correlation. _**Supervised**_ feature selection techniques _**use the target variable**_, such as methods that remove irrelevant variables.

#### 10.2.2.1. The Goals of Feature Selection Techniques
* To reduce the computational cost of modeling
* To improve the performance of the model

#### 10.2.2.2. Overview of Feature Selection Techniques
* **Feature Selection**: Select a subset of input features from the dataset
  - **Unsupervised**: Do not use the target variable (e.g. remove redundant variables)
  - **Supervised**: Use the target variable (e.g. remove irrelevant variables)
    - **Wrapper**: Search for well-performing subsets of features (i.e., explicitly choose features that result in the best performing model)
    - **Filter**: Select subsets of features based on their relationship with the target (i.e., score each input feature and allow a subset to be selected)
    - **Intrinsic**: Algorithms that perform automatic feature selection during training. 
  - Some models are naturally resistant to non-informative predictors. Tree- and rule-based models, MARS and the Lasso, for example, intrinsically conduct FS

* **Dimensionality Reduction:** 
  - Project input data into a lower-dimensional feature space
  - Dimensionality reduction (eg., PCA) is an alternative to FS rather than a type of feature selection

![](figures/MLPG-FeatureSelectionTechs.png)

#### 10.2.2.3. Feature Selection Methods (Filter-based)
**Scikit-learn** implementations:
* Pearson’s Correlation Coefficient: [f_regression()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html)
* ANOVA: [f_classif()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html)
* Chi-Squared: [chi2()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html)
* Mutual Information: [mutual_info_classif()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) & [mutual_info_regression()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html)
* Select the top k variables: [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
* Select the top percentile variables: [SelectPercentile](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html)

**SciPy** implementations:
* Kendall’s tau: [kendalltau](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html)
* Spearman’s rank correlation: [spearmanr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html)

_**NOTE**: Just like there is no best set of input variables or best ML algorithm, there is no best feature selection method. One must discover what works best for a specific problem._

![](figures/MLPG-FeatureSelectionMethods.png)

#### 10.2.2.4. Feature Importance
**Feature importance (FI)** refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

#### 10.2.2.4.1. The Uses of Feature Importance
* Better understanding the data – FI scores can provide insight into the dataset
* Better understanding a model – FI scores can provide insight into the model
* Reducing the number of input features – FI scores can be used to improve a predictive model
* Many ways to calculate FI scores and many models can be used for this purpose. Commonly used are,
  - Feature importance from model coefficients
  - Feature importance from decision trees
  - Feature importance from permutation testing

#### 10.2.2.4.2. Coefficients as Feature Importance
* In **Sklearn.linear_model** (LinearRegression, LogisticRegression, Ridge, ElasticNet), the `model.coeff_` property contains the coefficients found for each input variable

#### 10.2.2.4.3. Decision Tree Feature Importance
* In **Sklearn.tree** (CART FI [DecisionTreeRegressor, DecisionTreeClassifier], Random Forest FI [RandomForestRegressor, RandomForestClassifier]), the `model.feature_importances_` property contains the coefficients found for each input variable

#### 10.2.2.4.4. XGBoost Feature Importance
* In **xgboost** (XGBRegressor, XGBClassifier), the `model.feature_importances_` property contains the coefficients found for each input variable

#### 10.2.2.4.5. Permutation Feature Importance
* **Permutation Feature Importance** is a technique for calculating relative importance scores that is independent of the model used
* This approach can be used for regression or classification and requires that a performance metric be chosen as the basis of the importance score, such as the mean squared error for regression and accuracy for classification
* Permutation feature selection can be used via the [permutation_importance() function](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html) that takes a fit model, a dataset (train or test dataset is fine), and a scoring function (available in `scikit-learn`)

#### 10.2.2.5. Mutual Information (MI)
* MI is a lot like correlation in that it measures a relationship between two quantities
* MI describes relationships in terms of uncertainty
* The MI between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other
* _`The benefit of MI is that it can detect any kind of relationship, while correlation only detects linear relationships`_
* MI is a great general-purpose metric and especially useful at the start of feature development when we might not know what model we’d like to use yet
* `Advantages:`
  - Easy to use and interpret
  - Computationally efficient
  - Theoretically well-founded
  - Resistant to overfitting
  - Able to detect any kind of relationship

_**NOTE**: The uncertainty is measured using a quantity from information theory known as "entropy". The entropy of a variable means roughly: "how many yes-or-no questions you would need to describe an occurrence of that variable, on average." The more questions you have to ask, the more uncertain you must be about the variable. MI is how many questions you expect the feature to answer about the target._

#### 10.2.2.6. Interpreting MI scores
* MI score range -> _**zero to inf**_ (0.0 to ꝏ).
* When MI is zero, the quantities are independent: neither can tell you anything about the other
* Conversely, in theory, there's no upper bound to what MI can be
* In practice, though values above 2.0 or so are uncommon
* MI is a logarithmic quantity, so it increases very slowly
* Things to remember when applying mutual information:
  - MI can help understand the _`relative potential`_ of a feature as a predictor of the target, considered by itself
  - It's possible for a feature to be very informative when interacting with other features, but not so informative all alone. _`MI can't detect interactions between features.`_ It is a **univariate** metric.
  - The actual usefulness of a feature _`depends on the model we use it with`_. A feature is only useful to the extent that its relationship with the target is one the model can learn. Just because a feature has a high MI score doesn't mean the model will be able to do anything with that information. We may need to transform the feature first to expose the association
* Scikit-learn has two MI metrics in its `feature_selection` module: 
  - One for real-valued targets (`mutual_info_regression`) 
  - One for categorical targets (`mutual_info_classif`)
* Example source code to rank the features with MI and investigate the results by data visualization
[1985 Automobiles (auto.csv) dataset is available in Kaggle]:

### 10.2.3. Feature Engineering
* _**“Better features make better data and better data make better models”.**_

#### 10.2.3.1. The Goal and a Guiding principle
* The goal of Feature Engineering is to make data better suited to the problem to be solved
* Establish a baseline (score) by training the model on the un-augmented dataset. This will help us determine whether the new features are useful/worth keeping or discard them and try something else

#### 10.2.3.2. Creating features
* Tips on discovering new features:
  - Understand the existing features in the dataset by referring to the “data description document”
  - Research the problem domain to acquire **domain knowledge**
  - Study previous work. Solution write-ups are a great resource
  - Use data visualization
* Complex strings that can usefully be broken into simpler pieces. Some common examples are:
  - ID numbers: `'123-45-6789'`
  - Phone numbers: `'(999) 555-0123'`
  - Street addresses: `'8241 Kaggle Ln., Goose City, NV'`
  - Internet addresses: `‘http://www.kaggle.com’`
  - Product codes: `'0 36000 29145 2'`
  - Dates and times: `'Mon Sep 30 07:06:05 2013'`
  - Split columns (Area code from Phone numbers or Year from Dates and times, etc.)
* Combine/aggregate existing features to create a new one (e.g., Total or Count or Mean or Ratio)
* Reorder columns, if needed

#### 10.2.3.3. Tips on creating new features
Keep in mind the model's strengths and weaknesses when creating features. Here are some guidelines: 
* Linear models learn sums and differences naturally, but can't learn anything more complex
* Ratios seem to be difficult for most models to learn. Ratio combinations often lead to some easy performance gains
* Linear models and NNs generally do better with normalized features. NNs especially need features scaled to values not too far from 0. Tree-based models (like random forests and XGBoost) can sometimes benefit from normalization, but usually much less so
* Tree models can learn to approximate almost any combination of features, but when a combination is especially important, they can still benefit from having it explicitly created, especially when data is limited
* Counts are especially helpful for tree models since these models don't have a natural way of aggregating information across many features at once

#### 10.2.3.4. Clustering with K-Means
* We can use clustering algorithms (such as K-Means) in feature engineering. For example, we could attempt to discover groups of customers representing a market segment, for instance, or geographic areas that share similar weather patterns. Adding a feature of cluster labels can help machine learning models untangle complicated relationships of space or proximity
* Cluster Labels as a Feature
  - Applied to a single real-valued feature, clustering acts like a traditional "binning" or "discretization" transform. On multiple features, it's like "multi-dimensional binning" (sometimes called VECTOR QUANTIZATION)

#### 10.2.3.5. Principal Component Analysis (PCA)
* Just like clustering is a partitioning of the dataset based on _`proximity`_, we could think of PCA as a partitioning of the _`variation`_ in the data
* PCA is a great tool to help discover important relationships in the data and can also be used to create more informative features
* _**NOTE**: PCA is typically applied to standardized data. With standardized data "variation" means "correlation". With unstandardized data "variation" means "covariance"._
* There are two ways you could use PCA for feature engineering:
  - The **first way** is to use it as a **descriptive technique**. Since the components tell us about the variation, we could compute the MI scores for the components and see what kind of variation is most predictive of our target. That could give us ideas for kinds of features to create: 
    - A product of `'Height'` and `'Diameter'` if `'Size'` is important, say, or a ratio of `'Height'` and `'Diameter'` if `'Shape'` is important. You could even try clustering on one or more of the high-scoring components
  - The **second way** is to use the **components themselves as features**. Because the components expose the variational structure of the data directly, they can often be more informative than the original features
  - `Use cases for PCA:`
    - **Dimensionality reduction**: When the features are highly redundant (_`multicollinear`_, specifically), PCA will partition out the redundancy into one or more near-zero variance components, which we can then drop since they will contain little or no information
    - **Anomaly detection**: Unusual variation, not apparent from the original features, will often show up in the low-variance components. These components could be highly informative in an anomaly or outlier detection task
    - **Noise reduction**: A collection of sensor readings will often share some common background noise. PCA can sometimes collect the (informative) signal into a smaller number of features while leaving the noise alone, thus boosting the signal-to-noise ratio
    - **Decorrelation**: Some ML algorithms struggle with highly correlated features. PCA transforms correlated features into uncorrelated components, which could be easier for the algorithm to work with
* `PCA Best Practices:`
  - PCA only works with numeric features, like continuous quantities or counts
  - PCA is sensitive to scale. It's good practice to standardize the data before applying PCA unless we know we have a good reason not to
  - Consider removing or constraining outliers, since they can have an undue influence on the results

#### 10.2.3.6. Target Encoding
All the techniques discussed above so far are for numerical features. The technique used to encode categorical features is called “target encoding”
* A target encoding is any kind of encoding that replaces a feature's categories with some number derived from the target
* Target encoding is sometimes called _`mean encoding`_ or _`likelihood encoding`_ or _`impact encoding`_ or _`leave-one-out encoding`_ or _`binary encoding`_ (if applied to a binary target)
* Use cases for Target Encoding:
  - **High-cardinality features**: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature's most important property: its relationship with the target
  - **Domain-motivated features**: From prior experience, we might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature's true informativeness

### 10.2.4. Dimensionality Reduction
* Creating compact projections of the data
* Dimensionality reduction can be considered as part of Feature Selection as well

### 10.2.5. Split datasets for train-test
* Separate datasets into Input (X) and output (y) components, if needed
* Split the preprocessed datasets into train and test datasets, for model development, training, and evaluations
  - _`train-test-split`_:
    - A procedure to evaluate the performance of models on a large dataset (less computational cost)
    - The dataset split can be done during the Preprocessing stage and the train & test datasets should be used at later stages
    - _`Use train-test-split()`_ function from sklearn.model_selection
    - `k` is internally set to 2 for a single split (1 train and 1 test dataset)
  - _`LOOCV (Leave-One-Out-Cross-Validation)`_:
    - This is another extreme to `train-test-split` where `k` is set to the total number of observations (`k=n`) such that each observation is given a chance to be held out of the dataset
    - _`Use LeaveOneOut()`_ class from _sklearn.model_selection_
* Remove outliers (replace with a suitable test statistic `mean/median/mode`)

### 10.2.6. Data Transforms
* Changing the scale or distribution of variables.
* Data transformation can be considered as part of Feature engineering as well

#### **10.2.6.1. Numerical type**
#### **Change scale**

#### 10.2.6.1.1. Normalization a.k.a. Feature Scaling
* Rescales numerical values to a specific range of 0-1 to reduce skews
* Feature Scaling is a process used to normalize the range of independent variables so that they can be mapped onto the same scale
* In stochastic gradient descent, feature scaling can improve the convergence speed of the algorithm
* In SVMs, it can reduce the time to find support vectors
* Exceptions:
  - `Decision trees` and `random forests` are scale-invariant algorithms where we don’t need to worry about feature scaling
  - Similarly, `Naive Bayes` and `Linear Discriminant Analysis` are not affected by feature scaling
  - In Short, `any Algorithm which is not distance-based is not affected by feature scaling`

* Some of the common methods are:
  - _`Log`_ transform
  - _`MinMaxScaler`_ transform
    - This technique of rescaling is also called min-max scaling or min-max normalization
    - Normalization refers to the rescaling of the features to a range of [0, 1], which is a special case of min-max scaling
    - We do this by subtracting the minimum value (xmin) and dividing it by the maximum value (xmax) minus the minimum value (xmin)
    - Mathematically, the new value x(i)norm of sample x(i) can be calculated as follows:
      ```
      x(i)norm  =  (xi -  xmin)/(xmax -  xmin)
      ```
    - Here, `x(i)` is a particular sample value. xmax and xmin are the maximum and minimum feature values in a column. Scikit-Learn provides a transformer called _`MinMaxScaler`_ for this task. It has a feature range parameter to adjust the range of values. This estimator fits and transforms each feature variable individually such that it is in the given range (between zero and one) on the training set
    - As with all the other transformers, we fit this transformer to the training data only, not to the full data set (including the test set to avoid data leakage). Only then we can use them to transform the training set and the test set and new data. The syntax for implementing the min-max scaling procedure in Scikit-Learn is given as follows:
      ```
      from sklearn.preprocessing import MinMaxScaler
      ms = MinMaxScaler()
      X_train_ms = ms.fit_transform(X_train)
      X_test_ms  = ms.transform(X_test)
      ```
  - _`MaxAbsScaler`_ transform
    - In this feature rescaling task, we rescale each feature by its maximum absolute value
    - So, the maximum absolute value of each feature in the training set will be 1.0
    - It does not affect the data and hence there is no effect on sparsity
    - Scikit-Learn provides _`MaxAbsScaler`_ transformer for this task
    - The syntax for implementing max-abs scaling procedure in Scikit-Learn is given as follows:
      ```
      from sklearn.preprocessing import MaxAbsScaler
      mabs = MaxAbsScaler()
      X_train_mabs = mabs.fit_transform(X_train)
      X_test_mabs  = mabs.transform(X_test)
      ```
  - _`Normalizer`_ transform
    - In this feature scaling task, we rescale each observation to a length of 1 (a unit norm). 
    - Scikit-Learn provides the _`Normalizer`_ class for this task
    - In this task, we scale the components of a feature vector such that the complete vector has a length of one
    - This usually means dividing each component by the Euclidean length (magnitude) of the vector
    - Mathematically, normalization can be expressed by the following equation:
      ```
      x(i)norm =   x(i) / | x(i)|
      ```
      where `x(i)` is a particular sample value, `x(i)norm` is its normalized value, and `| x(i)|` is the corresponding Euclidean length of the vector. The syntax for normalization is quite similar to standardization given as follows:
      ```
      from sklearn.preprocessing import Normalizer
      nm = Normalizer()
      X_train_nm = nm.fit_transform(X_train)
      X_test_nm  = nm.transform(X_test)
      ```
  - _`Binarizer`_ transform
    - In this feature scaling procedure, we binarize the data (set feature values equal to 0 or 1) according to a threshold
    - Using a binary threshold, we transform our data by marking the values above it to 1 and those equal to or below it to 0
    - Scikit-Learn provides Binarizer class for this purpose. The syntax for binarizing the data follow the same rules as above and is given below:
      ```
      from sklearn.preprocessing import Binarizer
      binr = Binarizer()
      X_train_binr = binr.fit_transform(X_train)
      X_test_binr  = binr.transform(X_test)
      ```
  - _`Decimal`_ scaling: Scale the data by moving the decimal point of values of the attribute

**Applications of Feature scaling:**
* Generally, real-world datasets contain features that differ in magnitudes, units, and ranges
* We should perform Normalization when the scale of a feature is irrelevant or misleading
* The algorithms which depend on Euclidean distance measures are sensitive to magnitudes
* In this case, feature scaling helps to weigh all the features equally
* Suppose a feature in the dataset is relatively big in magnitude as compared to other features, then in algorithms where Euclidean distance is measured, this big scaled feature becomes dominating and needs to be normalized
* Applications of Feature Scaling:
  - _`K-Means`_: K-Means algorithm is based on the Euclidean distance measure, so, feature scaling matters
  - _`K-Nearest-Neighbours`_: Require feature scaling
  - _`Principal Component Analysis (PCA)`_: In the PCA algorithm, we try to get the feature with maximum variance. Here feature scaling is required
  - _`Gradient Descent`_: In gradient descent algorithm, calculation speed increases as theta calculation becomes faster after feature scaling
  - _`Naive Bayes, Linear Discriminant Analysis, and Tree-Based models (Decision Trees and Random Forests) are not affected by feature scaling because they are not distance-based.`_

#### 10.2.6.1.2. Standardization (_`StandardScaler`_)
* Standardize numerical data using the scale and center options (i.e., Mean of 0 and SD of 1))
* It can be more useful for many ML algorithms, especially for optimization algorithms such as gradient descent
* In standardization, first, we determine the distribution mean and standard deviation for each feature. 
* Next, we subtract the mean from each feature, then we divide the values of each feature by its standard deviation
* So, in standardization, we center the feature columns at mean 0 with a standard deviation of 1 so that the feature columns take the form of a normal distribution, which makes it easier to learn the weights
* Mathematically, standardization can be expressed by the following equation:
  ```
  x(i)std =  ( x(i)- μx)/(σx )
  ```
* Here, `x(i)` is a particular sample value, `x(i)std` is its standard deviation, `μx` is the sample mean of a particular feature column and `σx` is the corresponding standard deviation
* Min-max scaling scales the data to a limited range of values. Unlike min-max scaling, standardization does not bind values to a specific range
* So, standardization is much less affected by outliers
* Standardization maintains useful information about outliers and is much less affected by them
* It makes the algorithm less sensitive to outliers in contrast to min-max scaling
* Scikit-Learn provides a transformer called _`StandardScaler`_ for standardization
* The syntax to implement standardization is quite similar to min-max scaling given as follows:
  ```
  from sklearn.preprocessing import StandardScaler
  ss = StandardScaler()
  X_train_ss = ss.fit_transform(X_train)
  X_test_ss  = ss.transform(X_test)
  ```
* Again, we should fit the _`StandardScaler`_ class only once on the training data set and use those parameters to transform the test set or new data set to avoid data leakage

#### 10.2.6.1.3. Robust (_`RobustScaler`_)
* _`StandardScaler`_ can often give misleading results when the data contain outliers
* Outliers can often influence the sample mean and variance and hence give misleading results
* In such cases, it is better to use a scalar that is robust against outliers
* Scikit-Learn provides a transformer called _`RobustScaler`_ for this purpose
* The _`RobustScaler`_ is very similar to _`MinMaxScaler`_. The difference lies in the parameters used for scaling. While _`MinMaxScaler`_ uses the minimum and maximum values for rescaling, _`RobustScaler`_ uses the interquartile (IQR) range for the same
* Mathematically, the new value x(i)norm of sample x(i) can be calculated as follows:
  ```
  x(i)  =  (xi-  Q1(x) )/(Q3(x) - Q1(x))
  ```
* Here, `x(i)` is the scaled value, `xi` is a particular sample value, and `Q1(x)` and `Q3(x)` are the 1st quartile (25th quantile) and 3rd quartile (75th quantile) respectively. So, `Q3(x) - Q1(x)` is the difference between the 3rd quartile (75th quantile) and 1st quartile (25th quantile) respectively. It is called `IQR (Interquartile Range)`
* The syntax for implementing scaling using _`RobustScaler`_ in Scikit-Learn is given as follows:
  ```
  from sklearn.preprocessing import RobustScaler
  rb = RobustScaler()
  X_train_rb = rb.fit_transform(X_train)
  X_test_rb  = rb.transform(X_test)
  ```

#### **Change distribution**

#### 10.2.6.1.4. Power (_`PowerTransformer`_)
* A power transform will make the probability distribution of a variable more Gaussian distribution
* This power transform is available in the scikit-learn Python machine learning library via the _`PowerTransformer`_ class with the methods:
  - _`Box-Cox`_ transform: Automatic power transform
  - _`Yeo-Johnson`_ transform: Automatic power transform 

#### 10.2.6.1.5. Quantile (_`QuantileTransformer`_)
* Numerical input variables may have a highly skewed or non-standard distribution
* The quantile transform provides an automatic way to transform a numeric input variable to have a different data distribution, which in turn, can be used as input to a predictive model
* A quantile transform will map a variable’s probability distribution to another probability distribution
* The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution
* This quantile transform is available in the scikit-learn Python machine learning library via the _`QuantileTransformer`_ class

#### 10.2.6.1.6. Discretize (_`KBinsDiscretizer`_)
* This is also called _`BINNING`_ i.e., a grouping of values into 'bins' (e.g. High, Medium, Low)
* The discretization transform provides an automatic way to change a numeric input variable to have a different data distribution, which in turn can be used as input to a predictive model
* Different methods for grouping the values into k discrete bins can be used; common techniques include:
  - _`Uniform`_: Each bin has the same width in the span of possible values for the variable
  - _`Quantile`_: Each bin has the same number of values, split based on percentiles
  - _`Clustered`_: Clusters are identified and examples are assigned to each group
* The discretization transform is available in the scikit-learn Python machine learning library via the _`KBinsDiscretizer`_ class

#### **Engineer**

#### 10.2.6.1.7. Polynomial (_`PolynomialFeatures`_)
* Polynomial features are those features created by raising existing features to an exponent
* The polynomial features transform is available in the scikit-learn Python machine learning library via the _`PolynomialFeatures`_ class

#### **10.2.6.2. Categorical type**

#### **Nominal**

#### 10.2.6.2.1. One-Hot encode (_`OneHotEncoder`_)
* For categorical variables that do not have a natural rank-ordering, i.e., no relationships
  ```
  Eg., 'red' is (1,0,0), 'green' is (0,1,0), and 'blue' is (0,0,1)
  ```
* _`LabelEncoder`_ treats class labels as categorical data with no order associated with it
* The problem arises when we apply the same approach to transform the nominal variable with _`LabelEncoder`_
* Refer example in section 4.2.3.2.4. LabelEncode, the values are encoded as 0, 1, 2 for 'high', 'low', 'medium' respectively. This is OK for ordinal variables but not for nominal variables
* Although there is no order involved, a learning algorithm will assume that _`high < low < medium`_. This is a wrong assumption and it will not produce desired results
* To fix this issue, a common solution is to use a technique called _`one-hot-encoding`_
* In this technique, we create a new dummy feature for each unique value in the nominal feature column. The value of the dummy feature is equal to one when the unique value is present and zero otherwise. Similarly, for another unique value, the value of the dummy feature is equal to one when the unique value is present and zero otherwise. This is called one-hot encoding because only one dummy feature will be equal to one (hot), while the others will be zero (cold)
* Scikit-Learn provides an OneHotEncoder transformer to convert integer categorical values into one-hot vectors. For example:
from sklearn.preprocessing import _`OneHotEncoder`_
  ```
  x   = ['high', 'medium', 'low', 'low', 'high']
  df  = pd.DataFrame(x, columns=['x'])
  col = df['x'].values.reshape(-1,1)
  ohe = OneHotEncoder(sparse=False)
  df  = ohe.fit_transform(col)
  <<Output>> array([[1., 0., 0.],
                    [0., 0., 1.],
                    [0., 1., 0.],
                    [0., 1., 0.],
                    [1., 0., 0.]])
  ```
* By default, the output is a SciPy sparse matrix, instead of a NumPy array. This way of output is very useful when we have categorical attributes with thousands of categories. If there are a lot of zeros, a sparse matrix only stores the location of the non-zero elements. So, sparse matrices are a more efficient way of storing large datasets. It is supported by many Scikit-Learn functions
* To convert the dense NumPy array, we should call the toarray() method. To omit the toarray() step, we could alternatively initialize the encoder as:
  ```
  OneHotEncoder( … , sparse=False)   # Returns a regular NumPy array
  ```
* Another more convenient way is to create those dummy features via one-hot encoding is to use the _`pandas.get_dummies()`_ method. The _`get_dummies()`_ method will only convert string columns and leave all other columns unchanged in a dataframe

#### 10.2.6.2.2. Dummy encode
* Dummy encoding: Avoid redundancy in One-Hot encoding
  ```
  Eg., 'red' is (1,0), 'green' is (0,1), and 'blue is (0,0)
  ```

#### 10.2.6.2.3. Label binarize (_`LabelBinarizer`_)
* We can accomplish two tasks (encoding multi-class labels to integer categories, then from integer categories to one-hot vectors or binary labels) in one shot using the Scikit-Learn’s _`LabelBinarizer`_ class, in other words, it combines `Label encode + One-hot encode`. For example:
```
from sklearn.preprocessing import LabelBinarizer
x   = ['high', 'medium', 'low', 'low', 'high']
#x = [1.3, 3.1, 2.2, 6.4]
df  = pd.DataFrame(x, columns=['x'])
col = df['x'].values
#col = df['x'].values.astype('int64')
lb  = LabelBinarizer()
df  = lb.fit_transform(col)
<<Output>> array([[1, 0, 0],
                  [0, 0, 1],
                  [0, 1, 0],
                  [0, 1, 0],
                  [1, 0, 0]])
```
* This returns a dense NumPy array by default. We can get a sparse matrix by passing _`sparse_output=True`_ to the _`LabelBinarizer`_ constructor

#### **Ordinal**

#### 10.2.6.2.4. Label encode (_`LabelEncoder`_)
* The ML algorithms require that class labels are encoded as integers and most estimators for classification convert class labels to integers internally
* Scikit-Learn provides a transformer for this task called _`LabelEncoder`_. For example:
  ```
  from sklearn.preprocessing import LabelEncoder
  x   = ['high', 'medium', 'low', 'low', 'high']
  df  = pd.DataFrame(x, columns=['x'])
  col = df['x'].values
  le  = LabelEncoder()
  df['x'] = le.fit_transform(col)
  <<Output>> 0 2 1 1 0
  ```
* We can use the _`inverse_transform`_ method to transform the integer class labels back into their original string representation

#### 10.2.6.2.5. Ordinal encode (_`OrdinalEncoder`_)
* For categorical variables that do not have a natural rank/ ordering, i.e., each unique category value is assigned an integer value
  ```
  Eg., 'red' is 1, 'green' is 2, and 'blue' is 3
  ```
* For example:
  ```
  from sklearn.preprocessing import OrdinalEncoder
  x1   = ['high1', 'medium1', 'low1', 'low1', 'high1']
  x2   = ['high2', 'medium2', 'low2', 'low2', 'high2']
  x3   = ['high3', 'medium3', 'low3', 'low3', 'high3']
  data = {'x1':x1, 'x2':x2, 'x3':x3}
  df   = pd.DataFrame(data)
  col  = df[['x1', 'x2', 'x3']]
  oe   = OrdinalEncoder()
  df   = oe.fit_transform(col)
  <<Output>> array([[0., 0., 0.],
                    [2., 2., 2.],
                    [1., 1., 1.],
                    [1., 1., 1.],
                    [0., 0., 0.]])
  ```

**Differences between LabelEncoder and OrdinalEncoder**
```
LabelEncoder                               OrdinalEncoder
--------------------------------------     -----------------------------------------------
- Deals with 1D data, i.e., n_samples      - Deals with 2D data, i.e., n_features, n_samples
- Used to encode ‘Target variable’         - Used to encode ‘independent features’
```

_**IMPORTANT NOTE:**_
* _Apply any feature scaling or transformation technique (such as normalization or standardization etc.) on training & testing datasets separately to prevent DATA LEAKAGE. In other words, `DO NOT apply data transformation techniques before splitting the datasets into training & testing datasets.`_

Example 1:
* When encoding Categorical variables (ordinal or one-hot or dummy) using _`LabelEncoder`_ or _`category_encoders`_, first do it on the training dataset, then propagate to the test dataset
  ```
  import category_encoders as ce
  encoder = ce.OrdinalEncoder(cols=['col1', 'col2', 'col3', 'col4', 
                                    'col5', 'col6'])
  X_train = encoder.fit_transform(X_train)
  X_test  = encoder.transform(X_test)     # Note, 'transform', not 'fit_transform'
  ```

Example 2:
* When scaling the numerical variables, do the following
  ```
  from sklearn.preprocessing import RobustScaler
  scaler  = RobustScaler()
  X_train = scaler.fit_transform(X_train)
  X_test  = scaler.transform(X_test)
  ```

**Summary of Data Transforms tasks:**
![](figures/MLPG-DataTransforms.png)

Some of the Transformers provided by scikit-learn are:
* Normalize/Feature Scaling: _`Log(), MinMaxScaler(), MAxAbsScaler(), Normalizer(), Binarizer()`_
* Standardize: _`StandardScaler()`_
* Robust: _`RobustScaler()`_
* Power: _`PowerTransformer()`_
* Quantile: _`QuantileTransformer()`_
* Discretize: _`KBinsDiscretizer()`_
* Polynomial: _`PolynomialFeatures()`_
* One-Hot encode: _`OneHotEncoder()`_
* Label binarize: _`LabelBinarizer()`_
* Label encode: _`LabelEncoder()`_
* Ordinal encode: _`OrdinalEncoder()`_

### 10.2.7. Handling Imbalanced classes
* Any real-world dataset may come with several problems and the imbalanced classes are one of them
* The problem of imbalanced classes arises when one set of classes dominates over another set of classes
* The former is called the majority class while the latter is called the minority class
* This is a very common problem in machine learning where we have datasets with a disproportionate ratio of observations in each class

#### 10.2.7.1. Problems with imbalanced learning
* The problem of imbalanced classes is very common and it is bound to happen
* The problem of learning from imbalanced data has new and modern approaches
* This learning from imbalanced data is referred to as imbalanced learning
* Significant problems may arise with imbalanced learning. These are as follows:
  - It causes the machine learning model to be more biased towards the majority class
  - It causes poor classification of minority classes. Hence, this problem throws the question of "accuracy" out of the question
  - If the Imbalanced classes problem is not addressed properly, then we may end up with higher accuracy. But this higher accuracy is meaningless because it comes from a meaningless metric that is not suitable for the dataset in question. Hence, this higher accuracy no longer reliably measures model performance
  - There may be inherent complex characteristics in the dataset. Imbalanced learning from such datasets requires new approaches, principles, tools, and techniques. But it cannot guarantee an efficient solution to the business problem

#### 10.2.7.2. Example of imbalanced classes
* The problem of imbalanced classes may appear in many areas including the following:
  - Disease detection, Fraud detection, Anomaly detection
  - Earthquake prediction, Churn prediction, Intrusion prediction
  - Spam filtering, etc.

#### 10.2.7.3. Approaches to handle imbalanced classes
There are several methods to deal with the imbalanced class problems and the common ones are listed below
* Undersampling methods 
  1. Random
  2. Informative
  3. NearMiss
  4. Tomek links
  5. Edited nearest neighbors
  6. Cluster centroids
* Oversampling methods 
  1. Random 
  2. Cluster-based
  3. Synthetic data generation (SMOTE & ADASYN)
* Other methods
  1. Cost-sensitive learning
  2. Algorithmic Ensemble methods
  3. Imbalanced learn

#### **Undersampling methods**

The undersampling methods work with the majority class. In these methods, we randomly eliminate instances of the majority class. It reduces the number of observations from the majority class to make the dataset balanced. It results in a severe loss of information. This method is applicable when the dataset is huge and reducing the number of training samples makes the dataset balanced.

#### 10.2.7.3.1. Random undersampling
* In the random undersampling method, we balance the imbalanced class distribution by choosing and eliminating observations from the majority class to make the dataset balanced. This approach has some pros and cons:
* `Advantages:`
  - If the dataset is huge, we might face run time and storage problems. Undersampling can help to handle these problems successfully by improving run time and storage problems by reducing the number of training data samples
* `Disadvantages:`
  - This method can discard potentially useful information that could be important for building the classifiers
  - The sample chosen by random undersampling may be a biased one. It may not be an accurate representation of the population. So, it results in inaccurate results with the actual dataset

#### 10.2.7.3.2. Informative undersampling
* In informative undersampling, we follow a pre-defined selection criterion to remove the observations from the majority class
* Within this informative undersampling technique, we have EasyEnsemble and BalanceCascade algorithms. These algorithms produce good results and are relatively easy to follow
* `Easy ensemble:` This technique extracts several subsets of independent samples with replacements from the majority class. Then it develops multiple classifiers based on the combination of each subset with a minority class. It works just like an unsupervised learning algorithm
* `BalanceCascade:` This method takes a supervised learning approach where it develops an ensemble of classifiers and systematically selects which majority class to the ensemble

#### 10.2.7.3.3. NearMiss undersampling
* In near-miss undersampling, we only sample the data points from the majority class which is necessary to distinguish the majority class from other classes
* `NearMiss-1:`
  - In the NearMiss-1 sampling technique, we select samples from the majority class for which the average distance of the N closest samples of a minority class is the smallest
* `NearMiss-2:`
  - In the NearMiss-2 sampling technique, we select samples from the majority class for which the average distance of the N farthest samples of a minority class is the smallest

#### 10.2.7.3.4. Tomek links undersampling
* A Tomek's link can be defined as the set of two observations of different classes that are nearest neighbors of each other and remove these points and increase the separation gap between the two classes. Now, the algorithms produce more reliable output
* This technique will not produce a balanced dataset. It will simply clean the dataset by removing the Tomek links
* It may result in an easier classification problem. Thus, by removing the Tomek links, we can improve the performance of the classifier even if we don't have a balanced dataset

#### 10.2.7.3.5. Edited nearest neighbors undersampling
* In this type of undersampling technique, we apply the nearest neighbors algorithm
* We modify the dataset by removing samples that differ from their neighborhood. We select a subset of data to be under-sampled
* For each sample in the subset, the nearest neighbors are computed and if the selection criteria are not fulfilled, the sample is removed
* This technique is very much similar to Tomek’s links approach. We are not trying to achieve a class imbalance, instead, we try to remove noisy observations in the dataset to make for an easier classification problem

#### 10.2.7.3.6. Cluster centroids undersampling
* In this technique, we perform undersampling by generating centroids based on clustering methods
* The dataset will be grouped by similarity, to preserve information

#### **Oversampling methods**

The Oversampling methods work with the minority class. In these methods, we duplicate random instances of the minority class. So, it replicates the observations from minority classes to balance the data. It is also known as upsampling. It may result in overfitting due to duplication of data points. 

#### 10.2.7.3.7. Random oversampling
* In random oversampling, we balance the data by randomly oversampling the minority class
* `Advantages:`
  - It leads to no information loss
  - This method outperforms undersampling
* `Disadvantages:`
  - This method increases the likelihood of overfitting as it replicates the minority class labels

#### 10.2.7.3.8. Cluster-based oversampling
* In this method, the K-Means clustering algorithm technique is independently applied to minority and majority class labels
* Thus, we will identify clusters in the dataset. Subsequently, each cluster is oversampled such that all clusters of the same class have an equal number of instances and all classes have the same size
* `Advantages:`
  - This clustering technique helps to overcome the challenge of imbalanced class distribution.
  - Also, this technique overcomes the challenges of within-class imbalance, where a class is composed of different sub-clusters and each sub-cluster does not contain the same number of examples
* `Disadvantages:`
  - The disadvantage associated with this technique is the possibility of overfitting the training data

#### 10.2.7.3.9. Synthetic data generation oversampling (SMOTE & ADASYN)
* In the synthetic data generation technique, we overcome the data imbalances by generating artificial data

`Synthetic Minority Oversampling Technique or SMOTE:`
* In the context of synthetic data generation, there is a powerful and widely used method known as `SMOTE`
* SMOTE generates new observations by interpolation between existing observations in the dataset
* In SMOTE, we use a pre-specified criterion and synthetically generate minority class observations
* These synthetic instances are then added to the original dataset. The new dataset is then used as a sample to train the classification models
* This technique is followed to avoid overfitting which occurs when exact replicas of minority instances are added to the main dataset
* Under this technique, artificial data is created based on feature space
* Artificial data is generated with bootstrapping and the k-nearest neighbors algorithm. It works as follows:
  1. First of all, we take the differences between the feature vector (sample) under consideration and its nearest neighbor
  2. Then we multiply this difference by a random number between 0 and 1
  3. Then we add this number to the feature vector under consideration
  4. Thus, we select a random point along the line segment between two specific features
* `Advantages:`
  1. This technique reduces the problem of overfitting
  2. It does not result in the loss of useful information
* `Disadvantages:`
  1. Generating synthetic examples SMOTE does not take into account neighboring examples from other classes. It may result in overlapping classes and can introduce additional noise
  2. SMOTE is not very effective for high-dimensional data
  
_**IMPORTANT NOTE:** SMOTE resampling should NOT be done for test datasets ("X_test" & "y_test")_

`Adaptive Synthetic Technique or ADASYN:`
* This technique works similarly to SMOTE. But the number of samples generated is proportional to the number of nearby samples which do not belong to the same class
* Thus, it focuses on outliers when generating the new training samples

#### **Other methods**

#### 10.2.7.3.10. Cost-sensitive learning
* Cost-sensitive learning is another commonly used method to handle imbalanced classification problems
* This method evaluates the cost associated with misclassifying the observations
* This method does not create balanced data distribution, rather it focuses on the imbalanced learning problem by using cost matrices which describe the cost for misclassification in a particular scenario
* Researche has shown that this cost-sensitive learning may outperform sampling methods
* So, it provides a likely alternative to sampling methods

#### 10.2.7.3.11. Algorithmic ensemble methods
* So far, we have looked at techniques to provide balanced datasets
* In this approach, we modify the existing classification algorithms to make them appropriate for imbalanced datasets
* In this approach, we construct several two-stage classifiers from the original data, and then we aggregate their predictions
* The main aim of this ensemble technique is to improve the performance of single classifiers
* The ensemble techniques are of two types: bagging and boosting. These techniques are discussed below:

**`a. Bagging`**
* Bagging is an abbreviation of Bootstrap Aggregating
* In the conventional bagging algorithm, we generate n different bootstrap training samples with replacement
* Then we train the algorithm on each bootstrap training sample separately and then aggregate the predictions at the end
* Bagging is used to reduce overfitting to create strong learners so that we can generate strong predictions
* Bagging allows replacement in the bootstrapped training sample
* The ML algorithms like logistic regression, decision tree, and neural networks are fitted to each bootstrapped training sample
* These classifiers are then aggregated to produce a compound classifier
* This ensemble technique produces a strong compound classifier since it combines individual classifiers to come up with a strong classifier
* `Advantages:`
  - This technique improves the stability and accuracy of ML algorithms
  - It reduces variance and overcomes overfitting
  - It improves the misclassification rate of the bagged classifier
  - In noisy data situations, bagging outperforms boosting
* `Disadvantages:`
  - Bagging works only if the base classifiers are not bad, to begin with. Bagging with bad classifiers can further degrade the performance

**`b. Boosting`**
* Boosting is an ensemble technique to combine weak learners to create a strong learner so that we can make accurate predictions
* In boosting, we start with a base or weak classifier that is prepared on the training data
* The base learners are weak. So, the prediction accuracy is only slightly better than average
* A classifier learning algorithm is said to be weak when small changes in data result in big changes in the classification model

#### 10.2.7.3.12. Imbalanced learn
* There is a Python library, called _`Imbalanced-Learn`_ that enables us to handle the imbalanced datasets
* It is a Python library that contains various algorithms to handle imbalanced datasets
* It can be easily installed with the pip command
* This library contains a _`make_imbalance`_ method to exasperate the level of class imbalance within a given dataset

## 10.3. Deliverables from Stage-4
* Preprocessed datasets
* Metadata for preprocessed datasets
* Preprocessed data summary report
* Training datasets
* Test datasets

## 10.4. Notebook development tips

<!--NAVIGATION-->
<br>

<[ [Stage-3: Research](09.00-mlpg-Stage-3-Research.ipynb) | [Contents and Acronyms](00.00-mlpg-Contents-and-Acronyms.ipynb) | [Stage-5: Model Development](11.00-mlpg-Stage-5-Model-Development.ipynb) ]>