## Table of Contents
- [Data Prep](#Data-Prep)
- [Skewness](#Skewness)
- [Data Scaling](#Data-Scaling)
- [Encoding](#Encoding)
- [Outlier Removal](#Outlier-Removal)
- [Exploratory Data Analysis](#EDA)

# Data Prep

<div>
    <img src="../src/images/dataprep.png" width="600">
</div>

## Data preparation

also commonly called **data pre-processing**, is a fundamental step in the machine learning workflow. It's essentially the process of getting the raw data ready for a machine learning or AI model.  Imagine it like cleaning and prepping ingredients before cooking a meal. One wouldn't throw raw, unwashed vegetables straight into a pot when making a stew. Data preparation is essential because it directly impacts the quality and effectiveness of the machine learning model. "Garbage in, garbage out" applies here, if you feed your model messy or unorganized data, you'll get unreliable results.

There are several important tasks in the data preparation stage, including:

-   Data Cleaning: This involves identifying and fixing errors, inconsistencies, and missing values in your data.  For instance, one might need to remove duplicate entries, deal with data outliers, correct typos, or find a way to handle data points where information is absent.

-   Data Manipulation: This might involve scaling your data or transforming the data in some way, such as encoding categorical variables. Encoding transforms non-numeric data (like text categories) into a format a machine learning model can understand, often using numerical representations.

-   Data Reduction: In some cases, you might have a very large dataset. Data reduction techniques like dimensionality reduction can help you identify the most important features and reduce the size of your data without losing significant information.

-   Feature Engineering: While data cleaning and scaling ensures that the data is usable,  feature engineering goes a step further. It's the art of creating new features or transforming existing ones to be more informative for your machine learning model. By crafting informative features, you essentially give your model a richer understanding of the data, leading to more accurate and powerful results.


# --------------------------------------------------------
# Data Transformation

# Skewness

Many machine learning or deep learning algorithms (such as Linear Regression, Logistic Regression, and Artificial Neural Networks) assume that the variable data are normally distributed (i.e. follow a Gaussian distribution) and can perform much better if the data provided to them during modeling are normally distributed.

<div>
    <img src="../src/images/rightskew.png" width="600">
</div>

Right-skewed Distribution: When the distribution has a long tail towards the right side, then it is known as a right-skewed or positive-skewed distribution. In the right-skewed distribution, the concentration of data points towards the right tail is more than the left tail.

<div>
    <img src="../src/images/leftskew.png" width="600">
</div>
 
Left-skewed Distribution: When the distribution has a long tail towards the left side, then it is known as a left-skewed or negative-skewed distribution. In the negative-skewed distribution, the concentration of data points towards the left tail is more than the right tail.


#### Measuring Skewness

Skewness measures the lack of symmetry in a distribution.

    A perfectly symmetric distribution (like the normal distribution) has a skewness of 0.

    If data is skewed right (tail to the right), skewness is positive.

    If data is skewed left (tail to the left), skewness is negative.

Formula for Skewness Coefficient

The sample skewness is given by:

<div>
    <img src="../src/images/skeweq.png" width="300">
</div>

Where:
- n  = number of samples
- xi = data value
- x¯ = sample mean
- s  = sample standard deviation

Alternatively, libraries like Pandas and SciPy use

`df['feature'].skew()         # from Pandas`

`scipy.stats.skew(df['feature'])  # from SciPy`

<style type="text/css">
#T_d4da2 th {
  text-align: left;
}
#T_d4da2_row0_col0, #T_d4da2_row0_col1, #T_d4da2_row1_col0, #T_d4da2_row1_col1, #T_d4da2_row2_col0, #T_d4da2_row2_col1 {
  text-align: left;
}
</style>
<table id="T_d4da2">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_d4da2_level0_col0" class="col_heading level0 col0" >Skewness Value</th>
      <th id="T_d4da2_level0_col1" class="col_heading level0 col1" >Interpretation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_d4da2_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_d4da2_row0_col0" class="data row0 col0" >≈0</td>
      <td id="T_d4da2_row0_col1" class="data row0 col1" >Symmetric (normal)</td>
    </tr>
    <tr>
      <th id="T_d4da2_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_d4da2_row1_col0" class="data row1 col0" >>0</td>
      <td id="T_d4da2_row1_col1" class="data row1 col1" >Right-skewed (positive skew)</td>
    </tr>
    <tr>
      <th id="T_d4da2_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_d4da2_row2_col0" class="data row2 col0" ><0</td>
      <td id="T_d4da2_row2_col1" class="data row2 col1" >Left-skewed (negative skew)</td>
    </tr>
  </tbody>
</table>

<style type="text/css">
#T_d5957 th {
  text-align: left;
}
#T_d5957_row0_col0, #T_d5957_row0_col1, #T_d5957_row1_col0, #T_d5957_row1_col1, #T_d5957_row2_col0, #T_d5957_row2_col1 {
  text-align: left;
}
</style>
<table id="T_d5957">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_d5957_level0_col0" class="col_heading level0 col0" >Absolute Value</th>
      <th id="T_d5957_level0_col1" class="col_heading level0 col1" >Interpretation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_d5957_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_d5957_row0_col0" class="data row0 col0" >< 0.5</td>
      <td id="T_d5957_row0_col1" class="data row0 col1" > 	Approximately symmetric</td>
    </tr>
    <tr>
      <th id="T_d5957_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_d5957_row1_col0" class="data row1 col0" >0.5 – 1</td>
      <td id="T_d5957_row1_col1" class="data row1 col1" >Moderately skewed</td>
    </tr>
    <tr>
      <th id="T_d5957_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_d5957_row2_col0" class="data row2 col0" >> 1</td>
      <td id="T_d5957_row2_col1" class="data row2 col1" > 	Highly skewed</td>
    </tr>
  </tbody>
</table>

#### Positive/Right Skewed Data

One can use simple transformations to "normalize" right-skewed data:

-   log transformations:
    -   np.log(), np.log10 (but this does not deal with zeros in the data)
    -   np.log1p(x) = log(x+1) adds 1, but can only be used for positive data,
-   square-root transformations: np.sqrt()

There are several sklearn library functions for dealing with right-skwed data:

<style type="text/css">
#T_6068a th {
  text-align: left;
}
#T_6068a_row0_col0, #T_6068a_row0_col1, #T_6068a_row0_col2, #T_6068a_row0_col3, #T_6068a_row1_col0, #T_6068a_row1_col1, #T_6068a_row1_col2, #T_6068a_row1_col3, #T_6068a_row2_col0, #T_6068a_row2_col1, #T_6068a_row2_col2, #T_6068a_row2_col3, #T_6068a_row3_col0, #T_6068a_row3_col1, #T_6068a_row3_col2, #T_6068a_row3_col3, #T_6068a_row4_col0, #T_6068a_row4_col1, #T_6068a_row4_col2, #T_6068a_row4_col3 {
  text-align: left;
}
</style>
<table id="T_6068a">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_6068a_level0_col0" class="col_heading level0 col0" >Method</th>
      <th id="T_6068a_level0_col1" class="col_heading level0 col1" >Formula / Tool</th>
      <th id="T_6068a_level0_col2" class="col_heading level0 col2" >Handles Zeros?</th>
      <th id="T_6068a_level0_col3" class="col_heading level0 col3" >Comment</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_6068a_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_6068a_row0_col0" class="data row0 col0" >Log1p (Natural log + 1)</td>
      <td id="T_6068a_row0_col1" class="data row0 col1" >np.log1p(x) = log(1 + x)</td>
      <td id="T_6068a_row0_col2" class="data row0 col2" >yes</td>
      <td id="T_6068a_row0_col3" class="data row0 col3" >Handles zeros and negative values</td>
    </tr>
    <tr>
      <th id="T_6068a_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_6068a_row1_col0" class="data row1 col0" >Box-Cox (Scikit-Learn)</td>
      <td id="T_6068a_row1_col1" class="data row1 col1" >PowerTransformer(method='box-cox')</td>
      <td id="T_6068a_row1_col2" class="data row1 col2" > 	No (requires x > 0)</td>
      <td id="T_6068a_row1_col3" class="data row1 col3" >Requires positive values, can be shifted</td>
    </tr>
    <tr>
      <th id="T_6068a_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_6068a_row2_col0" class="data row2 col0" >Yeo-Johnson (Scikit-Learn)</td>
      <td id="T_6068a_row2_col1" class="data row2 col1" >PowerTransformer(method='yeo-johnson')</td>
      <td id="T_6068a_row2_col2" class="data row2 col2" >yes</td>
      <td id="T_6068a_row2_col3" class="data row2 col3" >Handles zeros and negative values</td>
    </tr>
    <tr>
      <th id="T_6068a_level0_row3" class="row_heading level0 row3" >3</th>
      <td id="T_6068a_row3_col0" class="data row3 col0" >Custom log(x + ε)</td>
      <td id="T_6068a_row3_col1" class="data row3 col1" >np.log(x + 1e-6)</td>
      <td id="T_6068a_row3_col2" class="data row3 col2" >yes</td>
      <td id="T_6068a_row3_col3" class="data row3 col3" >Customizable epsilon value</td>
    </tr>
    <tr>
      <th id="T_6068a_level0_row4" class="row_heading level0 row4" >4</th>
      <td id="T_6068a_row4_col0" class="data row4 col0" >QuantileTransformer</td>
      <td id="T_6068a_row4_col1" class="data row4 col1" >QuantileTransformer(output_distribution='normal')</td>
      <td id="T_6068a_row4_col2" class="data row4 col2" >yes</td>
      <td id="T_6068a_row4_col3" class="data row4 col3" >Transforms to normal distribution, Smooths heavy tails; robust to outliers and zero values</td>
    </tr>
  </tbody>
</table>

#### Negative/Left Skewed Data

Dealing with left-skewed data (where the tail is on the left and the bulk of values are on the right) is less commonly discussed than right-skewed data - but it’s just as important when preparing data for ML algorithms that assume symmetry or normality.

The simplest way of dealing with left skewed data is to transform it to right skewed distribution, and then apply the approaches above.

This can easily be done by simply:

-   Flip: Multiplying the data by -1
-   Shift: Adding the lowest value (i.e. most negative value) to the data.

This will transform the negative-skewed distribution to right-skewed, starting from 0.

# --------------------------------------------------------
# Data Scaling

Data scaling is a technique for transforming the values of variables within a dataset to a similar range. In other words to convert all variables to the same order-of-magnitude.

There are several reasons why scaling is important:

-   Fair treatment of features: Imagine features like "income" (in thousands) and "age". Without scaling, the model would overemphasize income due to its larger magnitude. Scaling levels the playing field.
-   Improved model convergence: Many machine learning algorithms rely on distance calculations between data points. Features with vast ranges can skew these distances, hindering the model's ability to converge on an optimal solution.
-   Outlier detection: Standardization (a specific scaling technique) transforms data to have a mean of 0 and standard deviation of 1. Values outside a specific range (e.g., +/- 2 standard deviations) can be flagged as potential outliers.

There are several common data scaling techniques that may be used.
 

-   Standardization: This technique transforms variables to have a mean of 0 and standard deviation of 1. It's useful when the distribution of your data is roughly Gaussian.
-   Normalization: This technique scales features to a specific range, often 0 to 1 or -1 to 1. It's a good choice when the data distribution is unknown or non-Gaussian.
-   Min-Max scaling: This technique scales each feature to a specific range (e.g., 0 to 1) based on its minimum and maximum values in the dataset.
-   Robust scaling: Robust scaling is a technique designed to handle outliers. Unlike standard scaling, which relies on mean and standard deviation (easily swayed by outliers), robust scaling uses the median and interquartile range (IQR). The median represents the data's center, and IQR reflects the spread of the middle half of the data, making them less influenced by extreme values. This ensures a more robust scaling process, especially for datasets with outliers.


<div>
    <img src="../src/images/scalingprocess.png" width="600">
</div>



There are no absolute rules for choosing a scaling technique. The most appropriate technique depends on the data and the particular machine learning algorithm being used. However, here are some general guidelines:

-   Use standardization for Gaussian distributed data and algorithms sensitive to feature means and variances (e.g., Support Vector Machines).
-   Use normalization for algorithms where the data range is important (e.g., some neural networks).
-   Use min-max scaling for a simple and quick approach, but be aware it can be sensitive to outliers.
-   Use robust scaling if the data has outliers and there is a concern about their impact on scaling.
-   Use robust scaling when using machine learning algorithms sensitive to feature means and variances (e.g., Support Vector Machines).


<style type="text/css">
#T_682ba th {
  text-align: left;
}
#T_682ba_row0_col0, #T_682ba_row0_col1, #T_682ba_row0_col2, #T_682ba_row0_col3, #T_682ba_row0_col4, #T_682ba_row0_col5, #T_682ba_row1_col0, #T_682ba_row1_col1, #T_682ba_row1_col2, #T_682ba_row1_col3, #T_682ba_row1_col4, #T_682ba_row1_col5, #T_682ba_row2_col0, #T_682ba_row2_col1, #T_682ba_row2_col2, #T_682ba_row2_col3, #T_682ba_row2_col4, #T_682ba_row2_col5, #T_682ba_row3_col0, #T_682ba_row3_col1, #T_682ba_row3_col2, #T_682ba_row3_col3, #T_682ba_row3_col4, #T_682ba_row3_col5, #T_682ba_row4_col0, #T_682ba_row4_col1, #T_682ba_row4_col2, #T_682ba_row4_col3, #T_682ba_row4_col4, #T_682ba_row4_col5, #T_682ba_row5_col0, #T_682ba_row5_col1, #T_682ba_row5_col2, #T_682ba_row5_col3, #T_682ba_row5_col4, #T_682ba_row5_col5 {
  text-align: left;
}
</style>
<table id="T_682ba">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_682ba_level0_col0" class="col_heading level0 col0" >Scaler</th>
      <th id="T_682ba_level0_col1" class="col_heading level0 col1" >What It Does</th>
      <th id="T_682ba_level0_col2" class="col_heading level0 col2" >Handles Outliers?</th>
      <th id="T_682ba_level0_col3" class="col_heading level0 col3" >Output Range / Properties</th>
      <th id="T_682ba_level0_col4" class="col_heading level0 col4" >Best Used When...</th>
      <th id="T_682ba_level0_col5" class="col_heading level0 col5" >Common Use Cases</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_682ba_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_682ba_row0_col0" class="data row0 col0" >StandardScaler</td>
      <td id="T_682ba_row0_col1" class="data row0 col1" >Standardizes features by removing the mean and scaling to unit variance</td>
      <td id="T_682ba_row0_col2" class="data row0 col2" >No</td>
      <td id="T_682ba_row0_col3" class="data row0 col3" >Mean = 0, Std = 1</td>
      <td id="T_682ba_row0_col4" class="data row0 col4" >Features follow a normal distribution</td>
      <td id="T_682ba_row0_col5" class="data row0 col5" >Logistic Regression, SVM, PCA, k-NN</td>
    </tr>
    <tr>
      <th id="T_682ba_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_682ba_row1_col0" class="data row1 col0" >MinMaxScaler</td>
      <td id="T_682ba_row1_col1" class="data row1 col1" >Scales features to a specified range (default [0, 1])</td>
      <td id="T_682ba_row1_col2" class="data row1 col2" >No</td>
      <td id="T_682ba_row1_col3" class="data row1 col3" >Scales to [min, max]</td>
      <td id="T_682ba_row1_col4" class="data row1 col4" >Features vary widely in scale but no strong outliers</td>
      <td id="T_682ba_row1_col5" class="data row1 col5" >Neural Networks, Image data, Distance-based models</td>
    </tr>
    <tr>
      <th id="T_682ba_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_682ba_row2_col0" class="data row2 col0" >RobustScaler</td>
      <td id="T_682ba_row2_col1" class="data row2 col1" >Scales using the median and IQR (Interquartile Range)</td>
      <td id="T_682ba_row2_col2" class="data row2 col2" >Yes</td>
      <td id="T_682ba_row2_col3" class="data row2 col3" >Not bounded; robust to outliers</td>
      <td id="T_682ba_row2_col4" class="data row2 col4" >Features contain extreme outliers</td>
      <td id="T_682ba_row2_col5" class="data row2 col5" >Financial data, Medical records</td>
    </tr>
    <tr>
      <th id="T_682ba_level0_row3" class="row_heading level0 row3" >3</th>
      <td id="T_682ba_row3_col0" class="data row3 col0" >MaxAbsScaler</td>
      <td id="T_682ba_row3_col1" class="data row3 col1" >Scales features to [-1, 1] by dividing by max absolute value</td>
      <td id="T_682ba_row3_col2" class="data row3 col2" >No</td>
      <td id="T_682ba_row3_col3" class="data row3 col3" >Preserves sparsity; good for sparse data</td>
      <td id="T_682ba_row3_col4" class="data row3 col4" >Working with sparse data</td>
      <td id="T_682ba_row3_col5" class="data row3 col5" >Sparse matrices (e.g., text data)</td>
    </tr>
    <tr>
      <th id="T_682ba_level0_row4" class="row_heading level0 row4" >4</th>
      <td id="T_682ba_row4_col0" class="data row4 col0" >QuantileTransformer</td>
      <td id="T_682ba_row4_col1" class="data row4 col1" >Maps data to a uniform or normal distribution via quantiles</td>
      <td id="T_682ba_row4_col2" class="data row4 col2" >Yes</td>
      <td id="T_682ba_row4_col3" class="data row4 col3" >Uniform or Gaussian distribution</td>
      <td id="T_682ba_row4_col4" class="data row4 col4" >Need uniform or normal distribution explicitly</td>
      <td id="T_682ba_row4_col5" class="data row4 col5" >Non-linear models, Outlier-rich data</td>
    </tr>
    <tr>
      <th id="T_682ba_level0_row5" class="row_heading level0 row5" >5</th>
      <td id="T_682ba_row5_col0" class="data row5 col0" >PowerTransformer (Box-Cox, Yeo-Johnson)</td>
      <td id="T_682ba_row5_col1" class="data row5 col1" >Stabilizes variance and makes data more Gaussian-like</td>
      <td id="T_682ba_row5_col2" class="data row5 col2" >Yes</td>
      <td id="T_682ba_row5_col3" class="data row5 col3" >Mean ≈ 0, Std ≈ 1</td>
      <td id="T_682ba_row5_col4" class="data row5 col4" >Want to stabilize variance and make data normal</td>
      <td id="T_682ba_row5_col5" class="data row5 col5" >Gaussian-sensitive models like LDA, PCA</td>
    </tr>
  </tbody>
</table>

# --------------------------------------------------------
# Encoding

<style type="text/css">
#T_78532 th {
  text-align: left;
}
#T_78532_row0_col0, #T_78532_row0_col1, #T_78532_row0_col2, #T_78532_row1_col0, #T_78532_row1_col1, #T_78532_row1_col2, #T_78532_row2_col0, #T_78532_row2_col1, #T_78532_row2_col2 {
  text-align: left;
}
</style>
<table id="T_78532">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_78532_level0_col0" class="col_heading level0 col0" >Feature Type</th>
      <th id="T_78532_level0_col1" class="col_heading level0 col1" >Example</th>
      <th id="T_78532_level0_col2" class="col_heading level0 col2" >Encoding Strategy</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_78532_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_78532_row0_col0" class="data row0 col0" >Nominal</td>
      <td id="T_78532_row0_col1" class="data row0 col1" >Petrol, Diesel, CNG</td>
      <td id="T_78532_row0_col2" class="data row0 col2" >One-Hot</td>
    </tr>
    <tr>
      <th id="T_78532_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_78532_row1_col0" class="data row1 col0" >Ordinal</td>
      <td id="T_78532_row1_col1" class="data row1 col1" >Small, Medium, Large</td>
      <td id="T_78532_row1_col2" class="data row1 col2" >Label / Ordinal Map</td>
    </tr>
    <tr>
      <th id="T_78532_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_78532_row2_col0" class="data row2 col0" >Binary</td>
      <td id="T_78532_row2_col1" class="data row2 col1" >Yes, No</td>
      <td id="T_78532_row2_col2" class="data row2 col2" >Map to 0/1</td>
    </tr>
  </tbody>
</table>

# --------------------------------------------------------
# Outlier Removal

Outliers are data points that fall significantly outside the typical range of your data.  They can distort a machine learning model's results and lead to inaccurate predictions. There are two main approaches to dealing with outliers: 

-   Removal:  This involves simply removing the outlier data from the dataset.
-   Imputation: This involves replacing the outlier values with substituted values.


There are several reasons why one might want to remove outliers from a dataset. These include:

-   Distorted models: Outliers can significantly influence the model's learning process, causing it to overfit to the extreme values instead of capturing the underlying pattern.
-   Reduced accuracy: Models trained on data with outliers might perform well on the training data but fail to generalize to unseen data.


## Interquartile Range (IQR) Approach

Here we describe the process of interquartile range outlier removal. There are other approaches, but we won't discuss them in this course.

<div>
    <img src="../src/images/IQR.png" width="600">
</div>

The Interquartile Range (IQR) is a statistical measure used to understand the spread of data, specifically focusing on the middle half of your dataset. It tells you how much variability exists within the central 50% of your data points, after ordering them from least to greatest.

Imagine dividing your ordered data into four equal parts. The points that divide these parts are called quartiles. There are three quartiles (Q1, Q2, and Q3).

-   Q1 (first quartile): Represents the value where 25% of the data falls below it and 75% falls above.
-   Q2 (second quartile): This is the median of your data, the middle value when ordered.
-   Q3 (third quartile): Represents the value where 75% of the data falls below it and 25% falls above.


Interquartile Range (IQR):  This is simply the difference between the third quartile (Q3) and the first quartile (Q1). So,

IQR=Q3−Q1.

In layman's terms, the IQR tells us how spread out the data is.

One can remove outliers that fall beyond a chosen multiple of the IQR.

In [None]:
# Python Code
# the following function will remove outliers outside 1.5 times the IQR:

def remove_outliers_iqr(df, columns):
    df_clean = df.copy()
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        df_clean = df_clean[(df_clean[col] >= lower) & (df_clean[col] <= upper)]
    return df_clean

numeric_columns = df.select_dtypes(include='number').columns.tolist()
df_cleaned = remove_outliers_iqr(df, numeric_columns)

## The z-score Method

In statistics, the z-score (also known as the standard score) tells you how many standard deviations a data point is from the mean of the distribution. The interpretati0n of the z-score is summarized in the following table:

<style type="text/css">
#T_9e2df th {
  text-align: left;
}
#T_9e2df_row0_col0, #T_9e2df_row0_col1, #T_9e2df_row1_col0, #T_9e2df_row1_col1, #T_9e2df_row2_col0, #T_9e2df_row2_col1, #T_9e2df_row3_col0, #T_9e2df_row3_col1 {
  text-align: left;
}
</style>
<table id="T_9e2df">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_9e2df_level0_col0" class="col_heading level0 col0" >Z-score</th>
      <th id="T_9e2df_level0_col1" class="col_heading level0 col1" >Meaning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_9e2df_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_9e2df_row0_col0" class="data row0 col0" >0</td>
      <td id="T_9e2df_row0_col1" class="data row0 col1" >Data point is exactly at the mean</td>
    </tr>
    <tr>
      <th id="T_9e2df_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_9e2df_row1_col0" class="data row1 col0" >+1</td>
      <td id="T_9e2df_row1_col1" class="data row1 col1" >1 standard deviation above the mean</td>
    </tr>
    <tr>
      <th id="T_9e2df_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_9e2df_row2_col0" class="data row2 col0" >-1</td>
      <td id="T_9e2df_row2_col1" class="data row2 col1" >1 standard deviation below the mean</td>
    </tr>
    <tr>
      <th id="T_9e2df_level0_row3" class="row_heading level0 row3" >3</th>
      <td id="T_9e2df_row3_col0" class="data row3 col0" >> +3 or < -3</td>
      <td id="T_9e2df_row3_col1" class="data row3 col1" >Potential outlier (extreme deviation)</td>
    </tr>
  </tbody>
</table>

#### z-Score and the Empirical Rule

The Empirical Rule (also called the 68–95–99.7 Rule) for a normal distribution, which is the foundation for the z-score approach to outlier detection is illustrated by the following diagram:

<div>
    <img src="../src/images/Zscore.png" width="600">
</div>

The z-score measures how many standard deviations (s or SD in the figure) a data point is from the mean (x¯ or M above ). In the normal distribution shown:

<style type="text/css">
#T_c544f th {
  text-align: left;
}
#T_c544f_row0_col0, #T_c544f_row0_col1, #T_c544f_row0_col2, #T_c544f_row0_col3, #T_c544f_row1_col0, #T_c544f_row1_col1, #T_c544f_row1_col2, #T_c544f_row1_col3, #T_c544f_row2_col0, #T_c544f_row2_col1, #T_c544f_row2_col2, #T_c544f_row2_col3 {
  text-align: left;
}
</style>
<table id="T_c544f">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_c544f_level0_col0" class="col_heading level0 col0" >Range</th>
      <th id="T_c544f_level0_col1" class="col_heading level0 col1" >Z-Score Range</th>
      <th id="T_c544f_level0_col2" class="col_heading level0 col2" >% of Data</th>
      <th id="T_c544f_level0_col3" class="col_heading level0 col3" >Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_c544f_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_c544f_row0_col0" class="data row0 col0" >±1s</td>
      <td id="T_c544f_row0_col1" class="data row0 col1" >z ∈ [−1, +1]</td>
      <td id="T_c544f_row0_col2" class="data row0 col2" >68%</td>
      <td id="T_c544f_row0_col3" class="data row0 col3" >Most data lies within 1 standard deviation of the mean</td>
    </tr>
    <tr>
      <th id="T_c544f_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_c544f_row1_col0" class="data row1 col0" >±2s</td>
      <td id="T_c544f_row1_col1" class="data row1 col1" >z ∈ [−2, +2]</td>
      <td id="T_c544f_row1_col2" class="data row1 col2" >95%</td>
      <td id="T_c544f_row1_col3" class="data row1 col3" >Covers almost all typical values</td>
    </tr>
    <tr>
      <th id="T_c544f_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_c544f_row2_col0" class="data row2 col0" >±3s</td>
      <td id="T_c544f_row2_col1" class="data row2 col1" >z ∈ [−3, +3]</td>
      <td id="T_c544f_row2_col2" class="data row2 col2" >99.7%</td>
      <td id="T_c544f_row2_col3" class="data row2 col3" >Includes nearly all data in a normal distribution</td>
    </tr>
  </tbody>
</table>


This means that data points with |z| > 3 fall in the outer 0.3% (i.e., beyond ±3s). These are considered extreme outliers in a normal distribution

#### When to use z-score method

Care should be taken when using the z-score method. It is appropriate when:
-   Data is approximately normally distributed
-   A simple, fast method is required

It is not ideal in cases where:
-   Data is heavily skewed or contains non-Gaussian outliers
-   You're working with multivariate data (consider use Isolation Forest instead)


In [None]:
# Python Code
# For example, the following python code will remove all data points outside 3 standard deviations of the mean.

from scipy.stats import zscore

z_scores = df[numeric_columns].apply(zscore)
df_cleaned = df[(abs(z_scores) < 3).all(axis=1)]

<style type="text/css">
#T_b15e8 th {
  text-align: left;
}
#T_b15e8_row0_col0, #T_b15e8_row0_col1, #T_b15e8_row0_col2, #T_b15e8_row0_col3, #T_b15e8_row0_col4, #T_b15e8_row1_col0, #T_b15e8_row1_col1, #T_b15e8_row1_col2, #T_b15e8_row1_col3, #T_b15e8_row1_col4, #T_b15e8_row2_col0, #T_b15e8_row2_col1, #T_b15e8_row2_col2, #T_b15e8_row2_col3, #T_b15e8_row2_col4, #T_b15e8_row3_col0, #T_b15e8_row3_col1, #T_b15e8_row3_col2, #T_b15e8_row3_col3, #T_b15e8_row3_col4, #T_b15e8_row4_col0, #T_b15e8_row4_col1, #T_b15e8_row4_col2, #T_b15e8_row4_col3, #T_b15e8_row4_col4, #T_b15e8_row5_col0, #T_b15e8_row5_col1, #T_b15e8_row5_col2, #T_b15e8_row5_col3, #T_b15e8_row5_col4 {
  text-align: left;
}
</style>
<table id="T_b15e8">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_b15e8_level0_col0" class="col_heading level0 col0" >Method</th>
      <th id="T_b15e8_level0_col1" class="col_heading level0 col1" >sklearn</th>
      <th id="T_b15e8_level0_col2" class="col_heading level0 col2" >Technique</th>
      <th id="T_b15e8_level0_col3" class="col_heading level0 col3" >Works On</th>
      <th id="T_b15e8_level0_col4" class="col_heading level0 col4" >Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_b15e8_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_b15e8_row0_col0" class="data row0 col0" >IQR Method</td>
      <td id="T_b15e8_row0_col1" class="data row0 col1" >None</td>
      <td id="T_b15e8_row0_col2" class="data row0 col2" >Removes values beyond 1.5×IQR outside Q1/Q3</td>
      <td id="T_b15e8_row0_col3" class="data row0 col3" >Univariate</td>
      <td id="T_b15e8_row0_col4" class="data row0 col4" >Simple, fast, interpretable</td>
    </tr>
    <tr>
      <th id="T_b15e8_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_b15e8_row1_col0" class="data row1 col0" >Z-score Method</td>
      <td id="T_b15e8_row1_col1" class="data row1 col1" >None</td>
      <td id="T_b15e8_row1_col2" class="data row1 col2" >Removes values with z-score > 3 (or threshold)</td>
      <td id="T_b15e8_row1_col3" class="data row1 col3" >Univariate</td>
      <td id="T_b15e8_row1_col4" class="data row1 col4" >Assumes normality</td>
    </tr>
    <tr>
      <th id="T_b15e8_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_b15e8_row2_col0" class="data row2 col0" >Isolation Forest</td>
      <td id="T_b15e8_row2_col1" class="data row2 col1" >sklearn.ensemble.IsolationForest</td>
      <td id="T_b15e8_row2_col2" class="data row2 col2" >Tree-based anomaly detection</td>
      <td id="T_b15e8_row2_col3" class="data row2 col3" >Multivariate</td>
      <td id="T_b15e8_row2_col4" class="data row2 col4" >Handles high-dimensional data</td>
    </tr>
    <tr>
      <th id="T_b15e8_level0_row3" class="row_heading level0 row3" >3</th>
      <td id="T_b15e8_row3_col0" class="data row3 col0" >Local Outlier Factor (LOF)</td>
      <td id="T_b15e8_row3_col1" class="data row3 col1" >sklearn.neighbors.LocalOutlierFactor</td>
      <td id="T_b15e8_row3_col2" class="data row3 col2" >Detects density-based local outliers</td>
      <td id="T_b15e8_row3_col3" class="data row3 col3" >Multivariate</td>
      <td id="T_b15e8_row3_col4" class="data row3 col4" >Good for non-globular clusters</td>
    </tr>
    <tr>
      <th id="T_b15e8_level0_row4" class="row_heading level0 row4" >4</th>
      <td id="T_b15e8_row4_col0" class="data row4 col0" >DBSCAN</td>
      <td id="T_b15e8_row4_col1" class="data row4 col1" >sklearn.cluster.DBSCAN</td>
      <td id="T_b15e8_row4_col2" class="data row4 col2" >Clustering + noise labeling</td>
      <td id="T_b15e8_row4_col3" class="data row4 col3" >Multivariate</td>
      <td id="T_b15e8_row4_col4" class="data row4 col4" >Detects outliers as noise</td>
    </tr>
    <tr>
      <th id="T_b15e8_level0_row5" class="row_heading level0 row5" >5</th>
      <td id="T_b15e8_row5_col0" class="data row5 col0" >Elliptic Envelope</td>
      <td id="T_b15e8_row5_col1" class="data row5 col1" >sklearn.covariance.EllipticEnvelope</td>
      <td id="T_b15e8_row5_col2" class="data row5 col2" >Fits a Gaussian “envelope” around data</td>
      <td id="T_b15e8_row5_col3" class="data row5 col3" >Multivariate</td>
      <td id="T_b15e8_row5_col4" class="data row5 col4" >Assumes Gaussian distribution</td>
    </tr>
  </tbody>
</table>

In [None]:
# Python Code - Isolation Forest
# For example, one can use the sklearn Isolation forest function to remove outliers as follows:

from sklearn.ensemble import IsolationForest

iso = IsolationForest(contamination=0.05, random_state=42)
outliers = iso.fit_predict(df[numeric_columns])
df_cleaned = df[outliers == 1]

# --------------------------------------------------------
# EDA

Data preparation (or pre-processing) and exploratory data analysis (EDA) are crucial steps in a machine learning project, but they serve distinct purposes.

#### Data Preparation


The goal of data preparation is to get the data ready for modeling by cleaning, organizing, and formatting it. The typical tasks in data preparation may include"

Cleaning: This involves fixing errors, handling missing values, and removing inconsistencies.
Formatting: Ensure data is in a format usable by machine learning algorithms. This may involve converting data types or scaling values.
Labeling: If applicable, assign labels to data points for supervised learning.

#### Exploratory Data Analysis (EDA)


The goal of exploratory data analysis is to understand the characteristics of your data and uncover patterns or relationships. This may involve:

Summarizing:  Get a basic understanding of the data through measures like central tendency (mean, median) and spread (standard deviation).
Visualizing: Create charts and graphs to see trends, identify outliers, and explore relationships between variables.
Hypothesis Generation: Based on your findings, formulate initial questions or predictions about the data.


#### Key Differences

Imagine data preparation like cleaning and organizing your ingredients for a recipe (the machine learning model). EDA is like reviewing the recipe itself, understanding the ingredients and how they interact, before you start cooking. The key differences are

-   Focus: Data preparation focuses on making the data usable, while EDA focuses on understanding its content and relationships.
-   Outcome: Data preparation creates a clean dataset, while EDA generates insights and potential hypotheses about the data.
-   Timing: Preparation typically happens before EDA, but they can be iterative. 