# Data Prep

<div>
    <img src="src/images/dataprep.png" width="600">
</div>

## Data preparation

also commonly called **data pre-processing**, is a fundamental step in the machine learning workflow. It's essentially the process of getting the raw data ready for a machine learning or AI model.  Imagine it like cleaning and prepping ingredients before cooking a meal. One wouldn't throw raw, unwashed vegetables straight into a pot when making a stew. Data preparation is essential because it directly impacts the quality and effectiveness of the machine learning model. "Garbage in, garbage out" applies here, if you feed your model messy or unorganized data, you'll get unreliable results.

There are several important tasks in the data preparation stage, including:

-   Data Cleaning: This involves identifying and fixing errors, inconsistencies, and missing values in your data.  For instance, one might need to remove duplicate entries, deal with data outliers, correct typos, or find a way to handle data points where information is absent.

-   Data Manipulation: This might involve scaling your data or transforming the data in some way, such as encoding categorical variables. Encoding transforms non-numeric data (like text categories) into a format a machine learning model can understand, often using numerical representations.

-   Data Reduction: In some cases, you might have a very large dataset. Data reduction techniques like dimensionality reduction can help you identify the most important features and reduce the size of your data without losing significant information.

-   Feature Engineering: While data cleaning and scaling ensures that the data is usable,  feature engineering goes a step further. It's the art of creating new features or transforming existing ones to be more informative for your machine learning model. By crafting informative features, you essentially give your model a richer understanding of the data, leading to more accurate and powerful results.


## Data Transformation

Many machine learning or deep learning algorithms (such as Linear Regression, Logistic Regression, and Artificial Neural Networks) assume that the variable data are normally distributed (i.e. follow a Gaussian distribution) and can perform much better if the data provided to them during modeling are normally distributed.

<div>
    <img src="src/images/rightskew.png" width="600">
</div>

Right-skewed Distribution: When the distribution has a long tail towards the right side, then it is known as a right-skewed or positive-skewed distribution. In the right-skewed distribution, the concentration of data points towards the right tail is more than the left tail.

<div>
    <img src="src/images/leftskew.png" width="600">
</div>
 
Left-skewed Distribution: When the distribution has a long tail towards the left side, then it is known as a left-skewed or negative-skewed distribution. In the negative-skewed distribution, the concentration of data points towards the left tail is more than the right tail.


#### Measuring Skewness

Skewness measures the lack of symmetry in a distribution.

    A perfectly symmetric distribution (like the normal distribution) has a skewness of 0.

    If data is skewed right (tail to the right), skewness is positive.

    If data is skewed left (tail to the left), skewness is negative.

Formula for Skewness Coefficient

The sample skewness is given by:

<div>
    <img src="src/images/skeweq.png" width="300">
</div>

Where:
- n  = number of samples
- xi = data value
- x¯ = sample mean
- s  = sample standard deviation

Alternatively, libraries like Pandas and SciPy use

`df['feature'].skew()         # from Pandas`

`scipy.stats.skew(df['feature'])  # from SciPy`

<style type="text/css">
#T_d4da2 th {
  text-align: left;
}
#T_d4da2_row0_col0, #T_d4da2_row0_col1, #T_d4da2_row1_col0, #T_d4da2_row1_col1, #T_d4da2_row2_col0, #T_d4da2_row2_col1 {
  text-align: left;
}
</style>
<table id="T_d4da2">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_d4da2_level0_col0" class="col_heading level0 col0" >Skewness Value</th>
      <th id="T_d4da2_level0_col1" class="col_heading level0 col1" >Interpretation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_d4da2_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_d4da2_row0_col0" class="data row0 col0" >≈0</td>
      <td id="T_d4da2_row0_col1" class="data row0 col1" >Symmetric (normal)</td>
    </tr>
    <tr>
      <th id="T_d4da2_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_d4da2_row1_col0" class="data row1 col0" >>0</td>
      <td id="T_d4da2_row1_col1" class="data row1 col1" >Right-skewed (positive skew)</td>
    </tr>
    <tr>
      <th id="T_d4da2_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_d4da2_row2_col0" class="data row2 col0" ><0</td>
      <td id="T_d4da2_row2_col1" class="data row2 col1" >Left-skewed (negative skew)</td>
    </tr>
  </tbody>
</table>

<style type="text/css">
#T_d5957 th {
  text-align: left;
}
#T_d5957_row0_col0, #T_d5957_row0_col1, #T_d5957_row1_col0, #T_d5957_row1_col1, #T_d5957_row2_col0, #T_d5957_row2_col1 {
  text-align: left;
}
</style>
<table id="T_d5957">
  <thead>
    <tr>
      <th class="blank level0" >&nbsp;</th>
      <th id="T_d5957_level0_col0" class="col_heading level0 col0" >Absolute Value</th>
      <th id="T_d5957_level0_col1" class="col_heading level0 col1" >Interpretation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_d5957_level0_row0" class="row_heading level0 row0" >0</th>
      <td id="T_d5957_row0_col0" class="data row0 col0" >< 0.5</td>
      <td id="T_d5957_row0_col1" class="data row0 col1" > 	Approximately symmetric</td>
    </tr>
    <tr>
      <th id="T_d5957_level0_row1" class="row_heading level0 row1" >1</th>
      <td id="T_d5957_row1_col0" class="data row1 col0" >0.5 – 1</td>
      <td id="T_d5957_row1_col1" class="data row1 col1" >Moderately skewed</td>
    </tr>
    <tr>
      <th id="T_d5957_level0_row2" class="row_heading level0 row2" >2</th>
      <td id="T_d5957_row2_col0" class="data row2 col0" >> 1</td>
      <td id="T_d5957_row2_col1" class="data row2 col1" > 	Highly skewed</td>
    </tr>
  </tbody>
</table>

Positive/Right Skewed Data

One can use simple transformations to "normalize" right-skewed data:

-   log transformations:
    -   np.log(), np.log10 (but this does not deal with zeros in the data)
    -   np.log1p(x) = log(x+1) adds 1, but can only be used for positive data,
-   square-root transformations: np.sqrt()

There are several sklearn library functions for dealing with right-skwed data:

In [None]:
import pandas as pd

# Define the data
data = {
    "Method": [
        "Log1p (Natural log + 1)",
        "Box-Cox (Scikit-Learn)",
        "Yeo-Johnson (Scikit-Learn)",
        "Custom log(x + ε)",
        "QuantileTransformer"
        ],
    " 	Formula / Tool": [
        "np.log1p(x) = log(1 + x)",
        "PowerTransformer(method='box-cox')",
        "PowerTransformer(method='yeo-johnson')",
        "np.log(x + 1e-6)"
    ],
    "Handles Zeros?": ["yes", "yes", "yes"],
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the table
df.style.set_properties(**{'text-align': 'left'}).set_table_styles(
    [{'selector': 'th', 'props': [('text-align', 'left')]}]
)