# Standardization in Data Transformation

## 2. Standard Scaling

`Standard Scaling` (also known as `Z-score normalization`) is a technique used to standardize the features of a dataset by transforming them into a distribution with a mean of `0` and a standard deviation of `1`. This method is particularly useful when the data follows a Gaussian distribution or when the features have different units or scales, as it brings all features onto a common scale.

Standard Scaling is achieved by subtracting the mean of the feature and then dividing by the standard deviation. This process centers the data around `0` and ensures that the spread of the data is consistent across all features.

The formula for Standard Scaling is:

$$X' = \frac{X - \mu}{\sigma}$$

where:
- $X$ is the original data,
- $\mu$ is the mean of the feature,
- $\sigma$ is the standard deviation of the feature.

### Advantages of Standard Scaling
- **Handles Different Scales**: It effectively standardizes features with different units or scales, making them comparable.
- **Centers Data**: The transformed features have a mean of `0` and a standard deviation of `1`, which can improve the performance of many machine learning algorithms.

### Disadvantages of Standard Scaling
- **Assumes Normal Distribution**: It works best when the data follows a Gaussian distribution; if not, the results may be less effective.
- **Sensitive to Outliers**: Outliers can significantly affect the mean and standard deviation, leading to misleading scaling.


In [1]:
# import the libraries

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler

In [2]:
# make an example dataset
df = {
    'Age': [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
    'Income': [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000, 21000],
    'Education': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210]
}

df = pd.DataFrame(df)
df.head()

Unnamed: 0,Age,Income,Education
0,20,1000,10
1,21,2000,20
2,22,3000,30
3,23,4000,40
4,24,5000,50


In [3]:
# standardize the data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns=df.columns)
df_scaled.describe()

Unnamed: 0,Age,Income,Education
count,21.0,21.0,21.0
mean,-1.057355e-17,5.2867760000000006e-17,4.2294210000000005e-17
std,1.024695,1.024695,1.024695
min,-1.651446,-1.651446,-1.651446
25%,-0.8257228,-0.8257228,-0.8257228
50%,0.0,0.0,0.0
75%,0.8257228,0.8257228,0.8257228
max,1.651446,1.651446,1.651446


## 2. Min-Max Scaling

`Min-Max Scaling` is a technique used to normalize the features of a dataset by transforming them into a specific range, typically [0, 1]. This method is particularly useful when the data does not follow a Gaussian distribution and is essential for algorithms that rely on distance calculations, such as k-nearest neighbors (KNN) and neural networks.

The Min-Max scaling is achieved by subtracting the minimum value of the feature and then dividing by the range (the difference between the maximum and minimum values). This ensures that all values are scaled proportionally within the desired range.

The formula for Min-Max Scaling is:

$$X' = \frac{X - X_{min}}{X_{max} - X_{min}}$$

where:
- $X$ is the original data,
- $X_{min}$ is the minimum value of the feature,
- $X_{max}$ is the maximum value of the feature.

### Advantages of Min-Max Scaling
- **Preserves Relationships**: It maintains the relationships between the original data points.
- **Bounded Range**: The transformed features are always within a defined range, which can improve the convergence of some algorithms.

### Disadvantages of Min-Max Scaling
- **Sensitive to Outliers**: The presence of outliers can skew the minimum and maximum values, affecting the scaling of other data points.
- **Not Suitable for All Algorithms**: Some algorithms may perform better with other scaling methods, such as standard scaling.

In [4]:
# Import the min max scaler
from sklearn.preprocessing import MinMaxScaler

# Fit the scaler to the data
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns=df.columns)
df_scaled.describe()

Unnamed: 0,Age,Income,Education
count,21.0,21.0,21.0
mean,0.5,0.5,0.5
std,0.310242,0.310242,0.310242
min,0.0,0.0,0.0
25%,0.25,0.25,0.25
50%,0.5,0.5,0.5
75%,0.75,0.75,0.75
max,1.0,1.0,1.0


## 3. Max Abs Scaler

`Max Abs Scaler` is a technique used to normalize the features of a dataset by transforming them into a range between `-1` and `1`, preserving the sparsity of the data. This method is particularly useful when the data contains many zeros or when the features are on different scales but should maintain the original distribution's sparsity, such as in text data or image pixels.

The Max Abs Scaler works by dividing each feature by the maximum absolute value of that feature. This ensures that the transformed data is scaled within the range `[-1, 1]`, with zero-centered data points remaining unaffected.

The formula for Max Abs Scaler is:

$$X' = \frac{X}{|X_{max}|}$$

where:
- $X$ is the original data,
- $X_{max}$ is the maximum absolute value of the feature.

### Advantages of Max Abs Scaler
- **Preserves Sparsity**: It maintains the sparsity of the original data, which is important for certain types of data, such as text or image data.
- **Bounded Range**: The transformed features are always within the range `[-1, 1]`, which can be beneficial for algorithms that are sensitive to feature magnitude.

### Disadvantages of Max Abs Scaler
- **Sensitive to Outliers**: The presence of outliers can skew the scaling process, though the impact is less severe compared to Min-Max Scaling.
- **Not Suitable for All Algorithms**: While useful for sparse data, some algorithms may perform better with other scaling methods, such as standard scaling.


In [5]:
# Import the Max Abs Scaler
from sklearn.preprocessing import MaxAbsScaler

# Fit the scaler to the data
scaler = MaxAbsScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns=df.columns)
df_scaled.describe()

Unnamed: 0,Age,Income,Education
count,21.0,21.0,21.0
mean,0.75,0.52381,0.52381
std,0.155121,0.295468,0.295468
min,0.5,0.047619,0.047619
25%,0.625,0.285714,0.285714
50%,0.75,0.52381,0.52381
75%,0.875,0.761905,0.761905
max,1.0,1.0,1.0


## 4. Robust Scaler

`Robust Scaler` is a technique used to normalize the features of a dataset by transforming them based on the median and interquartile range (IQR), making it robust to outliers. This method is particularly useful when the data contains outliers that could skew the results of other scaling methods, such as Min-Max Scaling or Standard Scaling.

The Robust Scaler works by subtracting the median of the feature and then dividing by the IQR (the range between the 25th and 75th percentiles). This process ensures that the scaled features are less influenced by outliers and that the central tendency of the data is preserved.

The formula for Robust Scaler is:

$$X' = \frac{X - \text{median}(X)}{IQR(X)}$$

where:
- $X$ is the original data,
- $\text{median}(X)$ is the median of the feature,
- $IQR(X)$ is the interquartile range of the feature.

### Advantages of Robust Scaler
- **Resistant to Outliers**: It effectively scales data by reducing the impact of outliers, making it more suitable for datasets with significant outliers.
- **Centers Data**: The transformed features are centered around the median, which is less sensitive to outliers than the mean.

### Disadvantages of Robust Scaler
- **May Not Preserve Relationships**: While it reduces the influence of outliers, this method may distort the relationships between features, especially if the distribution is not symmetric.
- **Not Suitable for All Algorithms**: Some algorithms that rely on the actual range of the data might perform better with other scaling methods, such as Min-Max Scaling.


In [8]:
# Import Robust Scaler
from sklearn.preprocessing import RobustScaler

# Fit the scaler to the data
scaler = RobustScaler()
df_scaled = scaler.fit_transform(df)
df_scaled = pd.DataFrame(df_scaled, columns=df.columns)
df_scaled.describe()

Unnamed: 0,Age,Income,Education
count,21.0,21.0,21.0
mean,0.0,0.0,0.0
std,0.620484,0.620484,0.620484
min,-1.0,-1.0,-1.0
25%,-0.5,-0.5,-0.5
50%,0.0,0.0,0.0
75%,0.5,0.5,0.5
max,1.0,1.0,1.0
