# 1. Standard Scaling

`Standard scaling` is a method of scaling the data such that the distribution of the data is centered around 0, with a standard deviation of 1. This is done by subtracting the mean of the data from each data point and then dividing by the standard deviation of the data. This is a very common method of scaling data, and is used in many machine learning algorithms.

The formula is as follows:

z = (x - μ) / σ --> formula for **z-score**


In [None]:
# import libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler

In [None]:
# make an example dataset
df = {
    'age': [25,30,35,40,45],
    'height': [165,170,175,180,185],
    'weight': [55,60,65,70,75]
}

# conver this data to pandas datafram
df = pd.DataFrame(df)
df.head()

# Standard Scalar

In [None]:
# (-3, 3) is the range of the standard scaler

# import the scalar
scalar = StandardScaler()

# fit the scalar on data
scaled_df = scalar.fit_transform(df)
# print(scaled_df)
# print('-----------------------------------')

# convert this data into a pandas dataframe
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
scaled_df.head()

# min-max scalar

In [None]:
# (0, 1) is the range of the min-max scaler. Minimum is 0 and maximum is 1

# import the scalar
scalar = MinMaxScaler()

# fit the scalar on data
scaled_df = scalar.fit_transform(df)
# convert this data into a pandas dataframe
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
scaled_df.head()

# Max ABS scalar

In [None]:
# (0, 1) is the range of the max-abs scaler. Maximum is 1 and minumum >=0

# import the scalar
scalar = MaxAbsScaler()

# fit the scalar on data
scaled_df = scalar.fit_transform(df)
scaled_df
# convert this data into a pandas dataframe
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
scaled_df.head()

# Robust Scalar

In [None]:
# Robust scaler: (-1, 1) is the range of the robust scaler

from sklearn.preprocessing import RobustScaler

# import the scalar
scalar = RobustScaler()

# fit the scalar on data
scaled_df = scalar.fit_transform(df)
scaled_df
# convert this data into a pandas dataframe
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)
scaled_df.head()

---
# Transformation
        Convert non-normal or non parametric data to normal/guassian or uniform distribution

- NOTE: it comes under **Normalization**

Distribution Conversion Image: https://scikit-learn.org/stable/modules/preprocessing.html#mapping-to-a-gaussian-distribution

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# generate non-normal data (exponential Distribution)
np.random.seed(0)
df = np.random.exponential(size=1000, scale=2) # scale means: means of values will be 2
# print(df)
df = pd.DataFrame(df, columns=['values'])
df.head()

In [None]:
sns.histplot(df['values'], kde=True)

In [None]:
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer

pt_boxcox = PowerTransformer(method='box-cox', standardize=False)
pt_yeo_johnson = PowerTransformer(method='yeo-johnson', standardize=False)
qt_normal = QuantileTransformer(output_distribution='normal')

# boxbcox k liay data must be postive
df['Box_Cox'] = pt_boxcox.fit_transform(df[['values']] + 1)
df['Yeo_Johnson'] = pt_yeo_johnson.fit_transform(df[['values']])
df['Quantile'] = qt_normal.fit_transform(df[['values']])

In [None]:
df.head()

In [None]:
# creata hostograms for all columns using sns.hist and kde=true use a for loop
for col in df.columns:
    sns.histplot(df[col], kde=True)
    plt.show()

---
# Normalization

Data Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.

### L2 Normalization:

Rescales each sample (row) to have unit norm. This type of normalization is often used when dealing with text data.
The L2 norm is calculated as the square root of the sum of the squared vector values.

Used mostly in NLP

In [None]:
from sklearn.preprocessing import Normalizer
data = [[1, 1, 1], # sqrt(1^2 + 1^2 + 1^2) = sqrt(3) --> normalized_value = 1 / sqrt(3) ≈ 0.57735027. OR FROM OUTPUT: sqrt((0.57735027)^2 + (0.57735027)^2 + (0.57735027)^2) = 1
        [1, 1, 0], 
        [1, 0, 0]]
normalizer = Normalizer(norm='l2')
print(normalizer.fit_transform(data))

### L1 Normalization:

Also rescales each sample (row) but with a different approach, ensuring the sum of the absolute values is 1 in each row.
The L1 norm is calculated as the sum of the absolute vector values.
Example:

In [None]:
from sklearn.preprocessing import Normalizer
data = [[1, 1, 1], [1, 1, 0], [1, 0, 0]]
normalizer = Normalizer(norm='l1')
print(normalizer.fit_transform(data))

1. Z-score normalization
   1. Standard Scalar
2. Min-Max normalization
   1. Min-Max Scalar

## Log Transformation

In [None]:
import pandas as pd
import numpy as np

# example dataset with skewed values
df = { "Values": [1,5,10,20,50,100,200,500,1000,2000,5000,10000,20000,50000,100000,1000000]}
df = pd.DataFrame(df)
df.head()

In [None]:
df['log_values'] = np.log(df['Values'])
df

In [None]:
for col in df.columns:
    sns.histplot(df[col], kde=True)
    plt.show()