<a href="https://colab.research.google.com/github/AppleBoiy/intro-to-machine-learning/blob/main/preprocess/standardization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Standardization?


Standardization is a preprocessing technique used in data science and machine learning to scale numerical features so that they have a **mean of 0** and a **standard deviation of 1**. It transforms the data to follow a standard normal distribution, making it suitable for algorithms that are sensitive to the magnitude of features.

### Standardization Formula



The formula for standardization is:

$$
X_{\text{standardized}} = \frac{X - \mu}{\sigma}
$$

Where:

*	𝑋: Original feature value.
*	μ : Mean of the feature.
*	σ: Standard deviation of the feature.

After standardization:
1.	The **mean** becomes 0.
2.	The **standard deviation** becomes 1.

## Why Use Standardization?



1.	Balances Features with Different Scales:
  *	Features with different ranges or units (e.g., kilometers vs. meters) can dominate the learning process. Standardization ensures all features are treated equally.
2.	Improves Model Convergence:
  *	Many machine learning algorithms (e.g., gradient descent) converge faster when the input data is standardized.
3.	Necessary for Certain Models:
  *	Some models, like Principal Component Analysis (PCA), Support Vector Machines (SVMs), and k-Means, perform better when the data is standardized.
4.	Enhances Distance-Based Algorithms:
  *	Algorithms that rely on distance metrics (e.g., k-NN) benefit from standardized data to avoid bias from large-scale features.

## When to Use Standardization?


Use standardization when:
1.	The dataset has features with varying scales or units.
2.	The algorithm assumes data follows a standard normal distribution (e.g., PCA, LDA).
3.	You’re using regularization techniques (e.g., L1 or L2 regularization in regression).

## How Does Standardization Differ from Normalization?

Aspect |	Standardization	|Normalization
--- | --- | ---
Range |	Mean = 0, Standard Deviation = 1	| Scales data to [0, 1] or [a, b]
Formula	| $$(X - \mu) / \sigma$$ |	$$(X - \text{min}) / (\text{max} - \text{min})$$
Application |	For normally distributed data | 	For non-normal data, distance-based algorithms

# Example: Standardization Using Python

We’ll use the Iris dataset to demonstrate standardization.


## Step 1: Import Libraries

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
import pandas as pd

## Step 2: Load and Inspect Data

In [None]:
# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

In [None]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Step 3: Standardize the Data

In [None]:
# Initialize the StandardScaler
scaler = StandardScaler()

# Standardize the dataset
standardized_data = scaler.fit_transform(df)

# Convert to a DataFrame for better readability
df_standardized = pd.DataFrame(standardized_data, columns=df.columns)

In [None]:
df_standardized.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444


## Conclusion

Standardization is essential for:
*	Preparing data with different scales for machine learning.
*	Ensuring optimal performance for models sensitive to scale.

