<a href="https://colab.research.google.com/github/AppleBoiy/intro-to-machine-learning/blob/main/preprocess/normalization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is Normalization?

Normalization is a data preprocessing technique used in machine learning and data science to rescale the values of numerical features to fit within a specific range, typically [0, 1]. It ensures that all features contribute equally to the model training process, preventing features with larger scales from dominating others.



## Key Idea

The goal of normalization is to transform data so that it lies within a uniform range. This helps improve model performance, especially for algorithms sensitive to the scale of the data (e.g., gradient descent, k-NN, SVMs, neural networks).

## Common Normalization Methods

###	1.	Min-Max Normalization

$$
X_{norm} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
$$

*	Rescales data to the range [0, 1].
*	Works well when the data is approximately uniformly distributed

###	2.	Scaling to a Specified Range
*	Similar to Min-Max Normalization but scales the data to any desired range [a, b].

## Why is Normalization Important?

1.	Improves Convergence Speed:
	*	Gradient-based algorithms (e.g., gradient descent) converge faster when features are normalized.
2.	Prevents Domination by Large-Scale Features:
	*	Algorithms like k-NN or SVMs compute distances between data points; large-scale features can distort these computations if not normalized.
3.	Helps Neural Networks Learn Efficiently:
	*	Normalized inputs prevent neurons from getting stuck in saturated activation regions.

## When to Use Normalization?

Use normalization when:

1.	The algorithm is distance-based (e.g., k-NN, k-Means).
2.	Features have different units or scales.
3.	You use models sensitive to input scale (e.g., neural networks, SVMs).

# Step-by-Step Example: Normalization

## Step 1: Import Necessary Libraries


In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

## Step 2: Load the Iris Dataset

In [None]:
# Load Iris dataset
iris = load_iris()

# Create a DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['Species'] = iris.target_names[iris.target]  # Add target (species) column

In [None]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Step 3: Normalize the Numerical Columns

We will normalize the numerical columns: sepal length, sepal width, petal length, and petal width.

In [None]:
# Initialize the scaler
scaler = MinMaxScaler()

# Select numerical columns for normalization
numerical_columns = iris.feature_names

# Fit and transform the numerical columns
df_normalized = df.copy()
df_normalized[numerical_columns] = scaler.fit_transform(df[numerical_columns])

In [None]:
df_normalized.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,0.222222,0.625,0.067797,0.041667,setosa
1,0.166667,0.416667,0.067797,0.041667,setosa
2,0.111111,0.5,0.050847,0.041667,setosa
3,0.083333,0.458333,0.084746,0.041667,setosa
4,0.194444,0.666667,0.067797,0.041667,setosa


## Step 4: Verify Normalization

To confirm the normalization was successful, check the minimum and maximum values of each numerical column.

In [None]:
df_normalized[numerical_columns].min()

Unnamed: 0,0
sepal length (cm),0.0
sepal width (cm),0.0
petal length (cm),0.0
petal width (cm),0.0


In [None]:
df_normalized[numerical_columns].max()

Unnamed: 0,0
sepal length (cm),1.0
sepal width (cm),1.0
petal length (cm),1.0
petal width (cm),1.0


Explanation

*	Before Normalization: The feature values were in different ranges (e.g., sepal length was in the range ~4-8, and petal width was ~0-3).
*	After Normalization: All feature values are scaled to the [0, 1] range, ensuring consistent contribution to distance-based computations.

## Summary

*	Dataset Used: Iris dataset.
*	Columns Normalized: Numerical columns (sepal length, sepal width, etc.).
*	Tool Used: MinMaxScaler from sklearn.

Normalization helps ensure that all features are on the same scale, improving model performance, particularly for algorithms like neural networks and k-NN.