# Feature Scaling and Normalization

Feature scaling is an important preprocessing step because many machine learning models perform better when features are on a similar scale.

In this notebook, we will cover:
- Why scaling is needed
- Standardization (Z-score normalization)
- Min-Max Scaling (Normalization)
- Robust Scaling
- Comparing effects of scaling

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

data = {
    'Age': [18, 25, 30, 50, 80],
    'Salary': [20000, 35000, 50000, 100000, 300000]
}

df = pd.DataFrame(data)
df

## 1. Why Feature Scaling?

- Algorithms like **KNN, SVM, Gradient Descent** are sensitive to scale.
- Features with larger ranges (like `Salary`) can dominate features with smaller ranges (like `Age`).
- Scaling ensures all features contribute equally to the model.

## 2. Standardization (Z-score Normalization)
- Formula:  
  \[ z = \frac{x - \mu}{\sigma} \]
- Centers data around mean 0 with standard deviation 1.
- Useful when data has both positive and negative values.

In [None]:
scaler = StandardScaler()
df_standardized = scaler.fit_transform(df)
df_standardized = pd.DataFrame(df_standardized, columns=['Age', 'Salary'])
df_standardized

## 3. Min-Max Scaling (Normalization)
- Formula:  
  \[ x' = \frac{x - x_{min}}{x_{max} - x_{min}} \]
- Scales values into range [0, 1].
- Useful when we want bounded values.

In [None]:
scaler = MinMaxScaler()
df_minmax = scaler.fit_transform(df)
df_minmax = pd.DataFrame(df_minmax, columns=['Age', 'Salary'])
df_minmax

## 4. Robust Scaling
- Uses median and interquartile range (IQR) instead of mean and standard deviation.
- Formula:  
  \[ x' = \frac{x - median}{IQR} \]
- Less sensitive to outliers compared to Standardization and Min-Max Scaling.

In [None]:
scaler = RobustScaler()
df_robust = scaler.fit_transform(df)
df_robust = pd.DataFrame(df_robust, columns=['Age', 'Salary'])
df_robust

## 5. Comparing Different Scaling Techniques

In [None]:
comparison = pd.DataFrame({
    'Original_Age': df['Age'],
    'Original_Salary': df['Salary'],
    'Standardized_Age': df_standardized['Age'],
    'MinMax_Age': df_minmax['Age'],
    'Robust_Age': df_robust['Age']
})
comparison

## ✅ Summary
- **Standardization**: Mean = 0, Std = 1, good for normal distributions.
- **Min-Max Scaling**: Scales data into [0, 1], useful for bounded features.
- **Robust Scaling**: Uses median & IQR, good for data with outliers.

👉 Choice of scaling method depends on the algorithm and dataset characteristics.