
# üßÆ Feature Scaling Techniques in Machine Learning

This notebook demonstrates **four different feature scaling techniques** using a realistic customer dataset.  
Each section includes:
- A **conceptual explanation**
- Step-by-step **Python implementation**
- Explanation of **why and when** to use each method


## üìÇ Step 1: Import Libraries and Load Dataset

In [1]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

# Load dataset
df = pd.read_csv("customer_scaling_dataset.csv")
df.head()


Unnamed: 0,Customer_ID,Age,Annual_Income,Spending_Score,Debt_to_Income_Ratio,Transactions_Last_Month
0,CUST_001,56,42159,81,0.665684,18
1,CUST_002,69,135510,8,0.543383,19
2,CUST_003,46,131530,35,0.710022,31
3,CUST_004,32,105077,35,0.66257,6
4,CUST_005,60,60920,33,0.230599,40



## üß† Step 2: Understanding Why We Scale Features

Machine learning algorithms often perform better when features are on a **similar scale**.

If one feature has values in thousands (e.g., income) and another in tens (e.g., age),  
the model might give **more importance** to the larger-valued feature ‚Äî even if it‚Äôs not truly more important.

Scaling ensures:
- Equal importance for each feature
- Faster convergence in optimization-based models
- Better numerical stability



## 1Ô∏è‚É£ Min-Max Scaler (Range Scaling)

**Formula:**
 x' = {x - x_{min}}/{x_{max} - x_{min}}

**Concept:**
- Transforms all values into a fixed range, usually **0 to 1**.
- Preserves the shape of the distribution but changes the scale.

**Best used when:**
- Data does **not** have outliers.
- You want to bring features into a **bounded range** (e.g., percentages, pixel values).


In [2]:

# Apply MinMaxScaler on bounded features (Spending_Score, Debt_to_Income_Ratio)
scaler_minmax = MinMaxScaler()
df_minmax = df.copy()
df_minmax[['Spending_Score', 'Debt_to_Income_Ratio']] = scaler_minmax.fit_transform(
    df_minmax[['Spending_Score', 'Debt_to_Income_Ratio']]
)
df_minmax.head()


Unnamed: 0,Customer_ID,Age,Annual_Income,Spending_Score,Debt_to_Income_Ratio,Transactions_Last_Month
0,CUST_001,56,42159,0.816327,0.818606,18
1,CUST_002,69,135510,0.071429,0.640087,19
2,CUST_003,46,131530,0.346939,0.883325,31
3,CUST_004,32,105077,0.346939,0.814061,6
4,CUST_005,60,60920,0.326531,0.183527,40



## 2Ô∏è‚É£ Mean Normalization (Centering by Mean)

**Formula:**
 x' = {x - x_{mean}}/{x_{max} - x_{min}}

**Concept:**
- Centers values around **0** (by subtracting the mean).
- Keeps the range roughly between **-1 and 1**.

**Best used when:**
- You want data to be centered around zero, but still within a bounded range.


In [3]:

# Apply Mean Normalization manually on Age feature
df_norm = df.copy()

for col in ['Age']:
    mean_val = df_norm[col].mean()
    min_val = df_norm[col].min()
    max_val = df_norm[col].max()
    df_norm[col] = (df_norm[col] - mean_val) / (max_val - min_val)

df_norm.head()


Unnamed: 0,Customer_ID,Age,Annual_Income,Spending_Score,Debt_to_Income_Ratio,Transactions_Last_Month
0,CUST_001,0.250667,42159,81,0.665684,18
1,CUST_002,0.510667,135510,8,0.543383,19
2,CUST_003,0.050667,131530,35,0.710022,31
3,CUST_004,-0.229333,105077,35,0.66257,6
4,CUST_005,0.330667,60920,33,0.230599,40



## 3Ô∏è‚É£ Standard Scaler (Z-score Standardization)

**Formula:**
 x' = {x - x_{mean}}/{std_dev}

**Concept:**
- Centers data at **mean = 0** and scales it so that **standard deviation = 1**.
- Commonly used for algorithms that assume normally distributed data (e.g., Logistic Regression, PCA).

**Best used when:**
- Data roughly follows a **Gaussian (normal)** distribution.
- Outliers are **not extreme**.


In [4]:

# Apply StandardScaler on normally distributed feature (Transactions_Last_Month)
scaler_std = StandardScaler()
df_std = df.copy()
df_std[['Transactions_Last_Month']] = scaler_std.fit_transform(df_std[['Transactions_Last_Month']])
df_std.head()


Unnamed: 0,Customer_ID,Age,Annual_Income,Spending_Score,Debt_to_Income_Ratio,Transactions_Last_Month
0,CUST_001,56,42159,81,0.665684,-0.539258
1,CUST_002,69,135510,8,0.543383,-0.469376
2,CUST_003,46,131530,35,0.710022,0.369211
3,CUST_004,32,105077,35,0.66257,-1.377846
4,CUST_005,60,60920,33,0.230599,0.998152



## 4Ô∏è‚É£ Robust Scaler (Median and IQR based)

**Formula:**
 x' = {x - x_{median}}/{IQR}

where \( IQR = Q3 - Q1 \)

**Concept:**
- Uses the **median** instead of the mean and **IQR** instead of standard deviation.
- Makes it **robust to outliers**.

**Best used when:**
- Data contains **outliers** (e.g., Income, Prices).


In [5]:

# Apply RobustScaler on features with outliers (Annual_Income)
scaler_robust = RobustScaler()
df_robust = df.copy()
df_robust[['Annual_Income']] = scaler_robust.fit_transform(df_robust[['Annual_Income']])
df_robust.head()


Unnamed: 0,Customer_ID,Age,Annual_Income,Spending_Score,Debt_to_Income_Ratio,Transactions_Last_Month
0,CUST_001,56,-0.900838,81,0.665684,18
1,CUST_002,69,0.868019,8,0.543383,19
2,CUST_003,46,0.792604,35,0.710022,31
3,CUST_004,32,0.291361,35,0.66257,6
4,CUST_005,60,-0.545346,33,0.230599,40



## üßæ Step 3: Summary of When to Use Which Scaler

| Scaler | Formula | Best For | Sensitive to Outliers |
|---------|----------|-----------|------------------------|
| Min-Max | (x - xmin) / (xmax - xmin) | Bounded features (e.g., 0‚Äì1 range) | ‚úÖ Yes |
| Mean Normalization | (x - mean) / (xmax - xmin) | Centering around 0 | ‚úÖ Yes |
| Standard Scaler | (x - mean) / std | Normally distributed features | ‚úÖ Yes |
| Robust Scaler | (x - median) / IQR | Features with outliers | ‚ùå No |

---

## üèÅ Step 4: Key Takeaways

- **Always scale numeric features** before applying algorithms sensitive to feature magnitude (e.g., PCA, KNN, Logistic Regression).
- Choose the scaler based on the **data distribution** and **presence of outliers**.
- Use `ColumnTransformer` if different features need different scalers.
