<a href="https://www.kaggle.com/code/angelchaudhary/scaling-techniques-comparison?scriptVersionId=292719320" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Scaling Techniques Compared: StandardScaler vs MinMaxScaler vs RobustScaler

## Introduction

Machine learning algorithms are highly sensitive to the scale of input features. When features exist on different ranges or contain extreme values (outliers), models may converge slowly, assign incorrect importance to features, or produce suboptimal results. Choosing an inappropriate scaling technique can therefore negatively impact model performance and interpretability.

Although feature scaling is a fundamental preprocessing step, it is often applied without understanding how different scalers behave under varying data conditions. StandardScaler, MinMaxScaler, and RobustScaler each make different assumptions about data distribution and outliers.

This case study aims to provide a **clear, hands-on comparison** of these three widely used scaling techniques, highlighting:
- How each scaler transforms the data
- Their sensitivity to outliers
- Their impact on model performance

In this notebook, we follow a structured and experimental approach:
1. Select a real-world dataset with features on different scales and the presence of outliers.
2. Apply **StandardScaler**, **MinMaxScaler**, and **RobustScaler** independently to the same dataset.
3. Visualize and analyze how each scaler transforms feature distributions.
4. Train identical machine learning models on the scaled data.
5. Compare model performance using consistent evaluation metrics.
6. Draw practical conclusions and recommendations for real-world use cases.

# LET'S DO IT!!!!
![FUNNY GIF](https://media.giphy.com/media/v1.Y2lkPTc5MGI3NjExZ2p1ejR6OGxlNXFyOGVmaThza2VubjZhazJoYTJjNzN6Mmk1ajF1NyZlcD12MV9naWZzX3NlYXJjaCZjdD1n/LHZyixOnHwDDy/giphy.gif)

## Dataset Overview

We'll use the **California Housing Prices Dataset** 
Why this dataset?
- All numerical features
- Strong variation in feature scales
- Presence of skewness and outliers
- Real-world regression problem

Target Variable:
- `median_house_value`

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("/kaggle/input/california-housing-prices-dataset/housing.csv")
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41,880,129.0,322,126,8.3252,452600,NEAR BAY
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,358500,NEAR BAY
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,352100,NEAR BAY
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,341300,NEAR BAY
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,342200,NEAR BAY


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  int64  
 3   total_rooms         20640 non-null  int64  
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  int64  
 6   households          20640 non-null  int64  
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  int64  
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(4), int64(5), object(1)
memory usage: 1.6+ MB


In [3]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [4]:
df = df.drop(columns=["total_bedrooms", "ocean_proximity"])
df.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
population            0
households            0
median_income         0
median_house_value    0
dtype: int64

In [5]:
X = df.drop(columns=["median_house_value"])
y = df["median_house_value"]

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Scaling Techniques Applied

Each scaler is fitted **only on training data** to avoid data leakage.

In [7]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

standard_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()
robust_scaler = RobustScaler()

X_train_standard = standard_scaler.fit_transform(X_train)
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_train_robust = robust_scaler.fit_transform(X_train)

In [8]:
X_train_standard_df = pd.DataFrame(
    X_train_standard, columns=X_train.columns, index=X_train.index
)

X_train_minmax_df = pd.DataFrame(
    X_train_minmax, columns=X_train.columns, index=X_train.index
)

X_train_robust_df = pd.DataFrame(
    X_train_robust, columns=X_train.columns, index=X_train.index
)

## Distribution Comparison: Median Income Feature

We compare how each scaler transforms the same feature (`median_income`) to understand differences in representation and outlier handling.

In [9]:
scaling_comparison = pd.DataFrame({
    "Original": X_train["median_income"].describe(),
    "StandardScaler": X_train_standard_df["median_income"].describe(),
    "MinMaxScaler": X_train_minmax_df["median_income"].describe(),
    "RobustScaler": X_train_robust_df["median_income"].describe()
})

scaling_comparison

Unnamed: 0,Original,StandardScaler,MinMaxScaler,RobustScaler
count,16512.0,16512.0,16512.0,16512.0
mean,3.880754,-6.519333000000001e-17,0.233159,0.1518051
std,1.904294,1.00003,0.131329,0.863048
min,0.4999,-1.775438,0.0,-1.380437
25%,2.5667,-0.6900689,0.142536,-0.4437394
50%,3.5458,-0.1758995,0.210059,1.006411e-16
75%,4.773175,0.4686502,0.294705,0.5562606
max,15.0001,5.839268,1.0,5.191221


## Observations Based on Scaling Output (Median Income)

- In the **original feature**, `median_income` spans a wide range (≈ 0.5 to 15), indicating strong variability and the presence of high-end values.

- After applying **StandardScaler**:
  - The feature is centered around **mean ≈ 0** with **standard deviation ≈ 1**, confirming correct standardization.
  - The presence of large minimum and maximum values shows that extreme values are **preserved**, not suppressed.

- With **MinMaxScaler**:
  - All values are mapped strictly into the **[0, 1] range**.
  - The compression of values indicates high sensitivity to extreme values, as the entire scaling depends on the global minimum and maximum.

- After applying **RobustScaler**:
  - The median is centered close to **0**, and scaling is based on the **interquartile range (IQR)**.
  - Compared to the other scalers, the influence of extreme values is reduced, resulting in a more stable spread.

- Overall, the table demonstrates that while all three techniques normalize the feature, they differ significantly in **outlier handling, centering strategy, and scale compression**, which can influence downstream learning behavior.

## Downstream Consistency Check

A simple regression model is used **only as an evaluation probe**
to verify whether different scaling strategies distort the learning signal.

In [10]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def evaluate_model(X_train_scaled, X_test_scaled):
    model = LinearRegression()
    model.fit(X_train_scaled, y_train)
    preds = model.predict(X_test_scaled)
    return {
        "MAE": mean_absolute_error(y_test, preds),
        "RMSE": np.sqrt(mean_squared_error(y_test, preds)),
        "R2": r2_score(y_test, preds)
    }

In [11]:
X_test_standard = standard_scaler.transform(X_test)
X_test_minmax = minmax_scaler.transform(X_test)
X_test_robust = robust_scaler.transform(X_test)

results = pd.DataFrame({
    "StandardScaler": evaluate_model(X_train_standard, X_test_standard),
    "MinMaxScaler": evaluate_model(X_train_minmax, X_test_minmax),
    "RobustScaler": evaluate_model(X_train_robust, X_test_robust)
}).T

results

Unnamed: 0,MAE,RMSE,R2
StandardScaler,51657.465162,70517.833856,0.620518
MinMaxScaler,51657.465162,70517.833856,0.620518
RobustScaler,51657.465162,70517.833856,0.620518


## Observations on Downstream Evaluation

- All three scaling techniques — **StandardScaler**, **MinMaxScaler**, and **RobustScaler** — result in **identical MAE, RMSE, and R² values** on the test set.

- This indicates that, for this dataset, different scaling methods **do not alter the underlying predictive signal**, even though they transform feature distributions differently.

- The result reinforces that scaling primarily affects **data representation, numerical stability, and outlier handling**, rather than guaranteeing improvements in predictive performance.

- The consistency across metrics confirms that the comparison is fair and controlled, as each scaler preserves the same information content while applying different normalization strategies.

- This outcome aligns with the objective of the case study: to compare **how scaling techniques behave**, not to force metric-level improvements.

## What This Comparison Reveals

This case study demonstrates that scaling techniques primarily influence
how data is **represented and normalized**, rather than directly improving
predictive performance in all scenarios.

Although StandardScaler, MinMaxScaler, and RobustScaler transform feature
distributions differently, they preserve the same underlying relationships
between features and the target variable.

As a result, downstream performance remains consistent, while feature
representation and sensitivity to outliers vary across scalers.

## When to Use Which Scaling Technique

| Scenario | Recommended Scaler | Reason |
|--------|-------------------|--------|
| Features are normally distributed | StandardScaler | Mean-centered, variance-normalized |
| Input values must be bounded | MinMaxScaler | Fixed range [0, 1] |
| Dataset contains outliers | RobustScaler | Uses median and IQR |
| Distance-based models | StandardScaler / MinMaxScaler | Scale sensitivity |
| Skewed real-world data | RobustScaler | Stable scaling |


## Limitations

- Only one dataset was analyzed.
- A single downstream model was used as an evaluation probe.
- Effects of scaling on regularized or distance-based models were not explored.

## Final Conclusion

This case study highlights that choosing a scaling technique is a **data-driven
decision**, not a performance shortcut.

StandardScaler, MinMaxScaler, and RobustScaler each serve distinct purposes,
and understanding their behavior is essential for building reliable and
interpretable machine learning pipelines.