# 🔹 What is Standardization?


# 🔹 Why is Standardization Important?

 Think Like This:
Suppose you’re training a model using:

Age (20 to 60)
Salary (₹20,000 to ₹2,00,000)
Experience (0 to 30)

Even if age and experience are important, the model might give more importance to salary, just because the numbers are bigger.

📌 Standardization fixes this problem by converting all values to the same scale.



# 🔹 When to Use Standardization?


# 🔹 When NOT to Use Standardization?
You usually don’t need it for:

Tree-based models, because they don’t care about the scale:

Decision Tree

Random Forest

XGBoost

LightGBM

# 🔹 Benefits of Standardization
✅ Makes training faster and more stable

✅ Helps models converge better

✅ Prevents features with large values from dominating

✅ Required for many algorithms that rely on distance or gradients




# Step 1: Import Libraries

In [7]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler


# Step 2: Load Titanic Dataset
We’ll use Titanic data from seaborn (easy to load):

In [2]:
import seaborn as sns

titanic = sns.load_dataset('titanic')
titanic.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# Step 3: Select Numeric Columns
We want to standardize numeric columns only:

In [5]:
numeric_features = ['age', 'fare', 'sibsp', 'parch']
titanic_numeric = titanic[numeric_features]
titanic_numeric.head()


# Explanation:

# age, fare, sibsp (siblings/spouses aboard), parch (parents/children aboard) are numeric columns.

# We extract these for standardization.



Unnamed: 0,age,fare,sibsp,parch
0,22.0,7.25,1,0
1,38.0,71.2833,1,0
2,26.0,7.925,0,0
3,35.0,53.1,1,0
4,35.0,8.05,0,0


# Step 4: Handle Missing Values
age has missing values, so fill them first:

In [8]:
titanic.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [19]:
titanic_numeric = titanic[numeric_features].copy()
titanic_numeric['age'] = titanic_numeric['age'].fillna(titanic_numeric['age'].mean())

In [23]:
titanic_numeric.isnull().sum()

age      0
fare     0
sibsp    0
parch    0
dtype: int64

# Step 5: Check Data Before Standardization


In [24]:
print(titanic_numeric.describe())


              age        fare       sibsp       parch
count  891.000000  891.000000  891.000000  891.000000
mean    29.699118   32.204208    0.523008    0.381594
std     13.002015   49.693429    1.102743    0.806057
min      0.420000    0.000000    0.000000    0.000000
25%     22.000000    7.910400    0.000000    0.000000
50%     29.699118   14.454200    0.000000    0.000000
75%     35.000000   31.000000    1.000000    0.000000
max     80.000000  512.329200    8.000000    6.000000


# Step 6: Apply StandardScaler


In [28]:
scaler = StandardScaler()
titanic_scaled = scaler.fit_transform(titanic_numeric)

# Step 8: Check Data After Standardization

In [29]:
print(titanic_scaled_df.describe())


                age          fare         sibsp         parch
count  8.910000e+02  8.910000e+02  8.910000e+02  8.910000e+02
mean   2.232906e-16  3.987333e-18  4.386066e-17  5.382900e-17
std    1.000562e+00  1.000562e+00  1.000562e+00  1.000562e+00
min   -2.253155e+00 -6.484217e-01 -4.745452e-01 -4.736736e-01
25%   -5.924806e-01 -4.891482e-01 -4.745452e-01 -4.736736e-01
50%    0.000000e+00 -3.573909e-01 -4.745452e-01 -4.736736e-01
75%    4.079260e-01 -2.424635e-02  4.327934e-01 -4.736736e-01
max    3.870872e+00  9.667167e+00  6.784163e+00  6.974147e+00


# What You Will See:
Before scaling: Different means and standard deviations

After scaling: Means close to 0, std close to 1

