In [74]:
import pandas as pd
df=pd.read_csv("sample_dataset.csv")
X=df.iloc[:,0:3] # Selecting 3 column
X # you can now see we have 569 rows

Unnamed: 0,mean radius,mean texture,mean perimeter
0,,10.38,122.80
1,20.57,17.77,132.90
2,19.69,21.25,130.00
3,11.42,20.38,77.58
4,20.29,14.34,
...,...,...,...
564,21.56,22.39,142.00
565,,28.25,131.20
566,16.60,28.08,108.30
567,20.60,29.33,140.10


In [75]:
X=df.iloc[:,0:3].dropna() # Selecting 3 column, removing rows which are not a number
X # you can now see that we have 385 rows

Unnamed: 0,mean radius,mean texture,mean perimeter
1,20.57,17.77,132.90
2,19.69,21.25,130.00
3,11.42,20.38,77.58
5,12.45,15.70,82.57
7,13.71,20.83,90.20
...,...,...,...
562,15.22,30.62,103.40
563,20.92,25.09,143.00
564,21.56,22.39,142.00
566,16.60,28.08,108.30


# Normalization
0 - 1 scaling

In [76]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

scaler=MinMaxScaler()
X_scaled=scaler.fit_transform(X)
X_scaled #scaled those 3 column values within 0 to 1

array([[0.63004759, 0.27257355, 0.60432679],
       [0.58687012, 0.3902604 , 0.58368915],
       [0.18110004, 0.36083869, 0.21064617],
       ...,
       [0.67862225, 0.42881299, 0.66908625],
       [0.43525833, 0.62123774, 0.42926274],
       [0.63151955, 0.66351031, 0.65556504]], shape=(385, 3))

In [77]:
"""
For each column, find the maximum value

If X_scaled was min-max scaled (values normalized to [0,1] range):

The maximum value in each column will be exactly 1.0

Result is a 1×3 array: [[max_of_col1, max_of_col2, max_of_col3]]"""

'\nFor each column, find the maximum value\n\nIf X_scaled was min-max scaled (values normalized to [0,1] range):\n\nThe maximum value in each column will be exactly 1.0\n\nResult is a 1×3 array: [[max_of_col1, max_of_col2, max_of_col3]]'

In [78]:
np.apply_over_axes(np.max,X_scaled,0) # Max values → 1

array([[1., 1., 1.]])

In [79]:
'''
For each column, find the minimum value

With min-max scaling, the minimum is always 0.0

Result is [[min_of_col1, min_of_col2, min_of_col3]]

'''

'\nFor each column, find the minimum value\n\nWith min-max scaling, the minimum is always 0.0\n\nResult is [[min_of_col1, min_of_col2, min_of_col3]]\n\n'

In [80]:
np.apply_over_axes(np.min,X_scaled,0) # Min values → 0

array([[0., 0., 0.]])

# Standardization

Standardization (also called Z-score normalization) transforms data to have:

- Mean = 0

- Standard Deviation = 1

- It rescales features so they follow a standard normal distribution (Gaussian distribution with μ=0, σ=1).

For each feature (column), standardization is calculated as:



X_standardized

X − μ /σ

​
where, 
μ: Mean of the feature
σ: Standard deviation of the feature



In [81]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_scaled=scaler.fit_transform(X)

#  A new array X_scaled where each column has: Mean = 0; Std Dev = 1
X_scaled

array([[ 1.87149535, -0.35527432,  1.7252216 ],
       [ 1.61663638,  0.44774298,  1.6034021 ],
       [-0.77845877,  0.24698866, -0.59859036],
       ...,
       [ 2.1582117 ,  0.71080038,  2.10748279],
       [ 0.72173384,  2.02377982,  0.6918562 ],
       [ 1.88018373,  2.31221994,  2.02767001]], shape=(385, 3))

In [82]:
#Computes column means
np.apply_over_axes(np.mean,X_scaled,0) #Computes the mean along axis 0 (columns) of X_scaled

# Standardization forces each column to have: Mean = 0 ; Std Dev = 1

array([[-7.38226219e-17, -2.86062660e-16, -1.84556555e-17]])

In [83]:
# This NumPy operation computes the variance of each column in X_scaled
np.apply_over_axes(np.var,X_scaled,0) #Since X_scaled is standardized: Variance of each column = 1
#Variance=σ^2=1

array([[1., 1., 1.]])

# Robus Scaling

Robust scaling is a data preprocessing technique that scales features using statistics that are resistant to outliers (unlike StandardScaler, which is sensitive to extreme values). It centers and scales data using the median and interquartile range (IQR) instead of the mean and standard deviation.

Centering: Subtract the median (50th percentile) from each feature.

X_centered=X−median(X)

​Scaling: Divide by the IQR (75th percentile - 25th percentile).

X_scaled = (X − median(X))/IQR(X)

 


In [84]:
from sklearn.preprocessing import RobustScaler
scaler=RobustScaler()
X_scaled=scaler.fit_transform(X)
X_scaled

array([[ 1.78484108, -0.18650089,  1.64689365],
       [ 1.56968215,  0.43161634,  1.54510355],
       [-0.45232274,  0.27708703, -0.29484029],
       ...,
       [ 2.02689487,  0.63410302,  1.96630397],
       [ 0.81418093,  1.64476021,  0.78343278],
       [ 1.79217604,  1.86678508,  1.8996139 ]], shape=(385, 3))

In [85]:
np.apply_over_axes(np.median,X_scaled,0)

array([[0., 0., 0.]])

# Difference between RobustScaler and StandardScaler
RobustScaler uses the median for centering and the interquartile range (IQR) for scaling, making it resistant to outliers. It’s ideal for skewed data or datasets with extreme values.

StandardScaler uses the mean for centering and standard deviation for scaling, assuming a Gaussian-like distribution. It’s sensitive to outliers, as they distort the mean and std dev.

Output: Both transform data to comparable scales, but RobustScaler preserves the structure of non-Gaussian data, while StandardScaler works best for normally distributed features.

Use Cases:
-----------

RobustScaler: Outlier-prone data (e.g., finance, sensor readings).

StandardScaler: Clean, Gaussian data (e.g., PCA, linear models).

