# Features Standardization


In [1]:
%store -r df boxcox_df quantitative_features disc


Feature standardization is a preprocessing technique used to transform numerical features in a dataset to a common scale. It aims to ensure that all features contribute equally to the analysis and modeling processes by eliminating or reducing the potential bias introduced by differences in the scales or units of the features.  
We will be using **Standardization (Z-score normalization) method**  
This method transforms each feature to have a mean of 0 and a standard deviation of 1. It is achieved by subtracting the mean of the feature from each value and then dividing by the standard deviation. The formula for standardization is: $X_{\text{std}} = \frac{{X - \mu}}{{\sigma}}$  
Where $X$ represents the original value of the feature, and $\mu$ represents the mean of the feature, and $\sigma$ represents the standard deviation of the feature.


## Standardizing the Transformed Data


In [2]:
# Let's apply the standardization method to our dataframe
standard_df = (boxcox_df[quantitative_features] - disc[quantitative_features].iloc[1, :]
               ) / disc[quantitative_features].iloc[2, :]
# Take a look at the data after standardization
standard_df.head()


Unnamed: 0,Age,Diastolic BP,Poverty index,Red blood cells,Sedimentation rate,Serum Albumin,Serum Cholesterol,Serum Iron,Serum Magnesium,Serum Protein,Systolic BP,TIBC,TS,White blood cells,BMI,Pulse pressure
545,0.020079,-0.120171,-2.304402,0.145336,1.40872,0.023742,0.284585,-0.895537,-0.961797,2.021751,-0.444789,1.903896,-1.442828,0.51689,1.486775,-0.513445
547,0.610437,0.054894,-0.800596,-0.884622,0.878106,0.023742,0.734657,-0.044703,-1.041311,1.399674,0.984337,0.111204,-0.119809,-0.623719,1.7687,1.239914
548,1.483692,-0.120171,-1.580777,-0.003111,-0.499084,0.023742,-1.879487,1.045193,1.010789,0.971768,0.644426,-0.882278,1.452263,-2.30605,0.57979,0.921631
550,-0.510027,0.738768,0.202758,-0.08868,-1.195601,0.023742,0.039622,0.177887,0.932419,-0.148378,-0.444789,0.839278,-0.20792,-0.623719,0.706977,-1.391574
552,-0.304107,-0.565642,0.202758,-1.39585,0.564484,2.068353,-0.096904,0.455878,0.0675,-0.6183,-0.731346,1.934171,-0.397883,-0.686985,0.449038,-0.513445


## Standardizing the Original Data


In [3]:
stats = df[quantitative_features].describe()
stats


Unnamed: 0,Age,Diastolic BP,Poverty index,Red blood cells,Sedimentation rate,Serum Albumin,Serum Cholesterol,Serum Iron,Serum Magnesium,Serum Protein,Systolic BP,TIBC,TS,White blood cells,BMI,Pulse pressure
count,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0,5384.0
mean,48.320765,81.691679,254.689264,47.489413,14.307764,4.391902,217.752303,98.283804,1.681909,7.07734,130.502229,360.033247,27.758544,7.293258,25.119266,48.81055
std,15.899117,11.443893,149.687885,4.705737,9.841914,0.295683,44.254173,31.123117,0.126697,0.436487,20.397663,50.703058,9.178221,1.787862,4.219595,15.010188
min,25.0,50.0,2.0,32.4,1.0,3.6,94.0,18.0,1.33,6.0,80.0,221.0,4.3,2.5,14.227712,12.0
25%,34.0,74.0,140.0,44.2,7.0,4.2,186.0,76.0,1.6,6.8,116.0,324.0,21.2,6.0,22.021129,38.0
50%,46.0,80.0,233.0,47.2,12.0,4.4,214.0,95.0,1.68,7.1,128.0,355.0,27.2,7.1,24.729935,46.0
75%,65.0,90.0,347.0,50.5,20.0,4.6,246.0,118.0,1.77,7.4,142.0,392.0,33.8,8.5,27.771702,58.0
max,74.0,114.0,714.0,65.6,44.0,5.2,342.0,190.0,2.04,8.3,190.0,501.0,52.8,12.5,37.053958,92.0


In [4]:
standard_original = (df[quantitative_features] - stats.iloc[1, :])/stats.iloc[2, :]
standard_original


Unnamed: 0,Age,Diastolic BP,Poverty index,Red blood cells,Sedimentation rate,Serum Albumin,Serum Cholesterol,Serum Iron,Serum Magnesium,Serum Protein,Systolic BP,TIBC,TS,White blood cells,BMI,Pulse pressure
545,-0.083072,-0.147824,-1.534455,0.108503,1.594429,0.027388,0.231565,-0.908772,-0.962215,2.113829,-0.514874,2.089948,-1.390089,0.451233,1.582671,-0.586971
547,0.545894,0.026942,-0.839676,-0.890278,0.781579,0.027388,0.706096,-0.105510,-1.041144,1.426524,0.955883,0.038790,-0.169809,-0.667422,1.956085,1.278428
548,1.615136,-0.147824,-1.280593,-0.040252,-0.640908,0.027388,-1.732092,1.051186,1.011004,0.968321,0.563681,-0.888176,1.508076,-1.953875,0.507326,0.878700
550,-0.586244,0.726005,0.022118,-0.125254,-1.047333,0.027388,-0.017000,0.119403,0.932075,-0.177188,-0.514874,0.807974,-0.256972,-0.667422,0.647354,-1.253186
552,-0.397554,-0.584738,0.022118,-1.357792,0.375154,2.056589,-0.152580,0.408577,0.063858,-0.635391,-0.760000,2.129393,-0.442193,-0.723355,0.366784,-0.586971
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8573,-1.215210,-0.672121,0.643410,0.724772,0.375154,0.365588,-0.376288,-2.226120,-0.567572,1.884728,-0.809025,0.176848,-2.163659,-0.331825,0.166999,-0.586971
8574,-1.403900,-0.497355,1.578690,-0.189006,-0.539302,1.041988,-1.824739,-0.266162,-2.304004,0.739219,-1.299278,-0.868453,0.080784,1.905484,-0.865121,-1.386428
8575,1.300653,0.026942,0.409591,0.108503,0.375154,1.041988,1.881126,-0.009119,2.668508,-0.864493,-0.024622,-0.513445,0.167947,-0.108094,-0.480861,-0.054000
8576,-0.900727,-0.147824,-0.452203,-0.040252,-0.437696,0.365588,-1.282417,-0.105510,-1.120073,-0.406290,-1.005126,0.413520,-0.311449,-1.058951,0.160777,-1.253186


In [5]:
%store standard_df

Stored 'standard_df' (DataFrame)
