#### Standardization

- Standardization is a feature scaling technique in where the transformed data has a mean = 0 and standard deviation = 1
- Also called as Z-score normalisation
- If z- score of a data point is 1.15, it tells us that my data point is 1.15 standard deviations away from the mean of the column

- StandardScaler() is used from sklearn to do standardization
- Usually standaridation is done when your data follows a gaussian distribution
- Always split ur data into train and test before applying standardization

In [30]:
# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv("data.csv")
df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [16]:
X = df[["Age","EstimatedSalary"]]
y = df["Purchased"]

In [17]:
X.shape

(400, 2)

In [18]:
y.shape

(400,)

In [19]:
# Train test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

print(X_train.shape)
print(X_test.shape)



(280, 2)
(120, 2)


In [20]:
# Standar Scaler

scaler = StandardScaler()

# let my model learn the mean and sd

scaler.fit(X_train)

# transform the X_train and X_test using whatever learned

X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

In [21]:
scaler.mean_

array([3.78642857e+01, 6.98071429e+04])

In [22]:
X_train

Unnamed: 0,Age,EstimatedSalary
92,26,15000
223,60,102000
234,38,112000
232,40,107000
377,42,53000
...,...,...
323,48,30000
192,29,43000
117,36,52000
47,27,54000


In [23]:
X_train_transformed = pd.DataFrame(X_train_transformed, columns = X_train.columns)

In [24]:
X_test_transformed = pd.DataFrame(X_test_transformed, columns = X_test.columns)

In [25]:
X_train_transformed

Unnamed: 0,Age,EstimatedSalary
0,-1.163172,-1.584970
1,2.170181,0.930987
2,0.013305,1.220177
3,0.209385,1.075582
4,0.405465,-0.486047
...,...,...
275,0.993704,-1.151185
276,-0.869053,-0.775237
277,-0.182774,-0.514966
278,-1.065133,-0.457127


In [27]:
# Why scaling is important

lr = LogisticRegression()
lr_transformed = LogisticRegression()

In [28]:
lr.fit(X_train,y_train)
lr_transformed.fit(X_train_transformed,y_train)

LogisticRegression()

In [33]:
y_pred = lr.predict(X_test)
y_pred_transformed = lr_transformed.predict(X_test_transformed)

In [34]:
print("Score of non scaled data:", accuracy_score(y_test,y_pred))
print("score of scaled data",accuracy_score(y_test,y_pred_transformed))

Score of non scaled data: 0.6583333333333333
score of scaled data 0.8666666666666667


- standardization improves the score in case of linear algorithms and algorithms where distance are to be calculated.
- Standardization brings the columns to equal ranges so that no column gets to outweigh by the other due to larges ranges of values
- Outliers remains the same in case of stadardization and should be handled explicitly
- DIstribution before and after standardization remains the same