In [1]:
import numpy as np
import pandas as pd

There are two approaches to implement feature scaling. One is standardization and second is Normalization. In this notebook Standardization has been implemented using StandardScaler and Normalization has been implemented using MinMaxScaler.

# Implementing Standardization using StandardScaler

In [2]:
data = pd.read_csv('D:/datasets/Social_Network_Ads.csv')
data.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [3]:
data.shape

(400, 5)

Just to implement standardization, UserId and Gender columns has been ignored. If we have to use gender column, it needs to be changed into
numerical data using OneHotEncoder as it is nominal categoriacal data.

In [4]:
dataset = data.iloc[:,2:]
dataset.head()

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0


In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset.iloc[:,:-1], dataset.iloc[:,-1],test_size=0.1, random_state=100)

In [6]:
print(X_train.head(),'\n')
print("The shape is:",X_train.shape)

     Age  EstimatedSalary
136   20            82000
54    27            58000
75    34           112000
369   54            26000
258   58            95000 

The shape is: (360, 2)


In [7]:
print(y_train.head(),'\n')
print("The shape is:",y_train.shape)

136    0
54     0
75     1
369    1
258    1
Name: Purchased, dtype: int64 

The shape is: (360,)


In [8]:
X_test.shape

(40, 2)

In [9]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(X_train)

StandardScaler()

In [10]:
ss.mean_

array([3.74500000e+01, 7.03777778e+04])

In [11]:
X_train_transformed = ss.transform(X_train)
X_test_transformed = ss.transform(X_test)

In [12]:
X_train.head(2)

Unnamed: 0,Age,EstimatedSalary
136,20,82000
54,27,58000


In [13]:
X_train_transformed

array([[-1.66313927,  0.33735873],
       [-0.99597739, -0.35929028],
       [-0.3288155 ,  1.20817   ],
       [ 1.57736131, -1.28815562],
       [ 1.95859668,  0.71471028],
       [-0.8053597 , -0.79469591],
       [ 0.24303754,  2.07898126],
       [-0.90066854, -0.4463714 ],
       [-0.04288898, -0.50442549],
       [ 1.67267015,  1.73065676],
       [-0.90066854, -0.76566887],
       [ 0.1477287 , -0.82372295],
       [ 0.24303754,  0.22125057],
       [-0.23350666, -1.46231788],
       [ 1.00550827, -1.08496633],
       [ 0.1477287 , -0.33026323],
       [ 1.57736131,  1.09206183],
       [ 0.91019943, -0.62053366],
       [ 0.33834638,  0.25027761],
       [ 2.14921436, -0.82372295],
       [-0.23350666,  0.19222352],
       [ 0.91019943, -0.67858774],
       [ 0.1477287 ,  0.01806127],
       [-0.3288155 ,  0.04708831],
       [-1.09128623, -1.17204746],
       [ 0.24303754, -0.15610098],
       [ 0.91019943,  1.23719704],
       [ 0.33834638,  0.04708831],
       [-1.28190391,

In [14]:
#converting the numpy array into the dataframe so that we can use pandas .describe funtion to analyze a mean and,
#standard deviation
X_train_transformed = pd.DataFrame(X_train_transformed, columns=X_train.columns)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=X_test.columns)

In [16]:
X_train_transformed.head(2)

Unnamed: 0,Age,EstimatedSalary
0,-1.663139,0.337359
1,-0.995977,-0.35929


Without Scaling, Mean = 37.4 (Age) and 70377.8 (EstimatedSalary) StandardDeviation = 10.5(Age) and 34498.6(EstimatedSalary) 
With Scaling, the mean of both features are 0 and standarddeviation are 1. This is what StandardScaling does. It makes the data
mean centric. Below, X_train_transformed.described() has showned that standardization has done its work.

The formula for StandardScaler is:

$$\left( \frac{Xi - Xmean}{XStandard Deviation} \right)$$

 
Below, .describe() has show the following after calculating through the same formula.

In [17]:
np.round(X_train.describe(),1)

Unnamed: 0,Age,EstimatedSalary
count,360.0,360.0
mean,37.4,70377.8
std,10.5,34498.6
min,18.0,15000.0
25%,29.0,43000.0
50%,37.0,70500.0
75%,46.0,88250.0
max,60.0,150000.0


In [18]:
np.round(X_train_transformed.describe(),1)

Unnamed: 0,Age,EstimatedSalary
count,360.0,360.0
mean,-0.0,-0.0
std,1.0,1.0
min,-1.9,-1.6
25%,-0.8,-0.8
50%,-0.0,0.0
75%,0.8,0.5
max,2.1,2.3


In [20]:
from sklearn.linear_model import LogisticRegression

In [21]:
lr_normal = LogisticRegression()
lr_scaled = LogisticRegression()

In [25]:
lr_normal.fit(X_train,y_train)
lr_scaled.fit(X_train_transformed,y_train)

LogisticRegression()

In [30]:
normal_prediction = lr_normal.predict(X_test)
scaled_prediction = lr_scaled.predict(X_test_transformed)

Below, we can see the difference in an accuracy with and without scaling the features. After implementing the StandardScaler the accuracy has been improved.
Algorithms that calculates the distances
like KNN, Kmean, neural network (because of gradient descent) and linear model needs to be scaled. If we do not scale the 
features, a feature that has greater value will dominate the training of the model resulting in baisness.

In [31]:
from sklearn.metrics import accuracy_score
print("Accuracy without scaling", accuracy_score(y_test, normal_prediction))
print("Accuracy with scaling", accuracy_score(y_test, scaled_prediction))

Accuracy without scaling 0.55
Accuracy with scaling 0.775


There will be no affect of feature scaling to some of the algorithms like DecisionTree, XGboost. As decisiontree just compares
the values, scaling or without scaling, the result will be the same. Below, the accuracy of the decisiontree has not affect
of feature scaling.

In [36]:
from sklearn.tree import DecisionTreeClassifier
dt_normal = DecisionTreeClassifier()
dt_scaled = DecisionTreeClassifier()

In [38]:
dt_normal.fit(X_train,y_train)
dt_scaled.fit(X_train_transformed,y_train)

DecisionTreeClassifier()

In [40]:
dt_normal_prediction = dt_normal.predict(X_test)
dt_scaled_prediction = dt_scaled.predict(X_test_transformed)

In [42]:
print("Accuracy without scaling", accuracy_score(y_test, dt_normal_prediction))
print("Accuracy with scaling", accuracy_score(y_test, dt_scaled_prediction))

Accuracy without scaling 0.875
Accuracy with scaling 0.875


# Implementing Normalization using MinMaxScaler

Directly importing the wine.csv dataset from github using raw github data link. 

In [49]:
data = pd.read_csv('https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/main/day25-normalization/wine_data.csv',
                   header=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Only selecting the 1st three columns to implement the MinMaxScaler.

In [63]:
dataset = data.iloc[:,0:3]
dataset.head()

Unnamed: 0,0,1,2
0,1,14.23,1.71
1,1,13.2,1.78
2,1,13.16,2.36
3,1,14.37,1.95
4,1,13.24,2.59


In [77]:
# df2_tidy = df2_melted.rename(columns = {'variable': 'Year', 'value': 'Income'}, inplace = False)
dataset = dataset.rename(columns={'0':'Class Label','1':'Alcohol','2':'Malic Acid'}, inplace=False)
dataset.head()

Unnamed: 0,Class Label,Alcohol,Malic Acid
0,1,14.23,1.71
1,1,13.2,1.78
2,1,13.16,2.36
3,1,14.37,1.95
4,1,13.24,2.59


In [80]:
dataset.shape

(178, 3)

In [79]:
Xtrain, Xtest, ytrain, ytest = train_test_split(dataset.drop('Class Label', axis=1), dataset['Class Label'], test_size = 0.1,
                                               random_state=10)
Xtrain.head()

Unnamed: 0,Alcohol,Malic Acid
110,11.46,3.74
163,12.96,3.45
1,13.2,1.78
160,12.36,3.83
47,13.9,1.68


In [84]:
print(Xtrain.shape,'\n',Xtest.shape)

(160, 2) 
 (18, 2)


In [85]:
from sklearn.preprocessing import MinMaxScaler
ms = MinMaxScaler()
ms.fit(Xtrain)

MinMaxScaler()

In [86]:
Xtrain_scaled = ms.transform(Xtrain)
Xtest_scaled = ms.transform(Xtest)

In [92]:
Xtrain_scaled = pd.DataFrame(Xtrain_scaled, columns=Xtrain.columns)
Xtrain_scaled.head(2)

Unnamed: 0,Alcohol,Malic Acid
0,0.113158,0.592885
1,0.507895,0.535573


Implementing the MinMaxScaler() as a normalization, the minimum value has become 0 and the maximum value has become 1. The formula
for MinMaxScaler is 
$$\left( \frac{Xi - Xmin}{Xmax - Xmin} \right)$$
Below, .describe() has show the following after calculating through the same formula.

Xtrain.describe()

In [97]:
np.round(Xtrain_scaled.describe(),1)

Unnamed: 0,Alcohol,Malic Acid
count,160.0,160.0
mean,0.5,0.3
std,0.2,0.2
min,0.0,0.0
25%,0.4,0.2
50%,0.5,0.2
75%,0.7,0.5
max,1.0,1.0


MinMaxScaler can be preferred if we know the minimum and maximum value in advance. Best example can be of CNN where we know
minimum value is 0 and maximum value is 255. There are other types of normalization techniques as well. Some of them are Mean absolute scaling, Robust scaling, mean normalization. They have their own use as per the need.