Name: Arijit Roy Chowdhury
    
Email: rc.arijit@gmail.com
    
Role: Data Scientist

# Feature Scaling and Standardization

Feature scaling is one of the most important data preprocessing step in machine learning. Algorithms that compute the distance between the features are biased towards numerically larger values if the data is not scaled.

Tree-based algorithms are fairly insensitive to the scale of the features. Also, feature scaling helps machine learning, and deep learning algorithms train and converge faster.

There are some feature scaling techniques such as Normalization and Standardization that are the most popular and at the same time, the most confusing ones. Let’s resolve that confusion.



Normalization or Min-Max Scaling is used to transform features to be on a similar scale. The new point is calculated as:

X_new = (X - X_min)/(X_max - X_min)

This scales the range to [0, 1] or sometimes [-1, 1]. Geometrically speaking, transformation squishes the n-dimensional data into an n-dimensional unit hypercube. Normalization is useful when there are no outliers as it cannot cope up with them. Usually, we would scale age and not incomes because only a few people have high incomes but the age is close to uniform.



Standardization or Z-Score Normalization is the transformation of features by subtracting from mean and dividing by standard deviation. This is often called as Z-score.

X_new = (X - mean)/Std

Standardization can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Geometrically speaking, it translates the data to the mean vector of original data to the origin and squishes or expands the points if std is 1 respectively. We can see that we are just changing mean and standard deviation to a standard normal distribution which is still normal thus the shape of the distribution is not affected.

Standardization does not get affected by outliers because there is no predefined range of transformed features.

# Import Libraries

In [1]:
import time
import random
import pandas as pd
import pandas_profiling as pp
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
pd.set_option('display.max_columns', 500) 
pd.set_option('display.max_rows', 200) 

In [2]:
# Before we standardize the data, we need to convert all columns to numeric. 
# So let's make use of the same code snippet that we saw in 'Encoding-Categorical-Columns-to-Numeric'

In [3]:
# Read the CSV File using Pandas and store it as a dataframe 'df':

df = pd.read_csv('Dataset/HR_Employee_Attrition_Data.csv')
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Y,Yes,11,3,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,Y,No,23,4,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,3,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Y,Yes,15,3,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,4,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Y,Yes,11,3,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,5,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,Y,No,12,3,4,80,1,6,3,3,2,2,2,2


In [4]:
# Remove columns with Zero Variance:

df = df.loc[:, (df != df.iloc[0]).any()] 
df.head(5)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,2,Female,94,3,2,Sales Executive,4,Single,5993,19479,8,Yes,11,3,1,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,2,3,Male,61,2,2,Research Scientist,2,Married,5130,24907,1,No,23,4,4,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,3,4,Male,92,2,1,Laboratory Technician,3,Single,2090,2396,6,Yes,15,3,2,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,4,4,Female,56,3,1,Research Scientist,3,Married,2909,23159,1,Yes,11,3,3,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,5,1,Male,40,3,1,Laboratory Technician,2,Married,3468,16632,9,No,12,3,4,1,6,3,3,2,2,2,2


In [5]:
df.info()

# Insights: There are 8 Categorical Columns: Attrition, BusinessTravel, Department, EducationField, Gender, JobRole,
#           MaritalStatus, OverTime. These columns needs to be converted to numeric using a suitable encoding technique like 
#           One Hot Encoding, Label Encoding, Pandas Replace Method etc...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2940 entries, 0 to 2939
Data columns (total 32 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       2940 non-null   int64 
 1   Attrition                 2940 non-null   object
 2   BusinessTravel            2940 non-null   object
 3   DailyRate                 2940 non-null   int64 
 4   Department                2940 non-null   object
 5   DistanceFromHome          2940 non-null   int64 
 6   Education                 2940 non-null   int64 
 7   EducationField            2940 non-null   object
 8   EmployeeNumber            2940 non-null   int64 
 9   EnvironmentSatisfaction   2940 non-null   int64 
 10  Gender                    2940 non-null   object
 11  HourlyRate                2940 non-null   int64 
 12  JobInvolvement            2940 non-null   int64 
 13  JobLevel                  2940 non-null   int64 
 14  JobRole                 

# Label Encode Categorical Variables to Numeric

In [6]:
# Function to convert categorical variables to numeric using preprocessing.LabelEncoder()

def preprocessor(df):
    res_df = df.copy()
    le = preprocessing.LabelEncoder()
    
    # Pass only Categorical / String column names here
    res_df['Attrition'] = le.fit_transform(res_df['Attrition'])
    res_df['BusinessTravel'] = le.fit_transform(res_df['BusinessTravel'])
    res_df['Department'] = le.fit_transform(res_df['Department'])
    res_df['EducationField'] = le.fit_transform(res_df['EducationField'])
    res_df['Gender'] = le.fit_transform(res_df['Gender'])
    res_df['JobRole'] = le.fit_transform(res_df['JobRole'])
    res_df['MaritalStatus'] = le.fit_transform(res_df['MaritalStatus'])
    res_df['OverTime'] = le.fit_transform(res_df['OverTime'])

    return res_df

encoded_df = preprocessor(df)
encoded_df.head(5)

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,2,1102,2,1,2,1,1,2,0,94,3,2,7,4,2,5993,19479,8,1,11,3,1,0,8,0,1,6,4,0,5
1,49,0,1,279,1,8,1,1,2,3,1,61,2,2,6,2,1,5130,24907,1,0,23,4,4,1,10,3,3,10,7,1,7
2,37,1,2,1373,1,2,2,4,3,4,1,92,2,1,2,3,2,2090,2396,6,1,15,3,2,0,7,3,3,0,0,0,0
3,33,0,1,1392,1,3,4,1,4,4,0,56,3,1,6,3,1,2909,23159,1,1,11,3,3,0,8,3,3,8,7,3,0
4,27,0,2,591,1,2,1,3,5,1,1,40,3,1,2,2,1,3468,16632,9,0,12,3,4,1,6,3,3,2,2,2,2


In [7]:
# All Categorical Columns are now converted to numeric. You can use any of the encoding techniques in a similar manner
# I have used LabelEncoder as it is the most popular encoding technique and returns a single column unlike One Hot Encoder,
# which returns n-1 columns, where n is the number of unique values in each column.

In [8]:
encoded_df1 = encoded_df.copy(deep=True)     # Used to demonstrate Stadardization
encoded_df2 = encoded_df.copy(deep=True)     # Used to demonstrate Normalization

# Standardization

In [9]:
# Save the Target Variable "Attrition" in variable y before standardization as Target Variable should not be standardised
y1 = encoded_df1['Attrition'].values
x1 = encoded_df1.drop(['Attrition'], axis=1)

In [10]:
scaler = StandardScaler()
cols = x1.columns
x1 = scaler.fit_transform(x1)

encoded_df1_standardised = pd.DataFrame(x1, columns = cols)
encoded_df1_standardised.head(5)

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,0.44635,0.590048,0.742527,1.401512,-1.010909,-0.891688,-0.937414,-1.731462,-0.660531,-1.224745,1.383138,0.379672,-0.057788,1.032716,1.153254,1.23682,-0.10835,0.72602,2.125136,1.591746,-1.150554,-0.42623,-1.584178,-0.932014,-0.421642,-2.171982,-2.49382,-0.164613,-0.063296,-0.679146,0.245834
1,1.322365,-0.913194,-1.297775,-0.493817,-0.14715,-1.868426,-0.937414,-1.730284,0.254625,0.816497,-0.240677,-1.026167,-0.057788,0.626374,-0.660853,-0.133282,-0.291719,1.488876,-0.678049,-0.628241,2.129306,2.346151,1.191438,0.241988,-0.164511,0.155707,0.338096,0.488508,0.764998,-0.368715,0.806541
2,0.008343,0.590048,1.414363,-0.493817,-0.887515,-0.891688,1.316673,-1.729105,1.169781,0.816497,1.284725,-1.026167,-0.961486,-0.998992,0.2462,1.23682,-0.937654,-1.674841,1.324226,1.591746,-0.057267,-0.42623,-0.658973,-0.932014,-0.550208,0.155707,0.338096,-1.144294,-1.167687,-0.679146,-1.155935
3,-0.429664,-0.913194,1.461466,-0.493817,-0.764121,1.061787,-0.937414,-1.727927,1.169781,-1.224745,-0.486709,0.379672,-0.961486,0.626374,0.2462,-0.133282,-0.763634,1.243211,-0.678049,1.591746,-1.150554,-0.42623,0.266233,-0.932014,-0.421642,0.155707,0.338096,0.161947,0.764998,0.252146,-1.155935
4,-1.086676,0.590048,-0.524295,-0.493817,-0.887515,-1.868426,0.565311,-1.726749,-1.575686,0.816497,-1.274014,0.379672,-0.961486,-0.998992,-0.660853,-0.133282,-0.644858,0.3259,2.525591,-0.628241,-0.877232,-0.42623,1.191438,0.241988,-0.678774,0.155707,0.338096,-0.817734,-0.615492,-0.058285,-0.595227


In [11]:
x1 = encoded_df1_standardised.values

# Train Test Split

In [12]:
# Split Training and Testing Data in 80:20 ratio
x_train1, x_test1, y_train1, y_test1 = train_test_split(x1, y1, test_size = 0.2, random_state = 42)

# Normalization

In [13]:
# Save the Target Variable "Attrition" in variable y before normalization as Target Variable should not be normalized
y2 = encoded_df2['Attrition'].values
x2 = encoded_df2.drop(['Attrition'], axis=1)

In [14]:
norm = Normalizer()
cols = x2.columns
x2 = norm.fit_transform(x2)

encoded_df2_normalized = pd.DataFrame(x2, columns = cols)
encoded_df2_normalized.head(5)

Unnamed: 0,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,0.002009,9.8e-05,0.053993,9.8e-05,4.9e-05,9.8e-05,4.9e-05,4.9e-05,9.8e-05,0.0,0.004606,0.000147,9.8e-05,0.000343,0.000196,9.8e-05,0.293629,0.95438,0.000392,4.9e-05,0.000539,0.000147,4.9e-05,0.0,0.000392,0.0,4.9e-05,0.000294,0.000196,0.0,0.000245
1,0.001927,3.9e-05,0.010971,3.9e-05,0.000315,3.9e-05,3.9e-05,7.9e-05,0.000118,3.9e-05,0.002399,7.9e-05,7.9e-05,0.000236,7.9e-05,3.9e-05,0.201718,0.979376,3.9e-05,0.0,0.000904,0.000157,0.000157,3.9e-05,0.000393,0.000118,0.000118,0.000393,0.000275,3.9e-05,0.000275
2,0.010679,0.000577,0.39628,0.000289,0.000577,0.000577,0.001154,0.000866,0.001154,0.000289,0.026553,0.000577,0.000289,0.000577,0.000866,0.000577,0.603223,0.691542,0.001732,0.000289,0.004329,0.000866,0.000577,0.0,0.00202,0.000866,0.000866,0.0,0.0,0.0,0.0
3,0.001411,4.3e-05,0.059532,4.3e-05,0.000128,0.000171,4.3e-05,0.000171,0.000171,0.0,0.002395,0.000128,4.3e-05,0.000257,0.000128,4.3e-05,0.124409,0.990439,4.3e-05,4.3e-05,0.00047,0.000128,0.000128,0.0,0.000342,0.000128,0.000128,0.000342,0.000299,0.000128,0.0
4,0.001588,0.000118,0.034765,5.9e-05,0.000118,5.9e-05,0.000176,0.000294,5.9e-05,5.9e-05,0.002353,0.000176,5.9e-05,0.000118,0.000118,5.9e-05,0.203999,0.978349,0.000529,0.0,0.000706,0.000176,0.000235,5.9e-05,0.000353,0.000176,0.000176,0.000118,0.000118,0.000118,0.000118


In [15]:
x2 = encoded_df2_normalized.values

# Train Test Split

In [16]:
# Split Training and Testing Data in 80:20 ratio
x_train2, x_test2, y_train2, y_test2 = train_test_split(x2, y2, test_size = 0.2, random_state = 42)

# All Done. Now we can use x_train, x_test, y_train, y_test for model training and testing