## Lab 7: Feature Scaling

Feature scaling is the process of normalising the range of features in a dataset.

Real-world datasets often contain features that are varying in degrees of magnitude, range and units. Therefore, in order for machine learning models to interpret these features on the same scale, we have to perform feature scaling.

In science, we all know the importance of comparing apples to apples and yet many people, especially beginners, have a tendency to overlook feature scaling as part of the preprocessing steps for machine learning. This has proven to cause models to make inaccurate predictions.

In this lab, we will explore why feature scaling is important, the difference between normalisation and standardisation as well as how feature scaling affects model accuracy. More specifically, we will explore the applications of 3 different types of scalers in the Scikit-learn library:

1. MixMaxScaler
2. StandardScaler
3. RobustScaler

For the purpose of this tutorial, we will use one of the toy datasets in the Scikit-learn library, the Boston house prices dataset. The details of the data are:

Input features in order:
1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million) [parts/10M]
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centres
9. RAD: index of accessibility to radial highways
10. TAX: full-value property-tax rate per $10,000 [$/10k]
11. PTRATIO: pupil-teacher ratio by town
12. B: The result of the equation B=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT: % lower status of the population

Output variable:
1. MEDV: Median value of owner-occupied homes in $1000's [k$]

In [3]:
import pandas as pd

houses = pd.read_csv('boston.csv')
houses.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [5]:
houses.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [7]:
# Get predictor and target variables
X = houses.drop('MEDV', axis = 1)
Y = houses['MEDV']

# X, Y shape
print("X shape: ", X.shape)
print("Y shape: ", Y.shape)

X shape:  (506, 13)
Y shape:  (506,)


#### Different Types of Feature Scaling

Before we examine the effects of feature scaling, let us first go over some theories behind normalisation and standardisation.

1. Normalisation
Normalisation, also known as min-max scaling, is a scaling technique whereby the values in a column are shifted so that they are bounded between a fixed range of 0 and 1.

X_new = (X - X_min) / (X_max - X_min)

MinMaxScaler is the Scikit-learn function for normalisation.

2. Standardisation
On the other hand, standardisation or Z-score normalisation is another scaling technique whereby the values in a column are rescaled so that they demonstrate the properties of a standard Gaussian distribution, that is mean = 0 and variance = 1. It has average almost zero or close to zero. standardization is commonly used as it shows better results, but in image processing the normalizationndo better.

X_new = (X - mean) / std

StandardScaler is the Scikit-learn function for standardisation.

Unlike StandardScaler, RobustScaler scales features using statistics that are robust to outliers. More specifically, this scaler removes the median and scales the data according to the quantile range or by default, the interquartile range, thus making it less susceptible to outliers.

3. Normalisation vs standardisation
The choice between normalisation or standardisation comes down to the application.

Standardisation is generally preferred over normalisation in most machine learning context as it is especially important in order to compare the similarities between features based on certain distance measures. This is most prominent in Principal Component Analysis (PCA) where we are interested in the components that maximise the variance.

Normalisation, on the other hand, also offers many practical applications particularly in computer vision and image processing where pixel intensities have to be normalised to fit within a the RGB colour range between 0 and 255. Furthermore, neural network algorithms typically require data to be normalised to a 0-1 scale before model training.

At the end of the day, there is no definitive answer as to whether you should normalise or standardise your data. One can always apply both techniques and compare the model performance for the best results.

Now that we have a theoretical understanding of feature scaling, let's see how they work in practice.

In [16]:
# Instantiate MinMaxScaler, StandardScaler and RobustScaler
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
norm = MinMaxScaler()
standard = StandardScaler()
robust = RobustScaler()

In [26]:
# MinMaxScaler
normalised_features = norm.fit_transform(X) #this return to numpy array instead of df
normalised_df = pd.DataFrame(normalised_features, index = X.index, columns = X.columns) #this retrun back to df

# StandardScaler
standardised_features = standard.fit_transform(X)
standardised_df = pd.DataFrame(standardised_features, index = X.index, columns = X.columns)

# RobustScaler
robust_features = robust.fit_transform(X)
robust_df = pd.DataFrame(robust_features, index = X.index, columns = X.columns)

Now that we have the features scaled with the three types of feature scaling, let us now check the impact of this process with a concrete example using the Boston house prices dataset.

We will use KNN algorithm for this purpose.

In [21]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

knn = KNeighborsRegressor()

# We will test KNN with the the original data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)
knn.fit(X_train, Y_train)
pred = knn.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))


# We will test KNN with the normalised_df
X_train, X_test, Y_train, Y_test = train_test_split(normalised_df, Y, test_size = 0.3)
knn.fit(X_train, Y_train)
pred = knn.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))


# We will test KNN with the standardised_df
X_train, X_test, Y_train, Y_test = train_test_split(standardised_df, Y, test_size = 0.3)
knn.fit(X_train, Y_train)
pred = knn.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))


# We will test KNN with the robust_df
X_train, X_test, Y_train, Y_test = train_test_split(robust_df, Y, test_size = 0.3)
knn.fit(X_train, Y_train)
pred = knn.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))

5.399660564380535
5.445455130560473
4.283795660761364
4.442559717336984


In [23]:
# Try to test different algorithms with the same dataset before and after feature scaling. What do you notice?
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor


In [31]:
from sklearn.metrics import mean_squared_error
svr = SVR()

# We will test SVR with the the original data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)
svr.fit(X_train, Y_train)
pred = svr.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))


# We will test KNN with the normalised_df
X_train, X_test, Y_train, Y_test = train_test_split(normalised_df, Y, test_size = 0.3)
svr.fit(X_train, Y_train)
pred = svr.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))


# We will test KNN with the standardised_df
X_train, X_test, Y_train, Y_test = train_test_split(standardised_df, Y, test_size = 0.3)
svr.fit(X_train, Y_train)
pred = svr.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))


# We will test KNN with the robust_df
X_train, X_test, Y_train, Y_test = train_test_split(robust_df, Y, test_size = 0.3)
svr.fit(X_train, Y_train)
pred = svr.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))

7.85017579521584
6.415316301949848
5.528862356295355
6.100958048509972


In [33]:
dtRegressor = DecisionTreeRegressor()

# We will test DecisionTreeRegressor with the the original data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)
dtRegressor.fit(X_train, Y_train)
pred = dtRegressor.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))


# We will test KNN with the normalised_df
X_train, X_test, Y_train, Y_test = train_test_split(normalised_df, Y, test_size = 0.3)
dtRegressor.fit(X_train, Y_train)
pred = dtRegressor.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))


# We will test KNN with the standardised_df
X_train, X_test, Y_train, Y_test = train_test_split(standardised_df, Y, test_size = 0.3)
dtRegressor.fit(X_train, Y_train)
pred = dtRegressor.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))


# We will test KNN with the robust_df
X_train, X_test, Y_train, Y_test = train_test_split(robust_df, Y, test_size = 0.3)
dtRegressor.fit(X_train, Y_train)
pred = dtRegressor.predict(X_test)
print(np.sqrt(mean_squared_error(Y_test, pred)))

#tree algorithm all features are treated equally, that is why normal

3.4719837344998576
4.355433690412351
3.7258873898851617
4.8152512641441785
