## Normalisation - Standarisation

We saw in previous lectures in section 3 of this course, that the magnitude of the variables affects different machine learning algorithms for different reasons. In this section, I will cover a few standard ways of setting the magnitude of the variables to the same range of values.


#### Normalisation

One method utilised to bring all the variables to a more homogeneous scale is normalisation. Normalisation is synonym of centering the distribution. This means subtracting the mean of the variable to each observation. This procedure will "center" the new distribution at zero (the new mean of the variable will now be zero).
Xnorm= (X - Xmin) / (Xmax - Xmin)

#### Standarisation

Standarisation is also used to bring all the variables to a similar scale. Standarisation means centering the variable at zero, and standarising the variance at 1. The procedure involves subtracting the mean of each observation and then dividing by the standard deviation:

z = (x - x_mean) /  std

For an overview of the different scaling methods check:
http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

Let's demonstrate the procedure of standarisation on the titanic dataset.

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

In [3]:
# load the numerical variables of the Titanic Dataset

data = pd.read_csv('titanic.csv', usecols = ['pclass', 'age', 'fare', 'survived'])
data.head()

Unnamed: 0,pclass,survived,age,fare
0,1.0,1.0,29.0,211.3375
1,1.0,1.0,0.9167,151.55
2,1.0,0.0,2.0,151.55
3,1.0,0.0,30.0,151.55
4,1.0,0.0,25.0,151.55


In [4]:
# let's have a look at the values of those variables to get an idea of the magnitudes
data.describe()

Unnamed: 0,pclass,survived,age,fare
count,1309.0,1309.0,1046.0,1308.0
mean,2.294882,0.381971,29.881135,33.295479
std,0.837836,0.486055,14.4135,51.758668
min,1.0,0.0,0.1667,0.0
25%,2.0,0.0,21.0,7.8958
50%,3.0,0.0,28.0,14.4542
75%,3.0,1.0,39.0,31.275
max,3.0,1.0,80.0,512.3292


The different variables present different value ranges, therefore different magnitudes. Not only the minimum and maximum values are different, but they also spread over ranges of different widths.

In [5]:
scaling=MinMaxScaler()

In [6]:
scaling.fit_transform(data[['age','fare']])

array([[0.36116884, 0.41250333],
       [0.00939458, 0.2958059 ],
       [0.0229641 , 0.2958059 ],
       ...,
       [0.33611663, 0.01410226],
       [0.36116884, 0.01537098],
       [       nan,        nan]])

In [6]:
# let's look at missing data

data.isnull().sum()

pclass        1
survived      1
age         264
fare          2
dtype: int64

In [7]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(data[['pclass', 'age', 'fare']],
                                                    data.survived, test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((917, 3), (393, 3))

In [8]:
# let's fill first the missing data

X_train.age.fillna(X_train.age.median(), inplace=True)
X_test.age.fillna(X_train.age.median(), inplace=True)

Age contains missing information, so I will fill those observations with the median in the next cell.

### Standarisation
Also known as Z-Score Normalization

StandardScaler from scikit-learn removes the mean and scales the data to unit variance. 

In [9]:
# standarisation: we use the StandardScaler from sklearn

scaler = StandardScaler() # create an object
X_train_scaled = scaler.fit_transform(X_train) # fit the scaler to the train set, and then transform it
X_test_scaled = scaler.transform(X_test) # transform the test set

In [10]:
#let's have a look at the scaled training dataset: mean and standard deviation

print('means (Pclass, Age and Fare): ', X_train_scaled.mean(axis=0))
print('std (Pclass, Age and Fare): ', X_train_scaled.std(axis=0))

means (Pclass, Age and Fare):  [-2.67325239e-16  1.56908292e-16             nan]
std (Pclass, Age and Fare):  [ 1.  1. nan]
