# Data Preprocessing Tools

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt #pyplot is a module in matplotlib library
import pandas as pd

## Importing the dataset

In [2]:
dataset = pd.read_csv('Data.csv')
# We are putting features and dependent variable in different datasets
# iloc stands for locate indexes. It will take indexes from the data set 
# iloc[row range, column range]
X = dataset.iloc[:, :-1].values
# By using -1 above, we took all columns except the last one
# -1 means the last column in python but then range include lower bound and excludes upper bound
y = dataset.iloc[:, -1].values

In [3]:
print(X) #Matrix

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

It is important to handle missing data in ML. If we have a large data set with only 1% missing data, we could probably just ignore those rows and remove them. But that is not always the case.

In [5]:
# SimpleImputer is a class from sklearn
# We first import this class and then create its object (instance)
from sklearn.impute import SimpleImputer 
# Here we are replacing the missing data with average of all other values in that column
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # This is the object of SimpleImputer class
# Here we apply this object on the matrix of features using a method of this class called 'fit', 'transform'
imputer.fit(X[:, 1:3]) #Indexing starts with 0 and upper bound is excluded so this just selects Age and Salary columns
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [6]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

We are using one hot encoding here. This article explains why one hot encoding can be useful in ML models: https://medium.com/analytics-vidhya/what-why-and-when-of-one-hot-encoding-52d25a5d3aba

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Below: [0] means the index that we are applying the transformation on
# Below: remainder tells what we want to do with the remaining columns on which we are not applying any transformation
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [8]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [9]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [10]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [12]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [13]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [14]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [15]:
print(y_test)

[0 1]


## Feature Scaling

Feature scaling is done after splitting the data set into test and train. Feature scaling gets mean and SD of features for scaling. This is why we do it after splitting training and test set so that test set is completely new. This avoids data leakage.

We are using standardisation for feature scaling here. 

```
[value - mean(features) ] / [ standard deviation (features) ]
```



In [16]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# scaler is fitted only to X_train. We get mean and standard deviation of X_train and then apply the same formula to transform X_train and X_test
# We should not get mean and standard deviation of test set as that should technically be unknown in real life scenario
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [17]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [18]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
