# Data Preprocessing:

## Importing Libirarys
For this note book all the imports are stated below and will also be provided in the requirement.txt doc.

In [1]:
# Imports.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler

## Importing The Dataset

In [2]:
# Imports the data, and checks it has been imported correctly
df = pd.read_csv('data.csv')
df.head(10)

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [3]:
df.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [4]:
# Split the data into a feature matrix and dependent variable vector (X and y).
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [5]:
print(f"X.shape = {X.shape}")
print(f"y.shape = {y.shape}")

X.shape = (10, 3)
y.shape = (10,)


## Dealing with Missing Values
Generally you do not want any missing values in your data as it will cause issues when training a model, so these must be handled during the preprocessing stage.
\
\
One method to deal with missing data is just to delete it, this is fine when its a huge dataset with only a few missing values say like 1%. However if there are lots of missing values then this could have a huge effect on the model so we must employe another method for dealing with these.
\
\
Another method would be to replace a missing value with the average for that column. For example the salary column is missing one value. In this instance it would be perfectly acceptable to replace this the mean of the column.
\
\
To do this I am going to use sklearn, sklearn is a huge machine learning library which I will be utilizing alot of throughout this notebook. I will be using an instance of simple imputer to find the mean of this column and replace any (1) null value.

In [6]:
myImputer = SimpleImputer(missing_values=np.nan, strategy='mean')
myImputer.fit(X.iloc[:, 1:3]) # Look for missing values in the two numerical columns.
X.iloc[:, 1:3] = myImputer.transform(X.iloc[:, 1:3]) # Apply changes.

X.head(10)

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.777778
5,France,35.0,58000.0
6,Spain,38.777778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


## Encoding Categorical Data
The data in this notebook contains one categorical column of data (country). Machine learning models find it hard to find a corelation between categorical variables and the dependent variable, so it is nessasery to change these categorys into numbers.
\
\
If we where to encode this data by making France 1, Spain 2, and Germany 3 we are implying that there is an order to the data, which may or may not be true. In this case it is not true so we need to avoid that.
\
\
To avoid this we will use one-hot-encoding. It involves turning these three different variables into three different columns. (Three columns for the three different classed found in this variable/ feature.)
\
\
Once this code below is run we should see three new columns representing the one hot encoded categorical data.

In [7]:
# Encoding for country.
myColumnTransformer = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
X = np.array(myColumnTransformer.fit_transform(X))
pd.DataFrame(X)

Unnamed: 0,0,1,2,3,4
0,1.0,0.0,0.0,44.0,72000.0
1,0.0,0.0,1.0,27.0,48000.0
2,0.0,1.0,0.0,30.0,54000.0
3,0.0,0.0,1.0,38.0,61000.0
4,0.0,1.0,0.0,40.0,63777.777778
5,1.0,0.0,0.0,35.0,58000.0
6,0.0,0.0,1.0,38.777778,52000.0
7,1.0,0.0,0.0,48.0,79000.0
8,0.0,1.0,0.0,50.0,83000.0
9,1.0,0.0,0.0,37.0,67000.0


In [8]:
# Encoding for dependent variable.
myLabelEncoder = LabelEncoder()
y = myLabelEncoder.fit_transform(y)
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Train/ Test Split - The Data
This involves spliting the data into a training and testing set. The training set will be used to train the machine learning model and the testing set is used as unseen data to determine how well the model generalizes to new data. The test set is basicly use to evaluate the model, which are exactly like any new data given to the model in the future.
\
\
We perform data splitting before moving on to the next and final stage in data preprocessing, feature scaling. This is important as we dont want to leak information to the test dataset which could create a bias in the evaluation stage and not truly represent how well the model generalizes.

In [9]:
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In [10]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,1.0,38.777778,52000.0
1,0.0,1.0,0.0,40.0,63777.777778
2,1.0,0.0,0.0,44.0,72000.0
3,0.0,0.0,1.0,38.0,61000.0
4,0.0,0.0,1.0,27.0,48000.0
5,1.0,0.0,0.0,48.0,79000.0
6,0.0,1.0,0.0,50.0,83000.0
7,1.0,0.0,0.0,35.0,58000.0


In [11]:
pd.DataFrame(X_test)

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,0.0,30.0,54000.0
1,1.0,0.0,0.0,37.0,67000.0


In [12]:
pd.DataFrame(y_train)

Unnamed: 0,0
0,0
1,1
2,0
3,0
4,1
5,1
6,0
7,1


In [13]:
pd.DataFrame(y_test)

Unnamed: 0,0
0,0
1,1


## Feature Scaling
Feature scalling will be the final part of the data preprocessing in this notebook. 
\
\
Feature scaling gets all our data and puts it on the same scale. Not doing  this could cause some features to dominate over the others purely beacuse they are larger in magnatude, not overall importance to the model.
\
\
The two main types of feature scaling are shown below:


|<center>Standardisation</center>                    |<center>Normalisation</center>                                        |
|----------------------------------------------------|----------------------------------------------------------------------|
|<b><center>X_standardized = (X - μ) / σ</center>    |<b><center>X_normalized = (X - X_min) / (X_max - X_min)</center></b>  |
|Data will usually be between -3 and 3.              |Data will be between 0 and 1                                          |
    
Which should we go for?
    
Normalisation should be selected when we have a normalised distrabution of data. While standardisation should work well all the time regardless. So for the most part go for standardisation.

In [14]:
myScaler = StandardScaler()
X_train[:, 3:] = myScaler.fit_transform(X_train[: , 3:])
X_test[:, 3:] = myScaler.transform(X_test[: , 3:])

In [15]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,1.0,-0.191592,-1.078126
1,0.0,1.0,0.0,-0.014117,-0.070132
2,1.0,0.0,0.0,0.566709,0.633562
3,0.0,0.0,1.0,-0.30453,-0.307866
4,0.0,0.0,1.0,-1.901801,-1.420464
5,1.0,0.0,0.0,1.147534,1.232653
6,0.0,1.0,0.0,1.437947,1.574991
7,1.0,0.0,0.0,-0.74015,-0.564619


In [16]:
pd.DataFrame(X_test)

Unnamed: 0,0,1,2,3,4
0,0.0,1.0,0.0,-1.466182,-0.906957
1,1.0,0.0,0.0,-0.449737,0.20564
