<h2>Data Preprocessing</h2>

<h4>Import libraries</h4>

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

<h4>Import the dataset</h4>

In [2]:
ds = pd.read_csv('Data.csv')

<h4>Exploring the data</h4>

In [3]:
ds.head

<bound method NDFrame.head of    Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes>

<h4>Split independent/dependent variables</h4>

In [4]:
X = ds.iloc[:, :-1].values
Y = ds.iloc[:, -1].values

print(X.shape)
print(Y.shape)

(10, 3)
(10,)


<h4>Missing data</h4>

First step are search if NaN values exists

In [15]:
#Taking care of missing data
x = pd.isnull(X[:,1])

#Return index of NaN values in column 1
x_na = np.where(x == True)
x_na

(array([6], dtype=int64),)

Well. We find NaN values, so we'll replace these values with mean from column values

In [16]:
from sklearn.preprocessing import Imputer

#Replace missing values with mean value
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:, 1:3])

X[:,1:3]

array([[44.0, 72000.0],
       [27.0, 48000.0],
       [30.0, 54000.0],
       [38.0, 61000.0],
       [40.0, 63777.77777777778],
       [35.0, 58000.0],
       [38.77777777777778, 52000.0],
       [48.0, 79000.0],
       [50.0, 83000.0],
       [37.0, 67000.0]], dtype=object)

<h4>Encoding variables</h4>

Machine Learning models don't work very well with categorical variables. So we'll encode this variables in appropriate format.

In [19]:
#Encoding categorical variables
from sklearn.preprocessing import LabelEncoder

labelencoder_X = LabelEncoder()
labelencoder_X.fit_transform(X[:,0])

array([0, 2, 1, 2, 1, 0, 2, 0, 1, 0], dtype=int64)

But a country column not express greatness values, but only countries. The above encode process are recommended to variables of greatness.
Another encode process are split the categorical variable in some columns, where each column are filled with 0 or 1, in a process named __One Hot Enconding__.

In [37]:
#Transform categorical variable in One Hot Enconding format
x_dummies = np.array(pd.get_dummies(X[0:,0]))

x_dummies

array([[1, 0, 0],
       [0, 0, 1],
       [0, 1, 0],
       [0, 0, 1],
       [0, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [1, 0, 0],
       [0, 1, 0],
       [1, 0, 0]], dtype=uint8)

In [39]:
#Concatenate one hot enconded variables with another columns
X = np.concatenate((x_dummies, X[:,1:]), axis = 1)
X

array([[1, 0, 0, 0, 0, 44.0, 72000.0],
       [0, 0, 1, 0, 1, 27.0, 48000.0],
       [0, 1, 0, 1, 0, 30.0, 54000.0],
       [0, 0, 1, 0, 1, 38.0, 61000.0],
       [0, 1, 0, 1, 0, 40.0, 63777.77777777778],
       [1, 0, 0, 0, 0, 35.0, 58000.0],
       [0, 0, 1, 0, 1, 38.77777777777778, 52000.0],
       [1, 0, 0, 0, 0, 48.0, 79000.0],
       [0, 1, 0, 1, 0, 50.0, 83000.0],
       [1, 0, 0, 0, 0, 37.0, 67000.0]], dtype=object)

In the class column (categorical) we have binominal value, so we use simple encoding shown previously.

In [40]:
#Verifying old values
Y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [43]:
y_encoder = LabelEncoder()
Y = y_encoder.fit_transform(Y)
Y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

The 'No' values in the class column are changed to 0 (zero) and 'Yes' values to 1 (one)

<h4>Split dataset into train and test</h4>

In [54]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state=0)

<h4>Featuring scaling</h4>

In this present dataset, the variables __Age__ and __Salary__ have different greatness and we need normalize his scales to avoid a problem with machine learning model prediction.

*__Standardisation__*

\begin{align}
x_{stand} = \frac{x - mean(x)}{\sigma(x)}
\end{align}

*__Normalization__*

\begin{align}
x_{norm} = \frac{x - min(x)}{max(x) - min(x)}
\end{align}