# Data Preprocessing Tutorial by [Ali Abdelaal](https://www.linkedin.com/in/aliabdelaal/) 
### this tutorial file was made for [Pixels](https://www.facebook.com/PixelsHU/) Course (Machine Learning)

In [3]:
# import our modules 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### First let's import our dataset and take a look to it.

In [4]:
dataset = pd.read_csv('unprocessed_data.csv')
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## now we need to seperate the dependant and independent variables

In [5]:
features_matrix = dataset.iloc[:,:-1].values
features_matrix

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [6]:
goal_vector = dataset.iloc[:,-1].values
goal_vector

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'], dtype=object)

## Handling the missing data
### we have two options : 
###    1. remove the row 
###    2. replace it by the mean (which is what we're going to do)

In [7]:
# import the modules we need
from sklearn.preprocessing import Imputer
# make a new imputer 
imputer = Imputer(missing_values='NaN', strategy='mean',axis=0)
features_matrix[:, 1:3] = imputer.fit_transform(features_matrix[:, 1:3])
features_matrix

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Handling Categorical Data 
### models prefer to deal with numbers rather than words if possible, that's what we are going to do now 

In [8]:
# import the needed librarys
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
features_matrix[:, 0] = encoder.fit_transform(features_matrix[:, 0])
features_matrix

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

In [9]:
goal_vector = encoder.fit_transform(goal_vector)
goal_vector

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

### Notice what happened here ?
#### We replaced the data with numbers that do the same job, as for the model it doesn't matter if it's called france or 1 for example 
#### yet, the model might think that some country has larger value than the other, and this might cause us some mistaken calculations !
#### a good way to handle such case, is to use the one hot encoder

### Now we have our data encoded in a dummy variables, and we no longer have the problem of one country or(category) taking a higher value than the other

In [10]:
# import the oneHotEncoder class
from sklearn.preprocessing import OneHotEncoder
oneHotEncoder = OneHotEncoder(categorical_features=[0])
features_matrix = oneHotEncoder.fit_transform(features_matrix).toarray()
features_matrix

array([[  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.40000000e+01,   7.20000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          2.70000000e+01,   4.80000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          3.00000000e+01,   5.40000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.80000000e+01,   6.10000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          4.00000000e+01,   6.37777778e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          3.50000000e+01,   5.80000000e+04],
       [  0.00000000e+00,   0.00000000e+00,   1.00000000e+00,
          3.87777778e+01,   5.20000000e+04],
       [  1.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          4.80000000e+01,   7.90000000e+04],
       [  0.00000000e+00,   1.00000000e+00,   0.00000000e+00,
          5.00000000e+01,   8.30000000e+04],
       [  1.00000000e+00,   0.0000000

## Now we need to split our data to train and test data ..

In [11]:
# import the modules 
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features_matrix, goal_vector, train_size = 0.8, random_state = 0)
print(len(x_train))
print(len(x_test))
print(len(features_matrix))
# see the size percentage !
# note :
## the random state just to have the same result each time you run
## the train size is changable and it depend on several things[the data size, the problem itself, ...]

8
2
10




## Okay, what about that diversity in the values in the salary column ?
### well, this might cause us a problem with a model that runs using the euclidean distance, as the variable with large values might dominate the other variable, thus make unclear predictions, so it's better to scale/standardize our features value, so that they have the same scale
### Also because this help the algorithm to converge much much faster 

In [12]:
# import the library we need from our beloved sklearn !
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# take a look how the values was before and after the scalling 
print('before scalling, max is %d and min is %d'%(np.max(x_train), np.min(x_train)))
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
print('after scalling, max is %d and min is %d'%(np.max(x_train), np.min(x_train)))

before scalling, max is 79000 and min is 0
after scalling, max is 2 and min is -1


## That's it for now !
## Thanks for reading 