# Data Preprocessing 

## Importing the libraries

The library is the collection of functions and methods that allows you to perform many actions without writing your code. Very simple!!!

Example:
    1. NumPy is the funcdamental package for scientific computing with Python.
    https://numpy.org/
    2. Matplotlib is for data visualization.
    https://matplotlib.org/
    3. pandas is for data manipulation and analysis. 
    https://pandas.pydata.org/ 
    4. Scikit-learn is for machine learning and data mining.
    https://scikit-learn.org/stable/
    5. TensorFLow is for machine leanring and deep learning.(google)
    https://www.tensorflow.org/
    6. PyTorch is for neural network modelling with GUI. (Facebook)
    https://pytorch.org/get-started/locally/
    
    

In [2]:
import numpy as np 
import matplotlib.pyplot as plt # plot data
import pandas as pd # import and manage data set

## Import the dataset using pandas

Tip: use type "pd." then Tab. Do you see something?

In [4]:
dataset = pd.read_csv('Data.csv')

In [6]:
dataset # just to check if we import correctly 

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [7]:
X = dataset.iloc[:,:-1].values # take all the row, all the column except the last one 

Note: .iloc[] in pandas is used to select rows and columns by number

In [9]:
X # check the value 

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [10]:
y = dataset.iloc[:,3].values # The last column, the index in python starts from zero

In [11]:
y # check value 

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

## Handling missing data

If we look at the the Age column, the data of Age on spain is missing. An approach is to remove its corresponding observation. However, it is quite dangerous if it contains crucial information. Another approach to handle this is to take the mean value of all the column data and put the mean in the missing. You can also take the "median" or the "most_frequent" one. 
Ex.

    imputer = Imputer(missing_values='NaN', strategy = 'mean', axis =0 ) 
    
    or
    
    imputer = Imputer(missing_values='NaN',strategy ='median',axis=0) 
    
    for median and/or 
    
    imputer = Imputer(missing_values='NaN',strategy ='most_frequent',axis=0)
    
    for most frequently use data respectively. 

In [12]:
from sklearn.preprocessing import Imputer

In [14]:
imputer = Imputer(missing_values='NaN', strategy = 'mean', axis =0 ) 

# axis = 0 -> take the mean of the column, 1 ->row
# strategy = 'median' for median, stategy = 'most_frequent' for most frequently used data
imputer = imputer.fit(X[:,1:3]) # fit the imputer on X
# only fit the missing data in the column 2 and 3, remember the index starts from 0, so 
# the index start from 1, and the upper bound is excluded so we take 3 as the upper bounnd. 
X[:,1:3]= imputer.transform(X[:,1:3]) # Impute all missing values in X (assign the value to sth)

In [15]:
X # check the value of X again. What did you see?

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

Go to the Excel sheet and verify that the missing data is replaced correctly.

## Categorical Data

In our datasheet, the categorical data contains in the column "Country" and "Purchased".
"Country" contains France, Spain and Germany, whereas "Purchased" contains Yes and No. Machine learning is based on mathematical equations. So we need to transform the categorical data to numbers. 

Encoding categorical data

### Label Encoder

In [18]:
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder() # labelencoder_X is the object of the class LabelEncoder
X[:,0] = labelencoder_X.fit_transform(X[:,0]) # applied the method to the first column of data 
# fit_transform means fit label encoder and return encoded label)

In [19]:
X[:,0]

array([0, 2, 1, 2, 1, 0, 2, 0, 1, 0], dtype=object)

In [20]:
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

Label encoding introduces a new problem. 

For example, we have encoded a set of country names into numerical data. This is actually categorical data 
and there is no relation, of any kind, between the rows. The problem here is since there are different numbers in 
the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 <2. 

The model may derive a correlation like "the country number increases/decreases the age increases/decreases", but this 
clearly may not be the case. To overcome this problem, we use **One Hot Encoder**.

### One Hot Encoder

What One Hot Encoder does is that it takes a column which has categorical data, which has been label encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value

In [23]:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0]) # first column
X = onehotencoder.fit_transform(X).toarray()

In [24]:
X

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        4.40000000e+01, 7.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        2.70000000e+01, 4.80000000e+04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        3.00000000e+01, 5.40000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        3.80000000e+01, 6.10000000e+04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        4.00000000e+01, 6.37777778e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        3.50000000e+01, 5.80000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        3.87777778e+01, 5.20000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        4.80000000e+01, 7.90000000e+04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        5.00000000e+01, 

For the observation, we don't have to use the OneHotEnconder since the machine learnin algorithm will know that it is a
label. 

In [26]:
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [27]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1], dtype=int64)

## Splitting dataset into the traning set and test set

In [30]:
from sklearn.cross_validation import train_test_split
# from sklearn.model_selection import train_test_split # give the same results
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0) # 20 percents are the test data
 # the random_state is there so that we can have the same result

In [31]:
X_train

array([[1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        4.00000000e+01, 6.37777778e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        3.70000000e+01, 6.70000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        2.70000000e+01, 4.80000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        3.87777778e+01, 5.20000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        4.80000000e+01, 7.90000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        3.80000000e+01, 6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        4.40000000e+01, 7.20000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        3.50000000e+01, 5.80000000e+04]])

In [32]:
X_test

array([[1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
       [1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 5.0e+01, 8.3e+04]])

In [35]:
y_train

array([1, 1, 1, 0, 1, 0, 0, 1], dtype=int64)

In [36]:
y_test

array([0, 0], dtype=int64)

## Feature Scaling 

The values of the data are not the same scale, this will cause some issue in the machine leanring model. Since ML is based 
on the euclidian distance, if the range of value is different, then one can dominate anther (ex Age and Salary have different range). There are two ways to do this. one is standardisation and another one is Normalisation. x_strand = (x -mean(x))/(standard deviation (x)) , x_norm = (x - min(x))/(max(x)-min(x)).

In [43]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train) # must fit before the transform
X_test = sc_X.transform(X_test)   # you don't have to fit, just transform

Note: The "StandardScaler.fit(X_train)" calculates the mean and variance from the values in "X_train". Then calling ".transform()" will transfrom will transform all of the features by substracting the mean and dividing by the variance. For convinience, these two functions calls can be done by one step using "fit_transform()".

If you fit() to your test data, you would compute a new mean and variance for each feature. In theory, these values may be
very similar if your test and train sets have the same distribution, but in practise it is not the case.

Instead,  you want to only transform the test data by using the parameters computed on the training data. 
(sklearn did it behinds the scene.)

In [44]:
X_train

array([[ 1.        , -1.        ,  2.64575131, -0.77459667,  0.26306757,
         0.12381479],
       [-1.        ,  1.        , -0.37796447, -0.77459667, -0.25350148,
         0.46175632],
       [ 1.        , -1.        , -0.37796447,  1.29099445, -1.97539832,
        -1.53093341],
       [ 1.        , -1.        , -0.37796447,  1.29099445,  0.05261351,
        -1.11141978],
       [-1.        ,  1.        , -0.37796447, -0.77459667,  1.64058505,
         1.7202972 ],
       [ 1.        , -1.        , -0.37796447,  1.29099445, -0.0813118 ,
        -0.16751412],
       [-1.        ,  1.        , -0.37796447, -0.77459667,  0.95182631,
         0.98614835],
       [-1.        ,  1.        , -0.37796447, -0.77459667, -0.59788085,
        -0.48214934]])

In [45]:
X_test

array([[ 1.        , -1.        ,  2.64575131, -0.77459667, -1.45882927,
        -0.90166297],
       [ 1.        , -1.        ,  2.64575131, -0.77459667,  1.98496442,
         2.13981082]])

Do we need to apply feature scaling to the observation. No for now since it is the classification problem. We need to apply 
this to the regression problem.