# Introduction 

In this jupyter notebook I will explain the following data preprocessing techniques using a simple example dataset.

* Handling missing data
* Encoding categorical data
* Splitting dataset
* Feature scaling

**Example Dataset:**
Assume that you are a product owner and you would like to sell your product. Since you have sold this product before, you have the past customer details such as their ***age***, ***salary***, ***country*** and if they ***purchased*** the product or not. You are required to use this data to train a model in order to predict whether or not a new customer will purchase the product.

Before training the model, we need to process the data in such a way that it is suitable and efficient for training.

Please note that we only deal with the data preprocessing step in this notebook and not the model training step.

### Import Libraries

In [1]:
import os                             # for performing operating system dependent functionalities
import numpy as np                    # for performing scientific computations
import pandas as pd                   # for data manipulation and analysis

### Access the Dataset

In [2]:
dataFilePath = os.getcwd() + '/Data.csv'
dataset = pd.read_csv(dataFilePath)

dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


### Analyse the Dataset

We observe the following from the dataset
* The independent and dependent values are in the same dataset.
* The column of ***Age*** and ***Salary*** have missing values(represented as 'NaN').
* The dataset contains both numbers and text i.e., both continuous and categorical data.
* All the numbers in the dataset are not in the same scale.

Let's start by separating the independent and dependent data.
        In our example, we can see that the customer's behaviour(Purchased) is governed by his Country, Age and Salary.      Therefore, 'Purchased' is the dependent data and the rest of them are independent data.

In [3]:
X = dataset.iloc[:,:-1].values   # Independent variable --> Country, Age, Salary
y = dataset.iloc[:,-1].values    # Dependent variable   --> Purchased

# Handling Missing Data

This can be done in one of the following two ways.

***Remove the rows/columns with missing data:***
 This might not be a suitable approach if it contains any crucial information required in training model/ decision making.
        
***Replace the rows/columns with missing data:***
This is the common approach. Here, the missing values are replaced with the mean/ median/ most_frequent values with respect to the column/ row. For this purpose we use scikit-learn, the machine learning library.


In [4]:
from sklearn.preprocessing import Imputer

# strategy can be either 'mean', 'meadian' or 'most_frequent'
# axis = 0 indicates impute along columns (1 for rows)
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)  # Imputer object
imputer = imputer.fit(X[:,1:3])                                         # compute(fit) the value to replace with
X[:,1:3] = imputer.transform(X[:,1:3])                                  # replace(transform) with the fitted value 

X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

We can see that the missing values are replaced with the mean of the respective columns.

# Encoding Categorical Data

In our dataset we have two categorical variables -- **Country** and **Purchased**. Country variable has three categorical data i.e., ***France, Spain*** and ***Germany***. Purchased variable has two categorical data i.e., ***Yes*** and ***No***. The machine learning models are based on mathematical equations therefore, we can intuitively understand that it is difficult to perform mathematical operations on the textual data. Hence, the need for encoding the categorical data.

We start with encoding the categorical variable Country.

In [5]:
from sklearn.preprocessing import LabelEncoder

labelencoder_X = LabelEncoder()                         # LabelEncoder object
X[:,0] = labelencoder_X.fit_transform(X[:,0])           # fit and tranform the categorical variable - Country

X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

We can see that the categorical values France, Germany and Spain are encoded as 0, 1 and 2 respectively. Now, they are suitable for using in the mathematical equations. But, there is a problem. When these encoded values are used in the mathematical equations, the machine leaning model will think that Spain(2) > Germany(1) > France(0). But we know that this is not the case. Spain, Germany and France are just three categories with no relational order. To take care of this problem, we create dummy encodings. The number of columns in the dummy encoding is equal to the number of categories. This style of encoding is called ***OneHot encoding***.

<img src="OneHotEncoding.png" width="400px" height="300px">



In [6]:
from sklearn.preprocessing import OneHotEncoder

# categorical_features = [0] indicates the index of the column 
# that is to be onehot encoded
onehotencoder = OneHotEncoder(categorical_features = [0])        # OneHotEncoder object
X = onehotencoder.fit_transform(X).toarray()                     # fit and tranform with one hot encodings

X

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

We can see that country column is replaced by three new columns.

Now, let's perform encoding for the Purchased variable. 

In [7]:
labelencoder_Y = LabelEncoder()                   # LabelEncoder object
y = labelencoder_Y.fit_transform(y)               # fit and tranform the categorical variable - Purchased

y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

The machine learning model knows that this(y) is a dependent variable. It considers the encoding 0(No) and 1(Yes) as categories. Hence, we do not need to use OneHot encoding.

# Splitting the Dataset

The machine learning model trains/ learns the co-relations from the dataset. If the model learns too much from the dataset, then it is called ***overfitting*** . If the model is overfits, then inspite of it's high training accuracy the model will fail to perform on the new dataset with slightly different co-relations. In order to know whether the model has actually learnt or memorized from the dataset, we split the dataset into training set, which is used for training the model and testing set, which is used for testing purpose. If the training accuracy and the testing accuracy are close enough then it indicates that the model has trained well.

In [8]:
from sklearn.model_selection import train_test_split

# test_size = 0.2, indicates that 20% of the dataset will be used as test data and 80% as training data
# 'X_train, X_test, Y_train, Y_test', the order of specifying the train and test variables is very important
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

X_train

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04]])

# Feature scaling

Often in the dataset, we can notice that the values will be of different magnitudes, units and ranges. Some machine learning models, while training use Euclidian distance between the datapoints. This means that the features with higher magnitude will have higher euclidean distance compared to the feature with smaller magnitude. This may lead to the domination of higher magnitude feature while training. To avoid this we need to scale/ normalize/ standardize the features to the same range.

There are different methods used for feature scaling. The most commonly used method is ***Standardization***.

$ x_{stand} = \frac{x - mean(x)}{standard deviation(x)} $

In our dataset we can see that the salary attribute has higher magnitude than that of age. Also, the data values are in different range. Hence, we need to standardize our dataset.

In [9]:
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()                   # StandardScaler object

# fit --> compute the mean(m) and standard deviation(sd) from the training data.
# transform --> use the computed m and sd values to scale/ standardize the training data
X_train = sc_X.fit_transform(X_train)     

# Note: the test data is only tranformed but NOT fitted
# tranform --> use the same m and sd values computed from training data to scale/ standardize the testing data
X_test = sc_X.transform(X_test)           

print(X_train)
print("*************************************************************")
print(X_test)

[[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]]
*************************************************************
[[-1.          2.64575131 -0.77459667 -1.45882927 -0.90166297]
 [-1.          2.64575131 -0.77459667  1.98496442  2.13981082]]


# Summary/ Template

All the above discussed methods are consolidated below.

In [None]:
#------------------#
# Import libraries #
#------------------#
import os                             
import numpy as np                    
import pandas as pd   

#-----------------#
# Access the data #
#-----------------#
dataFilePath = os.getcwd() + '/Data.csv'
dataset = pd.read_csv(dataFilePath)

X = dataset.iloc[:,:-1].values   # Independent variable 
y = dataset.iloc[:,-1].values    # Dependent variable   

#-----------------------#
# Handling Missing Data #
#-----------------------#
from sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)  # Imputer object
imputer = imputer.fit(X[:,1:3])                                         # compute(fit) the value to replace with
X[:,1:3] = imputer.transform(X[:,1:3])                                  # replace(transform) with the fitted value 

#---------------------------#
# Encoding Categorical Data #
#---------------------------#
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X = LabelEncoder()                                  # LabelEncoder object
X[:,0] = labelencoder_X.fit_transform(X[:,0])                    # fit and tranform the categorical variable

onehotencoder = OneHotEncoder(categorical_features = [0])        # OneHotEncoder object
X = onehotencoder.fit_transform(X).toarray()                     # fit and tranform with one hot encodings

labelencoder_Y = LabelEncoder()                                  # LabelEncoder object
y = labelencoder_Y.fit_transform(y)                              # fit and tranform the categorical variable

#-----------------------#
# Splitting the Dataset #
#-----------------------#
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#-----------------#
# Feature Scaling #
#-----------------#
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()                   # StandardScaler object
X_train = sc_X.fit_transform(X_train)     # fit and transform the training data
X_test = sc_X.transform(X_test)           # only tranform the testing data
