# Data Preprocessing Tools

## Importing the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Importing the dataset

First we have to import the data set, which we will work on. To do this we will use pandas "read_csv" function. After that we will use iloc method to seperate the Features(indipendent) and Dependent columns. Dependent columns are calculated with Features columns.

In [2]:
dataset = pd.read_csv("Data.csv")
x = dataset.iloc[:,:-1].values #Features (indipendent)
y = dataset.iloc[:,-1].values #Dependent

In [3]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [4]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data


Some times there may be nan which means epty columns in our data set. This is a may cause an error when we train our ML model. To solve this problem we can use several methods. In here we will use "mean" method, which will replace the nans with mean of the entire column. To implement this we will use scikit_learn lib. It has SimpleImputer class in it, with it and numpy lib we can detect nans and replace them with values which, class's methods has produce. 

In [5]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy='mean')
imputer.fit(x[:,1:])
x[:, 1:] = imputer.transform(x[:, 1:])

In [6]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

M.L models can only work with numerical values. For this reason, it is necessary to transform the categorical values of the relevant features into numerical ones. This process is called feature encoding. We will encode our categorical values with the one hot ecoding method, where you will have as many bits as the number of different categorical values. For example yo have the data set of {"one","two","three","one"....}. Lets consider that there are 3 different categorical values which are "one", "two", "three". With the one hot encoding method we will encode these values with 3 bit long binary values(Because there are 3 different values). These values may be as "one = 001", "two = 010", "three = 100". This is the main logic behind the encoding and how you can implement it.

### Encoding the Independent Variable

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])], remainder=("passthrough"))
x = np.array(ct.fit_transform(x))

In [8]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable



In [9]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [10]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

We are splitting our dataset into training and test sets. Why do we that ? After the training of the ML we have to check that the wether the model is work correctly or not. To do that we are splitting our data set into 2 part. We are implementing the ML training on training set and after the training we are controlling the ML with test set. If the predicted values are match with the test set then our model is works correctly. I seen a excellent explanation about "why we split our data into training and test sets ?" lets read it. "To use an analogy, let’s say you teach a child to multiply by letting the kid train on the small multiplication table, i.e. everything from 1*1 to 9*9. Next, you test whether the kid is able to perform the same multiplications. The result is a success. The kid gets it right almost every time. What’s the problem here? You don’t know if the kid understands multiplication at all, or has simply memorized the table! So what you would do instead is test the kid on multiplications like 11*12, that are outside of the table. This is exactly why we need to test machine learning models on unseen data. Otherwise, we have no way of knowing whether the algorithm has learned a generalizable pattern or has simply memorized the training data." I think that explains everything. To do this there are several methods. We will use random sampling. Before taking the next step there is a big confusion about what is the order of splitting and feature scaling. We will always do splitting and after that we will do feature scaling if it is neccessary. if we dont follow up this order there may be leakage.

In [11]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)


In [12]:
print(x_train)  

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [13]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [14]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [15]:
print(y_test)

[0 1]


## Feature Scaling

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.

In [16]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:,3:] = sc.fit_transform(x_train[:,3:])
x_test[:,3:]=sc.transform(x_test[:,3:])

In [17]:
print(x_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [18]:
print(x_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
