# Data Preprocessing

---

## Importing libraries

In [2]:
import numpy as np # Numerical computing
import matplotlib.pyplot as plt # Plotting
import pandas as pd # Data analysis


---

## Import the dataset

In [3]:
dataset = pd.read_csv('0100-Data/Data.csv')

In [4]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


A dataset always have two chracteristics, features and dependent variables vector. The features are the ***parameters*** that you used to predict the ***dependent*** vector.

The features should be separated from the dependent variable vector.

---

## Splitting the dataset

To take separated sections of the dataset we have to use `iloc`, which take the integr indexes of the columns and rows.

It works similarly to a common Python list, but you have to pass to ranges with the format `[row1:row2, col1:col2]`. Like the lists in Python, the lower bound is included, but the upper bound is not.

`values` is added to take the values from the dataset.

In [5]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [6]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [7]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)


---

## Taking care of missing data

You don't want to have missing data, sometimes it might cause errors or it may affect your model. There are different approaches to take care of it.

1. **Ignore the missing data.** You can just delete the incomplete data, but that is only in case you have a big data set and the missing data is not a lot.

2. **Replace the missing data.** You can replace the missing data with the average of all the other data in the same column.

In [8]:
from sklearn.impute import SimpleImputer

We use a `SimpleImputer` from `sklearn.impute`, which requires n parameters. Like this

- **missing_values***: the type oof missing values that are going to be replaced.

- **strategy**: indicates the way the values are going to be replaced, in this case using the mean.

In [9]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

To connect the imputer, we use the `fit` method, that will *fit* the imputer on the provided dataset.

In [10]:
imputer.fit(X[:, 1:3])

Call `transform` to do the replacement.

In [11]:
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [12]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)


---

## Encoding categorical data

For machines is not easy to understand how can a word or a category affect a prediction, that's why it is necessary to transform the data to something computers can understand.

We cannot use arbitrary numbers, because that might confuse the prediction model. What we can do is divide the column with ***n*** different possible values into ***n*** columns with a true/false value each. This is called one hot encoding.

### Encoding the independent variable


In [13]:
from sklearn.compose import ColumnTransformer

In [14]:
from sklearn.preprocessing import OneHotEncoder

The column transformer needs the **tranformers** tuple (the **name**, the **transformer**, and the **data**) and the **remainder**, in this case **'passthrough'**, that means that we will keep the data that was not transformed.

In [15]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')

In [16]:
X = ct.fit_transform(X) # np.array

In [17]:
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

### Encoding the dependent variable

In [18]:
from sklearn.preprocessing import LabelEncoder

In [19]:
le = LabelEncoder() # Doesn't require any parameters

In [20]:
y = le.fit_transform(y)

In [21]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])


---

## Splitting the dataset into Training set and Test set

We need a training set to train the model using existing observations and a test set, which we are going to use to measure the model.


In [22]:
from sklearn.model_selection import train_test_split

> **Recommended**: Test-Train size is 80-20.

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) # random_state sets a seed for how data is going to be splitted

In [24]:
X_train

array([[0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [25]:
X_test

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [26]:
y_train

array([0, 1, 0, 0, 1, 1, 0, 1])

In [27]:
y_test

array([0, 1])


---

## Feature scaling

Consists on scaling the features to make sure that all of them take values in the same scale. This is done to avoid some features dominating other features.

**Not all machine learning models need this**

Even though the feature scaling could be done before splitting the dataset, it should be done after the split, because the test set is supposed to be a completely new dataset.

**Standarization or Normalization**

Normalization is recommended when you have a normal distribution in all of your features. Standarization works well all the time.

You don't have to apply feature scaling to dummy variables.


In [28]:
from sklearn.preprocessing import StandardScaler

In [29]:
sc = StandardScaler()

In [30]:
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

In [31]:
X_train

array([[0.0, 0.0, 1.0, -0.19159184384578545, -1.0781259408412425],
       [0.0, 1.0, 0.0, -0.014117293757057777, -0.07013167641635372],
       [1.0, 0.0, 0.0, 0.566708506533324, 0.633562432710455],
       [0.0, 0.0, 1.0, -0.30453019390224867, -0.30786617274297867],
       [0.0, 0.0, 1.0, -1.9018011447007988, -1.420463615551582],
       [1.0, 0.0, 0.0, 1.1475343068237058, 1.232653363453549],
       [0.0, 1.0, 0.0, 1.4379472069688968, 1.5749910381638885],
       [1.0, 0.0, 0.0, -0.7401495441200351, -0.5646194287757332]],
      dtype=object)

We need to scale the test set with the scale the train set was scaled. So we use `transform` instead of `fit_transform`.

In [32]:
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [33]:
X_test

array([[0.0, 1.0, 0.0, -1.4661817944830124, -0.9069571034860727],
       [1.0, 0.0, 0.0, -0.44973664397484414, 0.2056403393225306]],
      dtype=object)