# Preprocessing Data

Sklearn can only handle numerical data, categorical data will either have to be dropped or converted to numerical data.

One technique involves splitting the categorical variable into dummy variables, one for each category, creating a column for each:

- we then use `0` where the observation was not that category
- while `1` means that the observation was that category

![Data Preprocessing](../imgs/data-preprocessing-1.png)

In this particular example, if a 'car' is not from 'Asia' and not from the 'US', then implicitly it must be from 'Europe'. This means that we do not need all three features and can drop one. In this case we've dropped the 'Europe' column.

![Data Preprocessing](../imgs/data-preprocessing-2.png)

If we don't do this then we will be duplicating information, which could be a problem for some models.

Both Pandas(`get_dummies()` and Sklearn(`OneHotEncoding()`) provide methods to convert categoical data into numerical and creating dummies variables.

### Create dummy variables with sklearn

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# target is 'mpg', categorical data is 'origin'
df = pd.read_csv('../data/auto.csv')
df.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


In [4]:
# encode 'origin' using dummy variables
df_encoded = pd.get_dummies(df)
df_encoded.head()

Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Asia,origin_Europe,origin_US
0,18.0,250.0,88,3139,14.5,15.0,0,0,1
1,9.0,304.0,193,4732,18.5,20.0,0,0,1
2,36.1,91.0,60,1800,16.4,10.0,1,0,0
3,18.5,250.0,98,3525,19.0,15.0,0,0,1
4,34.3,97.0,78,2188,15.8,10.0,0,1,0


`get_dummies()` creates three new features to encode the `origin` feature and drops the original feature. Each new feature follows the naming convention, 'orignal-feature-name_category-value'.

We can drop the first column, since it can be derived from the other two.

In [5]:
df_encoded.drop('origin_Asia', axis=1, inplace=True)
df_encoded.head()

Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Europe,origin_US
0,18.0,250.0,88,3139,14.5,15.0,0,1
1,9.0,304.0,193,4732,18.5,20.0,0,1
2,36.1,91.0,60,1800,16.4,10.0,0,0
3,18.5,250.0,98,3525,19.0,15.0,0,1
4,34.3,97.0,78,2188,15.8,10.0,1,0


Alternatively we can drop the first column when encoding the categorical data

In [8]:
df = pd.get_dummies(df, drop_first=True)
df.head()

Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Europe,origin_US
0,18.0,250.0,88,3139,14.5,15.0,0,1
1,9.0,304.0,193,4732,18.5,20.0,0,1
2,36.1,91.0,60,1800,16.4,10.0,0,0
3,18.5,250.0,98,3525,19.0,15.0,0,1
4,34.3,97.0,78,2188,15.8,10.0,1,0


We can then fit a model as before, here we'll fit a **linear regression model** using the **Ridge** regression algorithm.

In [9]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

X = df.drop('mpg', axis=1).values
y = df.mpg.values

print(type(X), X.shape)
print(type(y), y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

<class 'numpy.ndarray'> (392, 7)
<class 'numpy.ndarray'> (392,)


In [10]:
ridge = Ridge(alpha=0.5, normalize=True)
ridge.fit(X_train, y_train)

ridge.score(X_test, y_test)

0.7190645190217895