# Preprocessing - Encoding data

Most machine learning algorithms only work with numeric data, so if your dataset contains for example categorical data, you'll need to encode that into a numeric format. There is multipe approaches to doing this, which i' ll show in this notebook.

First we load some example data:

In [1]:
import pandas

df = pandas.read_csv('houses.csv')

df.head()

Unnamed: 0,Address,Price,LivingArea,Rooms,LotArea,Type
0,Adriaan van Bergenstraat 47,250000,71,4,92,Row
1,Jan van Zutphenlaan 56,209500,98,5,123,Row
2,Prinses irenelaan 126,349500,128,6,114,Row
3,Hubert Duyfhuysstraat 36,250000,86,4,98,Row
4,Prinses ireneplateau 125,419000,173,6,99,Row


In [2]:
df['Type'].describe()

count      42
unique      3
top       Row
freq       36
Name: Type, dtype: object

As you can see, the "Address" and "Type" columns need some attention before we can load the data into a machine learning model. We'll drop the "Address" column and deal with type in the rest of this notebook.

In [3]:
df.drop('Address', axis=1, inplace=True)

## One-hot or dummy encoding

First strategy for dealing with categorical data is to apply one-hot or dummy encoding. Basically that means creating binary(0/1) columns for each possible value of the categorical variable.

This approach works best if there is small-ish list of values and/or there is no natural ordering of values.

In [4]:
df_dummies = pandas.get_dummies(df)

df_dummies.head()

Unnamed: 0,Price,LivingArea,Rooms,LotArea,Type_Corner,Type_Detached,Type_Row
0,250000,71,4,92,0,0,1
1,209500,98,5,123,0,0,1
2,349500,128,6,114,0,0,1
3,250000,86,4,98,0,0,1
4,419000,173,6,99,0,0,1


So pandas has indeed converted the "Type" column into three, one for each value.

Now we can train and evaluate a machine learning model for price to see if it all worked:

In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

results = cross_val_score(Ridge(), df_dummies.drop('Price', axis=1), df_dummies['Price'], cv=5)
print("Score: %.2f (%.2f)" % (results.mean(), results.std()))

Score: 0.65 (0.21)


## Label or factor encoding

Another approach to encoding categorical data is to assign a number to each value and replace the values with their numeric counterpart.

This approach works best if there is a natural ordering and/or linear relationship with the predicted variable, at least for linear models (tree-based models usually don't care).

In [6]:
df_factor = df.copy()
df_factor['Type'], _ = pandas.factorize(df_factor['Type'])

df_factor.head()

Unnamed: 0,Price,LivingArea,Rooms,LotArea,Type
0,250000,71,4,92,0
1,209500,98,5,123,0
2,349500,128,6,114,0
3,250000,86,4,98,0
4,419000,173,6,99,0


Indeed the "Type" column has been converted to a numeric one, so we can again apply a machine learning model and see if this approach produces similar performance as above.

In [7]:
results = cross_val_score(Ridge(), df_factor.drop('Price', axis=1), df_factor['Price'], cv=5)
print("Score: %.2f (%.2f)" % (results.mean(), results.std()))

Score: 0.66 (0.22)


Indeed the model works fine and performance is pretty similar in this case.

Good luck!