#### Preprocessing - Categorical Data

When your data has categories represented by strings, it will be difficult to use them to train machine learning models which often only accepts numeric data.

Instead of ignoring the categorical data and excluding the information from our model, you can tranform the data so it can be used in your models.


In [14]:
import pandas as pd

cars = pd.read_csv('../data/cars.csv')
display(cars.head())


Unnamed: 0,Car,Model,Volume,Weight,CO2
0,Toyota,Aygo,1000,790,99
1,Mitsubishi,Space Star,1200,1160,95
2,Skoda,Citigo,1000,929,95
3,Fiat,500,900,865,90
4,Mini,Cooper,1500,1140,105


##### One Hot Encoding

We cannot make use of the Car or Model column in our data since they are not numeric. A linear relationship between a categorical variable, Car or Model, and a numeric variable, CO2, cannot be determined.

To fix this issue, we must have a numeric representation of the categorical variable. One way to do this is to have a column representing each group in the category.


In [15]:
encoded_cars = pd.get_dummies(cars[['Car']])  # encode categorical column 'Car'
display(encoded_cars.head())


Unnamed: 0,Car_Audi,Car_BMW,Car_Fiat,Car_Ford,Car_Honda,Car_Hundai,Car_Hyundai,Car_Mazda,Car_Mercedes,Car_Mini,Car_Mitsubishi,Car_Opel,Car_Skoda,Car_Suzuki,Car_Toyota,Car_VW,Car_Volvo
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [16]:
import pandas
from sklearn import linear_model

# preprocess data
cars = pandas.read_csv("../data/cars.csv")
encoded_cars = pandas.get_dummies(cars[['Car']])

# create X and y datasets
X = pandas.concat([cars[['Volume', 'Weight']], encoded_cars], axis=1)
y = cars['CO2']

regr = linear_model.LinearRegression()
regr.fit(X.values, y)

# predict the CO2 emission of a Volvo where the weight is 2300kg, and the volume is 1300cm3
predictedCO2 = regr.predict(
    [[2300, 1300, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]]
)
print("predicted CO2 :", predictedCO2)


predicted CO2 : [122.45153299]
