## Categorical Data

When your data has categories represented by strings, it will be difficult to use them to train machine learning models which often only accepts numeric data.

Instead of ignoring the categorical data and excluding the information from our model, you can tranform the data so it can be used in your models.

## One Hot Encoding

We cannot make use of the Car or Model column in our data since they are not numeric. A linear relationship between a categorical variable, Car or Model, and a numeric variable, CO2, cannot be determined.

To fix this issue, we must have a numeric representation of the categorical variable. One way to do this is to have a column representing each group in the category.

For each column, the values will be 1 or 0 where 1 represents the inclusion of the group and 0 represents the exclusion. This transformation is called one hot encoding.

Pandas module has a function that called get_dummies() which does one hot encoding.

In [None]:
# One Hot Encode the Car column:

import pandas as pd

cars = pd.read_csv('carsdata.csv')
ohe_cars = pd.get_dummies(cars[['Car']])
ohe_cars = ohe_cars.astype(int)
print(ohe_cars.to_string())

In [14]:
import pandas
from sklearn import linear_model

cars = pandas.read_csv("carsdata.csv")
ohe_cars = pandas.get_dummies(cars[['Car']])
# Select the independent variables (X) and add the dummy variables columnwise.
X = pandas.concat([cars[['Volume', 'Weight']], ohe_cars], axis=1)
# Store the dependent variable in y.
y = cars['CO2']

regr = linear_model.LinearRegression()
regr.fit(X,y)

##predict the CO2 emission of a Volvo where the weight is 2300kg, and the volume is 1300cm3:
predictedCO2 = regr.predict([[2300, 1300,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]])

print(predictedCO2)

[122.45153299]




## Dummifying

It is not necessary to create one column for each group in your category. The information can be retained using 1 column less than the number of groups you have.

In [15]:
# For example, you have a column representing colors and in that column, you have two colors, red and blue.
# You can create 1 column called red where 1 represents red and 0 represents not red, which means it is blue.

# We can use one hot encoding, get_dummies, and then drop one of the columns. 
# There is an argument, drop_first, which allows us to exclude the first column from the resulting table.

import pandas as pd

colors = pd.DataFrame({'color': ['blue', 'red']})
dummies = pd.get_dummies(colors, drop_first=True)

print(dummies)

   color_red
0      False
1       True


In [18]:
# What if you have more than 2 groups? How can the multiple groups be represented by 1 less column?
# Let's say we have three colors this time, red, blue and green. 
# When we get_dummies while dropping the first column, we get the following table.

import pandas as pd

colors = pd.DataFrame({'color': ['blue', 'red', 'green']})
dummies = pd.get_dummies(colors, drop_first=True)
dummies['color'] = colors['color']

print(dummies)

   color_green  color_red  color
0        False      False   blue
1        False       True    red
2         True      False  green
