## Categorical encoding

Often we deal with categorical data and this kind of data is something computer algorithms are not able to understand.
On the other hand long categorical features might take up unnecessary memory in our dataset, so converting to a categorical feature is optimal.

In [1]:
import pandas as pd

### Raw Material Charaterization

In this dataset, we have a few numerical features describing characteristics of our material, next to that we also have an Outcome feature describing the state of our material in a category.

Let's have a look at the data

In [2]:
raw_material_df = pd.read_csv('./data/raw-material-characterization.csv')
raw_material_df.head()

Unnamed: 0,Lot number,Outcome,Size5,Size10,Size15,TGA,DSC,TMA
0,B370,Adequate,13.8,9.2,41.2,787.3,18.0,65.0
1,B880,Adequate,11.2,5.8,27.6,772.2,17.7,68.8
2,B452,Adequate,9.9,5.8,28.3,602.3,18.3,50.7
3,B287,Adequate,10.4,4.0,24.7,677.9,17.7,56.5
4,B576,Adequate,12.3,9.3,22.0,593.5,19.5,52.0


So we can see that the outcome is indeed a text field with a human interpretable value.
The different values are:

In [3]:
raw_material_df.Outcome.unique()

array(['Adequate', 'Poor'], dtype=object)

Image that we would like to get all records where the Outcome is less than adequate, using strings this is not possible as the computer does not understand relations of Adequate and Poor when they are denoted as text.

In [4]:
raw_material_df[raw_material_df.Outcome<'Adequate']

Unnamed: 0,Lot number,Outcome,Size5,Size10,Size15,TGA,DSC,TMA


To overcome this we can change the type of the feature from 'object' (string) to 'category' let us look at the data types of our data

In [5]:
raw_material_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Lot number  24 non-null     object 
 1   Outcome     24 non-null     object 
 2   Size5       24 non-null     float64
 3   Size10      24 non-null     float64
 4   Size15      24 non-null     float64
 5   TGA         24 non-null     float64
 6   DSC         24 non-null     float64
 7   TMA         24 non-null     float64
dtypes: float64(6), object(2)
memory usage: 1.6+ KB


Now we can change that of Outcome to category using the astype method

In [6]:
raw_material_df.Outcome = raw_material_df.Outcome.astype('category')
raw_material_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Lot number  24 non-null     object  
 1   Outcome     24 non-null     category
 2   Size5       24 non-null     float64 
 3   Size10      24 non-null     float64 
 4   Size15      24 non-null     float64 
 5   TGA         24 non-null     float64 
 6   DSC         24 non-null     float64 
 7   TMA         24 non-null     float64 
dtypes: category(1), float64(6), object(1)
memory usage: 1.6+ KB


Our feature might be of categorical nature now, however we still have to define it is an ordinal category and has an order.

In [7]:
raw_material_df.Outcome = raw_material_df.Outcome.cat.as_ordered().cat.reorder_categories(['Poor', 'Adequate'])

If we retry to effort to only take the records where the Outcome is less than Adequate, we now get an outcome!
Since we only have 2 categories this is a bit unfortunate, but you should get the idea behind it.

In [8]:
raw_material_df[raw_material_df.Outcome<'Adequate']

Unnamed: 0,Lot number,Outcome,Size5,Size10,Size15,TGA,DSC,TMA
5,B914,Poor,13.7,7.8,27.0,597.9,18.1,49.8
6,B404,Poor,15.5,10.7,34.3,668.5,19.6,55.7
7,B694,Poor,15.4,10.7,35.9,602.8,19.2,53.6
8,B875,Poor,14.9,11.3,41.0,614.6,18.5,50.0
10,B517,Poor,16.1,11.6,39.2,682.8,17.5,56.4
13,B430,Poor,12.9,9.7,36.3,642.4,19.1,55.0
21,B745,Poor,10.2,5.8,24.7,575.9,18.5,46.2


Let's take this a step further, since computer algorithms still have no idea what the numerical relation is between Adequate and Poor, we could use a Label Encoder for that.

In [9]:
from sklearn.preprocessing import LabelEncoder

the label encoder is inputted with the Outcome feature and recognises 2 types, it chooses a numerical value for each while fitting.

In [10]:
le = LabelEncoder()
le.fit(raw_material_df.Outcome)

LabelEncoder()

After fitting we can use this encoder to transform our dataset!

In [11]:
raw_material_df['outcome_label'] = le.transform(raw_material_df.Outcome)
raw_material_df.head()

Unnamed: 0,Lot number,Outcome,Size5,Size10,Size15,TGA,DSC,TMA,outcome_label
0,B370,Adequate,13.8,9.2,41.2,787.3,18.0,65.0,0
1,B880,Adequate,11.2,5.8,27.6,772.2,17.7,68.8,0
2,B452,Adequate,9.9,5.8,28.3,602.3,18.3,50.7,0
3,B287,Adequate,10.4,4.0,24.7,677.9,17.7,56.5,0
4,B576,Adequate,12.3,9.3,22.0,593.5,19.5,52.0,0


It seems something unfortunate has happened, the encoder gave the Adequate an outcome label of 0, which is lower than the label of Poor (1), this might be bad if we would like to give a score as our outcome.

There is in pandas another method of mapping a label to a category albeit less automated, as you would have to know the categories in your feature.

In [12]:
raw_material_df.outcome_label = raw_material_df.Outcome.map({'Poor': 0, 'Adequate':1})
raw_material_df.head()

Unnamed: 0,Lot number,Outcome,Size5,Size10,Size15,TGA,DSC,TMA,outcome_label
0,B370,Adequate,13.8,9.2,41.2,787.3,18.0,65.0,1
1,B880,Adequate,11.2,5.8,27.6,772.2,17.7,68.8,1
2,B452,Adequate,9.9,5.8,28.3,602.3,18.3,50.7,1
3,B287,Adequate,10.4,4.0,24.7,677.9,17.7,56.5,1
4,B576,Adequate,12.3,9.3,22.0,593.5,19.5,52.0,1


Yes! This did the trick, now we can use that outcome label to predict an outcome for future samples.

## Restaurant tips

Now we are going to look at a dataset of tips, here a restaurant tracked the table bills and tips for a few days in the week whilst also noting the gender, smoking habit and time of day.
This led to a small yet very interesting dataset, let's have a look!

In [13]:
tips_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


We can see here that we have a lot of categorical variables: gender, smoker, the day and the time.
In later sections we will see how we can aggregate on these categorical variables.
Now however we would like to process them for a machine learning exercise, where we need numbers not text.
For the features smoker and day, you could argue there is a clear numbering between them, smoking is 1 if the person was smoking and e.g. Sun relates to 7 as it is the seventh day of the week.

But for the gender this is different, we can't really say that women are 1 and Men are 0 or vice versa (although in this binary case it might work).
The same theory applies for time, if we would say that breakfast, lunch and dinner equal to 0, 1 and 2 this would give our algorithm a bad impression as it would think dinner is twice lunch...

We use One Hot Encoding for this, the idea is that for each value of the feature we create a new column.

In [14]:
from sklearn.preprocessing import OneHotEncoder

First we create our encoder, then we give it the day column to learn and see which values of categories there are.

In [15]:
ohe = OneHotEncoder()
ohe.fit(tips_df[['day']])

OneHotEncoder()

Now we can perform an encoding, here we insert the day column and it returns a matrix with 4 columns corresponding to the 4 days in our feature.

In [16]:
ohe.transform(tips_df[['day']]).todense()

matrix([[0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 0., 1., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0., 1., 0., 0.],
        [0.,

As this is a rather mathematical approach for this simple problem I prefer to use the pandas approach for this, which is the get_dummies method.
The outcome is much more pleasing yet completely the same.

In [17]:
pd.get_dummies(tips_df.day)

Unnamed: 0,Fri,Sat,Sun,Thur
0,0,0,1,0
1,0,0,1,0
2,0,0,1,0
3,0,0,1,0
4,0,0,1,0
...,...,...,...,...
239,0,1,0,0
240,0,1,0,0
241,0,1,0,0
242,0,1,0,0


As an exercise you could create a script that given a specific feature (e.g. day):
- extracts that feature
- creates dummies
- concattenates it to the dataframe

Good luck!