<left><img width=100% height=100% src="img/itu_logo.png"></left>

## Lecture 03: Encoding Categorical Features

### __Gül İnan__<br><br>Istanbul Technical University

## Re-visit Video Games Data

In [59]:
#import dataset
import pandas as pd
video_df = pd.read_table("datasets/video.csv", sep = ";", na_values="99", index_col=0)
video_df.head()

Unnamed: 0,time,freq,sex,age,home,math,work,own,grade
0,2.0,weekly,female,19,yes,no,10.0,yes,A
1,0.0,monthly,female,18,yes,yes,0.0,yes,C
2,0.0,monthly,male,19,yes,no,0.0,yes,B
3,0.5,monthly,female,19,yes,no,0.0,yes,B
4,0.0,semesterly,female,19,yes,yes,0.0,no,B


When we investigate the **data types of variables** in the Video Games data set, we can see that the features `freq`, `sex`, `home`, `math`, `own`, and `grade` has `object` data type (`dtypes`).

In [74]:
video_df.dtypes

time     float64
freq      object
sex       object
age        int64
home      object
math      object
work     float64
own       object
grade     object
dtype: object

These features ara actually `categorical variables` with `limited number of distinct levels`. Most of the existing machine learning algorithms cannot be executed on categorical features. Instead, the _categorical features need to first be converted to numericals_.

In [61]:
#check the levels of sex
video_df.sex.value_counts()

male      53
female    38
Name: sex, dtype: int64

## Nominal Variable

When there is **no ordering between the levels of a categorical variable**, the variable is called as `nominal variable`. The nominal variables can be encoded via `one-hot encoding`.

## One-Hot Encoding

`One-hot encoding` consists of replacing a **nominal feature with k levels** by **k binary features which take value 0 or 1**, to indicate if a certain category is present in an observation. The `binary variables` are also known as `dummy variables`.

For example, from the feature `Sex` with categories “male” and “female”, we can generate a `boolean feature` named “male”, which takes 1 if the observation is male or 0 otherwise. We can also generate another `boolean feature` named “female”, which takes 1 if the observation is “female” and 0 otherwise. 

![](img/onehot.png)


The [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) transformer in scikit-learn performs one-hot encoding. 

`Important note`: Just like scaling and imputation, all methods of **categorical feature encoding should be performed over the training set** and then propagated the learn categories to the test set. Hence, `data leakage` can be avoid. 

In [62]:
#focus on binary features only
import pandas as pd
from sklearn.model_selection import train_test_split

video_X = video_df[["sex", "home", "math", "own"]] 
video_y = video_df[["time"]]

#Split 90:10
video_X_train, video_X_test, video_y_train, video_y_test = train_test_split(video_X, video_y, test_size=0.1, random_state=1300)

In [67]:
# One-hot encoding

from sklearn import set_config
set_config(transform_output="pandas") #available in sckit-learn 1.2.1 #othwerwise transforms return numpy arrays, we lose column names

from sklearn.preprocessing import OneHotEncoder

#instantiate
ohe = OneHotEncoder(sparse_output=False)  #Pandas output does not support sparse data. For that reason, we assign sparse_output=F
ohe.fit(video_X_train)

In [69]:
#see the encoded categories for each feature
ohe.categories_   #lose column labels :(

[array(['female', 'male'], dtype=object),
 array(['no', 'yes'], dtype=object),
 array(['no', 'yes'], dtype=object),
 array(['no', 'yes'], dtype=object)]

In [70]:
video_X_train_enc = ohe.transform(video_X_train)   #ohe.fit_transform(video_X_train) is the shortway

In [71]:
video_X_train_enc.head()

Unnamed: 0,sex_female,sex_male,home_no,home_yes,math_no,math_yes,own_no,own_yes
49,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
77,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
36,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
84,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
79,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0


In [38]:
video_X_test_enc = ohe.transform(video_X_test) 

In [39]:
video_X_test_enc.head()

Unnamed: 0,sex_female,sex_male,home_no,home_yes,math_no,math_yes,own_no,own_yes
20,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
16,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
59,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
72,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
9,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0



## Binary variables

When a categorical feature has only 2 categories like `Sex` in our example, then the second dummy variable created by one hot encoding can be **completely redundant**. We can drop automatically the first dummy variable for those variables that contain only 2 categories by setting the parameter `drop='if_binary'`. This will ensure that for every binary variable in the dataset, only 1 dummy is created. 

In [72]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop='if_binary', sparse_output=False)  #drop the first category in the feature
ohe.fit(video_X_train)
video_X_train_enc = ohe.fit_transform(video_X_train)
video_X_train_enc.head()

Unnamed: 0,sex_male,home_yes,math_yes,own_yes
49,1.0,1.0,0.0,1.0
77,1.0,1.0,0.0,0.0
36,0.0,1.0,0.0,1.0
84,0.0,1.0,1.0,0.0
79,1.0,1.0,0.0,1.0


In [73]:
video_X_test_enc = ohe.transform(video_X_test) 
video_X_test_enc.head()

Unnamed: 0,sex_male,home_yes,math_yes,own_yes
20,1.0,1.0,0.0,1.0
16,1.0,1.0,0.0,1.0
59,0.0,1.0,0.0,1.0
72,1.0,0.0,1.0,1.0
9,1.0,1.0,1.0,1.0


## k vs k-1 dummies

From a categorical feature with k unique categories, the  [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) can create **k binary features**, or **alternatively k-1** to **avoid redundant information**. This behaviour can be specified using the parameter `drop=first`. Only k-1 binary variables are necessary to encode all of the information in the original variable. However, there are situations in which we may choose to encode the data into k dummies.

Encode into k-1 if training linear models: Linear models evaluate all features during fit, thus, with k-1 they have all information about the original categorical variable.

Encode into k if training penalized linear regression or penalized linear classification models, decision trees, or performing feature selection: tree based models and many feature selection algorithms evaluate variables or groups of variables separately. Thus, if encoding into k-1, the first category will not be examined. That is, we lose the information contained in that category.


## Feature space and duplication

If the categorical fatures are highly cardinal, we may end up with very big datasets after one hot encoding. In addition, if some of these features are fairly constant or fairly similar, we may end up with one hot encoded features that are highly correlated if not identical.

## Ordinal Variable

When there is a **natural ordering between the levels of a categorical feature**, that variable is called as `ordinal variable`. The ordinal features can be encoded via `ordinal encoding`.


The `freq` and `grade` features are examples to ordinal variable (note that `freq` has missing values).

In [42]:
video_df.freq.value_counts()

weekly        28
semesterly    23
monthly       18
daily          9
Name: freq, dtype: int64

In [43]:
video_df.grade.value_counts()

B    52
A    31
C     8
Name: grade, dtype: int64

## Ordinal Encoding

The [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) transformer in scikit-learn replaces the categories by digits, starting from 0 to k-1, where k is the number of different categories. 

In [44]:
#split the data set
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

video_X = video_df[["freq","grade"]] 
video_y = video_df[["time"]]

#Split 90:10
video_X_train, video_X_test, video_y_train, video_y_test = train_test_split(video_X, video_y, test_size=0.1, random_state=1300)

In [45]:
video_X_train.freq.value_counts()

weekly        27
semesterly    21
monthly       15
daily          7
Name: freq, dtype: int64

In [46]:
video_X_train.grade.value_counts()

B    48
A    27
C     6
Name: grade, dtype: int64

In [47]:
#encode ordinal variables (note that one of them is missing)
from sklearn import set_config
set_config(transform_output="pandas")  #available in sckit-learn 1.2.1 #othwerwise transforms return numpy arrays, we lose column names

from sklearn.preprocessing import OrdinalEncoder

#assign specific labels to each level for each feature

freq_cat = ["daily", "weekly", "monthly", "semesterly"] #0,1,2,3
grade_cat = ["C","B","A"] #0,1,2   #run if you want specific assignment for each level

#if you have missing values, be careful
enc = OrdinalEncoder(categories=[freq_cat, grade_cat], handle_unknown='use_encoded_value', unknown_value=np.nan) 
#enc = OrdinalEncoder()
video_X_train_enc = enc.fit_transform(video_X_train)

In [48]:
# see the encoded categories
enc.categories_

[array(['daily', 'weekly', 'monthly', 'semesterly'], dtype=object),
 array(['C', 'B', 'A'], dtype=object)]

In [49]:
video_X_train_enc.freq.value_counts()

1.0    27
3.0    21
2.0    15
0.0     7
Name: freq, dtype: int64

In [50]:
video_X_train_enc.grade.value_counts()

1.0    48
2.0    27
0.0     6
Name: grade, dtype: int64

In [51]:
video_X_test_enc = enc.transform(video_X_test)
video_X_test_enc.head() #see Nans are not treated

Unnamed: 0,freq,grade
20,1.0,2.0
16,3.0,2.0
59,0.0,0.0
72,,1.0
9,3.0,2.0


In [52]:
#you can combine imputation with encoding through pipelines

from sklearn import set_config
set_config(transform_output="pandas")  #available in sckit-learn 1.2.1 #othwerwise transforms return numpy arrays, we lose column names

from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LinearRegression

pipe = make_pipeline(
    SimpleImputer(strategy="most_frequent"),  #categorical variable
    OrdinalEncoder(), 
    LinearRegression()
)

In [53]:
pipe.fit(video_X_train, video_y_train)

In [54]:
pipe.predict(video_X_test)

array([[1.37695074],
       [1.23541891],
       [0.71093132],
       [1.25623876],
       [1.23541891],
       [0.83164329],
       [0.97317512],
       [0.85246314],
       [1.37695074],
       [0.97317512]])

In [55]:
#linear model is not a good model for this data. for that reason r2 is close to zero. no worries
print('Test R2 on test data: %.2f' % pipe.score(video_X_test, video_y_test)) 

Test R2 on test data: -0.10


## Ordinal Encoding with many categories

- Do we have enough data for rare categories to learn anything meaningful?

- How about grouping them into bigger categories?

  - Example: country names into continents such as “South America” or “Asia”

- Or having “other” category for rare cases?

## Problem


Now, we would like to apply different transformations on different columns:

- Numeric columns:

  - imputation and
  - scaling

- Nominal categorical columns:

  - imputation and
  - one-hot encoding

- Ordinal categorical columns:

  - imputation and
  - ordinal encoding
  
  
How can we apply these on the data before fitting the regressor? 

## References

- https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_estimator_representation.html#sphx-glr-auto-examples-miscellaneous-plot-estimator-representation-py
- https://datascience.stackexchange.com/questions/72343/encoding-with-ordinalencoder-how-to-give-levels-as-user-input
- https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding?noredirect=1&lq=1
- https://www.kaggle.com/discussions/getting-started/114797
- https://datascience.stackexchange.com/questions/107714/encoding-before-vs-after-train-test-split
- https://feature-engine.trainindata.com/en/1.3.x/user_guide/encoding/OneHotEncoder.html

In [56]:
import session_info
session_info.show()