## One Hot Encoding of Frequent Categories

We learned in Section 3 that high cardinality and rare labels may result in certain categories appearing only in the train set, therefore causing over-fitting, or only in the test set, and then our models wouldn't know how to score those observations.

We also learned in the previous lecture on one hot encoding, that if categorical variables contain multiple labels, then by re-encoding them with dummy variables we will expand the feature space dramatically.

**In order to avoid these complications, we can create dummy variables only for the most frequent categories**

This procedure is also called one hot encoding of top categories.

In fact, in the winning solution of the KDD 2009 cup: ["Winning the KDD Cup Orange Challenge with Ensemble Selection"](http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf), the authors limit one hot encoding to the 10 most frequent labels of the variable. This means that they would make one binary variable for each of the 10 most frequent labels only.

OHE of frequent or top categories is equivalent to grouping all the remaining categories under a new category. We will have a better look at grouping rare values into a new category in a later notebook in this section.


### Advantages of OHE of top categories

- Straightforward to implement
- Does not require hrs of variable exploration
- Does not expand massively the feature space
- Suitable for linear models


### Limitations

- Does not add any information that may make the variable more predictive
- Does not keep the information of the ignored labels


Often, categorical variables show a few dominating categories while the remaining labels add little information. Therefore, OHE of top categories is a simple and useful technique.

### Note

The number of top variables is set arbitrarily. In the KDD competition the authors selected 10, but it could have been 15 or 5 as well. This number can be chosen arbitrarily or derived from data exploration.


## In this demo:

We will see how to perform one hot encoding with:

- Feature-Engine

 Advantages

- quick
- creates the same number of features in train and test set

The exercise is based on the training notes:
Feature Engineering for Machine Learning
by Soledad Galli


In [1]:
import numpy as np
import pandas as pd

# to split the datasets
from sklearn.model_selection import train_test_split

# for one hot encoding with feature-engine
from feature_engine.categorical_encoders import OneHotCategoricalEncoder

pd.set_option('display.max_columns', None)

# Dataset
This data approach student achievement in secondary education of two Portuguese schools. 
The data attributes include student grades, demographic, social and school related features) 
and it was collected by using school reports and questionnaires.
the target, G3 is the final year grade (issued at the 3rd period)


In [2]:
columns = [ 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
             'Dalc', 'Walc' , 'health', 'G3']
data_raw = pd.read_csv('C:\\Users\gusal\machine learning\Feature engineering\student-por.csv', delimiter= ';' ,usecols = columns)

In [3]:
data_raw.head(5)

Unnamed: 0,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,Dalc,Walc,health,G3
0,at_home,teacher,course,mother,yes,no,no,no,1,1,3,11
1,at_home,other,course,father,no,yes,no,no,1,1,3,11
2,at_home,other,other,mother,yes,no,no,no,2,3,3,12
3,health,services,home,mother,no,yes,no,yes,1,1,5,14
4,other,other,home,father,no,yes,no,no,1,2,5,13


In [4]:
data_raw.dtypes

Mjob          object
Fjob          object
reason        object
guardian      object
schoolsup     object
famsup        object
paid          object
activities    object
Dalc           int64
Walc           int64
health         int64
G3             int64
dtype: object

'Dalc', 'Walc' , 'health' are categorical variables, but are taken as numerical, therefore they have to be re-cast

In [5]:
#df.Day = df.Day.astype(str)
data_raw.Dalc = data_raw.Dalc.astype(object)
data_raw.Walc = data_raw.Walc.astype(object)
data_raw.health = data_raw.health.astype(object)


In [6]:
data_raw.dtypes

Mjob          object
Fjob          object
reason        object
guardian      object
schoolsup     object
famsup        object
paid          object
activities    object
Dalc          object
Walc          object
health        object
G3             int64
dtype: object

In [7]:
# let's have a look at how many labels each variable has

for col in data_raw.columns:
    print(col, ': ', len(data_raw[col].unique()), ' labels')

Mjob :  5  labels
Fjob :  5  labels
reason :  4  labels
guardian :  3  labels
schoolsup :  2  labels
famsup :  2  labels
paid :  2  labels
activities :  2  labels
Dalc :  5  labels
Walc :  5  labels
health :  5  labels
G3 :  17  labels


# Counting the items per each label

In [8]:
data_raw['Mjob'].value_counts().sort_values(ascending=False).head(10)

other       258
services    136
at_home     135
teacher      72
health       48
Name: Mjob, dtype: int64

In [9]:
data_raw['Fjob'].value_counts().sort_values(ascending=False).head(10)

other       367
services    181
at_home      42
teacher      36
health       23
Name: Fjob, dtype: int64

In [10]:
data_raw['reason'].value_counts().sort_values(ascending=False).head(10)

course        285
home          149
reputation    143
other          72
Name: reason, dtype: int64

In [11]:
data_raw['Dalc'].value_counts().sort_values(ascending=False).head(10)

1    451
2    121
3     43
4     17
5     17
Name: Dalc, dtype: int64

In [12]:
data_raw['Walc'].value_counts().sort_values(ascending=False).head(10)

1    247
2    150
3    120
4     87
5     45
Name: Walc, dtype: int64

In [13]:
data_raw['health'].value_counts().sort_values(ascending=False).head(10)

5    249
3    124
4    108
1     90
2     78
Name: health, dtype: int64

In [14]:
inputs = data_raw.drop(['G3'], axis = 1)
target = data_raw.G3

In [15]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    inputs,  # predictors
    target,  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((454, 11), (195, 11))

### Encoding important

It is important to select the top or most frequent categories based of the train data. Then, we will use those top categories to encode the variables in the test data as well

In [16]:
columns.remove('G3')

ohe_enc = OneHotCategoricalEncoder(
    top_categories=3,  # you can change this value to select more or less variables
    # we can select which variables to encode
    variables= columns, 
    drop_last=True)# to return k-1, false to return k

ohe_enc.fit(X_train)

OneHotCategoricalEncoder(drop_last=True, top_categories=3,
                         variables=['Mjob', 'Fjob', 'reason', 'guardian',
                                    'schoolsup', 'famsup', 'paid', 'activities',
                                    'Dalc', 'Walc', 'health'])

In [17]:
# in the encoder dict we can observe each of the top categories
# selected for each of the variables

ohe_enc.encoder_dict_

{'Mjob': ['other', 'services', 'at_home'],
 'Fjob': ['other', 'services', 'at_home'],
 'reason': ['course', 'home', 'reputation'],
 'guardian': ['mother', 'father', 'other'],
 'schoolsup': ['no', 'yes'],
 'famsup': ['yes', 'no'],
 'paid': ['no', 'yes'],
 'activities': ['no', 'yes'],
 'Dalc': [1, 2, 3],
 'Walc': [1, 2, 3],
 'health': [5, 3, 4]}

In [18]:
X_train = ohe_enc.transform(X_train)
X_test = ohe_enc.transform(X_test)

# let's explore the result
X_train.head()

Unnamed: 0,Mjob_other,Mjob_services,Mjob_at_home,Fjob_other,Fjob_services,Fjob_at_home,reason_course,reason_home,reason_reputation,guardian_mother,guardian_father,guardian_other,schoolsup_no,schoolsup_yes,famsup_yes,famsup_no,paid_no,paid_yes,activities_no,activities_yes,Dalc_1,Dalc_2,Dalc_3,Walc_1,Walc_2,Walc_3,health_5,health_3,health_4
561,0,0,1,0,0,1,1,0,0,1,0,0,1,0,1,0,1,0,0,1,1,0,0,0,0,0,1,0,0
452,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,0,1,0,0,0,0,1
89,0,0,0,0,0,0,0,0,1,1,0,0,1,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,0
299,1,0,0,1,0,0,1,0,0,0,0,1,1,0,1,0,0,1,0,1,1,0,0,0,1,0,0,0,0
231,0,1,0,0,0,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,1


In [19]:
X_train.shape

(454, 29)