## Integer Encoding

Integer encoding consist in replacing the categories by digits from 1 to n (or 0 to n-1, depending the implementation), where n is the number of distinct categories of the variable.

The numbers are assigned arbitrarily. This encoding method allows for quick benchmarking of machine learning models. 


### Advantages

- Straightforward to implement
- Does not expand the feature space


### Limitations

- Does not capture any information about the categories labels
- Not suitable for linear models.

Integer encoding is better suited for non-linear methods which are able to navigate through the arbitrarily assigned digits to try and find patters that relate them to the target.


## In this demo:

We will see how to perform one hot encoding with:

- Feature-Engine

the exercise is based on the training notes:
Feature Engineering for Machine Learning
by Soledad Galli


In [1]:
import numpy as np
import pandas as pd

# to split the datasets
from sklearn.model_selection import train_test_split


# for integer encoding using feature-engine
from feature_engine.categorical_encoders import OrdinalCategoricalEncoder

# Dataset
This data approach student achievement in secondary education of two Portuguese schools. 
The data attributes include student grades, demographic, social and school related features) 
and it was collected by using school reports and questionnaires.
the target, G3 is the final year grade (issued at the 3rd period)


In [2]:
columns = [ 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
             'Dalc', 'Walc' , 'health', 'G3']
data_raw = pd.read_csv('C:\\Users\gusal\machine learning\Feature engineering\student-por.csv', delimiter= ';' ,usecols = columns)

In [3]:
data_raw.head(5)

Unnamed: 0,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,Dalc,Walc,health,G3
0,at_home,teacher,course,mother,yes,no,no,no,1,1,3,11
1,at_home,other,course,father,no,yes,no,no,1,1,3,11
2,at_home,other,other,mother,yes,no,no,no,2,3,3,12
3,health,services,home,mother,no,yes,no,yes,1,1,5,14
4,other,other,home,father,no,yes,no,no,1,2,5,13


In [4]:
data_raw.dtypes

Mjob          object
Fjob          object
reason        object
guardian      object
schoolsup     object
famsup        object
paid          object
activities    object
Dalc           int64
Walc           int64
health         int64
G3             int64
dtype: object

'Dalc', 'Walc' , 'health' are categorical variables, but are taken as numerical, therefore they have to be re-cast

In [5]:
data_raw.Dalc = data_raw.Dalc.astype(object)
data_raw.Walc = data_raw.Walc.astype(object)
data_raw.health = data_raw.health.astype(object)


In [6]:
data_raw.dtypes

Mjob          object
Fjob          object
reason        object
guardian      object
schoolsup     object
famsup        object
paid          object
activities    object
Dalc          object
Walc          object
health        object
G3             int64
dtype: object

In [7]:
# let's have a look at how many labels each variable has

for col in data_raw.columns:
    print(col, ': ', len(data_raw[col].unique()), ' labels')

Mjob :  5  labels
Fjob :  5  labels
reason :  4  labels
guardian :  3  labels
schoolsup :  2  labels
famsup :  2  labels
paid :  2  labels
activities :  2  labels
Dalc :  5  labels
Walc :  5  labels
health :  5  labels
G3 :  17  labels


# Counting the items per each label

In [8]:
inputs = data_raw.drop(['G3'], axis = 1)
target = data_raw.G3

In [9]:

for col in inputs.columns:
    print(col, ':\n ', inputs[col].value_counts().sort_values(ascending=False),'\n ')

Mjob :
  other       258
services    136
at_home     135
teacher      72
health       48
Name: Mjob, dtype: int64 
 
Fjob :
  other       367
services    181
at_home      42
teacher      36
health       23
Name: Fjob, dtype: int64 
 
reason :
  course        285
home          149
reputation    143
other          72
Name: reason, dtype: int64 
 
guardian :
  mother    455
father    153
other      41
Name: guardian, dtype: int64 
 
schoolsup :
  no     581
yes     68
Name: schoolsup, dtype: int64 
 
famsup :
  yes    398
no     251
Name: famsup, dtype: int64 
 
paid :
  no     610
yes     39
Name: paid, dtype: int64 
 
activities :
  no     334
yes    315
Name: activities, dtype: int64 
 
Dalc :
  1    451
2    121
3     43
4     17
5     17
Name: Dalc, dtype: int64 
 
Walc :
  1    247
2    150
3    120
4     87
5     45
Name: Walc, dtype: int64 
 
health :
  5    249
3    124
4    108
1     90
2     78
Name: health, dtype: int64 
 


In [10]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    inputs,  # predictors
    target,  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((454, 11), (195, 11))

### Encoding important

We select which digit to assign to each category using the train set, and then use those mappings in the test set.

## Integer Encoding with Feature-Engine

In [11]:
columns.remove('G3')
# in this case I did not select explicitly the variables (variables = columns), 
# when variable = None, the encoder will automatically identify all 
#categorical variables.

ordinal_enc = OrdinalCategoricalEncoder(
    encoding_method='arbitrary',
    variables = None)

ordinal_enc.fit(X_train)

OrdinalCategoricalEncoder(encoding_method='arbitrary',
                          variables=['Mjob', 'Fjob', 'reason', 'guardian',
                                     'schoolsup', 'famsup', 'paid',
                                     'activities', 'Dalc', 'Walc', 'health'])

In [12]:
# in the encoder dict we can observe the numbers
# assigned to each category for all the indicated variables

ordinal_enc.encoder_dict_

{'Mjob': {'at_home': 0, 'other': 1, 'teacher': 2, 'services': 3, 'health': 4},
 'Fjob': {'at_home': 0, 'other': 1, 'health': 2, 'services': 3, 'teacher': 4},
 'reason': {'course': 0, 'reputation': 1, 'home': 2, 'other': 3},
 'guardian': {'mother': 0, 'other': 1, 'father': 2},
 'schoolsup': {'no': 0, 'yes': 1},
 'famsup': {'yes': 0, 'no': 1},
 'paid': {'no': 0, 'yes': 1},
 'activities': {'yes': 0, 'no': 1},
 'Dalc': {1: 0, 3: 1, 2: 2, 5: 3, 4: 4},
 'Walc': {4: 0, 1: 1, 5: 2, 2: 3, 3: 4},
 'health': {5: 0, 4: 1, 2: 2, 3: 3, 1: 4}}

In [13]:
X_train = ordinal_enc.transform(X_train)
X_test = ordinal_enc.transform(X_test)

# let's explore the result
X_train.head()

Unnamed: 0,Mjob,Fjob,reason,guardian,schoolsup,famsup,paid,activities,Dalc,Walc,health
561,0,0,0,0,0,0,0,0,0,0,0
452,1,1,0,0,0,0,0,1,0,1,1
89,2,2,1,0,0,0,0,1,1,2,0
299,1,1,0,1,0,0,1,0,0,3,2
231,3,2,2,2,0,0,0,1,2,0,1


**Note**

If the argument variables is left to None, then the encoder will automatically identify all categorical variables. Is that not sweet?

The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

Note, if there is a variable in the test set, for which the encoder doesn't have a number to assigned (the category was not seen in the train set), the encoder will return an error.