# Tutorial for hccEncoding


This notebook will exhibit how to use hcc-encoding package for classification and regression problem using dataset from Kaggle competition 'Prudential Life Insurance Assessment'

In hcc-encoding, the basic principle motivating the processing is to map individual values of a high-cardinality categorical independent attribute to an estimate of the probability or the expected value of dependent attribute. However, just simply transfer the high-cardinality categorical to target statistics often result in information leaking.  Daniele Micci-Barreca 's empirical Bayes method [ref1] and Owen Zhang's leave-one-out encoding[ref2] are two nice method to prevent information leaking, which are implemented in hccEncoding package


ref1: Daniele Micci-Barreca. 2001. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explor. Newsl. 3, 1 (July 2001), 27-32.

ref2: - https://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions


dataset download: https://www.kaggle.com/c/prudential-life-insurance-assessment

In this dataset, you are provided over a hundred variables describing attributes of life insurance applicants. 
The task is to predict the "Response" variable for each Id in the test set. 
"Response" is an ordinal measure of risk that has 8 levels, which means the problem can be treated as both regression problem and classification problems (classify to 8 classes)

In [1]:
# load raw data
import pandas as pd
train=pd.read_csv('train.csv')
test=pd.read_csv('test.csv')

In [2]:
train.head()

Unnamed: 0,Id,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,...,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Response
0,2,1,D3,10,0.076923,2,1,1,0.641791,0.581818,...,0,0,0,0,0,0,0,0,0,8
1,5,1,A1,26,0.076923,2,3,1,0.059701,0.6,...,0,0,0,0,0,0,0,0,0,4
2,6,1,E1,26,0.076923,2,3,1,0.029851,0.745455,...,0,0,0,0,0,0,0,0,0,8
3,7,1,D4,10,0.487179,2,3,1,0.164179,0.672727,...,0,0,0,0,0,0,0,0,0,8
4,8,1,D2,26,0.230769,2,3,1,0.41791,0.654545,...,0,0,0,0,0,0,0,0,0,8


In [3]:
len(train['Product_Info_2'].unique())

19

It can be seen that the feature 'Product_Info_2' can be treated as a high-cardinal feature. To exhibit how to use hcc-encoding more easily, we ignore most irrelavant features:

In [4]:
train=train[['Id','Response','Product_Info_2']]
test=test[['Id','Product_Info_2']]

In [5]:
train.head()

Unnamed: 0,Id,Response,Product_Info_2
0,2,8,D3
1,5,4,A1
2,6,8,E1
3,7,8,D4
4,8,8,D2


# Part 1. Encoding for classification problems

In [6]:
from hccEncoding.EncoderForClassification import BayesEncoding,BayesEncodingKfold,LOOEncoding,LOOEncodingKfold

train_BayesEncoding,test_BayesEncoding=BayesEncoding(train,test,'Response','Product_Info_2')
train_LOOEncoding,test_LOOEncoding=LOOEncoding(train,test,'Response','Product_Info_2')


In [7]:
train_BayesEncoding.head()

Unnamed: 0,Id,Response,Product_Info_2,bayes_Product_Info_2_1,bayes_Product_Info_2_2,bayes_Product_Info_2_3,bayes_Product_Info_2_4,bayes_Product_Info_2_5,bayes_Product_Info_2_6,bayes_Product_Info_2_7,bayes_Product_Info_2_8
0,2,8,D3,0.10052,0.116242,0.016612,0.029456,0.087625,0.227483,0.145368,0.274486
1,5,4,A1,0.055488,0.100343,0.02179,0.028626,0.083012,0.148888,0.115147,0.445322
2,6,8,E1,0.076636,0.078272,0.012823,0.029673,0.067797,0.162853,0.150906,0.425601
3,7,8,D4,0.064025,0.065807,0.007659,0.011436,0.075465,0.166885,0.135976,0.477254
4,8,8,D2,0.11959,0.147281,0.017612,0.024653,0.06727,0.247489,0.145203,0.233645


Note: In BayesEncoding for classification problem, after encoding, new feature will be the probilities of each class, the header of generated new feature will be 'bayes_ __ _(Origin Feature name)_ __ _(name of the class)'

In [8]:
train_LOOEncoding.head()

Unnamed: 0,Id,Response,Product_Info_2,loo_Product_Info_2
0,2,8,D3,5.568211
1,5,4,A1,6.176825
2,6,8,E1,6.171721
3,7,8,D4,6.37337
4,8,8,D2,5.260191


In [9]:
train_BayesEncodingKfold,test_BayesEncodingKfold=BayesEncodingKfold(train,test,'Response','Product_Info_2',fold=5)
train_LOOEncodingKfold,test_LOOEncodingKfold=LOOEncodingKfold(train,test,'Response','Product_Info_2',fold=5)


In [10]:
train_BayesEncodingKfold.head()

Unnamed: 0,Id,Response,Product_Info_2,bayes_Product_Info_2_1,bayes_Product_Info_2_2,bayes_Product_Info_2_3,bayes_Product_Info_2_4,bayes_Product_Info_2_5,bayes_Product_Info_2_6,bayes_Product_Info_2_7,bayes_Product_Info_2_8
0,2,8,D3,0.103135,0.11697,0.016271,0.029247,0.086499,0.229355,0.147193,0.268195
1,5,4,A1,0.05417,0.094039,0.023956,0.030422,0.08712,0.148125,0.115648,0.443238
2,6,8,E1,0.072549,0.079161,0.013412,0.02899,0.063683,0.161461,0.150065,0.433177
3,7,8,D4,0.063238,0.064417,0.007972,0.011082,0.076236,0.165874,0.13138,0.478303
4,8,8,D2,0.114566,0.146584,0.018327,0.024603,0.068299,0.245421,0.148178,0.233748


In [11]:
train_LOOEncodingKfold.head()

Unnamed: 0,Id,Response,Product_Info_2,loo_Product_Info_2
0,2,8,D3,5.51035
1,5,4,A1,6.147088
2,6,8,E1,6.165965
3,7,8,D4,6.389432
4,8,8,D2,5.280445


Note: the difference between BayesEncoding and BayesEncodingKfold (also LOOEncoding and LOOEncodingKfold) is how to encode train dataset. In BayesEncoding (or LOOEncoding), the train dataset is encoded using statistics of full train dataset. In BayesEncodingKfold (or LOOEncodingKfold), the train dataset is encoded using statistics of part of train dataset. For example, when fold=5, Baye0%sEncodingKfold (or LOOEncodingKfold) use 80% of train dataset to encode the rest 20% train dataset. This can further reduce the risk of information leaking, the cons is to use less information from train dataset. 

# Part 2. Encoding for Regression problems

In [12]:
from hccEncoding.EncoderForRegression import BayesEncoding,BayesEncodingKfold,LOOEncoding,LOOEncodingKfold

train_BayesEncoding,test_BayesEncoding=BayesEncoding(train,test,'Response','Product_Info_2')
train_LOOEncoding,test_LOOEncoding=LOOEncoding(train,test,'Response','Product_Info_2')

In [13]:
train_BayesEncoding.head()

Unnamed: 0,Id,Response,Product_Info_2,bayes_Product_Info_2
0,2,8,D3,5.554615
1,5,4,A1,6.212693
2,6,8,E1,6.226905
3,7,8,D4,6.33727
4,8,8,D2,5.261874


In [14]:
train_LOOEncoding.head()

Unnamed: 0,Id,Response,Product_Info_2,loo_Product_Info_2
0,2,8,D3,5.553309
1,5,4,A1,6.042045
2,6,8,E1,6.065838
3,7,8,D4,6.423081
4,8,8,D2,5.302704


In [15]:
train_BayesEncodingKfold,test_BayesEncodingKfold=BayesEncodingKfold(train,test,'Response','Product_Info_2',fold=5)
train_LOOEncodingKfold,test_LOOEncodingKfold=LOOEncodingKfold(train,test,'Response','Product_Info_2',fold=5)

In [16]:
train_BayesEncodingKfold.head()

Unnamed: 0,Id,Response,Product_Info_2,bayes_Product_Info_2
0,2,8,D3,5.528438
1,5,4,A1,6.217039
2,6,8,E1,6.175437
3,7,8,D4,6.322527
4,8,8,D2,5.288024


In [17]:
train_LOOEncodingKfold.head()

Unnamed: 0,Id,Response,Product_Info_2,loo_Product_Info_2
0,2,8,D3,5.511082
1,5,4,A1,6.147173
2,6,8,E1,6.164431
3,7,8,D4,6.390441
4,8,8,D2,5.280106


for more detailed explanation about parameters, please check online documentation: http://hccencoding-project.readthedocs.io/en/latest/