## Introduction

Inspired by this article and the repo, I have created the following kernel:

- [Benchmarking Categorical Encoders](https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8)

- [CategoricalEncodingBenchmark](https://github.com/DenisVorotyntsev/CategoricalEncodingBenchmark)

Let's see how these methods work in this dataset.

[Discussion](https://www.kaggle.com/c/cat-in-the-dat/discussion/112584)

- no feature preprocessing
- Use KFold(5) for CV (+ more fold get better score)
- LR (C=0.1, solver=lbfgs)

|Encoder|LB Score|
|-|-|
|TE|0.78018|
|WOE|0.78861|
|LOOE|0.79382|
|James-Stein|0.77843|
|Catboost|0.79164|
|One-Hot(another my kernel)|0.77973|



### Category-Encoders

1. Label Encoder
2. One-Hot Encoder
3. Sum Encoder
4. Helmert Encoder
5. Frequency Encoder
6. Target Encoder
7. M-Estimate Encoder
8. Weight Of Evidence Encoder
9. James-Stein Encoder
10. Leave-one-out Encoder
11. Catboost Encoder
---
- Validation (Benchmark)
    - single LR
    - LR with Cross Validation

- Submit

__Note__: With no arguments passed in aforementioned encoders (__category\_encoders__) library, only __string__ or __object__ type columns will undergo encoding process and numeric/integer columns will be left as it is. If we want to encode specific columns or even numeric/integer columns, then we have to pass list of columns as __cols__ variable while instantiating the encoder.

## Category-Encoders 

A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques.

In [None]:
# If you want to test this on your local notebook
# http://contrib.scikit-learn.org/categorical-encoding/
# !pip install category-encoders

In [35]:
import pandas as pd

from category_encoders.ordinal import OrdinalEncoder
from category_encoders.woe import WOEEncoder
from category_encoders.target_encoder import TargetEncoder
from category_encoders.sum_coding import SumEncoder
from category_encoders.m_estimate import MEstimateEncoder
from category_encoders.leave_one_out import LeaveOneOutEncoder
from category_encoders.helmert import HelmertEncoder
from category_encoders.cat_boost import CatBoostEncoder
from category_encoders.james_stein import JamesSteinEncoder
from category_encoders.one_hot import OneHotEncoder

TEST = True

read csv and doing some preprocessing

In [29]:
%%time
train = pd.read_csv('/kaggle/input/cat-in-the-dat/train.csv')
test = pd.read_csv('/kaggle/input/cat-in-the-dat/test.csv')
target = train['target']
train_id = train['id']
test_id = test['id']
train.drop(['target', 'id'], axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)
print(train.dtypes)
train.head()

bin_0     int64
bin_1     int64
bin_2     int64
bin_3    object
bin_4    object
nom_0    object
nom_1    object
nom_2    object
nom_3    object
nom_4    object
nom_5    object
nom_6    object
nom_7    object
nom_8    object
nom_9    object
ord_0     int64
ord_1    object
ord_2    object
ord_3    object
ord_4    object
ord_5    object
day       int64
month     int64
dtype: object
CPU times: user 2.34 s, sys: 69.8 ms, total: 2.41 s
Wall time: 2.39 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,T,Y,Green,Triangle,Snake,Finland,Bassoon,...,c389000ab,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2
1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,Piano,...,4cd920251,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8
2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,Theremin,...,de9c9f684,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2
3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,Oboe,...,4ade6ab69,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1
4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,Oboe,...,cb43ab175,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8


In [4]:
feature_list = list(train.columns) # you can custumize later.
feature_list

['bin_0',
 'bin_1',
 'bin_2',
 'bin_3',
 'bin_4',
 'nom_0',
 'nom_1',
 'nom_2',
 'nom_3',
 'nom_4',
 'nom_5',
 'nom_6',
 'nom_7',
 'nom_8',
 'nom_9',
 'ord_0',
 'ord_1',
 'ord_2',
 'ord_3',
 'ord_4',
 'ord_5',
 'day',
 'month']

### notation

- $y$ and $y+$ — the total number of observations and the total number of positive observations (y=1);
- $x_i$, $y_i$ — the i-th value of category and target;
- $n$ and $n+$ — the number of observations and the number of positive observations (y=1) for a given value of a categorical column;
- $a$ — a regularization hyperparameter (selected by a user), prior — an average value of the target.

## 1. Label Encoder (LE), Ordinary Encoder(OE)

One of the most common encoding methods.

An encoding method that converts categorical data into numbers.
The code is very simple, and when you encode a specific column you can proceed as follows:

``` python
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()

train[column_name] = label.fit_transform(train[column_name])
```

The simple idea is to convert the same category to a number with the same value.

So the range of numbers maps from 0 to n-1 as labels.

The disadvantage is that the labels are ordered randomly (in the existing order of the data), which can add noise while assigning an unexpected order between labels. In other words, the data becomes ordinary (ordinal, ordered) data, which can lead to unintended consequences.

If you use `Category-Encoders` it will look like this code below.

In [5]:
%%time
LE_encoder = OrdinalEncoder(feature_list)
train_le = LE_encoder.fit_transform(train)
test_le = LE_encoder.transform(test)
train_le.head()

CPU times: user 6.33 s, sys: 533 ms, total: 6.86 s
Wall time: 5.63 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,1,1,1,1,1,1,1,...,1,1,2,1,1,1,1,1,2,2
1,0,1,0,1,1,1,2,2,2,2,...,2,2,1,1,2,2,2,2,7,8
2,0,0,0,2,1,2,2,3,2,3,...,3,3,1,2,3,1,3,3,7,2
3,0,1,0,2,1,3,2,1,3,4,...,4,4,1,1,4,3,1,4,2,1
4,0,0,0,2,2,3,2,3,3,4,...,5,5,1,1,5,2,3,5,7,8


## 2. One-Hot Encoder (OHE, dummy encoder)


So what can you do to give values ​​by category instead of ordering them?

If you have data with specific category values, you can create a column. If the base Label Encoder label type is N, then OHE is the way to create N columns.

Since only the row containing the content is given as 1, it is called one-hot encoding. Also called dummy encoding in the sense of creating a dummy.


In this competition:

``` python
traintest = pd.concat([train, test])
dummies = pd.get_dummies(traintest, columns=traintest.columns, drop_first=True, sparse=True)
train_ohe = dummies.iloc[:train.shape[0], :]
test_ohe = dummies.iloc[train.shape[0]:, :]
train_ohe = train_ohe.sparse.to_coo().tocsr()
test_ohe = test_ohe.sparse.to_coo().tocsr()
```

If you use `Category-Encoders` it will look like this code below.

In [6]:
# %%time
# this method didn't work because of RAM memory. One-hot encoder blows up the dimensionality which could not be stored in memory
# OHE_encoder = OneHotEncoder(feature_list)
# train_ohe = OHE_encoder.fit_transform(train)
# test_ohe = OHE_encoder.transform(test)

## 3. Sum Encoder (Deviation Encoder, Effect Encoder)

**Sum Encoder** compares the mean of the dependent variable (target) for a given level of a categorical column to the overall mean of the target. 

Sum Encoding is very similar to OHE and both of them are commonly used in Linear Regression (LR) types of models.

If you use `Category-Encoders` it will look like this code below.

In [7]:
# %%time
# this method didn't work because of RAM memory. this encoder too blows up the dimensionality which could not be stored in memory
# SE_encoder =SumEncoder(feature_list)
# train_se = SE_encoder.fit_transform(train[feature_list], target)
# test_se = SE_encoder.transform(test[feature_list])

## 4. Helmert Encoder

**Helmert Encoding** is a third commonly used type of categorical encoding for regression along with OHE and Sum Encoding. 

It compares each level of a categorical variable to the mean of the subsequent levels. 

This type of encoding can be useful in certain situations where levels of the categorical variable are ordered. (not this dataset)

If you use `Category-Encoders` it will look like this code below.

In [8]:
# %%time
# this method didn't work because of RAM memory. this encoder too blows up the dimensionality which could not be stored in memory.
# HE_encoder = HelmertEncoder(feature_list)
# train_he = HE_encoder.fit_transform(train[feature_list], target)
# test_he = HE_encoder.transform(test[feature_list])

## 5. Frequency Encoder

This method encodes by frequency.

Create a new feature with the number of categories from the training data.

I will not proceed separately in this data.

## 6. Target Encoder

This is a work in progress for many kernels.

The encoded category values are calculated according to the following formulas:

$$s = \frac{1}{1+exp(-\frac{n-mdl}{a})}$$

$$\hat{x}^k = prior * (1-s) + s * \frac{n^{+}}{n}$$

- mdl means **'min data in leaf'**
- a means **'smooth parameter, power of regularization'**

Target Encoder is a powerful, but it has a huuuuuge disadvantage 

> **target leakage**: it uses information about the target. 

To reduce the effect of target leakage, 

- Increase regularization
- Add random noise to the representation of the category in train dataset (some sort of augmentation)
- Use Double Validation (using other validation)

Let's use while being careful about overfitting.

If you use `Category-Encoders` it will look like this code below.

In [32]:
%%time

# TE_encoder = TargetEncoder(cols= feature_list) #this will encode all the columns in feature_list including numeric/int64 columns.
TE_encoder = TargetEncoder() #this will encode only string or object type columns
train_te = TE_encoder.fit_transform(train[feature_list], target)
test_te = TE_encoder.transform(test[feature_list])

train_te.head()

CPU times: user 12.9 s, sys: 1.24 s, total: 14.1 s
Wall time: 10.8 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.302537,0.290107,0.327145,0.360978,0.307162,0.242813,0.237743,...,0.372694,0.368421,2,0.403885,0.257877,0.306993,0.208354,0.401186,2,2
1,0,1,0,0.302537,0.290107,0.327145,0.290054,0.359209,0.289954,0.304164,...,0.189189,0.076924,1,0.403885,0.326315,0.206599,0.186877,0.30388,7,8
2,0,0,0,0.309384,0.290107,0.24179,0.290054,0.293085,0.289954,0.353951,...,0.223022,0.172414,1,0.317175,0.403126,0.306993,0.351864,0.206843,7,2
3,0,1,0,0.309384,0.290107,0.351052,0.290054,0.307162,0.339793,0.329472,...,0.325123,0.227273,1,0.403885,0.360961,0.330148,0.208354,0.355985,2,1
4,0,0,0,0.309384,0.333773,0.351052,0.290054,0.293085,0.339793,0.329472,...,0.376812,0.2,1,0.403885,0.225214,0.206599,0.351864,0.404345,7,8


In [20]:
print(target.shape[0], target.sum(), target.shape[0]- target.sum())
# train[feature_list].group_by("bin_0")
train_te.bin_4.value_counts()

300000 91764 208236


0.290107    191633
0.333773    108367
Name: bin_4, dtype: int64

## 7. M-Estimate Encoder

**M-Estimate Encoder** is a **simplified version of Target Encoder**. It has only one hyperparameter

$$\hat{x}^k = \frac{n^+ + prior * m}{y^+ + m}$$

The higher value of m results into stronger shrinking. Recommended values for m is in the range of 1 to 100.

If you use `Category-Encoders` it will look like this code below.

In [22]:
%%time
MEE_encoder = MEstimateEncoder()
train_mee = MEE_encoder.fit_transform(train[feature_list], target)
test_mee = MEE_encoder.transform(test[feature_list])
test_mee.head()

CPU times: user 12.2 s, sys: 1.02 s, total: 13.2 s
Wall time: 10.4 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,1,0.302537,0.290107,0.241791,0.360976,0.319017,0.242815,0.304164,...,0.348078,0.187212,2,0.242055,0.288797,0.342474,0.324946,0.30059,5,11
1,0,0,0,0.302537,0.333773,0.351051,0.338931,0.293085,0.339792,0.304164,...,0.223069,0.289258,1,0.355076,0.403125,0.379275,0.186884,0.244833,7,5
2,1,0,1,0.309384,0.290107,0.241791,0.338931,0.245141,0.311723,0.304164,...,0.187194,0.108823,2,0.317175,0.225215,0.206602,0.236894,0.417684,1,12
3,0,0,1,0.302537,0.290107,0.351051,0.310626,0.335367,0.311723,0.304164,...,0.359772,0.319457,1,0.278534,0.403125,0.220467,0.336262,0.365124,2,3
4,0,1,1,0.309384,0.333773,0.351051,0.290055,0.245141,0.311723,0.304164,...,0.374739,0.294771,3,0.403884,0.403125,0.379275,0.40947,0.389782,4,11


## 8. Weight of Evidence Encoder 

**Weight Of Evidence** is a commonly used target-based encoder in credit scoring. 

It is a measure of the “strength” of a grouping for separating good and bad risk (default). 

It is calculated from the basic odds ratio:

``` python
a = Distribution of Good Credit Outcomes
b = Distribution of Bad Credit Outcomes
WoE = ln(a / b)
```

However, if we use formulas as is, it might lead to **target leakage**(and overfit).

To avoid that, regularization parameter a is induced and WoE is calculated in the following way:

$$nomiinator = \frac{n^+ + a}{y^+ + 2*a}$$

$$denominator = ln(\frac{nominator}{denominator})$$

If you use `Category-Encoders` it will look like this code below.

In [24]:
%%time
WOE_encoder = WOEEncoder()
train_woe = WOE_encoder.fit_transform(train[feature_list], target)
test_woe = WOE_encoder.transform(test[feature_list])
train_woe.head()

CPU times: user 12.4 s, sys: 486 ms, total: 12.8 s
Wall time: 9.95 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,-0.015794,-0.075416,0.098327,0.248359,0.006058,-0.317803,-0.345614,...,0.302749,0.333932,2,0.430146,-0.237516,0.005297,-0.514545,0.420532,2,2
1,0,1,0,-0.015794,-0.075416,0.098327,-0.07566,0.240683,-0.076148,-0.008087,...,-0.600377,-1.052362,1,0.430146,0.094611,-0.526005,-0.650766,-0.008737,7,8
2,0,0,0,0.016454,-0.075416,-0.32342,-0.07566,-0.06099,-0.076148,0.217747,...,-0.417323,-0.607677,1,0.052724,0.426997,0.005297,0.20866,-0.523234,7,2
3,0,1,0,0.016454,-0.075416,0.205037,-0.07566,0.006058,0.155252,0.108884,...,0.096879,-0.338013,1,0.430146,0.248265,0.11198,-0.514545,0.227089,2,1
4,0,0,0,0.016454,0.128285,0.205037,-0.07566,-0.06099,0.155252,0.108884,...,0.321353,-0.468414,1,0.430146,-0.416062,-0.526005,0.20866,0.432324,7,8


## 9. James-Stein Encoder

**James-Stein Encoder** is a target-based encoder.

The idea behind James-Stein Encoder is simple. Estimation of the mean target for category k could be calculated according to the following formula:

$$\hat{x}^k = (1-B) * \frac{n^+}{n} + B * \frac{y^+}{y} $$

One way to select B is to tune it like a hyperparameter via cross-validation, but Charles Stein came up with another solution to the problem:

$$B = \frac{Var[y^k]}{Var[y^k] + Var[y]}$$

Seems quite fair, but James-Stein Estimator has a big disadvantage — it is defined only for normal distribution (which is not the case for any classification task). 

To avoid that, we can either convert binary targets with a log-odds ratio as it was done in WoE Encoder (which is used by default because it is simple) or use beta distribution.

If you use `Category-Encoders` it will look like this code below.

In [25]:
%%time
JSE_encoder = JamesSteinEncoder()
train_jse = JSE_encoder.fit_transform(train[feature_list], target)
test_jse = JSE_encoder.transform(test[feature_list])
test_jse.head()

CPU times: user 12 s, sys: 417 ms, total: 12.5 s
Wall time: 9.88 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,1,0.302537,0.290107,0.24179,0.343763,0.315031,0.260374,0.304449,...,0.326349,0.234325,2,0.256848,0.293836,0.32633,0.316033,0.303194,5,11
1,0,0,0,0.302537,0.333773,0.351052,0.328749,0.296876,0.329339,0.304449,...,0.260119,0.297338,1,0.342313,0.37213,0.346197,0.232549,0.272939,7,5
2,1,0,1,0.309384,0.290107,0.24179,0.328749,0.262111,0.309961,0.304449,...,0.236455,0.155348,2,0.314323,0.247048,0.243675,0.26608,0.358623,1,12
3,0,0,1,0.302537,0.290107,0.351052,0.309196,0.326306,0.309961,0.304449,...,0.331938,0.31271,1,0.285182,0.37213,0.253214,0.321938,0.334532,2,3
4,0,1,1,0.309384,0.333773,0.351052,0.29473,0.262111,0.309961,0.304449,...,0.338701,0.30011,3,0.377845,0.37213,0.346197,0.358728,0.345933,4,11


## 10. Leave-one-out Encoder (LOO or LOOE)

**Leave-one-out Encoding** is another example of target-based encoders.

This encoder calculate mean target of category k for observation j if observation j is removed from the dataset:

$$\hat{x}^k_i = \frac{\sum_{j \neq i}(y_j * (x_j == k) ) - y_i }{\sum_{j \neq i} x_j == k}$$

While encoding the test dataset, a category is replaced with the mean target of the category k in the train dataset:

$$\hat{x}^k = \frac{\sum y_j * (x_j == k)  }{\sum x_j == k}$$

If you use `Category-Encoders` it will look like this code below.

In [33]:
%%time
LOOE_encoder = LeaveOneOutEncoder()
train_looe = LOOE_encoder.fit_transform(train[feature_list], target)
test_looe = LOOE_encoder.transform(test[feature_list])
test_looe.head()

CPU times: user 13.5 s, sys: 810 ms, total: 14.3 s
Wall time: 11.8 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,1,0.302537,0.290107,0.24179,0.360978,0.319017,0.242813,0.304164,...,0.348315,0.181818,2,0.242055,0.288796,0.342476,0.324947,0.300588,5,11
1,0,0,0,0.302537,0.333773,0.351052,0.338932,0.293085,0.339793,0.304164,...,0.222707,0.288889,1,0.355078,0.403126,0.379277,0.186877,0.244795,7,5
2,1,0,1,0.309384,0.290107,0.24179,0.338932,0.245139,0.311724,0.304164,...,0.186667,0.090909,2,0.317175,0.225214,0.206599,0.236891,0.417726,1,12
3,0,0,1,0.302537,0.290107,0.351052,0.310627,0.335367,0.311724,0.304164,...,0.360656,0.32,1,0.278533,0.403126,0.22046,0.336264,0.365151,2,3
4,0,1,1,0.309384,0.333773,0.351052,0.290054,0.245139,0.311724,0.304164,...,0.375,0.294118,3,0.403885,0.403126,0.379277,0.409481,0.389864,4,11


## 11. Catboost Encoder

**Catboost** is a recently created target-based categorical encoder. 

It is intended to overcome target leakage problems inherent in LOO. 

If you use `Category-Encoders` it will look like this code below.

In [27]:
%%time
CBE_encoder = CatBoostEncoder()
train_cbe = CBE_encoder.fit_transform(train[feature_list], target)
test_cbe = CBE_encoder.transform(test[feature_list])
test_cbe.head()

CPU times: user 19.9 s, sys: 1.12 s, total: 21 s
Wall time: 17.7 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,1,0.302537,0.290107,0.241791,0.360976,0.319017,0.242815,0.304164,...,0.348078,0.187212,2,0.242055,0.288797,0.342474,0.324946,0.30059,5,11
1,0,0,0,0.302537,0.333773,0.351051,0.338931,0.293085,0.339792,0.304164,...,0.223069,0.289258,1,0.355076,0.403125,0.379275,0.186884,0.244833,7,5
2,1,0,1,0.309384,0.290107,0.241791,0.338931,0.245141,0.311723,0.304164,...,0.187194,0.108823,2,0.317175,0.225215,0.206602,0.236894,0.417684,1,12
3,0,0,1,0.302537,0.290107,0.351051,0.310626,0.335367,0.311723,0.304164,...,0.359772,0.319457,1,0.278534,0.403125,0.220467,0.336262,0.365124,2,3
4,0,1,1,0.309384,0.333773,0.351051,0.290055,0.245141,0.311723,0.304164,...,0.374739,0.294771,3,0.403884,0.403125,0.379275,0.40947,0.389782,4,11


## Validation

Validation proceeds with single lr and lr with cv.

- I will add OneHotEncoder, etc later.
- More Fold get better score (my experience)
- you can try another solver and another parameter

### Single LR

In [34]:
%%time
import gc
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score as auc
from sklearn.linear_model import LogisticRegression

encoder_list = [ OrdinalEncoder(), WOEEncoder(), TargetEncoder(), MEstimateEncoder(), JamesSteinEncoder(), LeaveOneOutEncoder() ,CatBoostEncoder()]

X_train, X_val, y_train, y_val = train_test_split(train, target, test_size=0.2, random_state=97)

for encoder in encoder_list:
    print("Test {} : ".format(str(encoder).split('(')[0]), end=" ")
    train_enc = encoder.fit_transform(X_train[feature_list], y_train)
    #test_enc = encoder.transform(test[feature_list])
    val_enc = encoder.transform(X_val[feature_list])
    lr = LogisticRegression(C=0.1, solver="lbfgs", max_iter=1000)
    lr.fit(train_enc, y_train)
    lr_pred = lr.predict_proba(val_enc)[:, 1]
    score = auc(y_val, lr_pred)
    print("score: ", score)
    del train_enc
    del val_enc
    gc.collect()



Test OrdinalEncoder :  



score:  0.6058952245541614
Test WOEEncoder :  score:  0.7814115175120097
Test TargetEncoder :  score:  0.7790232741835907
Test MEstimateEncoder :  score:  0.7792025608680453
Test JamesSteinEncoder :  score:  0.7719623937439537
Test LeaveOneOutEncoder :  score:  0.79576549204064
Test CatBoostEncoder :  score:  0.7917094405644212
CPU times: user 4min 55s, sys: 2.02 s, total: 4min 57s
Wall time: 3min 15s


### LR with CrossValidation

In [36]:
%%time
from sklearn.model_selection import KFold
import numpy as np

# CV function original : @Peter Hurford : Why Not Logistic Regression? https://www.kaggle.com/peterhurford/why-not-logistic-regression

def run_cv_model(train, test, target, model_fn, params={}, label='model'):
    kf = KFold(n_splits=5)
    fold_splits = kf.split(train, target)

    cv_scores = []
    pred_full_test = 0
    pred_train = np.zeros((train.shape[0]))
    i = 1
    for dev_index, val_index in fold_splits:
        print('Started {} fold {}/5'.format(label, i))
        dev_X, val_X = train.iloc[dev_index], train.iloc[val_index]
        dev_y, val_y = target[dev_index], target[val_index]
        pred_val_y, pred_test_y = model_fn(dev_X, dev_y, val_X, val_y, test, params)
        pred_full_test = pred_full_test + pred_test_y
        pred_train[val_index] = pred_val_y
        cv_score = auc(val_y, pred_val_y)
        cv_scores.append(cv_score)
        print(label + ' cv score {}: {}'.format(i, cv_score))
        i += 1
        
    print('{} cv scores : {}'.format(label, cv_scores))
    print('{} cv mean score : {}'.format(label, np.mean(cv_scores)))
    print('{} cv std score : {}'.format(label, np.std(cv_scores)))
    pred_full_test = pred_full_test / 5.0
    results = {'label': label, 'train': pred_train, 'test': pred_full_test, 'cv': cv_scores}
    return results


def runLR(train_X, train_y, test_X, test_y, test_X2, params):
    model = LogisticRegression(**params)
    model.fit(train_X, train_y)
    pred_test_y = model.predict_proba(test_X)[:, 1]
    pred_test_y2 = model.predict_proba(test_X2)[:, 1]
    return pred_test_y, pred_test_y2


CPU times: user 107 µs, sys: 0 ns, total: 107 µs
Wall time: 116 µs


In [37]:
if TEST:

    lr_params = {'solver': 'lbfgs', 'C': 0.1}

    results = list()

    for encoder in  [ OrdinalEncoder(), WOEEncoder(), TargetEncoder(), MEstimateEncoder(), JamesSteinEncoder(), LeaveOneOutEncoder() ,CatBoostEncoder()]:
        train_enc = encoder.fit_transform(train[feature_list], target)
        test_enc = encoder.transform(test[feature_list])
        result = run_cv_model(train_enc, test_enc, target, runLR, lr_params, str(encoder).split('(')[0])
        results.append(result)
    results = pd.DataFrame(results)
    results['cv_mean'] = results['cv'].apply(lambda l : np.mean(l))
    results['cv_std'] = results['cv'].apply(lambda l : np.std(l))
    results[['label','cv_mean','cv_std']].head(8)

Started OrdinalEncoder fold 1/5




OrdinalEncoder cv score 1: 0.5478087630083479
Started OrdinalEncoder fold 2/5




OrdinalEncoder cv score 2: 0.5785368145198041
Started OrdinalEncoder fold 3/5




OrdinalEncoder cv score 3: 0.5794711812720389
Started OrdinalEncoder fold 4/5




OrdinalEncoder cv score 4: 0.5776419118948795
Started OrdinalEncoder fold 5/5




OrdinalEncoder cv score 5: 0.5530765797041557
OrdinalEncoder cv scores : [0.5478087630083479, 0.5785368145198041, 0.5794711812720389, 0.5776419118948795, 0.5530765797041557]
OrdinalEncoder cv mean score : 0.5673070500798453
OrdinalEncoder cv std score : 0.01388216518854039
Started WOEEncoder fold 1/5




WOEEncoder cv score 1: 0.8296010840892494
Started WOEEncoder fold 2/5




WOEEncoder cv score 2: 0.8276778463251253
Started WOEEncoder fold 3/5




WOEEncoder cv score 3: 0.8343609323413513
Started WOEEncoder fold 4/5




WOEEncoder cv score 4: 0.8315878216378239
Started WOEEncoder fold 5/5




WOEEncoder cv score 5: 0.8307581723870281
WOEEncoder cv scores : [0.8296010840892494, 0.8276778463251253, 0.8343609323413513, 0.8315878216378239, 0.8307581723870281]
WOEEncoder cv mean score : 0.8307971713561155
WOEEncoder cv std score : 0.002213045618434726
Started TargetEncoder fold 1/5




TargetEncoder cv score 1: 0.8228825225372465
Started TargetEncoder fold 2/5




TargetEncoder cv score 2: 0.819640037288339
Started TargetEncoder fold 3/5




TargetEncoder cv score 3: 0.8270373654953843
Started TargetEncoder fold 4/5




TargetEncoder cv score 4: 0.8279529191167299
Started TargetEncoder fold 5/5




TargetEncoder cv score 5: 0.8270456093348808
TargetEncoder cv scores : [0.8228825225372465, 0.819640037288339, 0.8270373654953843, 0.8279529191167299, 0.8270456093348808]
TargetEncoder cv mean score : 0.8249116907545162
TargetEncoder cv std score : 0.003169511807334321
Started MEstimateEncoder fold 1/5




MEstimateEncoder cv score 1: 0.8239353389041686
Started MEstimateEncoder fold 2/5




MEstimateEncoder cv score 2: 0.8225460353941083
Started MEstimateEncoder fold 3/5




MEstimateEncoder cv score 3: 0.8250872944298472
Started MEstimateEncoder fold 4/5




MEstimateEncoder cv score 4: 0.8290057049907481
Started MEstimateEncoder fold 5/5




MEstimateEncoder cv score 5: 0.8257572319360416
MEstimateEncoder cv scores : [0.8239353389041686, 0.8225460353941083, 0.8250872944298472, 0.8290057049907481, 0.8257572319360416]
MEstimateEncoder cv mean score : 0.8252663211309826
MEstimateEncoder cv std score : 0.002164601755855072
Started JamesSteinEncoder fold 1/5




JamesSteinEncoder cv score 1: 0.825785651337012
Started JamesSteinEncoder fold 2/5




JamesSteinEncoder cv score 2: 0.8213704510704314
Started JamesSteinEncoder fold 3/5




JamesSteinEncoder cv score 3: 0.8279921606609264
Started JamesSteinEncoder fold 4/5




JamesSteinEncoder cv score 4: 0.8144299433590861
Started JamesSteinEncoder fold 5/5




JamesSteinEncoder cv score 5: 0.8265248336422055
JamesSteinEncoder cv scores : [0.825785651337012, 0.8213704510704314, 0.8279921606609264, 0.8144299433590861, 0.8265248336422055]
JamesSteinEncoder cv mean score : 0.8232206080139323
JamesSteinEncoder cv std score : 0.004918616364474788
Started LeaveOneOutEncoder fold 1/5




LeaveOneOutEncoder cv score 1: 0.7870271221339217
Started LeaveOneOutEncoder fold 2/5




LeaveOneOutEncoder cv score 2: 0.7883871851867641
Started LeaveOneOutEncoder fold 3/5




LeaveOneOutEncoder cv score 3: 0.7940299539122266
Started LeaveOneOutEncoder fold 4/5




LeaveOneOutEncoder cv score 4: 0.7901332335441938
Started LeaveOneOutEncoder fold 5/5




LeaveOneOutEncoder cv score 5: 0.7883777050213145
LeaveOneOutEncoder cv scores : [0.7870271221339217, 0.7883871851867641, 0.7940299539122266, 0.7901332335441938, 0.7883777050213145]
LeaveOneOutEncoder cv mean score : 0.7895910399596842
LeaveOneOutEncoder cv std score : 0.0024287055632732923
Started CatBoostEncoder fold 1/5




CatBoostEncoder cv score 1: 0.7297166731892779
Started CatBoostEncoder fold 2/5




CatBoostEncoder cv score 2: 0.7805421729205533
Started CatBoostEncoder fold 3/5




CatBoostEncoder cv score 3: 0.7839670066128355
Started CatBoostEncoder fold 4/5




CatBoostEncoder cv score 4: 0.7795226775861617
Started CatBoostEncoder fold 5/5
CatBoostEncoder cv score 5: 0.786081368408643
CatBoostEncoder cv scores : [0.7297166731892779, 0.7805421729205533, 0.7839670066128355, 0.7795226775861617, 0.786081368408643]
CatBoostEncoder cv mean score : 0.7719659797434942
CatBoostEncoder cv std score : 0.021255246501015294




In [None]:
results

## Submit

Even CVs did not solve the target based encoder's overfit problem.

In [None]:
if TEST:
    for idx, label in enumerate(results['label']):
        sub_df = pd.DataFrame({'id': test_id, 'target' : results.iloc[idx]['test']})
        sub_df.to_csv("LR_{}.csv".format(label), index=False)

