![](https://miro.medium.com/max/5120/1*hSDIm8k315XGjxZ9gqnhvA.jpeg)

# Categorical Feature Encoding Challenge
[Crislânio Macêdo](https://medium.com/sapere-aude-tech) -  Last Update in March, 07th, 2021


- [**Github**](https://github.com/crislanio)
- [**Linkedin**](https://www.linkedin.com/in/crislanio/)
- [**Medium**](https://medium.com/sapere-aude-tech)
- [**Quora**](https://www.quora.com/profile/Crislanio)
- [**Ensina.AI**](https://medium.com/ensina-ai/an%C3%A1lise-dos-dados-abertos-do-governo-federal-ba65af8c421c)
- [**Hackerrank**](https://www.hackerrank.com/crislanio_ufc?hr_r=1)
- [**Blog**](https://medium.com/@crislanio.ufc)
- [**Personal Page**](https://crislanio.wordpress.com/about)
- [**Twitter**](https://twitter.com/crs_macedo)

----------
----------



# About this Competition
![](http://img08.deviantart.net/3e2f/i/2016/121/7/8/beerus__god_of_destruction_by_liloutehcat-da0wye6.png)

> #### In this competition, you will be predicting the probability [0, 1] of a binary target column.

The data contains binary features (bin_*), nominal features (nom_*), ordinal features (ord_*) as well as (potentially cyclical) day (of the week) and month features. The string ordinal features ord_{3-5} are lexically ordered according to string.ascii_letters.
Since the purpose of this competition is to explore various encoding strategies, the data has been simplified in that (1) there are no missing values, and (2) the test set does not contain any unseen feature values (See this). (Of course, in real-world settings both of these factors are often important to consider!)

#### Files
- train.csv - the training set
- test.csv - the test set; you must make predictions against this data
- sample_submission.csv - a sample submission file in the correct format

> #### Inspired by:
- https://www.kaggle.com/felipeleiteantunes/h2o-ai-from-linear-models-to-deep-learning (upvote this !) Not only useful but also valuable


1. # Instructions to download:
> #### http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html

1. # Documentation:
> #### https://h2o-release.s3.amazonaws.com/h2o/rel-turan/4/docs-website/h2o-py/docs/intro.html

1. # A booklet:
> #### http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/PythonBooklet.pdf

1. # A presentation:
> #### https://pt.slideshare.net/0xdata/intro-to-h2o-in-python-data-science-la

And many more questions:
<html>
<body>

<p><font size="5" color="Blue">
If you find this kernel useful or interesting, please don't forget to upvote the kernel =)
</font></p>

</body>
</html>



In [None]:
def evalBinaryClassifier(model, x, y, labels=['Positives','Negatives']):
    '''
    source: https://towardsdatascience.com/how-to-interpret-a-binary-logistic-regressor-with-scikit-learn-6d56c5783b49
    Visualize the performance of  a Logistic Regression Binary Classifier.
    
    Displays a labelled Confusion Matrix, distributions of the predicted
    probabilities for both classes, the ROC curve, and F1 score of a fitted
    Binary Logistic Classifier. Author: gregcondit.com/articles/logr-charts
    
    Parameters
    ----------
    model : fitted scikit-learn model with predict_proba & predict methods
        and classes_ attribute. Typically LogisticRegression or 
        LogisticRegressionCV
    
    x : {array-like, sparse matrix}, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples
        in the data to be tested, and n_features is the number of features
    
    y : array-like, shape (n_samples,)
        Target vector relative to x.
    
    labels: list, optional
        list of text labels for the two classes, with the positive label first
        
    Displays
    ----------
    3 Subplots
    
    Returns
    ----------
    F1: float
    '''
    #model predicts probabilities of positive class
    p = model.predict_proba(x)
    if len(model.classes_)!=2:
        raise ValueError('A binary class problem is required')
    if model.classes_[1] == 1:
        pos_p = p[:,1]
    elif model.classes_[0] == 1:
        pos_p = p[:,0]
    
    #FIGURE
    plt.figure(figsize=[15,4])
    
    #1 -- Confusion matrix
    cm = confusion_matrix(y,model.predict(x))
    plt.subplot(131)
    ax = sns.heatmap(cm, annot=True, cmap='Blues', cbar=False, 
                annot_kws={"size": 14}, fmt='g')
    cmlabels = ['True Negatives', 'False Positives',
              'False Negatives', 'True Positives']
    for i,t in enumerate(ax.texts):
        t.set_text(t.get_text() + "\n" + cmlabels[i])
    plt.title('Confusion Matrix', size=15)
    plt.xlabel('Predicted Values', size=13)
    plt.ylabel('True Values', size=13)
      
    #2 -- Distributions of Predicted Probabilities of both classes
    df = pd.DataFrame({'probPos':pos_p, 'target': y})
    plt.subplot(132)
    plt.hist(df[df.target==1].probPos, density=True, bins=25,
             alpha=.5, color='green',  label=labels[0])
    plt.hist(df[df.target==0].probPos, density=True, bins=25,
             alpha=.5, color='red', label=labels[1])
    plt.axvline(.5, color='blue', linestyle='--', label='Boundary')
    plt.xlim([0,1])
    plt.title('Distributions of Predictions', size=15)
    plt.xlabel('Positive Probability (predicted)', size=13)
    plt.ylabel('Samples (normalized scale)', size=13)
    plt.legend(loc="upper right")
    
    #3 -- ROC curve with annotated decision point
    fp_rates, tp_rates, _ = roc_curve(y,p[:,1])
    roc_auc = auc(fp_rates, tp_rates)
    plt.subplot(133)
    plt.plot(fp_rates, tp_rates, color='green',
             lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], lw=1, linestyle='--', color='grey')
    #plot current decision point:
    tn, fp, fn, tp = [i for i in cm.ravel()]
    plt.plot(fp/(fp+tn), tp/(tp+fn), 'bo', markersize=8, label='Decision Point')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate', size=13)
    plt.ylabel('True Positive Rate', size=13)
    plt.title('ROC Curve', size=15)
    plt.legend(loc="lower right")
    plt.subplots_adjust(wspace=.3)
    plt.show()
    #Print and Return the F1 score
    tn, fp, fn, tp = [i for i in cm.ravel()]
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    F1 = 2*(precision * recall) / (precision + recall)
    printout = (
        f'Precision: {round(precision,2)} | '
        f'Recall: {round(recall,2)} | '
        f'F1 Score: {round(F1,2)} | '
    )
    print(printout)
    return F1

In [None]:
conda install gxx_linux-64 gcc_linux-64 swig

# Load the H2O library and start up the H2O cluter locally on your machine

In [None]:
import h2o
h2o.init(ip="localhost", port=54323)

# Import Libraries

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Load files

In [None]:

train_data = h2o.import_file("/kaggle/input/cat-in-the-dat/train.csv")
test_data = h2o.import_file("/kaggle/input/cat-in-the-dat/test.csv")

In [None]:
 test_id = h2o.import_file('/kaggle/input/cat-in-the-dat/test.csv')['id']

### Import H2O GLM

In [None]:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator

Train a default GLM We first create an object of class, "H2OGeneralizedLinearEstimator".

H2OGeneralizedLinearEstimator

In [None]:
glm_fit1 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit1')

Now that glm_fit1 object is initialized, we can train the model:

In [None]:
train_data["target"] = train_data["target"].asfactor()

In [None]:
train, valid, test = train_data.split_frame(ratios=[0.7, 0.15], seed=42)  
y = 'target'
x = list(train_data.columns)

In [None]:
id_var = 'id'
x.remove(id_var)  #remove the response

In [None]:
x.remove(y)  #remove the response
print(x)

#### H2O Machine Learning
> Now that we have prepared the data, we can train some models. We will start by training a single model from each of the H2O supervised algos:

- Generalized Linear Model (GLM)
- Random Forest (RF)
- Gradient Boosting Machine (GBM)
- Deep Learning (DL)
- Generalized Linear Model (GLM)

Let's start with a basic binomial Generalized Linear Model (GLM). By default, H2O's GLM uses a regularized, elastic net model.

In [None]:
glm_fit1.train(x=x, y=y, training_frame=train)

#### Train a GLM with lambda search
Next we will do some automatic tuning by passing in a validation frame and setting lambda_search = True. Since we are training a GLM with regularization, we should try to find the right amount of regularization (to avoid overfitting). The model parameter, lambda, controls the amount of regularization in a GLM model and we can find the optimal value for lambda automatically by setting lambda_search = True and passing in a validation frame (which is used to evaluate model performance using a particular value of lambda).

In [None]:
glm_fit2 = H2OGeneralizedLinearEstimator(family='binomial', model_id='glm_fit2', lambda_search=True,balance_classes = True)
glm_fit2.train(x=x, y=y, training_frame=train, validation_frame=valid)

Evaluate model performance

Let's compare the performance of the two GLMs that were just trained.

In [None]:
glm_perf1 = glm_fit1.model_performance(test)
glm_perf2 = glm_fit2.model_performance(test)

# Retreive test set AUC

In [None]:

print (glm_perf1.gini())
print (glm_perf2.gini())

# Compare test AUC to the training AUC and validation AUC

In [None]:

print (glm_fit2.gini(train=True))
print (glm_fit2.gini(valid=True))

### Random Forest

H2O's Random Forest (RF) is implements a distributed version of the standard Random Forest algorithm and variable importance measures.

# Import H2O RF

In [None]:

from h2o.estimators.random_forest import H2ORandomForestEstimator

Train and a default RF First we will train a basic Random Forest model with default parameters. Random Forest will infer the response distribution from the 
response encoding. A seed is required for reproducibility. :


# Initialize the RF estimator


In [None]:

rf_fit1 = H2ORandomForestEstimator(model_id='rf_fit1',   seed=1)


Now that rf_fit1 object is initialized, we can train the model:

In [None]:
rf_fit1.train(x=x, y=y, training_frame=train,validation_frame=valid)

Train an RF with more trees Next we will increase the number of trees used in the forest by setting ntrees = 100. The default number of trees in an H2O Random Forest is 50, so this RF will be twice as big as the default. Usually increasing the number of trees in an RF will increase performance as well. Unlike Gradient Boosting Machines (GBMs), Random Forests are fairly resistant (although not free from) overfitting by increasing the number of trees. See the GBM example below for additional guidance on preventing overfitting using H2O's early stopping functionality.

In [None]:
rf_fit2 = H2ORandomForestEstimator(model_id='rf_fit2', ntrees=100,   seed=1)
rf_fit2.train(x=x, y=y, training_frame=train,validation_frame=valid)

Compare model performance Let's compare the performance of the two RFs that were just trained.

In [None]:
rf_perf1 = rf_fit1.model_performance(test)
rf_perf2 = rf_fit2.model_performance(test)

# Retreive test set AUC

In [None]:

print(rf_perf1.gini())
print(rf_perf2.gini())

Cross-validate performance Rather than using held-out test set to evaluate model performance, a user may wish to estimate model performance using cross-validation. Using the RF algorithm (with default model parameters) as an example, we demonstrate how to perform k-fold cross-validation using H2O. No custom code or loops are required, you simply specify the number of desired folds in the nfolds argument. Since we are not going to use a test set here, we can use the original (full) dataset, which we called data rather than the subsampled train dataset. Note that this will take approximately k (nfolds) times longer than training a single RF model, since it will train k models in the cross-validation process (trained on n(k-1)/k rows), in addition to the final model trained on the full training_frame dataset with n rows.

In [None]:
rf_fit3 = H2ORandomForestEstimator(model_id='rf_fit3', seed=1, nfolds=5)
rf_fit3.train(x=x, y=y, training_frame=train)

To evaluate the cross-validated AUC, do the following:

In [None]:
print( rf_fit3.gini(xval=True))

# Import H2O GBM

In [None]:

from h2o.estimators.gbm import H2OGradientBoostingEstimator

Train a default GBM First we will train a basic GBM model with default parameters. GBM will infer the response distribution from the response encoding if not specified explicitly through the distribution argument. A seed is required for reproducibility.

# Initialize and train the GBM estimator


In [None]:

gbm_fit1 = H2OGradientBoostingEstimator(model_id='gbm_fit1',   seed=1)
gbm_fit1.train(x=x, y=y, training_frame=train, validation_frame=valid)

Train a GBM with more trees Next we will increase the number of trees used in the GBM by setting ntrees=500. The default number of trees in an H2O GBM is 50, so this GBM will trained using ten times the default. Increasing the number of trees in a GBM is one way to increase performance of the model, however, you have to be careful not to overfit your model to the training data by using too many trees. To automatically find the optimal number of trees, you must use H2O's early stopping functionality. This example will not do that, however, the following example will.

In [None]:
gbm_fit2 = H2OGradientBoostingEstimator(model_id='gbm_fit2', ntrees=500,   seed=1)
gbm_fit2.train(x=x, y=y, training_frame=train,validation_frame=valid)

Train a GBM with early stopping We will again set ntrees = 500, however, this time we will use early stopping in order to prevent overfitting (from too many trees). All of H2O's algorithms have early stopping available, however, with the exception of Deep Learning, it is not enabled by default. There are several parameters that should be used to control early stopping. The three that are generic to all the algorithms are: stopping_rounds, stopping_metric and stopping_tolerance. The stopping metric is the metric by which you'd like to measure performance, and so we will choose AUC here. The score_tree_interval is a parameter specific to Random Forest and GBM. Setting score_tree_interval=5 will score the model after every five trees. The parameters we have set below specify that the model will stop training after there have been three scoring intervals where the AUC has not increased more than 0.0005. Since we have specified a validation frame, the stopping tolerance will be computed on validation AUC rather than training AUC.

# Now let's use early stopping to find optimal ntrees


In [None]:
# Now let's use early stopping to find optimal ntrees

gbm_fit3 = H2OGradientBoostingEstimator(model_id='gbm_fit3', 
                                        ntrees=1000, 
                                        score_tree_interval=5,     #used for early stopping
                                        stopping_rounds=3,         #used for early stopping
                                        stopping_metric='AUC',     #used for early stopping
                                        stopping_tolerance=0.0005, #used for early stopping
                                        seed=1)
# The use of a validation_frame is recommended with using early stopping
gbm_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

# Let's try XGBOOSTING

In [None]:
# Let's try XGBOOSTING
from h2o.estimators import H2OXGBoostEstimator
param = {
      "model_id": 'gbm_fit4'
    , "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100
}
gbm_fit4 = H2OXGBoostEstimator(**param)
gbm_fit4.train(x=x, y=y, training_frame=train, validation_frame=valid)

Compare model performance Let's compare the performance of the three GBMs that were just trained.

In [None]:
gbm_perf1 = gbm_fit1.model_performance(test)
gbm_perf2 = gbm_fit2.model_performance(test)
gbm_perf3 = gbm_fit3.model_performance(test)
gbm_perf4 = gbm_fit4.model_performance(test)

# Retreive test set AUC

In [None]:

print (gbm_perf1.gini())
print (gbm_perf2.gini())
print (gbm_perf3.gini())
print (gbm_perf4.gini())

# Deep Learning
H2O's Deep Learning algorithm is a multilayer feed-forward artificial neural network. It can also be used to train an autoencoder, however, in the example below we will train a standard supervised prediction model

# Import H2O DL

In [None]:
# Import H2O DL:
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

Train a default DL First we will train a basic DL model with default parameters. DL will infer the response distribution from the response encoding if not specified explicitly through the distribution argument. H2O's DL will not be reproducbible if run on more than a single core, so in this example, the performance metrics below may vary slightly from what you see on your machine. In H2O's DL, early stopping is enabled by default, so below, it will use the training set and default stopping parameters to perform early stopping.

# Initialize and train the DL estimator


In [None]:
# Initialize and train the DL estimator:

dl_fit1 = H2ODeepLearningEstimator(model_id='dl_fit1',   seed=1,  balance_classes = True)
dl_fit1.train(x=x, y=y, training_frame=train,validation_frame=valid)

Train a DL with new architecture and more epochs Next we will increase the number of epochs used in the GBM by setting epochs=20 (the default is 10). Increasing the number of epochs in a deep neural net may increase performance of the model, however, you have to be careful not to overfit your model. To automatically find the optimal number of epochs, you must use H2O's early stopping functionality. Unlike the rest of the H2O algorithms, H2O's DL will use early by default, so we will first turn it off in the next example by setting stopping_rounds=0, for comparison.

In [None]:
dl_fit2 = H2ODeepLearningEstimator(model_id='dl_fit2', 
                                   epochs=50, 
                                   hidden=[10,10], 
                                   stopping_rounds=0,  #disable early stopping
                                   seed=1,
                                   balance_classes = True)
dl_fit2.train(x=x, y=y, training_frame=train,validation_frame=valid)


Train a DL with early stopping This example will use the same model parameters as dl_fit2, however, we will turn on early stopping and specify the stopping criterion. We will also pass a validation set, as is recommended for early stopping.

In [None]:
dl_fit3 = H2ODeepLearningEstimator(model_id='dl_fit3', 
                                   epochs=500, 
                                   hidden=[10,10],
                                   score_interval=1,          #used for early stopping
                                   stopping_rounds=50,         #used for early stopping
                                   stopping_metric='AUC',     #used for early stopping
                                   stopping_tolerance=0.0005, #used for early stopping
                                   seed=1,  
                                   balance_classes = True)
dl_fit3.train(x=x, y=y, training_frame=train, validation_frame=valid)

Compare model performance Again, we will compare the model performance of the three models using a test set and AUC.

In [None]:
dl_perf1 = dl_fit1.model_performance(test)
dl_perf2 = dl_fit2.model_performance(test)
dl_perf3 = dl_fit3.model_performance(test)

# Retreive test set AUC

In [None]:
# Retreive test set AUC
print (dl_perf1.gini())
print (dl_perf2.gini())
print( dl_perf3.gini())

In [None]:
test_pred = gbm_fit4.predict(test_id) # test

test_pred

In [None]:
test_pred

# General Findinds

- Is data synthetic? by [cpmpml](https://www.kaggle.com/cpmpml)

source: https://www.kaggle.com/c/cat-in-the-dat/discussion/105713


- Encoding cyclical features using sin and cos transformation by [gogo827jz](https://www.kaggle.com/gogo827jz)
source: https://www.kaggle.com/c/cat-in-the-dat/discussion/105610

- CATEGORICAL MATERIAL MUST READ by (brunhs)[https://www.kaggle.com/brunhs]

source: https://www.kaggle.com/c/cat-in-the-dat/discussion/105512

- CATEGORICAL MATERIAL SURVEY🐱 & Deduplication & Record Linkage. by [caesarlupum](https://www.kaggle.com/caesarlupum)

source: https://www.kaggle.com/c/cat-in-the-dat/discussion/111930




# Top kernels

-  ### 🐱 Cat with Null Importance - Target Permutation by @CaesarLupum

Source: https://www.kaggle.com/caesarlupum/cat-with-null-importance-target-permutation

- ###  An Overview of Encoding Techniques by @shahules

Source: https://www.kaggle.com/shahules/an-overview-of-encoding-techniques

-  ### EDA & Feat Engineering - Encode & Conquer by @kabure

Source: https://www.kaggle.com/kabure/eda-feat-engineering-encode-conquer

- ###  Why Not Logistic Regression? by @peterhurford

Source: https://www.kaggle.com/peterhurford/why-not-logistic-regression

-  ### OH my Ca by @superant

Source: https://www.kaggle.com/superant/oh-my-cat

- ###  Entity embeddings to handle categories by @abhishek

Source: https://www.kaggle.com/abhishek/entity-embeddings-to-handle-categories

- ###  2nd place Solution - Categorical FE Callenge by @adaubas

Source: https://www.kaggle.com/adaubas/2nd-place-solution-categorical-fe-callenge

- ###  🐱 CatComp - Simple Target Encoding by @CaesarLupum

Source: https://www.kaggle.com/caesarlupum/catcomp-simple-target-encoding

-  ### Handling Categorical Variables:Encoding & Modeling by @vikassingh1996

Source: https://www.kaggle.com/vikassingh1996/handling-categorical-variables-encoding-modeling

-  ### R GLMNET by @ccccat

Source: https://www.kaggle.com/ccccat/r-glmnet

-  ### Exploring CATegorical encodings  by @artgor

Source: https://www.kaggle.com/artgor/exploring-categorical-encodings

- ### CatBoost Baseline with Feature Importance by @gogo827jz

Source: https://www.kaggle.com/gogo827jz/catboost-baseline-with-feature-importance



<html>
<body>

<p><font size="5" color="purple">If you like my kernel please consider upvoting it</font></p>
<p><font size="4" color="purple">Don't hesitate to give your suggestions in the comment section</font></p>

</body>
</html>


## Final