# Categorical Feature Challenge

We have been provided with a dataset that only has categorical variables and we are asked to try out different encoding schemes and compare how they perform.The competition is binary classification challenge with only categorical variables to train on.

### References

In [1]:
#https://www.kaggle.com/cdeotte/high-scoring-lgbm-malware-0-702-0-775
#https://www.kaggle.com/fabiendaniel/detecting-malwares-with-lgbm
#https://www.kaggle.com/humananalog/xgboost-lasso
#https://www.kaggle.com/ogrellier/good-fun-with-ligthgbm
#https://www.kaggle.com/mlisovyi/modular-good-fun-with-ligthgbm/output

### Import Necessary libraries:

In [2]:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


import gc
from tqdm import tqdm

### Reading the data

In [3]:
kaggle=1

if kaggle==0:
    train=pd.read_csv("train.csv")
    test=pd.read_csv("test.csv")
    sample_submission=pd.read_csv("sample_submission.csv")
    
else:
    train=pd.read_csv("../input/cat-in-the-dat/train.csv")
    test=pd.read_csv("../input/cat-in-the-dat/test.csv")
    sample_submission=pd.read_csv("../input/cat-in-the-dat/sample_submission.csv")

In [4]:
train.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0,0,0,T,Y,Green,Triangle,Snake,Finland,...,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2,0
1,1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,...,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8,0
2,2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,...,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2,0
3,3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,...,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1,1
4,4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,...,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8,0


In [5]:
train.shape,test.shape

((300000, 25), (200000, 24))

We see that the train dataset has 25 categorical columns with varying degree of cardinality.

Let check the distribution of the target value to understand whether the dataset is balanced or not.

In [6]:
train['target'].value_counts()

0    208236
1     91764
Name: target, dtype: int64

We see that the target has lot of 0's than 1's.Its an unbalanced problem.

In [7]:
train.dtypes

id         int64
bin_0      int64
bin_1      int64
bin_2      int64
bin_3     object
bin_4     object
nom_0     object
nom_1     object
nom_2     object
nom_3     object
nom_4     object
nom_5     object
nom_6     object
nom_7     object
nom_8     object
nom_9     object
ord_0      int64
ord_1     object
ord_2     object
ord_3     object
ord_4     object
ord_5     object
day        int64
month      int64
target     int64
dtype: object

In [8]:
test.dtypes

id        int64
bin_0     int64
bin_1     int64
bin_2     int64
bin_3    object
bin_4    object
nom_0    object
nom_1    object
nom_2    object
nom_3    object
nom_4    object
nom_5    object
nom_6    object
nom_7    object
nom_8    object
nom_9    object
ord_0     int64
ord_1    object
ord_2    object
ord_3    object
ord_4    object
ord_5    object
day       int64
month     int64
dtype: object

In [9]:
#convert all the columns to category datatype:
for f in train.columns:
    if f=="id" or f=="target": continue
    print(f'Converting {f} into category datatype\n')
    train[f]=train[f].astype('category')
    test[f]=test[f].astype('category')

Converting bin_0 into category datatype

Converting bin_1 into category datatype

Converting bin_2 into category datatype

Converting bin_3 into category datatype

Converting bin_4 into category datatype

Converting nom_0 into category datatype

Converting nom_1 into category datatype

Converting nom_2 into category datatype

Converting nom_3 into category datatype

Converting nom_4 into category datatype

Converting nom_5 into category datatype

Converting nom_6 into category datatype

Converting nom_7 into category datatype

Converting nom_8 into category datatype

Converting nom_9 into category datatype

Converting ord_0 into category datatype

Converting ord_1 into category datatype

Converting ord_2 into category datatype

Converting ord_3 into category datatype

Converting ord_4 into category datatype

Converting ord_5 into category datatype

Converting day into category datatype

Converting month into category datatype



### Cardinality of the columns

In [10]:
## For binary columns , the cardinality will be 2.Lets separate them out .
binary_columns=[c for c in train.columns if train[c].nunique()==2]

In [11]:
binary_columns

['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'target']

In [12]:
categorical_columns=[c for c in train.columns if (c not in binary_columns)]

In [13]:
cardinality=[]
for c in categorical_columns:
    if c=='id':continue
    cardinality.append([c,train[c].nunique()])
cardinality.sort(key=lambda x:x[1],reverse=True)


In [14]:
cardinality

[['nom_9', 11981],
 ['nom_8', 2215],
 ['nom_7', 1220],
 ['nom_6', 522],
 ['nom_5', 222],
 ['ord_5', 192],
 ['ord_4', 26],
 ['ord_3', 15],
 ['month', 12],
 ['day', 7],
 ['nom_1', 6],
 ['nom_2', 6],
 ['nom_3', 6],
 ['ord_2', 6],
 ['ord_1', 5],
 ['nom_4', 4],
 ['nom_0', 3],
 ['ord_0', 3]]

We see that there are 7 columns with high cardinality.Feature encoding for these columns may include frequency encoding which is based on the ranking of categories based on the frequency of occurence in the group.We check if the cols have same levels in both test and train.We encode only those columns.

In [15]:
# Columns that can be safely label encoded
good_label_cols = [col for col in categorical_columns if 
                   set(train[col]) == set(test[col])]

In [16]:
## from https://www.kaggle.com/fabiendaniel/detecting-malwares-with-lgbm
def frequency_encoding(variable):
    t = pd.concat([train[variable], test[variable]]).value_counts().reset_index()
    t = t.reset_index()
    t.loc[t[variable] == 1, 'level_0'] = np.nan
    t.set_index('index', inplace=True)
    max_label = t['level_0'].max() + 1
    t.fillna(max_label, inplace=True)
    return t.to_dict()['level_0']

In [17]:
#frequency_encoded_columns=['nom_9','nom_8','nom_7','nom_6','nom_5','ord_5','ord_4']

In [18]:
for variable in tqdm(good_label_cols):
    freq_encod_dict=frequency_encoding(variable)
    train[variable+'_FE']=train[variable].map(lambda x:freq_encod_dict.get(x,np.nan))
    test[variable+'_FE']=test[variable].map(lambda x:freq_encod_dict.get(x,np.nan))
    categorical_columns.remove(variable)

100%|██████████| 15/15 [00:00<00:00, 58.53it/s]


### Label Encoding

In [19]:
#https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study

def factorize(train, test, features, na_value=-9999, full=False, sort=True):
    """Factorize categorical features.
    Parameters
    ----------
    train : pd.DataFrame
    test : pd.DataFrame
    features : list
           Column names in the DataFrame to be encoded.
    na_value : int, default -9999
    full : bool, default False
        Whether use all columns from train/test or only from train.
    sort : bool, default True
        Sort by values.
    Returns
    -------
    train : pd.DataFrame
    test : pd.DataFrame
    """

    for column in features:
        if full:
            vs = pd.concat([train[column], test[column]])
            labels, indexer = pd.factorize(vs, sort=sort)
        else:
            labels, indexer = pd.factorize(train[column], sort=sort)

        train[column+'_LE'] = indexer.get_indexer(train[column])
        test[column+'_LE'] = indexer.get_indexer(test[column])

        if na_value != -1:
            train[column] = train[column].replace(-1, na_value)
            test[column] = test[column].replace(-1, na_value)

    return train, test

In [20]:
# indexer = {}
# for col in tqdm(categorical_columns):
#     if col == 'id': continue
#     _, indexer[col] = pd.factorize([train[col],test[col]])

In [21]:
#categorical_columns.remove('id')
train,test=factorize(train,test,categorical_columns,full=True)

In [22]:
#train,test=factorize(train,test,frequency_encoded_columns,full=True)

In [23]:
# for col in tqdm(categorical_columns):
#     if col=='id':continue
#     train[col+'_LE']=indexer[col].get_indexer(train[col])
#     test[col+'_LE']=indexer[col].get_indexer(test[col])
    

Now we do one hot encoding for all the binary categorical variables.

In [24]:
binary_columns

['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'target']

In [25]:
train_cat_dum=pd.DataFrame()
test_cat_dum=pd.DataFrame()
for c_ in binary_columns:
    if c_=='target':continue
    train_cat_dum=pd.concat([train_cat_dum,pd.get_dummies(train[c_],prefix=c_).astype(np.uint8)],axis=1)
    test_cat_dum=pd.concat([test_cat_dum,pd.get_dummies(test[c_],prefix=c_).astype(np.uint8)],axis=1)

In [26]:
train_cat_dum.head()

Unnamed: 0,bin_0_0,bin_0_1,bin_1_0,bin_1_1,bin_2_0,bin_2_1,bin_3_F,bin_3_T,bin_4_N,bin_4_Y
0,1,0,1,0,1,0,0,1,0,1
1,1,0,0,1,1,0,0,1,0,1
2,1,0,1,0,1,0,1,0,0,1
3,1,0,0,1,1,0,1,0,0,1
4,1,0,1,0,1,0,1,0,1,0


In [27]:
train=pd.concat([train,train_cat_dum],axis=1)
test=pd.concat([test,test_cat_dum],axis=1)

In [28]:
train.head()

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,bin_0_0,bin_0_1,bin_1_0,bin_1_1,bin_2_0,bin_2_1,bin_3_F,bin_3_T,bin_4_N,bin_4_Y
0,0,0,0,0,T,Y,Green,Triangle,Snake,Finland,...,1,0,1,0,1,0,0,1,0,1
1,1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,...,1,0,0,1,1,0,0,1,0,1
2,2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,...,1,0,1,0,1,0,1,0,0,1
3,3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,...,1,0,0,1,1,0,1,0,0,1
4,4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,...,1,0,1,0,1,0,1,0,1,0


Now,we have taken care of all the categorical variables.Lets build the model and with 5 fold cross validation .Before this ,lets delete the original categorical columns.

In [29]:
train.columns,test.columns

(Index(['id', 'bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'nom_0', 'nom_1',
        'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9',
        'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5', 'day', 'month',
        'target', 'nom_0_FE', 'nom_1_FE', 'nom_2_FE', 'nom_3_FE', 'nom_4_FE',
        'nom_5_FE', 'nom_6_FE', 'ord_0_FE', 'ord_1_FE', 'ord_2_FE', 'ord_3_FE',
        'ord_4_FE', 'ord_5_FE', 'day_FE', 'month_FE', 'id_LE', 'nom_7_LE',
        'nom_8_LE', 'nom_9_LE', 'bin_0_0', 'bin_0_1', 'bin_1_0', 'bin_1_1',
        'bin_2_0', 'bin_2_1', 'bin_3_F', 'bin_3_T', 'bin_4_N', 'bin_4_Y'],
       dtype='object'),
 Index(['id', 'bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'nom_0', 'nom_1',
        'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9',
        'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5', 'day', 'month',
        'nom_0_FE', 'nom_1_FE', 'nom_2_FE', 'nom_3_FE', 'nom_4_FE', 'nom_5_FE',
        'nom_6_FE', 'ord_0_FE', 'ord_1_FE'

In [30]:
cols_to_remove=['id', 'bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4', 'nom_0', 'nom_1',
       'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9',
       'ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5', 'day', 'month','id_LE']

In [31]:
train=train.drop(cols_to_remove,axis=1)
test=test.drop(cols_to_remove,axis=1)

In [32]:
train.shape

(300000, 29)

In [33]:
test.shape

(200000, 28)

### Building the model

In [34]:
## Importing required libraries:
from sklearn.model_selection import KFold, StratifiedKFold
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, precision_recall_curve, roc_curve, average_precision_score

In [35]:
y=train['target']
del train['target']

In [36]:
n_folds=5

In [37]:
folds=StratifiedKFold(n_splits=5,shuffle=True,random_state=1234)
feats=[f for f in train.columns if f not in ['id']]

In [38]:
oof_preds = np.zeros(train.shape[0])
sub_preds = np.zeros(test.shape[0])
    
feature_importance_df = pd.DataFrame()
categorical_features=[c for c in train.columns if c not in ['id_LE']]

In [39]:
# param = {'num_leaves': 60,
#          'min_data_in_leaf': 60, 
#          'objective':'binary',
#          'max_depth': -1,
#          'learning_rate': 0.1,
#          "boosting": "gbdt",
#          "feature_fraction": 0.8,
#          "bagging_freq": 1,
#          "bagging_fraction": 0.8 ,
#          "bagging_seed": 11,
#          "metric": 'auc',
#          "lambda_l1": 0.1,
#          "random_state": 133,
#          "verbosity": -1}

In [40]:
#params after bayesian optimisation:

param = {'num_leaves': 31,
         'min_data_in_leaf': 69, 
         'objective':'binary',
         'max_depth': 4,
         'learning_rate': 0.06,
         "boosting": "gbdt",
         "feature_fraction": 0.33,
         "metric": 'auc',
         "lambda_l1": 0.01,
         "random_state": 133,
         "verbosity": -1}

In [41]:
for n_folds,(train_idx,valid_idx) in enumerate(folds.split(train.values,y.values)):
    print("fold n°{}".format(n_folds+1))
    trn_data = lgb.Dataset(train.iloc[train_idx][feats],
                           label=y.iloc[train_idx],
                           categorical_feature=categorical_features
                          )
    val_data = lgb.Dataset(train.iloc[valid_idx][feats],
                           label=y.iloc[valid_idx],categorical_feature=categorical_features
                          )

    num_round = 10000
    clf = lgb.train(param,
                    trn_data,
                    num_round,
                    valid_sets = [trn_data, val_data],
                    verbose_eval=100,
                    early_stopping_rounds = 200)
    
    #clf.fit(train_x,train_y,eval_set=[(train_x,train_y),(valid_x,valid_y)],verbose=500,eval_metric="auc",early_stopping_rounds=100)
    
    oof_preds[valid_idx]=clf.predict(train.iloc[valid_idx][feats],num_iteration=clf.best_iteration)
    sub_preds+=clf.predict(test[feats],num_iteration=clf.best_iteration)/folds.n_splits
    
    fold_importance_df=pd.DataFrame()
    fold_importance_df['features']=feats
    fold_importance_df['importance']=clf.feature_importance(importance_type='gain')
    fold_importance_df['folds']=n_folds+1
    print(f'Fold {n_folds+1}: Most important features are:\n')
    for i in np.argsort(fold_importance_df['importance'])[-5:]:
        print(f'{fold_importance_df.iloc[i,0]}-->{fold_importance_df.iloc[i,1]}')
    
    feature_importance_df=pd.concat([feature_importance_df,fold_importance_df],axis=0)
    
    print('Fold %2d AUC : %.6f' % (n_folds + 1, roc_auc_score(y.iloc[valid_idx], oof_preds[valid_idx])))
    del clf
    gc.collect()
    


print('Full auc score %.6f' % (roc_auc_score(y,oof_preds)))

test['target']=sub_preds
              

fold n°1




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.812265	valid_1's auc: 0.776934
[200]	training's auc: 0.842193	valid_1's auc: 0.788267
[300]	training's auc: 0.858734	valid_1's auc: 0.791869
[400]	training's auc: 0.870669	valid_1's auc: 0.793548
[500]	training's auc: 0.877213	valid_1's auc: 0.794343
[600]	training's auc: 0.882285	valid_1's auc: 0.79474
[700]	training's auc: 0.88624	valid_1's auc: 0.794841
[800]	training's auc: 0.890613	valid_1's auc: 0.79474
[900]	training's auc: 0.894143	valid_1's auc: 0.794632
Early stopping, best iteration is:
[755]	training's auc: 0.888632	valid_1's auc: 0.794961
Fold 1: Most important features are:

nom_9_LE-->69376.6969089508
nom_6_FE-->72652.86227715015
nom_8_LE-->88576.83816957474
nom_7_LE-->96573.90236628056
ord_5_FE-->99198.95233100653
Fold  1 AUC : 0.794961
fold n°2




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.812989	valid_1's auc: 0.775461
[200]	training's auc: 0.842445	valid_1's auc: 0.78671
[300]	training's auc: 0.858491	valid_1's auc: 0.790576
[400]	training's auc: 0.8697	valid_1's auc: 0.792198
[500]	training's auc: 0.876419	valid_1's auc: 0.793187
[600]	training's auc: 0.881325	valid_1's auc: 0.793591
[700]	training's auc: 0.885472	valid_1's auc: 0.793773
[800]	training's auc: 0.8897	valid_1's auc: 0.793812
[900]	training's auc: 0.893336	valid_1's auc: 0.793672
Early stopping, best iteration is:
[737]	training's auc: 0.886772	valid_1's auc: 0.79399
Fold 2: Most important features are:

nom_5_FE-->68756.54258432984
nom_6_FE-->73243.1858072877
nom_8_LE-->86274.63925421238
nom_7_LE-->94869.22529029846
ord_5_FE-->97742.06808823347
Fold  2 AUC : 0.793990
fold n°3




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.812084	valid_1's auc: 0.776903
[200]	training's auc: 0.842185	valid_1's auc: 0.788773
[300]	training's auc: 0.858128	valid_1's auc: 0.792572
[400]	training's auc: 0.869771	valid_1's auc: 0.794446
[500]	training's auc: 0.875999	valid_1's auc: 0.795041
[600]	training's auc: 0.88103	valid_1's auc: 0.795505
[700]	training's auc: 0.885178	valid_1's auc: 0.795701
[800]	training's auc: 0.889461	valid_1's auc: 0.795498
[900]	training's auc: 0.893087	valid_1's auc: 0.795375
Early stopping, best iteration is:
[738]	training's auc: 0.886505	valid_1's auc: 0.795756
Fold 3: Most important features are:

nom_9_LE-->68694.23496437073
nom_6_FE-->73069.73902124166
nom_8_LE-->87378.97347640991
nom_7_LE-->93186.46600532532
ord_5_FE-->100275.03570967913
Fold  3 AUC : 0.795756
fold n°4




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.811995	valid_1's auc: 0.777185
[200]	training's auc: 0.841865	valid_1's auc: 0.78821
[300]	training's auc: 0.857842	valid_1's auc: 0.791955
[400]	training's auc: 0.86929	valid_1's auc: 0.79345
[500]	training's auc: 0.875985	valid_1's auc: 0.794037
[600]	training's auc: 0.881079	valid_1's auc: 0.7943
[700]	training's auc: 0.88511	valid_1's auc: 0.794358
[800]	training's auc: 0.889312	valid_1's auc: 0.794156
Early stopping, best iteration is:
[674]	training's auc: 0.884114	valid_1's auc: 0.794404
Fold 4: Most important features are:

nom_5_FE-->65404.37067639828
nom_6_FE-->71244.60027629137
nom_8_LE-->86290.03809547424
nom_7_LE-->91309.6388168335
ord_5_FE-->96564.6946144402
Fold  4 AUC : 0.794404
fold n°5




Training until validation scores don't improve for 200 rounds.
[100]	training's auc: 0.811499	valid_1's auc: 0.77653
[200]	training's auc: 0.842227	valid_1's auc: 0.787069
[300]	training's auc: 0.858328	valid_1's auc: 0.790577
[400]	training's auc: 0.869681	valid_1's auc: 0.79176
[500]	training's auc: 0.876626	valid_1's auc: 0.792565
[600]	training's auc: 0.88164	valid_1's auc: 0.79285
[700]	training's auc: 0.885845	valid_1's auc: 0.792819
[800]	training's auc: 0.890096	valid_1's auc: 0.792611
Early stopping, best iteration is:
[606]	training's auc: 0.881838	valid_1's auc: 0.792949
Fold 5: Most important features are:

nom_9_LE-->61814.11954879761
nom_6_FE-->68174.71038126945
nom_8_LE-->85322.02912902832
nom_7_LE-->87065.94546461105
ord_5_FE-->96681.51070272923
Fold  5 AUC : 0.792949
Full auc score 0.794398


In [42]:
sample_submission['target']=sub_preds

In [43]:
sample_submission.head()

Unnamed: 0,id,target
0,300000,0.312891
1,300001,0.670414
2,300002,0.160528
3,300003,0.348535
4,300004,0.808928


In [44]:
sample_submission.to_csv("sample_submission.csv",index=False)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
