# Intro to Data Science
## Final project
## Porto Seguro’s Safe Driver Prediction

* To do.. Add description of the project
* To do.. Add some visualization

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [164]:
# load the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

There are the following columns in train dataset.

In [3]:
train.columns

Index(['id', 'target', 'ps_ind_01', 'ps_ind_02_cat', 'ps_ind_03',
       'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_ind_06_bin', 'ps_ind_07_bin',
       'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_10_bin', 'ps_ind_11_bin',
       'ps_ind_12_bin', 'ps_ind_13_bin', 'ps_ind_14', 'ps_ind_15',
       'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_reg_01',
       'ps_reg_02', 'ps_reg_03', 'ps_car_01_cat', 'ps_car_02_cat',
       'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat', 'ps_car_06_cat',
       'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_10_cat',
       'ps_car_11_cat', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14',
       'ps_car_15', 'ps_calc_01', 'ps_calc_02', 'ps_calc_03', 'ps_calc_04',
       'ps_calc_05', 'ps_calc_06', 'ps_calc_07', 'ps_calc_08', 'ps_calc_09',
       'ps_calc_10', 'ps_calc_11', 'ps_calc_12', 'ps_calc_13', 'ps_calc_14',
       'ps_calc_15_bin', 'ps_calc_16_bin', 'ps_calc_17_bin', 'ps_calc_18_bin',
       'ps_calc_19_bin', 'ps_calc_20_bin'],


In [167]:
# split the data
X_train = train.drop(['id', 'target'], axis=1)
y_train = train['target']

X_test = test.drop(['id'], axis=1)

We have checked in which columns there are missing values. The most part of the missing values is in categorical features. Only 4 numerical features have missing values. Missing values in numerical features will be dealing by XGBoost algorithm. Missing values in cat features will be considered as the separate category while encoding procedure runs.

In [168]:
print([column_name for column_name in X_train.columns if -1 in X_train[column_name].values])

['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_reg_03', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_05_cat', 'ps_car_07_cat', 'ps_car_09_cat', 'ps_car_11', 'ps_car_12', 'ps_car_14']


In [169]:
print([column_name for column_name in X_test.columns if -1 in X_test[column_name].values])

['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_reg_03', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_05_cat', 'ps_car_07_cat', 'ps_car_09_cat', 'ps_car_11', 'ps_car_14']


Some categorical features have high cardinality. So, the idea is to use target mean for cat features encoding.

In [109]:
def target_mean_encoding(train_feature, test_feature, target):
    temp = pd.concat([train_feature, target], axis=1)
    temp = temp.groupby(by=train_feature)[target.name].agg(['mean'])
    encoded_train_feature = pd.merge(train_feature.to_frame(), temp.reset_index(), how='left')
    encoded_train_feature = encoded_train_feature.drop(train_feature.name, axis=1)
    encoded_test_feature = pd.merge(test_feature.to_frame(), temp.reset_index(), how='left')
    encoded_test_feature = encoded_test_feature.drop(test_feature.name, axis=1)
    return (encoded_train_feature.rename(columns={'mean': train_feature.name})[train_feature.name],
            encoded_test_feature.rename(columns={'mean': test_feature.name})[test_feature.name])

In [134]:
cat_features = [feature for feature in X_train.columns if 'cat' in feature]

In [170]:
for feature in cat_features:
    X_train[feature], X_test[feature] = target_mean_encoding(X_train[feature], X_test[feature], y_train)

The evaluation in this competition on Kaggle is performed by using normalized Gini coefficient. We have found that this metric can be calculated using ROC AUC, in particular, as 2 * roc_auc - 1. So, we will use the usual funaction and the roc_auc make evaluation in our CV procedure.

In [None]:
from sklearn.metrics import roc_auc_score

In [141]:
def gini(actual, pred):
    matrix = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)
    matrix = matrix[np.lexsort((matrix[:,2], -1*matrix[:,1]))]
    totalLosses = matrix[:,0].sum()
    giniSum = matrix[:,0].cumsum().sum() / totalLosses
 
    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

We will use XGBoost for our classification task. To perform our CV procedure we use StratifiedKFold splitting from sklearn, because we have unbalanced targets in our dataset.

In [151]:
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import StratifiedKFold

The internal cv feature of xgboost algorithms showed us that 100 estimators and learning rate 0.1 is OK for this task. So, we continue with tuning max_tree_depth parameter using grid search procedure. 

In [171]:
max_tree_depths = [3, 4, 6, 8]

for max_tree_depth in max_tree_depths:
    skf = StratifiedKFold()
    roc_auc_scores = []
    gini_scores = []

    for train_index, test_index in skf.split(X_train, y_train):
        X_train_cv, X_test_cv = X_train.iloc[train_index], X_train.iloc[test_index]
        y_train_cv, y_test_cv = y_train[train_index], y_train[test_index]
        xgb = XGBClassifier(max_depth=max_tree_depth, missing=-1)
        xgb.fit(X_train_cv, y_train_cv)
        preds = xgb.predict_proba(X_test_cv)
        gini_scores.append(gini_normalized(y_test_cv, preds[:,1]))
        roc_auc_scores.append(2*roc_auc_score(y_test_cv, preds[:,1])-1)

    print('max_tree_depth value: ', max_tree_depth)
    print('gini (function): ', np.mean(gini_scores), '\tgini (roc auc): ', np.mean(roc_auc_scores))
    print()

max_tree_depth value:  3
gini (function):  0.275427958339 	gini (roc auc):  0.275427959786

max_tree_depth value:  4
gini (function):  0.27743021761 	gini (roc auc):  0.277430211582

max_tree_depth value:  6
gini (function):  0.274512398979 	gini (roc auc):  0.274512400426

max_tree_depth value:  8
gini (function):  0.260478704967 	gini (roc auc):  0.260478696286



After choosing the optimal max_tree_depth value equals 4, we tune row_sabsampling and col_subsampling parameters also using grid search.

In [None]:
row_subsamplings = [0.5, 0.75, 1.0]
col_subsamplings = [0.4, 0.6, 0.8, 1.0]

for row_subsampling in row_subsamplings:
    for col_subsampling in col_subsamplings:
        skf = StratifiedKFold()
        roc_auc_scores = []
        gini_scores = []

        for train_index, test_index in skf.split(X_train, y_train):
            X_train_cv, X_test_cv = X_train.iloc[train_index], X_train.iloc[test_index]
            y_train_cv, y_test_cv = y_train[train_index], y_train[test_index]
            xgb = XGBClassifier(max_depth=4, subsample=row_subsampling, colsample_bytree=col_subsampling, missing=-1)
            xgb.fit(X_train_cv, y_train_cv)
            preds = xgb.predict_proba(X_test_cv)
            gini_scores.append(gini_normalized(y_test_cv, preds[:,1]))
            roc_auc_scores.append(2*roc_auc_score(y_test_cv, preds[:,1])-1)

        print('row_subsampling value: ', row_subsampling, '\tcol_subsampling value: ', col_subsampling)
        print('gini (function): ', np.mean(gini_scores), '\tgini (roc auc): ', np.mean(roc_auc_scores))
        print()

Now, we are ready to train our final model and make predictions for submitting on Kaggle.

In [161]:
xgb = XGBClassifier(max_depth=4, subsample=0.75, colsample_bytree=0.8, missing=-1)
xgb.fit(X_train, y_train)
preds = xgb.predict_proba(X_test)

In [162]:
result = pd.DataFrame(np.array([test['id'].values, preds[:,1]]).transpose(), columns=['id', 'target'])
result = result.astype({'id':int})
result.to_csv('result.csv', index=False)

The result is 0.276 score on Public LB.