Notebook05 for Safe Driver Prediction

Timeline: 2017/11/3

Goals: Use lightgbm for training

I. Import Packages, define functions and import files

In [40]:
# Data Manipulation
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Training
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
import xgboost as xgb
import lightgbm as lgb

# display
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [41]:
# Define the gini metric - from https://www.kaggle.com/c/ClaimPredictionChallenge/discussion/703#5897
def gini(actual, pred, cmpcol = 0, sortcol = 1):
    assert( len(actual) == len(pred) )
    all = np.asarray(np.c_[ actual, pred, np.arange(len(actual)) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ]
    totalLosses = all[:,0].sum()
    giniSum = all[:,0].cumsum().sum() / totalLosses
    
    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)
 
def gini_normalized(a, p):
    return gini(a, p) / gini(a, a)

# Create an XGBoost-compatible metric from Gini

def gini_xgb(preds, dtrain):
    labels = dtrain.get_label()
    gini_score = gini_normalized(labels, preds)
    return [('gini', gini_score)]

def gini_lgb(preds, dtrain):
    y = list(dtrain.get_label())
    score = gini(y, preds) / gini(y, y)
    return 'gini', score, True

In [42]:
train_df = pd.read_csv('/Users/maxji/Desktop/Kaggle/0SafeDriver/data/train.csv')
test_df = pd.read_csv('/Users/maxji/Desktop/Kaggle/0SafeDriver/data/test.csv')
submission_df = pd.read_csv('/Users/maxji/Desktop/Kaggle/0SafeDriver/data/sample_submission.csv')

II. Data manipulation

In [43]:
# Pick out columns with specific keyword inside
def select_cols(df,description):
    get_cols = [col for col in df.columns if description in col]
    return df[get_cols]

# Remove -1 in the code and replace with N/A
def recover_na(df):
    df = df.replace(-1, np.NaN)
    return df

In [44]:
# Select columns with specific data type (w/o price)
cat_cols = select_cols(train_df,'cat')
bin_cols = select_cols(train_df,'bin')
cont_cols = train_df.select_dtypes(include=['float64'])
temp_cols = [col for col in train_df.columns if ('cat' not in col) and ('bin' not in col) and (train_df[col].dtype != float) 
            and ('id' not in col) and ('target' not in col)]
ord_cols = train_df[temp_cols]

# Select columns with specific category
ind_cols = select_cols(train_df,'ind')
reg_cols = select_cols(train_df,'reg')
car_cols = select_cols(train_df,'car')
calc_cols = select_cols(train_df,'calc')

# Recover the NA
train_copy = recover_na(train_df)

In [45]:
#Dropping columns with 'ps_calc_'
col_to_drop = train_df.columns[train_df.columns.str.startswith('ps_calc_')]
train = train_df.drop(col_to_drop, axis=1)  
test = test_df.drop(col_to_drop, axis=1)

In [46]:
# Preparing for training
X = train.drop(['id', 'target'], axis=1)
features = X.columns
X = X.values
y = train['target'].values
sub=test['id'].to_frame()
sub['target']=0

III. Training

In [47]:
# Run CV
nrounds=2000  # need to change to 2000
kfold = 5  # need to change to 5
skf = StratifiedKFold(n_splits=kfold, random_state=0)

# lgb parameters
params = {'metric': 'auc', 'learning_rate' : 0.01, 'max_depth':10, 'max_bin':10,  'objective': 'binary', 
          'feature_fraction': 0.8,'bagging_fraction':0.9,'bagging_freq':10,  'min_data': 500}

skf = StratifiedKFold(n_splits=kfold, random_state=1)
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(' lgb kfold: {}  of  {} : '.format(i+1, kfold))
    X_train, X_eval = X[train_index], X[test_index]
    y_train, y_eval = y[train_index], y[test_index]
    lgb_model = lgb.train(params, lgb.Dataset(X_train, label=y_train), nrounds, 
                  lgb.Dataset(X_eval, label=y_eval), verbose_eval=10, 
                  feval=gini_lgb, early_stopping_rounds=100)
    sub['target'] += lgb_model.predict(test[features].values, 
                        num_iteration=lgb_model.best_iteration) / (kfold)

# Create a submission file
sub.to_csv('lightgbm.csv', index=False, float_format='%.5f') 
sub.head(2)

 lgb kfold: 1  of  5 : 
Training until validation scores don't improve for 100 rounds.
[10]	valid_0's auc: 0.624471	valid_0's gini: 0.248969
[20]	valid_0's auc: 0.627194	valid_0's gini: 0.254391
[30]	valid_0's auc: 0.627261	valid_0's gini: 0.254518
[40]	valid_0's auc: 0.627471	valid_0's gini: 0.254938
[50]	valid_0's auc: 0.628113	valid_0's gini: 0.256223
[60]	valid_0's auc: 0.62873	valid_0's gini: 0.257459
[70]	valid_0's auc: 0.628432	valid_0's gini: 0.256864
[80]	valid_0's auc: 0.628391	valid_0's gini: 0.256781
[90]	valid_0's auc: 0.628301	valid_0's gini: 0.256601
[100]	valid_0's auc: 0.628702	valid_0's gini: 0.257404
[110]	valid_0's auc: 0.62857	valid_0's gini: 0.257139
[120]	valid_0's auc: 0.628903	valid_0's gini: 0.257805
[130]	valid_0's auc: 0.628766	valid_0's gini: 0.257531
[140]	valid_0's auc: 0.629423	valid_0's gini: 0.258846
[150]	valid_0's auc: 0.629551	valid_0's gini: 0.259102
[160]	valid_0's auc: 0.629845	valid_0's gini: 0.25969
[170]	valid_0's auc: 0.629842	valid_0's gini:

[230]	valid_0's auc: 0.627576	valid_0's gini: 0.255152
[240]	valid_0's auc: 0.628023	valid_0's gini: 0.256046
[250]	valid_0's auc: 0.628199	valid_0's gini: 0.256398
[260]	valid_0's auc: 0.628681	valid_0's gini: 0.257361
[270]	valid_0's auc: 0.629512	valid_0's gini: 0.259024
[280]	valid_0's auc: 0.629872	valid_0's gini: 0.259744
[290]	valid_0's auc: 0.630342	valid_0's gini: 0.260684
[300]	valid_0's auc: 0.630882	valid_0's gini: 0.261764
[310]	valid_0's auc: 0.631148	valid_0's gini: 0.262296
[320]	valid_0's auc: 0.631309	valid_0's gini: 0.262618
[330]	valid_0's auc: 0.63182	valid_0's gini: 0.26364
[340]	valid_0's auc: 0.632271	valid_0's gini: 0.264542
[350]	valid_0's auc: 0.632604	valid_0's gini: 0.265207
[360]	valid_0's auc: 0.632955	valid_0's gini: 0.265911
[370]	valid_0's auc: 0.633488	valid_0's gini: 0.266976
[380]	valid_0's auc: 0.634011	valid_0's gini: 0.268023
[390]	valid_0's auc: 0.634383	valid_0's gini: 0.268766
[400]	valid_0's auc: 0.634611	valid_0's gini: 0.269223
[410]	valid_

[600]	valid_0's auc: 0.639263	valid_0's gini: 0.278527
[610]	valid_0's auc: 0.639456	valid_0's gini: 0.278913
[620]	valid_0's auc: 0.639472	valid_0's gini: 0.278943
[630]	valid_0's auc: 0.639573	valid_0's gini: 0.279146
[640]	valid_0's auc: 0.639709	valid_0's gini: 0.279417
[650]	valid_0's auc: 0.64006	valid_0's gini: 0.28012
[660]	valid_0's auc: 0.640211	valid_0's gini: 0.280422
[670]	valid_0's auc: 0.640288	valid_0's gini: 0.280577
[680]	valid_0's auc: 0.640289	valid_0's gini: 0.280578
[690]	valid_0's auc: 0.64039	valid_0's gini: 0.280781
[700]	valid_0's auc: 0.640451	valid_0's gini: 0.280903
[710]	valid_0's auc: 0.640454	valid_0's gini: 0.280908
[720]	valid_0's auc: 0.640655	valid_0's gini: 0.281311
[730]	valid_0's auc: 0.640786	valid_0's gini: 0.281571
[740]	valid_0's auc: 0.640873	valid_0's gini: 0.281746
[750]	valid_0's auc: 0.640914	valid_0's gini: 0.281829
[760]	valid_0's auc: 0.641078	valid_0's gini: 0.282157
[770]	valid_0's auc: 0.641199	valid_0's gini: 0.282397
[780]	valid_0

[1060]	valid_0's auc: 0.646396	valid_0's gini: 0.292792
[1070]	valid_0's auc: 0.646385	valid_0's gini: 0.292769
[1080]	valid_0's auc: 0.646357	valid_0's gini: 0.292714
[1090]	valid_0's auc: 0.646358	valid_0's gini: 0.292715
[1100]	valid_0's auc: 0.646297	valid_0's gini: 0.292595
[1110]	valid_0's auc: 0.646306	valid_0's gini: 0.292612
[1120]	valid_0's auc: 0.646304	valid_0's gini: 0.292608
[1130]	valid_0's auc: 0.646224	valid_0's gini: 0.292447
[1140]	valid_0's auc: 0.64624	valid_0's gini: 0.292479
Early stopping, best iteration is:
[1041]	valid_0's auc: 0.646441	valid_0's gini: 0.292881
 lgb kfold: 5  of  5 : 
Training until validation scores don't improve for 100 rounds.
[10]	valid_0's auc: 0.618821	valid_0's gini: 0.237636
[20]	valid_0's auc: 0.621649	valid_0's gini: 0.243305
[30]	valid_0's auc: 0.623032	valid_0's gini: 0.246067
[40]	valid_0's auc: 0.623625	valid_0's gini: 0.247252
[50]	valid_0's auc: 0.624162	valid_0's gini: 0.248326
[60]	valid_0's auc: 0.624288	valid_0's gini: 0.24

NameError: name 'gc' is not defined

In [48]:
#sub.describe()

Unnamed: 0,id,target
count,892816.0,892816.0
mean,744153.5,0.036478
std,429683.0,0.019101
min,0.0,0.008457
25%,372021.8,0.023797
50%,744307.0,0.031978
75%,1116308.0,0.043556
max,1488026.0,0.456627


Insight: 
Now we have lightgbm as another model that performs around 0.281-0.282, and it takes less time to train than both xgboost and catboost. The distribution of lightgbm seems to be more closer to xgboost than catboost.