## Model ensembling with weighted average

This is a much simplified model ensembling as it only includes three single models.
In the full solution, with 27 single models and 5 open solution models, ensembling can achieve a boost of ~0.002.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score

Read single model predictions

In [2]:
train_files = ['../output/train_pred_lgb1.csv', '../output/train_pred_lgb2.csv', '../output/train_pred_lgb3.csv']
test_files = ['../output/test_pred_lgb1.csv', '../output/test_pred_lgb2.csv', '../output/test_pred_lgb3.csv']

n_model = len(train_files)

train_x = pd.DataFrame()
for i in range(n_model):
    train_pred = pd.read_csv(train_files[i])
    name = train_files[i][-8:-4]
    train_x[name] = train_pred['prob']

train_y = train_pred['target']

test_x = pd.DataFrame()
for i in range(len(test_files)):
    test_pred = pd.read_csv(test_files[i])
    name = test_files[i][-8:-4]
    test_x[name] = test_pred['TARGET']

test_id = test_pred['SK_ID_CURR']

Check correlation between single model predictions, idealy we want low correlation for larger diversition between single models.

In [3]:
print('correlation train:')
print(train_x.corr())
print('correlation test:')
print(test_x.corr())

correlation train:
          lgb1      lgb2      lgb3
lgb1  1.000000  0.990643  0.981924
lgb2  0.990643  1.000000  0.984013
lgb3  0.981924  0.984013  1.000000
correlation test:
          lgb1      lgb2      lgb3
lgb1  1.000000  0.993755  0.990406
lgb2  0.993755  1.000000  0.992537
lgb3  0.990406  0.992537  1.000000


Blending and check blended model local CV

In [4]:
weights = [1.0/3, 1.0/3, 1.0/3]
train_pred = pd.Series(np.zeros([train_x.shape[0]]))
test_pred = pd.Series(np.zeros([test_x.shape[0]]))

for i in range(n_model):
    train_pred += weights[i] * train_x.iloc[:,i].values
    test_pred += weights[i] * test_x.iloc[:,i].values
    print ('%25s, auc %.6f   weight: %.4f' %(train_x.columns.values[i], roc_auc_score(train_y,train_x.iloc[:,i]), weights[i]))

print ('stacking model auc: train %.6f' %(roc_auc_score(train_y,train_pred)))

                     lgb1, auc 0.802002   weight: 0.3333
                     lgb2, auc 0.801779   weight: 0.3333
                     lgb3, auc 0.801157   weight: 0.3333
stacking model auc: train 0.802830


Save test prediction to disk. This will be our final submission.

In [5]:
sub = pd.DataFrame()
sub['SK_ID_CURR'] = test_id
sub['TARGET'] = test_pred
sub.to_csv('../output/stacked_sub.csv',index=False)