## A new approach for blending

###### Why not simply stack the models?

We made a huge mistake in stage 1 model training phase: all the model shares exactly same training set, which leads to extreme over-fitting problem when we stacks the models and getting poor generalizabilty. 

###### What does this algorithm do?

Very simple, we split the original validation set (240000) into two parts: new training set (168000) and new validation set (72000). Use the new training set to do stacking and new validation set for validation. Which has following pros and cons:

**Pros:**

1. Out-of-bound training data:
        Our model overall takes more training data then original. 
2. Better generalizability:
        And the out-of-bound training data makes the stage 1 models not overfits that much, which finally gives us a very well result.
3. Can use strong classifiers for stage two training:
        We uses Xgboost as our stage 2 model, and yeilds an extra 1.3% boost on private log-loss score. (0.15380 -> 0.15172)
        
**Cons:**

1. Risky:
        The new training data for stage two is relatively small. Since we luckily splited a very large validation set at first, so the blending works very well. But one should consider this approach is very risky in generalizability.

----

## 新的 blending 作法

###### 為什麼我們不直接做 stacking ?

我們在 stage 1 model 的 training 上犯了很嚴重的錯誤：「我們所有的 model 都使用了同樣的 training set ，這使得 stack 起來時的 model 在 training set 上非常嚴重的 overfit ，因此也失去了 generalizability 。」

###### 這個算法做了什麼？

非常簡單：「我們將原本的 validation set (240000) 再次切分成新的 training data (168000) 與新的 validation data (72000) ，使用新的 training data 訓練 stage 2 的 model ，並使用新的 validation data 做 validation 」，這樣的做法有以下的優勢與缺陷：

**優勢**

1. Out-of-bound training data:
        總體而言， model 使用到了比原先預計更多的 training data。
2. 更好的 generalizability:
        因為使用了 Out-of-bound training data ，這使得 stage 1 的 training data 相對 overfit 得較不嚴重。
3. stage 2 model 可以使用較強的分類器：
        我們使用了 Xgboost 作為 stage 2 的 model ，並最終得到了 1.3% 的 private log-loss 優化 (0.15380 -> 0.15172) 。

**缺陷**

1. 風險：
        這個做法的風險非常高，主要是因為我們一開始所切的 validation set 相對很大才能成功，若是一開始的 validation set 非常小，這樣的作法並不能保障 generalizability ，也就是可能會在新的 training set 上面嚴重 overfit。

In [1]:
import numpy as np
import pandas as pd
import pickle

In [2]:
val_idxes = pickle.load(open('./lystdo_kernel/validation_index.pkl','rb'))

In [3]:
df_train = pd.read_csv('../dataset/quora-question-pairs/train.csv')['is_duplicate']
# df_test = pd.read_csv('../dataset/quora-question-pairs/test.csv')

df_val = df_train.ix[val_idxes]
df_train = df_train.drop(val_idxes)

y_train = df_train.as_matrix()
y_val = df_val.as_matrix()

In [4]:
blending_train_ratio = 0.7

np.random.seed(9487)

perm = np.random.permutation(len(val_idxes))
blending_train_size = int(len(val_idxes)*blending_train_ratio)

blending_train_idx = val_idxes[perm[:blending_train_size]]
blending_val_idx = val_idxes[perm[blending_train_size:]]

y_blending_train = df_val.ix[blending_train_idx].as_matrix()
y_blending_val = df_val.ix[blending_val_idx].as_matrix()

In [5]:
# calculate the class-weight will apply to the training model
def get_pos_neg_weights(data, as_array=False):
    test_set_pos_label_ratio = 0.1746
    m_pos_ratio = sum(data)/len(data)
    weight = {
        0: (1-test_set_pos_label_ratio) / (1-m_pos_ratio),
        1: test_set_pos_label_ratio/m_pos_ratio
    }
    
    if not as_array:
        return weight
    else:
        return np.array([weight[v] for v in data])


val_weight = get_pos_neg_weights(y_val, as_array=True)
train_weight = get_pos_neg_weights(y_train, as_array=True)

blending_val_weight = get_pos_neg_weights(y_blending_val, as_array=True)
blending_train_weight = get_pos_neg_weights(y_blending_train, as_array=True)

y_all_train = pd.read_csv('../dataset/quora-question-pairs/train.csv')['is_duplicate'].as_matrix()
all_train_weight = get_pos_neg_weights(y_all_train, as_array=True)

## Embedding Cache

Since calculating data is time consuming, so we did some caching for these data.

In [6]:
# '''no correction , use GloVe'''
# data_1,data_2,labels,test_data_1,test_data_2,test_ids,embedding_matrix, nb_words = pickle.load(open('./cache.pkl','rb'))
# model_name = 'glove_without_word_correction'

# '''correction , use GloVe'''
# data_1,data_2,labels,test_data_1,test_data_2,test_ids,embedding_matrix, nb_words = pickle.load(open('./cache_text_correction.pkl','rb'))
# model_name = 'glove_with_word_correction'


'''no correction , use fasttext , 0.1592'''
data_1,data_2,labels,test_data_1,test_data_2,test_ids,embedding_matrix, nb_words = pickle.load(open('./cache_fasttext.pkl','rb'))
model_name = 'fasttext_without_word_correction'

# '''correction , use fasttext , 0.1602'''
# data_1,data_2,labels,test_data_1,test_data_2,test_ids,embedding_matrix, nb_words = pickle.load(open('./cache_fasttext_text_correction.pkl','rb'))
# model_name = 'fasttext_with_word_correction'

In [7]:
# def embed_datas(datas):
#     print('need test')
#     res = []
#     for i,data in enumerate(datas):
#         res.append([embedding_matrix[v] for v in data])
#     return np.array(res)

# Load all features

Select which stage one model will be chosen. These can be considered as a kind of hyper-parameters.

In [8]:
#### Add extra features to leak

from sklearn.preprocessing import StandardScaler

is_all_features_in_single_dense = False

feature_files = [
#     'lystdo_correctwords_lstm',
#     'lystdo_1234_loss017',
    'Abhishek_features',
    'magic_feature',
    'magic_feature_v1',
    'Howard_feature',
    'HubertLin_features_raw',
    'HubertLin_features_simple_tokenizer',
    'HubertLin_features_word_corrected',
    'pagerank',
    'word_match_share',
    'fasttext_distance',
#     'modified_count',
    'magic_v25_qid',
    'k_scrore',
    'pos_dist',
    'dep_dist',
]

def inf_nan_to_zero(arr):
    nan = np.isnan(arr)
    inf = np.isinf(arr)
    arr[nan] = 0
    arr[inf] = 0
    return arr

def standardize(train,test):
    ss = StandardScaler()
    ss.fit(np.vstack((train, test)))
    train = ss.transform(train)
    test = ss.transform(test)
    return train, test

print('Loading original leaks')
leaks,test_leaks = pickle.load(open('./lystdo_kernel/leaks.pkl','rb'))
leaks,test_leaks = standardize(leaks,test_leaks)
if not is_all_features_in_single_dense:
    leaks = [leaks]
    test_leaks = [test_leaks]

for feature_file in feature_files:
    
    try:
        print('Loading '+feature_file)
        train_features,test_features = pickle.load(open('./lystdo_kernel/ss_cache/'+feature_file+'.pkl' , 'rb'))
    except:
        train_features = pd.read_csv('./features_from_model/train/'+feature_file+'.csv').as_matrix()
        test_features = pd.read_csv('./features_from_model/test/'+feature_file+'.csv').as_matrix()
        train_features = inf_nan_to_zero(train_features)
        test_features = inf_nan_to_zero(test_features)

        train_features,test_features = standardize(train_features,test_features)
        pickle.dump([train_features,test_features], open('./lystdo_kernel/ss_cache/'+feature_file+'.pkl' , 'wb'))

    if is_all_features_in_single_dense:
        leaks = np.hstack([leaks,train_features])
        test_leaks = np.hstack([test_leaks,test_features])
    else:
        leaks.append(train_features)
        test_leaks.append(test_features)
        
# # wrap all features up into one singe fc layer
# if is_all_features_in_single_dense:
#     leaks = [leaks]
#     test_leaks = [test_leaks]

In [9]:
leaks = np.hstack(leaks)
test_leaks = np.hstack(test_leaks)

blending_val_leaks = leaks[blending_val_idx,:]
blending_train_leaks = leaks[blending_train_idx,:]

In [10]:
'''
Select which leaks data is used for training and the remainings are validation data.
'''

def select_leaks(leaks, idxes, stack):
    ret = []
    for leak in leaks:
        if stack:
            ret.append(np.vstack([leak[idxes],leak[idxes]]))
        else:
            ret.append(leak[idxes])
    return ret

stack = True

if stack:
#     data_1_train = np.vstack((data_1[idx_train], data_2[idx_train]))
#     data_2_train = np.vstack((data_2[idx_train], data_1[idx_train]))
    leaks_train = select_leaks(leaks,idx_train,stack)
    labels_train = np.concatenate((labels[idx_train], labels[idx_train]))

#     data_1_val = data_1[idx_val]
#     data_2_val = data_2[idx_val]
    leaks_val = select_leaks(leaks,idx_val,stack=False)
    labels_val = labels[idx_val]
    
    if use_prev_model_as_helper:
        val_helpers = [train_helpers[0][idx_val]]
        train_helpers = [np.vstack([train_helpers[0][idx_train],train_helpers[0][idx_train]])]
else:
    data_1_train = data_1[idx_train]
    data_2_train = data_2[idx_train]
    leaks_train = select_leaks(leaks,idx_train,stack)
    labels_train = labels[idx_train]
    
    data_1_val = data_1[idx_val]
    data_2_val = data_2[idx_val]
    leaks_val = select_leaks(leaks,idx_val,stack)
    labels_val = labels[idx_val]
    
    if use_prev_model_as_helper:
        val_helpers = [train_helpers[0][idx_val]]
        train_helpers = [train_helpers[0][idx_train]]
    

weight_val = np.ones(len(labels_val))
if re_weight:
    weight_val *= 0.472001959
    weight_val[labels_val==0] = 1.309028344

## Load previous model predictions

Use some blended models' prediction as next blending model's input, this works not very well.

In [162]:
use_prev_model_as_helper = False

helpers = [
    'lystdo_onlyAB_0.1704_prediction_max',
    'lystdo_text_corrected_0.1719_prediction_mean',
    'lystdo_full_features_and_poolings_0.1806_prediction_max',
    
    'lystdo_Fasttext_AllFeatures_WordCorrection_Loss_0.1609_0.1609_prediction_mean',
    'lystdo_original_fullfeature_prev_lstm_0.1739_prediction_mean',
    'lystdo_onlyAB_0.1704_prediction_mean',

    'lystdo_full_features_0.1744_prediction_max', # 14942
    
    'lystdo_original_fullfeature_prev_lstm_0.1739_prediction_max',
    'lystdo_text_corrected_0.1719_prediction_max',
    'lystdo_full_features_0.1744_prediction_mean',
    
#     'Howard_xgb'
    
#     ...QWQ
]

# 0.14963
#     'lystdo_full_features_0.1744_prediction_max',
#     'lystdo_onlyAB_0.1704_prediction_max',
#     'lystdo_text_corrected_0.1719_prediction_mean',
#     'lystdo_full_features_and_poolings_0.1806_prediction_max',
    
#     'lystdo_Fasttext_AllFeatures_WordCorrection_Loss_0.1609_0.1609_prediction_mean',
#     'lystdo_original_fullfeature_prev_lstm_0.1739_prediction_mean',
#     'lystdo_onlyAB_0.1704_prediction_mean',

train_tmp = []
test_tmp = []

def read_csv(file, is_train):
    if is_train:
        name = './model_predictions/train/'+file+'.csv'
    else:
        name = './model_predictions/test/'+file+'.csv'
    df = pd.read_csv(name)
    df = pd.DataFrame(df['is_duplicate'].as_matrix(), columns=[file+'_is_duplicate'])
    return df

for helper in helpers:
    print('Loading :', helper)
    train_tmp.append(read_csv(helper,is_train=True))
    test_tmp.append(read_csv(helper,is_train=False))

train_helpers = pd.concat(train_tmp, axis=1).as_matrix()
test_helpers = pd.concat(test_tmp, axis=1).as_matrix()

val_helpers = train_helpers[val_idxes]
blending_train_helper = train_helpers[blending_train_idx]
blending_val_helper = train_helpers[blending_val_idx]

train_helpers = np.delete(train_helpers, val_idxes, axis=0)

del train_tmp
del test_tmp


Loading : lystdo_onlyAB_0.1704_prediction_max
Loading : lystdo_text_corrected_0.1719_prediction_mean
Loading : lystdo_full_features_and_poolings_0.1806_prediction_max
Loading : lystdo_Fasttext_AllFeatures_WordCorrection_Loss_0.1609_0.1609_prediction_mean
Loading : lystdo_original_fullfeature_prev_lstm_0.1739_prediction_mean
Loading : lystdo_onlyAB_0.1704_prediction_mean
Loading : lystdo_full_features_0.1744_prediction_max
Loading : lystdo_original_fullfeature_prev_lstm_0.1739_prediction_max
Loading : lystdo_text_corrected_0.1719_prediction_max
Loading : lystdo_full_features_0.1744_prediction_mean


## Clean data for xgboost

In [163]:
def inf_nan_to_zero(arr):
    nan = np.isnan(arr)
    inf = np.isinf(arr)
    arr[nan] = 0
    arr[inf] = 0
    return arr

# process blending training data

blending_train_helper = inf_nan_to_zero(blending_train_helper)
blending_val_helper = inf_nan_to_zero(blending_val_helper)

# process all training/testing data

train_helpers = inf_nan_to_zero(train_helpers)
test_helpers = inf_nan_to_zero(test_helpers)
val_helpers = inf_nan_to_zero(val_helpers)

# process all leaks

# blending_val_leaks = inf_nan_to_zero(blending_val_leaks)
# blending_train_leaks = inf_nan_to_zero(blending_train_leaks)

In [152]:
# blending_train = np.hstack([blending_train_helper,blending_train_leaks])
# blending_val = np.hstack([blending_val_helper,blending_val_leaks])

## Start blending

In [173]:
import xgboost as xgb

# Set our parameters for xgboost
params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'

params['eta'] = 0.02
params['max_depth'] = 7
params['subsample'] = 0.6
params['base_score'] = 0.2
 
# params['eta'] = 0.1
# params['max_depth'] = 5
# params['subsample'] = 0.7
# params['colsample_bytree'] = 0.5
# params['colsample_bylevel'] = 0.5
# params['min_child_weight'] = 10

# params['base_score'] = 0.1


d_train = xgb.DMatrix(blending_train_helper, label=y_blending_train, weight=blending_train_weight)
d_valid = xgb.DMatrix(blending_val_helper, label=y_blending_val, weight=blending_val_weight)

watchlist = [(d_train, 'train'), (d_valid, 'valid')]

bst = xgb.train(params, d_train, 100000, watchlist, early_stopping_rounds=100, verbose_eval=20)

[0]	train-logloss:0.452911	valid-logloss:0.453067
Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping.

Will train until valid-logloss hasn't improved in 100 rounds.
[20]	train-logloss:0.308539	valid-logloss:0.311724
[40]	train-logloss:0.240279	valid-logloss:0.245806
[60]	train-logloss:0.201137	valid-logloss:0.208824
[80]	train-logloss:0.177142	valid-logloss:0.186616
[100]	train-logloss:0.161898	valid-logloss:0.172979
[120]	train-logloss:0.151945	valid-logloss:0.164322
[140]	train-logloss:0.145396	valid-logloss:0.158843
[160]	train-logloss:0.141034	valid-logloss:0.155287
[180]	train-logloss:0.138006	valid-logloss:0.153089
[200]	train-logloss:0.135804	valid-logloss:0.151752
[220]	train-logloss:0.134192	valid-logloss:0.150889
[240]	train-logloss:0.132972	valid-logloss:0.150368
[260]	train-logloss:0.13192	valid-logloss:0.149987
[280]	train-logloss:0.131025	valid-logloss:0.149825
[300]	train-logloss:0.13009	valid-logloss:0.149764
[320]	train-logloss:0.12

In [174]:
# d_test = xgb.DMatrix()
# pred = bst.predict(d_test)
# eval_logloss(y_all_train,pred.flatten(), all_train_weight)

In [175]:
'''
Output the blending result to csv file
'''

d_test = xgb.DMatrix(test_helpers)
pred = bst.predict(d_test)

sub = pd.DataFrame()
sub['test_id'] = np.arange(2345796)
sub['is_duplicate'] = pred
sub.to_csv('./final_hubert_blending_prediction.csv', index=False)

## Use LR as blending model

LR is a more regular choice, so just try it.

But it turns out not very well (log-loss 0.1645, which is even lower than a single model). My explaination is that: "Our blending is actually transforming the whole problem into a new problem, the features between old training data and new training data is totally different. We need a stronger classifier to learn these new features, rather than using LR which it too weak."

In [166]:
from sklearn.linear_model import LogisticRegression

class_weight = get_pos_neg_weights(y_blending_train, as_array=False)

lr = LogisticRegression(penalty='l1', 
                        C=1, 
                        intercept_scaling=1, 
                        class_weight=class_weight,
                        random_state=8787, 
                        solver='liblinear', 
                        max_iter=100, 
                        multi_class='ovr')

lr.fit(blending_train_helper, y_blending_train)

pred = lr.predict_proba(blending_val_helper)
eval_logloss(y_blending_val, pred[:,1], blending_val_weight)

0.16454852889914509

In [168]:
# pred = lr.predict_proba(train_helpers)
# eval_logloss(y_all_train,pred[:,1], all_train_weight)

In [169]:
pred = lr.predict_proba(test_helpers)[:,1]

sub = pd.DataFrame()
sub['test_id'] = np.arange(2345796)
sub['is_duplicate'] = pred
sub.to_csv('./final_lr_blending_prediction.csv', index=False)

## Directly use stacking (performance sucks)

As stated above, we made a huge mistake while training each model. So the direct stacking yeilds a very poor generalizability. (validation loss is 0.168, the lowest one I have ever seen.)

In [164]:
import xgboost as xgb

# Set our parameters for xgboost
params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'

params['eta'] = 0.02
params['max_depth'] = 7
params['subsample'] = 0.6
params['base_score'] = 0.2
 
# params['eta'] = 0.1
# params['max_depth'] = 5
# params['subsample'] = 0.7
# params['colsample_bytree'] = 0.5
# params['colsample_bylevel'] = 0.5
# params['min_child_weight'] = 10

# params['base_score'] = 0.1


d_train = xgb.DMatrix(train_helpers, label=y_train, weight=train_weight)
d_valid = xgb.DMatrix(val_helpers, label=y_val, weight=val_weight)

watchlist = [(d_train, 'train'), (d_valid, 'valid')]

bst = xgb.train(params, d_train, 20000, watchlist, early_stopping_rounds=100, verbose_eval=50)

[0]	train-logloss:0.451229	valid-logloss:0.452266
Multiple eval metrics have been passed: 'valid-logloss' will be used for early stopping.

Will train until valid-logloss hasn't improved in 100 rounds.
[50]	train-logloss:0.192985	valid-logloss:0.222633
[100]	train-logloss:0.132873	valid-logloss:0.178054
[150]	train-logloss:0.113614	valid-logloss:0.168937
[200]	train-logloss:0.106972	valid-logloss:0.168834
[250]	train-logloss:0.104529	valid-logloss:0.170354
Stopping. Best iteration:
[171]	train-logloss:0.109984	valid-logloss:0.168408



In [165]:
d_test = xgb.DMatrix(test_helpers)
pred = bst.predict(d_test)

sub = pd.DataFrame()
sub['test_id'] = np.arange(2345796)
sub['is_duplicate'] = pred
sub.to_csv('./final_stacking_prediction.csv', index=False)