## What is this kernel?

1. This kernel uses naive RNN to do training.
2. Why called it "dynamic" ? Because the training data in the training phase changes dynamically, we can tune some hyper-parameter of training data before or even while training. Also, we can add some augmented data into the training data.

## Why this kernel is eventually deprecated?

We found that whatever model we tried, we can't using deep learning only and yield a good score. It's because this data set has some weird/interesting distribution and mis-labeling problems. So we need to add extra "leak" features.

## Some ideas

1. Since we already have data that labelled as same question paris. Is it possible for us to train another model which can sequetially generate a same question string depends on the input? 
2. curriculum learning
3. Ensembling:

        a. Naive RNN
        b. xgboost
        c. randomforest
        
4. Feature engineering

        a. question length (May not be very powerful)
        b. similarity metric score (Directly count it and feed it as training feature. For exampole: cosine similarity LOL)
        c. I don't think TF-IDF would work, but still worth trying if we have time.
        
5. Create tags for each question by using the sentense paired with it. 

        For example:
        Q1: I am a good man.    <->    Q2: A good man is me.
        
        Q1 is tagged with [A, good, man, is, me]
        Q2 is tagged with [I, am, a, good, man]
        
        Then use these data to train a model predicting what tags would appear for each question. 
        
6. fit_generator 如果有設定 class_weight ，可以嘗試先不設 class_weight 、duplicated 比例 1:1 下去 train 到 early stop 之後，再設定 class_weight 下去 train ，這樣可以進一步降低 loss ，概念上就是「先學 representation，再學distribution」

7. 因為 q1 q2 位置交換的 prediction 結果會不一樣，所以可以在 predict test set 時，兩種擺法都 predict ，選離 0.5 最遠的 predict 結果。

## TODO

1. Get back the non-augmented training pairs from validation set.
2. Use maxout for semantic compressor instead of RNN.
3. xgboost of course

# model time!

In [1]:
import numpy as np
import pandas as pd
import re
import pickle
import json
import matplotlib.pyplot as plt
import random

In [2]:
word2vec = pickle.load(open('../dataset/spacy_feature_extraction/word_embedding_vector_dict.pkl','rb'))

In [3]:
train_pos_pairs = pickle.load(open('../dataset/augmented/train_positive_qid_pairs.pkl', 'rb'))
train_neg_pairs = pickle.load(open('../dataset/augmented/train_negative_qid_pairs.pkl', 'rb'))

qid_question_dict = pickle.load(open('../dataset/spacy_feature_extraction/qid_spacy_cleaned_word_lists_dict.pkl','rb'))

In [4]:
# set to 1 if you don't want to do N_gram parsing/training (but you still need to modify the model's structure by yourself)
N_gram = 1

is_output_single_value = False

In [5]:
question_max_len = 30
wordvec_dim = 300

EMPTY_WORD = ''
EMPTY_VEC = np.zeros(wordvec_dim)
word2vec[EMPTY_WORD] = EMPTY_VEC

In [6]:
def clip_word_list_len(word_list, leng=question_max_len):
    # too long
    if len(word_list)>=question_max_len:
        return word_list[:question_max_len]
    # too short
    else:
        len_diff = question_max_len - len(word_list)
        return np.hstack([word_list, np.repeat(EMPTY_WORD, len_diff)])

In [7]:
for qid in qid_question_dict:
    qid_question_dict[qid] = clip_word_list_len(qid_question_dict[qid])

#### Process the validation set

In [8]:
# ex: [a,b,c] with N=3 =>  [ED,ED,A], [ED,A,B], [A,B,C], [B,C,ED], [C,ED,ED], where A,B,C,ED are embedded vectors
def N_gram_embedding(word_lists, N):
    
    # single word list
    if len(word_lists.shape)==1:
        question_len = word_lists.shape[0]
        sub_seq_len = question_len+N-1
        res = np.zeros((1, sub_seq_len, N, 300))
        
        # insert ED
        res[0,0,0], res[0,0,1], res[0,1,0], res[0,-1,-1], res[0,-1,-2], res[0,-2,-1] = [EMPTY_VEC]*6

        for i,word in enumerate(word_lists):
            for n in range(N):
                if i+n<sub_seq_len and N-n-1>-1:
                    res[0,i+n,N-n-1] = word2vec[word]
        
    # multiple word lists
    else:
        batch_size, question_len = word_lists.shape
        sub_seq_len = question_len+N-1
        res = np.zeros((batch_size, sub_seq_len, N, 300))

        for s in range(batch_size):

            # insert ED
            res[s,0,0], res[s,0,1], res[s,1,0], res[s,-1,-1], res[s,-1,-2], res[s,-2,-1] = [EMPTY_VEC]*6

            for i,word in enumerate(word_lists[s]):
                for n in range(N):
                    if i+n<sub_seq_len and N-n-1>-1:
                        res[s,i+n,N-n-1] = word2vec[word]

    return res

In [9]:
def word_list_to_vec(word_list):
    if N_gram!=None and N_gram>1:
        return N_gram_embedding(word_list, N_gram)
    else:
        # the word list is single data
        if type(word_list[0])==str:
            ret = []
            for word in word_list:
                try:
                    ret.append(word2vec[word])
                except:
                    print('Exception word:', word)
                    ret.append(word2vec['something'])
            return np.array(ret)
        # the word list is multiple data
        else:
            ret = np.zeros((len(word_list),question_max_len,wordvec_dim))
            for i,single_list in enumerate(word_list):
                for j,word in enumerate(single_list):
                    try:
                        ret[i,j] = word2vec[word]
                    except:
                        print('Exception word:', word)
                        ret[i,j] = word2vec['something']
            return ret
            
    
def qid_list_to_vec(qid_list, swap=False):
    word_lists_1 = np.array([qid_question_dict[q_pair[0]] for q_pair in qid_list])
    word_lists_2 = np.array([qid_question_dict[q_pair[1]] for q_pair in qid_list])
    
    if N_gram!=None and N_gram>1:
        word_vectors_1 = N_gram_embedding(word_lists_1, N_gram)
        word_vectors_2 = N_gram_embedding(word_lists_2, N_gram)
    else:
        word_vectors_1 = np.array([[word2vec[word] for word in word_list] for word_list in word_lists_1])
        word_vectors_2 = np.array([[word2vec[word] for word in word_list] for word_list in word_lists_2])
    
    if swap:
        return [word_vectors_2, word_vectors_1]
    else:
        return [word_vectors_1, word_vectors_2]
    

In [10]:
# %%time

val_pos_pairs = pickle.load(open('../dataset/augmented/validation_positive_qid_pairs.pkl', 'rb'))
val_neg_pairs = pickle.load(open('../dataset/augmented/validation_negative_qid_pairs.pkl', 'rb'))
all_val_paris = np.vstack([val_pos_pairs,val_neg_pairs])

X_val = qid_list_to_vec(all_val_paris, swap=False) # must not swap

pos_size = val_pos_pairs.shape[0]
y_val = np.zeros((all_val_paris.shape[0],2))
y_val[:pos_size,1] = 1
y_val[pos_size:,0] = 1

del val_pos_pairs, val_neg_pairs, all_val_paris

In [11]:
X_val[0].shape, X_val[1].shape, y_val.shape

((20000, 30, 300), (20000, 30, 300), (20000, 2))

#### Calculate validation data's class weights

In [12]:
# Calculate the class weights to simulate the class disribution of testing samples
#
# These numbers and formula is got from here: 
#     https://www.kaggle.com/lystdo/quora-question-pairs/lstm-with-word2vec-embeddings#175198

test_set_pos_label_ratio = 0.1746

if is_output_single_value:
    validation_pos_ratio = sum(y_val)/len(y_val)
else:
    validation_pos_ratio = sum(y_val[:,1]==True)/y_val.shape[0]

val_weights = {
    0: (1-test_set_pos_label_ratio) / (1-validation_pos_ratio),
    1: test_set_pos_label_ratio/validation_pos_ratio
}

validation_weights = np.repeat(val_weights[1], y_val.shape[0])
if is_output_single_value:
    validation_weights[y_val==0] = val_weights[0]
else:
    validation_weights[y_val[:,0]==1] = val_weights[0]

#### Create model

In [13]:
# from tensorflow.contrib.keras.python.keras.layers import Input, GRU, Reshape, Dense, Dropout, BatchNormalization, LSTM, GaussianDropout
# from tensorflow.contrib.keras.python.keras.layers import SpatialDropout1D, Dot, Add, Flatten, Activation
# from tensorflow.contrib.keras.python.keras.layers.merge import Concatenate
# from tensorflow.contrib.keras.python.keras.models import Model
# from tensorflow.contrib.keras.python.keras.optimizers import RMSprop, Nadam
# from tensorflow.contrib.keras.python.keras.layers.wrappers import TimeDistributed, Bidirectional
# from tensorflow.contrib.keras.python.keras.layers.noise import GaussianNoise
# from tensorflow.contrib.keras.python.keras.callbacks import EarlyStopping, TensorBoard

# from tensorflow.contrib.keras.python.keras.layers.pooling import MaxPooling2D, AveragePooling2D, AveragePooling1D
# from tensorflow.contrib.keras.python.keras.layers.convolutional import Conv2D

# import tensorflow as tf


from keras.layers import Input, GRU, Reshape, Dense, Dropout, BatchNormalization, LSTM, GaussianDropout
from keras.layers import SpatialDropout1D, Dot, Add, Flatten, Activation
from keras.layers.merge import Concatenate
from keras.models import Model
from keras.optimizers import RMSprop, Nadam
from keras.layers.wrappers import TimeDistributed, Bidirectional
from keras.layers.noise import GaussianNoise
from keras.callbacks import EarlyStopping, TensorBoard

from keras.layers.pooling import MaxPooling2D, AveragePooling2D, AveragePooling1D
from keras.layers.convolutional import Conv2D



def gen_model():

    # hyper-parameters that should be passed as function argument
    
    lstm_ret_seq_output = 64
    ttd_dense = 8
    
    lstm_naive_output = 128
    naive_dense = 16

#     q1_len = Input(shape=(1,))
#     q2_len = Input(shape=(1,))

     
    # Define distance metrics

#     Dot_distance = lambda x: K.sum( K.batch_dot(x[0],x[1],axes=[1,1]),axis=1,keepdims=True )
#     dot_dist = Merge(mode=Dot_distance, output_shape=(1,))

#     Manhattan_distance = lambda x: K.sum( K.abs(x[0]-x[1]),axis=1,keepdims=True )
#     manhattan_dist = Merge(mode=Manhattan_distance, output_shape=(1,))
    
#     Cosine_distance = lambda x: K.sum( K.cos(x[0],x[1]),axis=1,keepdims=True )
#     cos_dist = Merge(mode=Cosine_distance, output_shape=(1,))
    
    ########################################################################################################################
    # Use distance metric with naive RNN (without return sequences)
    ########################################################################################################################
    
#     # add noise to inputs
    
#     Noise = GaussianNoise(0.1)

#     embedded_input1 = Input(shape=(question_max_len,wordvec_dim))
#     noised_input1 = Noise(embedded_input1)

#     embedded_input2 = Input(shape=(question_max_len,wordvec_dim))
#     noised_input2 = Noise(embedded_input2)
    
#     RNN_naive_2d = LSTM(lstm_naive_output, dropout=0.2, recurrent_dropout=0.2)
#     rnn_2d_1 = RNN_naive_2d(noised_input1)
#     rnn_2d_2 = RNN_naive_2d(noised_input2)
    
#     RNN_naive_flat = LSTM(lstm_naive_output, dropout=0.2, recurrent_dropout=0.2)
#     rnn_flat_1 = Reshape((1,question_max_len*wordvec_dim))(noised_input1)
#     rnn_flat_2 = Reshape((1,question_max_len*wordvec_dim))(noised_input2)
#     rnn_flat_1 = RNN_naive_flat(rnn_flat_1)
#     rnn_flat_2 = RNN_naive_flat(rnn_flat_2)
    
#     rnn_concat = Concatenate(axis=-1)([rnn_2d_1, rnn_2d_2, rnn_flat_1, rnn_flat_2])
#     rnn_concat = Dropout(0.2)(rnn_concat)
#     rnn_concat = BatchNormalization()(rnn_concat)
    
#     dot_dist_flat = dot_dist([rnn_flat_1,rnn_flat_2])
#     manhattan_dist_flat = manhattan_dist([rnn_flat_1,rnn_flat_2])
# #     cos_dist_flat = cos_dist([rnn_flat_1,rnn_flat_2])
    
#     concat_dists  = Concatenate(axis=1)([rnn_concat, dot_dist_flat, manhattan_dist_flat])
    
#     concat_dists = Dense(naive_dense, activation='relu')(concat_dists)
#     concat_dists = Dropout(0.2)(concat_dists)
#     concat_dists = BatchNormalization()(concat_dists)
    
#     out = Dense(1, activation='sigmoid')(concat_dists)
    
# #     all_comb = Concatenate(axis=1)([dot_dist_flat, manhattan_dist_flat, dense_similarity])
# #     out = Dense(1, activation='sigmoid')(all_comb)
    
    
    ########################################################################################################################
    #     Use naive RNN fit representation again (Not yet testes, takes too much memory)
    ########################################################################################################################

#     # add noise to inputs
    
#     Noise = GaussianNoise(0.1)

#     embedded_input1 = Input(shape=(question_max_len,wordvec_dim))
#     noised_input1 = Noise(embedded_input1)

#     embedded_input2 = Input(shape=(question_max_len,wordvec_dim))
#     noised_input2 = Noise(embedded_input2)

#     rnn_reseq1 = SpatialDropout1D(0.2)(rnn_reseq1)
#     rnn_reseq2 = SpatialDropout1D(0.2)(rnn_reseq2)
#     rnn_reseq1 = BatchNormalization()(rnn_reseq1)
#     rnn_reseq2 = BatchNormalization()(rnn_reseq2)
    
#     RNN_naive = LSTM(lstm_naive_output, dropout=0.2, recurrent_dropout=0.2)
#     flatten1 = Reshape((1,lstm_ret_seq_output*question_max_len*2))(rnn_reseq1)
#     flatten2 = Reshape((1,lstm_ret_seq_output*question_max_len*2))(rnn_reseq2)
#     rnn_naive1 = RNN_naive(flatten1)
#     rnn_naive2 = RNN_naive(flatten2)
    
#     rnn_concat = Concatenate(axis=-1)([rnn_naive1, rnn_naive2])
#     rnn_concat = Dropout(0.3)(rnn_concat)
#     rnn_concat = BatchNormalization()(rnn_concat)
    
#     rnn_concat = Dense(question_max_len*ttd_dense//2, activation='sigmoid')(rnn_concat)
#     rnn_concat = Dropout(0.2)(rnn_concat)
#     rnn_concat = BatchNormalization()(rnn_concat)
#     out = Dense(1, activation='sigmoid')(rnn_concat)

    ########################################################################################################################
    #     RNN with input is N-gram in each time stamp
    ########################################################################################################################
    
    lstm_naive_output = 256
    first_dense = 64
    second_dense = 64
    feature_dense = 16
    
    dense_dropout = 0.2
    rnn_dropout = 0.02
    
    time_stamps = question_max_len+N_gram-1
    
#     def SubtractTensor(layers_list):
#         return merge(layers_list, mode=lambda x: x[0] - x[1], output_shape=lambda x: x[0])
    
    # feed two questions to same RNN model
    
    if N_gram!=None and N_gram>1:
    
        semantic_dropout = 0.01
        semantic_compress_size = 128
        semantic_compressor = LSTM(semantic_compress_size, dropout=semantic_dropout, recurrent_dropout=semantic_dropout, return_sequences=True)

        embedded_input1 = Input(shape=(time_stamps, N_gram, wordvec_dim))
        emb_res1 = Reshape((time_stamps, N_gram*wordvec_dim))(embedded_input1)
#         noised_input1 = GaussianNoise(0.001)(emb_res1)
        compressed_input1 = semantic_compressor(emb_res1)
        compressed_input1 = Reshape((time_stamps, semantic_compress_size))(compressed_input1)

        embedded_input2 = Input(shape=(time_stamps, N_gram, wordvec_dim))
        emb_res2 = Reshape((time_stamps, N_gram*wordvec_dim))(embedded_input2)
#         noised_input2 = GaussianNoise(0.001)(emb_res2)
        compressed_input2 = semantic_compressor(emb_res2)
        compressed_input2 = Reshape((time_stamps, semantic_compress_size))(compressed_input2)

        RNN = Bidirectional(LSTM(lstm_naive_output, dropout=rnn_dropout, recurrent_dropout=rnn_dropout))
        rnn_out1 = RNN(compressed_input1)
        rnn_out2 = RNN(compressed_input2)
        
    else:
        
        print('Not N-gram mode')
        
        embedded_input1 = Input(shape=(time_stamps, wordvec_dim))
#         emb_res1 = Reshape((1, time_stamps*wordvec_dim))(embedded_input1)
#         noised_input1 = GaussianNoise(0.001)(emb_res1)
        
        embedded_input2 = Input(shape=(time_stamps, wordvec_dim))
#         emb_res2 = Reshape((1, time_stamps*wordvec_dim))(embedded_input2)
#         noised_input2 = GaussianNoise(0.001)(emb_res2)
        
        RNN = Bidirectional(LSTM(lstm_naive_output, dropout=rnn_dropout, recurrent_dropout=rnn_dropout))
#         RNN = LSTM(wordvec_dim, dropout=rnn_dropout, recurrent_dropout=rnn_dropout)
        rnn_out1 = RNN(embedded_input1)
        rnn_out2 = RNN(embedded_input2)
    
    
    # Directly substract two tensors with all combinations, i.e. get manhatten distance of each word pairs
    # The result would be a 2D tensor
    
#     if is_use_dist_metrics:
        
#         subtract_map_input = Input(shape=(question_max_len, question_max_len, distance_policies_count))

#         conv_feature = 256
#         conv_window = (3,3)
#         pool_stride = (3,3) # how much dist between two windows in pooling
#         pool_window = (3,3)
#         pooling_first_dense = 512
#         pooling_output_size = 128

#     #     conv = Conv2D(conv_feature, conv_window, activation='relu', padding='same')(subtract_map_input)
#     #     conv = Conv2D(conv_feature, conv_window, activation='relu', padding='same')(conv)

#         # It is said that mean pooling works better than max-pooling

#         subtract_avgpooling = AveragePooling2D(pool_size=pool_window, strides=(1,1))(subtract_map_input)
#         subtract_avgpooling = Conv2D(conv_feature, conv_window, activation='relu', padding='same')(subtract_avgpooling)
#         subtract_avgpooling = Conv2D(conv_feature, conv_window, activation='relu', padding='same')(subtract_avgpooling)
#         subtract_avgpooling = Flatten()(subtract_avgpooling)

#         print(subtract_avgpooling)
        
#         subtract_avgpooling = BatchNormalization()(subtract_avgpooling)
#         subtract_avgpooling = Dropout(0.05)(subtract_avgpooling)
#         subtract_avgpooling = Dense(pooling_first_dense, activation='relu')(subtract_avgpooling)

#         subtract_avgpooling = BatchNormalization()(subtract_avgpooling)
#         subtract_avgpooling = Dropout(0.05)(subtract_avgpooling)
#         subtract_avgpooling = Dense(pooling_output_size, activation='relu')(subtract_avgpooling)
        
#         subtract_avgpooling = BatchNormalization()(subtract_avgpooling)
#         merge_all = Dropout(0.05)(subtract_avgpooling)

    
    # merge models
    
    
    FirstDense = Dense(first_dense)
    FirstDropout = Dropout(dense_dropout)
    
    merge_1 = Concatenate(axis=-1)([rnn_out1, rnn_out2])
    merge_1 = FirstDense(merge_1)
    merge_1 = BatchNormalization()(merge_1)
    merge_1 = Activation('relu')(merge_1)
    merge_1 = FirstDropout(merge_1)
    
    merge_2 = Concatenate(axis=-1)([rnn_out2, rnn_out1])
    merge_2 = FirstDense(merge_2)
    merge_2 = BatchNormalization()(merge_2)
    merge_2 = Activation('relu')(merge_2)
    merge_2 = FirstDropout(merge_2)
    
    merge_all = Concatenate(axis=-1)([merge_1, merge_2])
    
#     merge_all = Dense(second_dense)(merge_all)
#     merge_all = BatchNormalization()(merge_all)
#     merge_all = Activation('relu')(merge_all)
#     merge_all = Dropout(dense_dropout)(merge_all)
    
#     pool_size = 4
#     merge_all = Reshape((second_dense,1))(merge_all)
#     merge_all = AveragePooling1D(pool_size=pool_size, strides=None, padding='same')(merge_all)
#     merge_all = Reshape((second_dense//pool_size,))(merge_all)
    
    merge_all = Dense(feature_dense)(merge_all)
    merge_all = BatchNormalization()(merge_all)
    merge_all = Activation('relu')(merge_all)
    merge_all = Dropout(dense_dropout)(merge_all)
    
    out = Dense(2, activation='softmax')(merge_all)
    
    # compile the model
    
    model = Model(inputs=[embedded_input1, embedded_input2], outputs=out)
    # choose objective and optimizer
#     model.compile(loss='binary_crossentropy', optimizer=RMSprop(lr=1e-3))
    model.compile(loss='binary_crossentropy', optimizer=Nadam(lr=0.001), metrics=['accuracy'])
    
    return model

Using TensorFlow backend.


In [14]:
model = gen_model()

Not N-gram mode


In [15]:
'''
Generate training/validation data

Pseudo code:

2. Generate positive samples 
        if same_question_ratio is set:
            Random assign partial positive samples as same question pairs (i.e. [qid_A, qid_A] )
        the remaining ratio of samples would be picked from pos_pairs
3. Generate negative samples 
        if random_negative_samples_ratio is set:
            Random assigning qid pairs as negative sample. (The number of negative samples is decided by "pos_ratio" parameter)
        the remaining ratio of samples would be picked from neg_pairs
4. Shuffle the order of all question pairs.
5. Embed the question pairs.
'''
def gen_batch_data(pos_pairs, neg_pairs, batch_size, pos_ratio, swap=False):
    
    global embedding_matrix
    
    all_id_list = list(qid_question_dict.keys())
    
    def gen_rnd_idx(list_data):
        return random.randint(0,len(list_data)-1)
    
    def random_pick_from(list_data):
        return list_data[gen_rnd_idx(list_data)]
            
    def gen_shuffle_idxes(num):
        a = np.arange(num)
        random.shuffle(a)
        return a
    
    def gen_positive_qid_pairs(pos_pairs, N):
        
        N_pos_pairs = [pos_pairs[random.randint(0,pos_pairs.shape[0]-1)] for i in range(N)]
        
        if same_question_ratio!=0:
            same_question_count = round(same_question_ratio*N)
            replaced_by_same_question_idxes = [gen_rnd_idx(N_pos_pairs) for i in range(same_question_count)]
            for i in replaced_by_same_question_idxes:
                rnd_qid = random_pick_from(all_id_list)
                N_pos_pairs[i] = [rnd_qid,rnd_qid]
        
        return N_pos_pairs
        
    def gen_negative_qid_pairs(neg_pairs, N):
        
        '''
        TODO:
            Maybe we still need to check if generated question pair belongs to duplicated question samples.
            Although I think this is not very necessay, but we still can give it a try if we have time.
        '''
        
        N_neg_pairs = [random_pick_from(neg_pairs) for i in range(N)]
        
        if random_negative_samples_ratio!=0:
            random_sample_count = round(random_negative_samples_ratio*N)
            replaced_by_random_idxes = [gen_rnd_idx(N_neg_pairs) for i in range(random_sample_count)]
            for i in replaced_by_random_idxes:
                N_neg_pairs[i] = [random_pick_from(all_id_list),random_pick_from(all_id_list)]
        
        return N_neg_pairs
    
    # generate negative samples to match the given pos:neg ratio
    
    pos_count = int(round(batch_size*training_pos_ratio))
    neg_count = int(round( ((1-pos_ratio)/pos_ratio) * pos_count ))
    
    pos_pairs = gen_positive_qid_pairs(pos_pairs, pos_count)
    neg_pairs = gen_negative_qid_pairs(neg_pairs, neg_count)
    
    pos_1, pos_2 = qid_list_to_vec(pos_pairs, swap=swap)
    neg_1, neg_2 = qid_list_to_vec(neg_pairs, swap=swap)
    
    q1 = np.vstack([pos_1,neg_1])
    q2 = np.vstack([pos_2,neg_2])
    
    # create y
    
    y = np.zeros((q1.shape[0], 2))
    y[:len(pos_pairs),1] = 1
    y[len(pos_pairs):,0] = 1
    
    # shuffle
    
    shuffle_idxes = gen_shuffle_idxes(y.shape[0])
    q1 = q1[shuffle_idxes]
    q2 = q2[shuffle_idxes]
    y = y[shuffle_idxes]
    
    return [q1,q2], y

In [16]:
# Calculate the class weights to simulate the class disribution of testing samples
#
# These numbers and formula is got from here: 
#     https://www.kaggle.com/lystdo/quora-question-pairs/lstm-with-word2vec-embeddings#175198

test_set_pos_label_ratio = 0.1746
training_pos_ratio = 0.3 # test_set_pos_label_ratio

same_question_ratio = 0 # a training hyper paramter to add testcases like [A,A] 
random_negative_samples_ratio = 0

weights = {
    0: (1-test_set_pos_label_ratio) / (1-training_pos_ratio),
    1: test_set_pos_label_ratio/training_pos_ratio
}

In [None]:
model_name = 'HubertLin_augmented_spaCy_features_Naive_LSTM256_Feature16'
model_path = './model/tmp/'+model_name+'.model'

In [None]:
import random
from keras.callbacks import ModelCheckpoint

batch_size = 256 # true batch size would a little differ

def batch_generator(pos_pairs, neg_pairs, switch_position=False):
    
    while True:
        
        if switch_position and random.randint(0,1)==0:
            switch_position = True
        
        X, y = gen_batch_data(pos_pairs, neg_pairs, batch_size, training_pos_ratio, swap=switch_position)
        
        yield X, y

callbacks = [
    EarlyStopping(monitor='val_loss', patience=10, mode='min', verbose=1),
#     ModelCheckpoint(model_path, monitor='val_loss', verbose=0, save_best_only=True, mode='min', period=1)
]
    
try:
    
    model.fit_generator(batch_generator(train_pos_pairs, train_neg_pairs, switch_position=True),
                        steps_per_epoch=100,
                        class_weight=weights,
                        epochs=1000,
                        validation_data=(X_val,y_val,validation_weights), 
                        callbacks=callbacks)
    
except KeyboardInterrupt:
    print('\nEarly stopped by user')

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000

In [49]:
1+1

2

## customized testcase

In [17]:
import sys
sys.path.append('../dataset/spacy_feature_extraction/')

import spacy_extract_features as extractor

Sanity test:
[I, am, a, good, programmer, !]


In [18]:
def process_new_string(s):
    s = extractor.process_new_string(s)
    s = clip_word_list_len(s)
    s = word_list_to_vec(s)
    return s

In [40]:
import numpy as np
import pandas as pd
import re
import pickle

dec_map = pickle.load(open('../dataset/processed/dec_map.pkl','rb'))
enc_map = pickle.load(open('../dataset/processed/enc_map.pkl','rb'))

testcases = [
    {
        'q1': 'I am a good man',
        'q2': 'A good man is me'
    },
    {
        'q1': 'The <RARE0> is totally fucking awful',
        'q2': 'The <RARE0> is the best thing I have ever seen'
    },
    {
        'q1': 'The <RARE0> is the best thing I have ever seen',
        'q2': 'The <RARE0> is totally fucking awful',
    },
    {
        'q1': 'How <RARE0> can be used in cooking',
        'q2': 'Can we find <RARE1> on the moon'
    },
    {
        'q1': 'A bird is flying in the sky',
        'q2': 'My bird is flying over the ocean'
    },
    {
        'q1': 'What is your favorite food',
        'q2': 'Which is your favorite one'
    },
    {
        'q1': '<RARE0> <RARE1> <RARE2> <RARE3> <RARE4> <RARE5> ',
        'q2': '<RARE6> <RARE7> <RARE8> <RARE9> <RARE10> <RARE11> ',
    },
    {
        'q1': '<RARE0> <RARE1> <RARE2> <RARE3> <RARE4> <RARE5> ',
        'q2': '<RARE0> <RARE1> <RARE2> <RARE3> <RARE4> <RARE5> ',
    },
    {
        'q1': ' what a genius program I have never seen such beautiful code ever',
        'q2': ' what a genius program I have never seen such beautiful code ever',
    },
    {
        'q1': ' what a genius program I have never seen such beautiful code ever',
        'q2': ' what a genius program I have never seen such beautiful code ever',
    },
]

for testcase in testcases:
    
    q1 = process_new_string(testcase['q1'])
    q2 = process_new_string(testcase['q2'])
    
    q1 = q1.reshape(1,q1.shape[0],q1.shape[1])
    q2 = q2.reshape(1,q2.shape[0],q2.shape[1])

#     if N_gram==None:
#         emb1 = embedding_matrix[np.array([enc1])]
#         emb2 = embedding_matrix[np.array([enc2])]
#     else:
#         emb1 = N_gram_embedding(np.array([enc1]), N_gram)
#         emb2 = N_gram_embedding(np.array([enc2]), N_gram)

    pred = model.predict([q1,q2])
    print(testcase['q1'])
    print(testcase['q2'])
    print('Predicting is same question proba =', pred)
    print('')


I am a good man
A good man is me
Predicting is same question proba = [[ 0.56513655  0.43486339]]

The <RARE0> is totally fucking awful
The <RARE0> is the best thing I have ever seen
Predicting is same question proba = [[ 0.97436553  0.02563444]]

The <RARE0> is the best thing I have ever seen
The <RARE0> is totally fucking awful
Predicting is same question proba = [[ 0.98159063  0.01840929]]

How <RARE0> can be used in cooking
Can we find <RARE1> on the moon
Predicting is same question proba = [[ 0.97004843  0.0299516 ]]

A bird is flying in the sky
My bird is flying over the ocean
Predicting is same question proba = [[ 0.86727273  0.13272727]]

What is your favorite food
Which is your favorite one
Predicting is same question proba = [[ 0.4904066   0.50959343]]

<RARE0> <RARE1> <RARE2> <RARE3> <RARE4> <RARE5> 
<RARE6> <RARE7> <RARE8> <RARE9> <RARE10> <RARE11> 
Predicting is same question proba = [[ 0.99625278  0.00374718]]

<RARE0> <RARE1> <RARE2> <RARE3> <RARE4> <RARE5> 
<RARE0> <RARE

## Get features of training set for ensembling

In [19]:
del train_pos_pairs
del train_neg_pairs
del qid_question_dict
del X_val
del y_val
del val_weights

In [19]:
import sys
sys.path.append('../dataset/spacy_feature_extraction/')

import spacy_extract_features as extractor

In [32]:
# df_test = pd.read_csv('../dataset/quora-question-pairs/test.csv')
# df_test.head()

# Direct prediction

In [None]:
df_output = pd.DataFrame(columns=['test_id','is_duplicate'])

q1_cache = []
q2_cache = []
batch_size = 100
prev_idx = 0

for i,series in df_test.iterrows():
    
    q1 = series['question1']
    q2 = series['question2']
    
    q1 = extractor.process_new_string(q1)
    q1 = clip_word_list_len(q1)
    
    q2 = extractor.process_new_string(q2)
    q2 = clip_word_list_len(q2)
    
    q1_cache.append(q1)
    q2_cache.append(q2)
    
    if i%batch_size==0 and i!=0:
    
        q1_cache = np.array(q1_cache)
        q2_cache = np.array(q2_cache)
    
        q1_cache = word_list_to_vec(q1_cache)
        q2_cache = word_list_to_vec(q2_cache)

        pred_a = model.predict([q1_cache,q2_cache])
        pred_b = model.predict([q2_cache,q1_cache])
        
        bests = [pred_a[i][1] if abs(pred_a[i][1]-0.5)>abs(pred_b[i][1]-0.5) else pred_b[i][1] for i in range(pred_a.shape[0])]
        
        sub_df = pd.DataFrame({
            'test_id': [str(i) for i in range(prev_idx,i+1)],
            'is_duplicate': bests
        })
        
        prev_idx = i+1

        df_output = df_output.append(sub_df,ignore_index=True)

        q1_cache = []
        q2_cache = []
        
        if i>300:
            break

In [None]:
df_output

# extract features from layers

In [20]:
# model_name = 'HubertLin_augmented_spaCy_features_Naive_LSTM256_Feature16_OvertrainLoss090'
# model_name = 'HubertLin_augmented_spaCy_features_3gram128_LSTM128_Feature16_Loss030_ACC_080'
model_name = 'HubertLin_augmented_spaCy_features_3gram128_LSTM256_DeeperDense_Feature16_Overtrain090loss'

N_gram = 1

SET = 'TRAIN'
# SET = 'TEST'

if SET == 'TRAIN':
    from_file = '../dataset/spacy_feature_extraction/train_processed.txt'
    write_file = './features_from_model/train/'+model_name+'.csv'
else:
    from_file = '../dataset/spacy_feature_extraction/test_processed.txt'
    write_file = './features_from_model/test/'+model_name+'.csv'

In [53]:
from keras.models import load_model

model_path = './model/'+model_name+'.model'

import h5py
import glob

model_files = sorted(glob.glob(model_path))
for model_file in model_files:
    print("Update '{}'".format(model_file))
    with h5py.File(model_file, 'a') as f:
        if 'optimizer_weights' in f.keys():
            del f['optimizer_weights']

model = load_model(model_path)

Update './model/HubertLin_augmented_spaCy_features_3gram128_LSTM256_DeeperDense_Feature16_Overtrain090loss.model'


In [54]:
feature_count = 16
feature_names = [model_name + '_' + chr(ord('a')+i) for i in range(feature_count)]

In [55]:
import keras.backend as K

layer_id = -2
model_segment = K.function(model.input+[K.learning_phase()], [model.layers[layer_id].output])

def get_layer_output(q1,q2):
    return model_segment([q1,q2,0])[0]

In [None]:
%%time

import csv

q1_cache = []
q2_cache = []
batch_size = 1000
idx = 0

# directly write to file, it takes too much memory
with open(write_file, "w") as csv_file:
    
    with open(from_file, 'r') as test_file:
        
        writer = csv.writer(csv_file, delimiter=',')
        writer.writerow(feature_names)
        
        size = 0
        last_run = False

        while True:
            
            q1 = test_file.readline()
            if q1=='': 
                last_run = True
            q1 = q1[:-1].split(' ')
            q2 = test_file.readline()[:-1].split(' ')
            q1 = clip_word_list_len(q1)
            q2 = clip_word_list_len(q2)

            if not last_run:
                q1_cache.append(q1)
                q2_cache.append(q2)
            
            size += 1

            if size == batch_size or last_run:

                q1_cache = np.array(q1_cache)
                q2_cache = np.array(q2_cache)

                q1_cache = word_list_to_vec(q1_cache)
                q2_cache = word_list_to_vec(q2_cache)

                retrieved_features = get_layer_output(q1_cache,q2_cache)

                for feature in retrieved_features:
                    writer.writerow(feature)

                idx += batch_size
                size = 0

                q1_cache = []
                q2_cache = []

                if idx%100000==0:
                    print(idx)
                    
            if last_run:
                break
            
        

## Prediction time

In [54]:
# from keras.models import load_model

# try:
#     model==None
# except:
#     model = load_model('../model/'+model_name+'.model')

# df_test = pickle.load(open('../dataset/processed/df_test_hubertLin_version.pkl', 'rb'))

In [None]:
idx = 0
df_out = pd.DataFrame(columns=['test_id','is_duplicate'])

with open('../dataset/spacy_feature_extraction/test_processed.txt') as f:
    
    while True:
    
        if idx%100000==0:
            print('Process {}/2345796'.format(idx))

        q1 = np.array(f.readline()[:-1].split(' '))
        q2 = np.array(f.readline()[:-1].split(' '))

        q1 = clip_word_list_len(q1)
        q1 = word_list_to_vec(q1)
        q1 = q1.reshape(1,q1.shape[0],q1.shape[1])

        q2 = clip_word_list_len(q2)
        q2 = word_list_to_vec(q2)
        q2 = q2.reshape(1,q2.shape[0],q2.shape[1])

        out_1 = model.predict([q1,q2])[0][1]
        out_2 = model.predict([q2,q1])[0][1]

        best = out_1 if abs(out_1-0.5)>abs(out_2-0.5) else out_2

        df_out.append(pd.Series({'test_id':str(idx),'is_duplicate':best}),ignore_index=True)

        idx+=1


In [None]:
%%time

partition_size = 10000

def predict(i,q):
    
    if i%10 == 0:
        print(i*partition_size, '/', len(df_test))
    
    x1 = np.array(list(q['question1']))
    print(x1)
    raise('fafa')
    x2 = np.array(list(q['question2']))
    
    np.array([process_new_string(x) for x in x1])
    
    return model.predict([x1,x2])

partition_len = len(df_test)//partition_size +1
result = [predict(i,df_test.iloc[i*partition_size:(i+1)*partition_size]) for i in range(partition_len)]
con = np.concatenate(result)
df_result = pd.DataFrame({'test_id':np.arange(len(con)),'is_duplicate':con.reshape(len(con))}, columns=['test_id','is_duplicate'])
df_result.to_csv('../result/prediction.csv', index=False)

In [51]:
if len(df_result)!=2345796:
    print('Your result prediction count is not fit to the testing data length 2345796 , yours:', len(df_result))
else:
    print('Prediction success')

Prediction success


## visualize the prediction result

In [54]:
over = df_test[df_result['is_duplicate']>0.8]

c = 0
for i,s in over.iterrows():
    print(' '.join(dec_question(s['question1'],dec_map)))
    print(' '.join(dec_question(s['question2'],dec_map)))
    print(df_result.ix[i]['is_duplicate'])
    print('')
    c+=1
    if c>5:
        break

Can a vacuum cleaner concentrate suck your eye out if it is pressed against your <RARE0> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED>
Could a vacuum cleaner suck get your eye out if directly pressed on the <RARE0> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED>
0.813005

Is web development just building <RARE0> best you get a web developer job if you know how to make a <RARE1> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED>
Is web development just building <RARE0> Can you get a backend web developer job if you know how year make a <RARE1> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED>
0.966372

How do I overcome my shyness with <RARE0> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED>
How do you overcome being <RARE1> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED> <ED>
0.865856

If I like a comment to a computer post b