## Data augmentation

Since many question pairs has same qid involve, 

In [1]:
import numpy as np
import pandas as pd
import re
import pickle
import json

In [2]:
df_train = pd.read_csv('../dataset/raw/train.csv', delimiter=',')
df_test = pd.read_csv('../dataset/raw/test.csv', delimiter=',')
df_train.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


## Get all duplicated questions

We'll construct two data structures:
1. A dictionary records { qid : question_text } pair. 

         The question_text is only splited as a list of words, not yet transformed into encoded form. This gives us chances to do training more flexible.
         
2. A list of duplicated qid pairs 

        Recording only qid saves us data loading time (since we'll try to augment and enumerate a huge amount of duplicated question pairs).

Note:

I didn't record any single non-duplicated question relation. 
I decided to random assign any question pairs to be non-duplicated question pairs. This might causes some issue that similar to too-much-upsampling case, but this extremely increases the variety of non-duplicated question samples.

In [6]:
df_train = pd.read_csv('../dataset/raw/train.csv', delimiter=',')

In [7]:
def get_max_qid(df):
    max_qid = 0
    for idx,frame in df.iterrows():
        qid1 = int(frame['qid1'])
        qid2 = int(frame['qid2'])
        if qid1>max_qid:
            max_qid = qid1
        elif qid2>max_qid:
            max_qid = qid2
    print('Max qid = ', max_qid)
    return max_qid

max_qid = get_max_qid(df_train)

Max qid =  537932


Forming groups of duplicated question pairs.

EX: 
```
if A==B and B==C:
    group A,B,C as a group, then we can enumerate all combinations, including A==C as a new sample
```

In [8]:
def group_questions(df):
    group_id = 0
    group_list = np.repeat(-1, max_qid)
    
    for idx,frame in df.iterrows():
        qid1 = int(frame['qid1'])
        qid2 = int(frame['qid2'])
        
        if int(frame['is_duplicate'])==1:
            # if both has no group, add new group
            if group_list[qid1]==-1 and group_list[qid2]==-1:
                group_list[qid1] = group_id
                group_list[qid2] = group_id
                group_id += 1

            # if both has group, join the group 
            elif group_list[qid1]!=-1 and group_list[qid2]!=-1 :
                idxes_to_be_joined = np.where(group_list==group_list[qid2])[0]
                group_list[idxes_to_be_joined] = group_list[qid1]

            # only q1 has group , than add q2 to q1's group
            elif  group_list[qid1]!=-1:
                group_list[qid2] = group_list[qid1]

            # only q2 has group , than add q1 to q2's group
            elif  group_list[qid2]!=-1:
                group_list[qid1] = group_list[qid2]
                
    return group_list
    
group_ids = group_questions(df_train)

In [9]:
sum(group_ids!=-1) # means these questions has group

149650

In [10]:
# Get all the group and store it as a dictionary
group_dict = {}
for i in range(np.max(group_ids)+1):
    group_members = np.where(group_ids==i)[0]
    if len(group_members)>0:
        group_dict[i] = group_members

In [11]:
import itertools

def enumerate_all_positive_cases(group_dict):
    
    def enumerate_inside_group(group):
        return list(itertools.combinations(group, 2))
    
    return np.vstack(enumerate_inside_group(group_dict[group_id]) for group_id in group_dict)

def duplicate_all(df):
    
    def get_qid_set():
        ids = set()
        for i,series in df.iterrows():
            if series['qid1'] not in ids:
                ids.add(series['qid1'])
            if series['qid2'] not in ids:
                ids.add(series['qid2'])
        return ids
    
    id_set = get_qid_set()
    return [[i,i] for i in id_set]

In [13]:
# Enumerate all cases of duplicated question pairs from each group
enumerate_pairs = enumerate_all_positive_cases(group_dict)

# The question pairs with itself is also a sample of duplicated question pair
# duplicate_pairs = duplicate_all(df_train) 

all_pos_pairs = enumerate_pairs
# all_pos_pairs = np.vstack([enumerate_pairs,duplicate_pairs])

In [15]:
print('The duplicate question pair count grows from {} to {}'.format(len(df_train[df_train['is_duplicate']==1]),len(all_pos_pairs))) # The total duplicated samples count 

The duplicate question pair count grows from 149263 to 228548


#### Remove validation set from training set

We should gaurantee that the questions in validation set never appears in the training set. Since we enumerate all possible combinations of positive question pairs, using questions in training set as validation set is very weird and risky.

我們要保證 validation set 裡面的 data 與 training set 裡面的 data 互不交及，否則 training set 裡面包含者 validation set 的資訊是危險而且詭異的做法。


In [23]:
def get_pos_rate_in_training_set():
    dup = np.array(df_train['is_duplicate'])
    pos_ratio = np.sum(dup) / dup.shape[0]
    return pos_ratio

In [115]:
# since we are very probable to use a single question for several times, we should remove validation samples directly at this point
# Fro example, Q_a == Q_b and Q_b == Q_c ,
# In my method, we'll generate a new data Q_a == Q_c
# If we move this Q_a == Q_c sample to validation set,
#     it is very weird that our training set already has this kind of information (can be recognized from the Q_a == Q_b == Q_c relation).

import random

validation_size = 20000 # an approximation, final result can be slightly more than this number

def split_val(pos_pairs, val_pos_ratio):
    
    # an estimation of how many data should be split from training set
    split_pos_size = int(validation_size * val_pos_ratio)
    
    qids = pos_pairs.flatten()
    
    # totally remove those selected qids from training set
    val_bools = np.repeat(False,pos_pairs.shape[0])
    while(np.sum(val_bools)<split_pos_size):
        rnd_qid = random.randint(0,len(qids)-1)
        val_single_bool = np.bitwise_or(pos_pairs[:,0]==rnd_qid,pos_pairs[:,1]==rnd_qid)
        val_bools = np.bitwise_or(val_bools, val_single_bool)
    val_idxes = np.where(val_bools)[0]
    
    val = pos_pairs[val_idxes]
    train = np.delete(pos_pairs, val_idxes, axis=0)
    
    return train, val
    

In [116]:
val_pos_ratio = get_pos_rate_in_training_set()
train_pos_pairs, val_pos_pairs = split_val(all_pos_pairs, val_pos_ratio)

In [118]:
print('Training positive pairs:', train_pos_pairs.shape[0])
print('Validation positive pairs:', val_pos_pairs.shape[0])

Training positive pairs: 221150
Validation positive pairs: 7398


In [119]:
pickle.dump(train_pos_pairs, open('../dataset/processed/train_positive_qid_pairs.pkl', 'wb'))
pickle.dump(val_pos_pairs, open('../dataset/processed/validation_positive_qid_pairs.pkl', 'wb'))

## Record non-duplicate question pairs

Not considering qid this time.

我是覺得 testing data 裡面也有包含 training data 的 qid ，而且 negative pair 不像 positive pair 那樣有連鎖性關係，所以在這邊就不處理了。

In [120]:
non_duplicate = df_train[df_train['is_duplicate']==0]
non_dup_question_pairs = np.array([[series['qid1'],series['qid2']] for i,series in non_duplicate.iterrows()])

val_pos_count = val_pos_pairs.shape[0]
val_total_count = val_pos_count / val_pos_ratio
val_neg_count = int(val_total_count - val_pos_count)

val_neg_idxes = [random.randint(0,len(non_dup_question_pairs)-1) for i in range(val_neg_count)]

val_neg_pairs = non_dup_question_pairs[val_neg_idxes]
train_neg_pairs = np.delete(non_dup_question_pairs, val_neg_idxes, axis=0)

In [126]:
print('Gauranteed non-duplicated question pair length is ', len(non_dup_question_pairs), '\n')

print('Training negative pairs:', train_neg_pairs.shape[0])
print('Validation negative pairs:', val_neg_pairs.shape[0])

Gauranteed non-duplicated question pair length is  255027 

Training negative pairs: 242730
Validation negative pairs: 12640


In [122]:
pickle.dump(train_pos_pairs, open('../dataset/processed/train_negative_qid_pairs.pkl', 'wb'))
pickle.dump(val_pos_pairs, open('../dataset/processed/validation_negative_qid_pairs.pkl', 'wb'))

In [125]:
print('Validation final size:', val_neg_pairs.shape[0] + val_pos_pairs.shape[0] )

Validation final size: 20038


## Parse original training DataFrame to words list and store it

Note: 

Not encoded yet, we need to map rare words to same `<RARE_X>` special token in each question pair. This should be done in training phase.

In [128]:
df_train = pd.read_csv('../dataset/raw/train.csv', delimiter=',')
enc_map = pickle.load(open('../dataset/processed/enc_map.pkl','rb'))

In [36]:
import re

def parse_wrod_list(question):
    
    if type(question)!=str:
        return []
    
    # identify special characters that separate words : (space) ' ! " ? @ ^ + * / . , ~ ( ) [ ] { } & | ` $ % = : ; < >  
    special_chars = '[\s\'!"\?@\^+*/\.,~\(\)\[\]\{\}\&\|`\$\%\=:;\<\>\-]'
    pre_separator = '(?='+special_chars+')'
    post_separator = '(?='+special_chars+'|$)'
    single_word = '[^\s\-]+' # non-empty is enough here

    return re.findall(special_chars+single_word+post_separator, question)
    

In [129]:
def gen_qid_question_dict(df, parse=False):
    res = {}
    for i,frame in df.iterrows():
        
        qid1 = int(frame['qid1'])
        if qid1 not in res:
            if parse:
                res[qid1] = parse_wrod_list(frame['question1'])
            else:
                res[qid1] = frame['question1']
        
        qid2 = int(frame['qid2'])
        if qid2 not in res:
            if parse:
                res[qid2] = parse_wrod_list(frame['question2'])
            else:
                res[qid2] = frame['question2']
            
    return res

In [39]:
# import random

# i = 20
# rnd = [random.randint(0,len(df_train)-1) for ii in range(i)]
# for r in rnd:
#     print(df_train.ix[r]['question1'])
#     print(parse_wrod_list(df_train.ix[r]['question1']))

In [13]:
# def gen_all_pos_df(qid_dict, all_pos_pairs):
#     all_series = []
#     column_names = ['qid1','qid2','question1','question2', 'is_duplicate']
#     for i,pair in enumerate(all_pos_pairs):
#         series = pd.Series([ pair[0], pair[1], qid_dict[pair[0]], qid_dict[pair[1]], 1 ], name=i)
#         all_series.append(series)
#     ret = pd.DataFrame(all_series)
#     ret.columns = column_names
#     return ret

# all_pos_df = gen_all_pos_df(qid_dict, all_pos_pairs)
# all_pos_df.head(10)
# pickle.dump(all_pos_df, open('../dataset/processed/enumerate_all_positive_training_data.pkl','wb'))

In [130]:
training_question_dict = gen_qid_question_dict(df_train)

In [131]:
pickle.dump(training_question_dict, open('../dataset/processed/qid_question_dict.pkl', 'wb'))