# Siamese neural network

Import needed packages and modules

In [5]:
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from keras.callbacks import TensorBoard
from string import punctuation
import pandas as pd
import numpy as np
import time
import datetime
import pickle
import nbimporter
from utils.preprocessing_utils import clear_offers, prepare_representation
from utils.model_utils import prepare_embedding_matrix, one_or_zero, build_model_blstm, exponent_neg_manhattan_distance, calculate_preds_binary, model_statistics

Define timestamp which will enable us to version data and models

In [6]:
start_time = time.time()
now = time.strftime("%Y%m%d-%H%M%S")

### Loading data

Description from Kaggle:

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

The goal of this competition is to predict which of the provided pairs of questions contain two questions with the same meaning. 

Data fields:
id - the id of a training set question pair
qid1, qid2 - unique ids of each question (only available in train.csv)
question1, question2 - the full text of each question
is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

In [7]:
data = pd.read_csv("src/Siamese_workshops_quora.csv", index_col="id", nrows=10000)
data = data[data['question1'].apply(lambda x: isinstance(x,str))]
data = data[data['question2'].apply(lambda x: isinstance(x,str))]

In [8]:
data.head()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [9]:
data.tail()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9995,19404,19405,How would you order these four cities (Bangalo...,What is the cost of living in Europe and the U...,0
9996,19406,19407,Stphen william hawking?,"What are the differences between SM, YG and JY...",0
9997,19408,19409,Mathematical Puzzles: What is () + () + () = 3...,What are the steps to solve this equation: [ma...,0
9998,19410,19411,Is IMS noida good for BCA?,How good is IMS Noida for studying BCA?,1
9999,19412,19413,What are the most respected and informative te...,What are Caltech's required and recommended te...,0


In [10]:
data[data.is_duplicate==1].shape

(3711, 5)

In [11]:
data[data.is_duplicate==0].shape

(6289, 5)

### Cleaning data

Clear text data from punctuation and lemmatize it.
Don't remove stopwords because in this case they can significantly change meaning of question.

Opisac funkcję

In [12]:
for question in ['question1', 'question2']:
    strip_chars = punctuation + '„”–'
    data[question + '_cleared'] = clear_offers(data[question], strip_chars, is_remove_stopwords=False, is_lemmatize=True)

2019-04-10 15:50:34.461892 Oczyszczenie danych - SUKCES
2019-04-10 15:50:38.236430 Lemmatyzacja - SUKCES
2019-04-10 15:50:38.652433 Oczyszczenie danych - SUKCES
2019-04-10 15:50:39.403429 Lemmatyzacja - SUKCES


In [13]:
data.tail()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate,question1_cleared,question2_cleared
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9995,19404,19405,How would you order these four cities (Bangalo...,What is the cost of living in Europe and the U...,0,"[how, would, you, order, these, four, city, ba...","[what, is, the, cost, of, living, in, europe, ..."
9996,19406,19407,Stphen william hawking?,"What are the differences between SM, YG and JY...",0,"[stphen, william, hawking]","[what, are, the, difference, between, sm, yg, ..."
9997,19408,19409,Mathematical Puzzles: What is () + () + () = 3...,What are the steps to solve this equation: [ma...,0,"[mathematical, puzzle, what, is, 30, using, 1,...","[what, are, the, step, to, solve, this, equati..."
9998,19410,19411,Is IMS noida good for BCA?,How good is IMS Noida for studying BCA?,1,"[is, ims, noida, good, for, bca]","[how, good, is, ims, noida, for, studying, bca]"
9999,19412,19413,What are the most respected and informative te...,What are Caltech's required and recommended te...,0,"[what, are, the, most, respected, and, informa...","[what, are, caltech's, required, and, recommen..."


### Creating  and saving tokenizer

Create and save tokenizer to be able to prepare any data for this model.
Opisac funkcje

In [14]:
tokenizer, stacked_representation = prepare_representation(
    pd.concat([data['question1_cleared'], data['question2_cleared']], axis=0), 'unk')
data['question1_tokens'], data['question2_tokens'] = np.array_split(stacked_representation, 2)
with open(f"results/{now}_tokenizer_warsztaty.pickle", 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [15]:
data.tail()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate,question1_cleared,question2_cleared,question1_tokens,question2_tokens
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9995,19404,19405,How would you order these four cities (Bangalo...,What is the cost of living in Europe and the U...,0,"[how, would, you, order, these, four, city, ba...","[what, is, the, cost, of, living, in, europe, ...","[7, 46, 16, 515, 352, 1556, 321, 488, 1018, 65...","[3, 4, 2, 277, 11, 426, 9, 1104, 13, 2, 85, 59..."
9996,19406,19407,Stphen william hawking?,"What are the differences between SM, YG and JY...",0,"[stphen, william, hawking]","[what, are, the, difference, between, sm, yg, ...","[11146, 11147, 11148]","[3, 12, 2, 66, 50, 2017, 13747, 13, 13748, 611..."
9997,19408,19409,Mathematical Puzzles: What is () + () + () = 3...,What are the steps to solve this equation: [ma...,0,"[mathematical, puzzle, what, is, 30, using, 1,...","[what, are, the, step, to, solve, this, equati...","[2985, 2738, 3, 4, 753, 177, 109, 144, 188, 31...","[3, 12, 2, 649, 8, 791, 69, 1308, 192, 1153, 2..."
9998,19410,19411,Is IMS noida good for BCA?,How good is IMS Noida for studying BCA?,1,"[is, ims, noida, good, for, bca]","[how, good, is, ims, noida, for, studying, bca]","[4, 7873, 5462, 40, 15, 7874]","[7, 40, 4, 7873, 5462, 15, 845, 7874]"
9999,19412,19413,What are the most respected and informative te...,What are Caltech's required and recommended te...,0,"[what, are, the, most, respected, and, informa...","[what, are, caltech's, required, and, recommen...","[3, 12, 2, 56, 3418, 13, 11149, 2184, 15, 845,...","[3, 12, 13751, 579, 13, 1858, 2184, 15, 2158, ..."


### Setting parameters and splitting data

opisac parametry modelu

In [16]:
model_parameters = {
        'emb_len': 32,
        'lstm_units': 10,
        'max_seq_len': 150,
        'offer_rep_dim': 64,
        'batch_size': 64,
        'maxlen': 40,
        'distance': 'manhattan',
        'optimizer': 'adam',
        'loss': 'bin'}

In [17]:
data = data.set_index(['qid1', 'qid2'])

y = data['is_duplicate'].astype(np.int64).apply(one_or_zero, args=(1,))


Y_train, Y_validation, X_train, X_validation = train_test_split(y, data.drop(["is_duplicate"], axis=1), test_size=0.2)

X_train = X_train.drop(["question1","question2", "question1_cleared", "question2_cleared"], axis=1)

In [18]:
X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,question1_tokens,question2_tokens
qid1,qid2,Unnamed: 2_level_1,Unnamed: 3_level_1
19273,19274,"[7, 14, 6, 24, 5, 40, 225, 260]","[3, 4, 5, 40, 42, 8, 98, 5, 40, 225, 667]"
12214,3938,"[7, 14, 6, 180, 19, 2274, 11, 122, 241]","[3, 31, 6, 10, 8, 180, 19, 122]"
3075,3076,"[64, 10, 1059, 1311, 1829, 9, 2, 752, 17, 10, ...","[3, 211, 1059, 1311, 9, 2, 752]"
5120,5121,"[17, 10, 41, 70, 30, 65, 12, 110, 99, 1142, 13...","[17, 4, 18, 30, 41, 70, 30, 65, 12, 110, 99, 1..."
10306,10307,"[26, 4, 2, 56, 3655, 97, 15, 16]","[26, 4, 2, 56, 5454, 97, 96]"


In [19]:
X_train_dataset = [pad_sequences(X_train['question1_tokens'], maxlen=model_parameters['maxlen']),
                   pad_sequences(X_train['question2_tokens'], maxlen=model_parameters['maxlen'])]
X_val_dataset = [pad_sequences(X_validation['question1_tokens'], maxlen=model_parameters['maxlen']),
                 pad_sequences(X_validation['question2_tokens'], maxlen=model_parameters['maxlen'])]

In [20]:
X_train_dataset

[array([[   0,    0,    0, ...,   40,  225,  260],
        [   0,    0,    0, ...,   11,  122,  241],
        [   0,    0,    0, ...,   10,   65,  207],
        ...,
        [   0,    0,    0, ..., 1958,   58, 6516],
        [   0,    0,    0, ...,  176,   29,   69],
        [   0,    0,    0, ..., 5077, 5078,  789]]),
 array([[   0,    0,    0, ...,   40,  225,  667],
        [   0,    0,    0, ...,  180,   19,  122],
        [   0,    0,    0, ...,    9,    2,  752],
        ...,
        [   0,    0,    0, ...,    3,    4, 6516],
        [   0,    0,    0, ...,    6,   10,  176],
        [   0,    0,    0, ...,   13, 5078,  789]])]

There two possible embedddings for this model: fb_emb or train your own embeddings.
Opisać funkcje i parametry

In [21]:
model_parameters.update({'nb_tokens': len(tokenizer.index_word) + 1})

embedding_matrix, embeddings_index, is_trainable = prepare_embedding_matrix(model_parameters, tokenizer.word_index,
                                                                            'src/wiki.en.vec.csv', 'own', nrows=None)
model_parameters.update({'is_trainable': is_trainable})

### Building and training model

Opisać funkcję

In [22]:
model = build_model_blstm(model_parameters, embedding_matrix)
print(model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 40)           0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 40)           0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 40, 32)       440064      input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 40, 20)       3440        embedding_1[0][0]                
          

In [23]:
ones_share = Y_train.sum() / Y_train.shape[0]

ones_weight = (1 - ones_share) / ones_share

In [24]:
model_trained = model.fit(X_train_dataset, Y_train.values,
                          validation_split=0.1, batch_size=model_parameters['batch_size'], epochs=1,
                          class_weight={1: ones_weight, 0: 1})

model.save("results/" + now + "_model.h5")

Train on 7200 samples, validate on 800 samples
Epoch 1/1






### Results

Wczytywanie modelu

In [26]:
from keras.models import load_model
from keras import backend as K

#read tokenizer
model = load_model('results/20190410-140449model.h5', custom_objects={'exponent_neg_manhattan_distance': exponent_neg_manhattan_distance})
with open('results/20190410-140449_tokenizer_warsztaty.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)
    
X_validation = X_validation.drop(["question1_tokens","question2_tokens"], axis=1)

Task 1: Tokenize validation data and prepare it for making a prediction

In [27]:
X_validation

Unnamed: 0_level_0,Unnamed: 1_level_0,question1,question2,question1_cleared,question2_cleared
qid1,qid2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13684,13685,What is the essential reading list for learnin...,What books do you recommend to read about Sema...,"[what, is, the, essential, reading, list, for,...","[what, book, do, you, recommend, to, read, abo..."
13977,13978,Can you treat pneumonia with Albuterol?,How can you treat pneumonia and albuterol?,"[can, you, treat, pneumonia, with, albuterol]","[how, can, you, treat, pneumonia, and, albuterol]"
16126,8887,"If you voted for Donald Trump, why did you vot...",Why did you specifically vote for Donald Trump?,"[if, you, voted, for, donald, trump, why, did,...","[why, did, you, specifically, vote, for, donal..."
16179,16180,Can you see who viewed your Instagram?,Can someone see if you have viewed public Inst...,"[can, you, see, who, viewed, your, instagram]","[can, someone, see, if, you, have, viewed, pub..."
2209,5609,Quora: How do you post a question on Quora?,How do I post something in Quora?,"[quora, how, do, you, post, a, question, on, q...","[how, do, i, post, something, in, quora]"
3443,3444,"In the UK, what are the implications of the Im...","In the UK, what are the implications of the Im...","[in, the, uk, what, are, the, implication, of,...","[in, the, uk, what, are, the, implication, of,..."
13878,13879,Is a heart rate of 110 beats per minute health...,How does it feel to have increased rate of hea...,"[is, a, heart, rate, of, 110, beat, per, minut...","[how, doe, it, feel, to, have, increased, rate..."
12463,12464,What are the best small classes for freshmen a...,How do I get admitted to Amherst College?,"[what, are, the, best, small, class, for, fres...","[how, do, i, get, admitted, to, amherst, college]"
12040,12041,What are the differences between BitBucket and...,What are the pros and cons of GitHub versus Bi...,"[what, are, the, difference, between, bitbucke...","[what, are, the, pro, and, con, of, github, ve..."
4377,4378,Is 21 too late to learn guitar?,Is it too late to learn how to sing at age 21?,"[is, 21, too, late, to, learn, guitar]","[is, it, too, late, to, learn, how, to, sing, ..."


In [28]:
content = pd.concat([X_validation['question1_cleared'], X_validation['question2_cleared']], axis=0)

def prepare_representation_tokenizer(content_frame, tokenizer):
    texts = content_frame.str.join(' ')
    return pd.DataFrame(pd.Series(tokenizer.texts_to_sequences(texts), index=content_frame.index))

stacked_representation = prepare_representation_tokenizer(content, tokenizer)

X_validation['question1_tokens_new'], X_validation['question2_tokens_new'] = np.array_split(stacked_representation, 2)

X_val_dataset = [pad_sequences(X_validation['question1_tokens_new'], maxlen=model_parameters['maxlen']),
                 pad_sequences(X_validation['question2_tokens_new'], maxlen=model_parameters['maxlen'])]

In [29]:
X_val_dataset

[array([[    0,     0,     0, ...,   268, 10284, 10285],
        [    0,     0,     0, ...,  4753,    29,  7260],
        [    0,     0,     0, ...,   566,    15,   287],
        ...,
        [    0,     0,     0, ...,    11,  1992,  1264],
        [    0,     0,     0, ...,    13,   225,  7356],
        [    0,     0,     0, ...,    63,  1269,   552]]),
 array([[    0,     0,     0, ...,    48,  7226,   268],
        [    0,     0,     0, ...,  4753,    13,  7260],
        [    0,     0,     0, ...,    15,   153,   100],
        ...,
        [    0,     0,     0, ...,    35,  1992,  1264],
        [    0,     0,     0, ...,   650,     4, 13135],
        [    0,     0,     0, ...,   208,    63,   836]])]

In [30]:
print(model.evaluate(X_val_dataset, Y_validation.values))

[0.5896039743423462, 0.686, 0.4014213891029358, -0.369499968290329]


In [31]:
preds = model.predict(X_val_dataset)
preds_binary = calculate_preds_binary(preds)

cm, metrics = model_statistics(preds_binary, Y_validation)

Liczba poprawnie przewidzianych ogłoszeń:  1372
Liczba wszystkich ogłoszeń w zbiorze testowym:  2000
Confusion matrix: 
Predicted  False  True
Actual                
False        724   537
True          91   648
Metryki: 
              precision    recall  f1-score   support

           0       0.89      0.57      0.70      1261
           1       0.55      0.88      0.67       739

   micro avg       0.69      0.69      0.69      2000
   macro avg       0.72      0.73      0.69      2000
weighted avg       0.76      0.69      0.69      2000



Task 2: Display first 5 correctly and incorrectly classified pairs of questions

In [32]:
results_y = pd.concat([Y_validation, pd.Series(preds_binary, index=Y_validation.index, name="preds")], axis=1)

In [33]:
result_df = pd.merge(X_validation[["question1", "question2"]], results_y, right_index=True, left_index=True)

In [34]:
result_df[result_df.is_duplicate==result_df.preds][0:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,question1,question2,is_duplicate,preds
qid1,qid2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
13684,13685,What is the essential reading list for learnin...,What books do you recommend to read about Sema...,0,0
13977,13978,Can you treat pneumonia with Albuterol?,How can you treat pneumonia and albuterol?,1,1
16126,8887,"If you voted for Donald Trump, why did you vot...",Why did you specifically vote for Donald Trump?,1,1
16179,16180,Can you see who viewed your Instagram?,Can someone see if you have viewed public Inst...,1,1
2209,5609,Quora: How do you post a question on Quora?,How do I post something in Quora?,1,1
13878,13879,Is a heart rate of 110 beats per minute health...,How does it feel to have increased rate of hea...,0,0
11173,9988,How do you know if you're a psychopath and how...,How do you know if you are a psychopath?,1,1
18271,18272,What is the difference between marijuana/weed/...,Are there any side effects of ganja (marijuana)?,0,0
12424,12425,How can I move my whatsapp account with the sa...,How do I get to my WhatsApp account using a ne...,1,1
12125,12126,Which indian movie has the biggest collection ...,Which Indian movie has highest collections?,1,1


In [35]:
result_df[result_df.is_duplicate!=result_df.preds][0:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,question1,question2,is_duplicate,preds
qid1,qid2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3443,3444,"In the UK, what are the implications of the Im...","In the UK, what are the implications of the Im...",0,1
12463,12464,What are the best small classes for freshmen a...,How do I get admitted to Amherst College?,0,1
12040,12041,What are the differences between BitBucket and...,What are the pros and cons of GitHub versus Bi...,0,1
4377,4378,Is 21 too late to learn guitar?,Is it too late to learn how to sing at age 21?,0,1
13068,13069,Why did I get my period 6 days late?,What does it mean when your period is three da...,1,0
14248,14249,Should I become a professor?,How do you become a professor?,0,1
17376,17377,How long does allergy season last in North Ame...,How long does allergy season last in South Ame...,0,1
14863,14864,"Does India have nutmeg, mace, and cloves?","Did China have nutmeg, mace, and cloves?",0,1
1659,1660,Can an auto immune disease cause insomnia?,What causes Auto Immune diseases?,0,1
2254,2255,Who were the Aztec?,Who were the Aztec Gods?,0,1
