# Syjamska sieć neuronowa

Ten notebook przedstawia użycie syjamskiej sieci neuronowej na danych tekstowych.
Sieć oparta jest na dwóch modelach blstm o takiej samej architekturze i współdzielonych wagach.

Wykorzystamy dane z kaggle dotyczące pytań na Quora i podobieństwa między nimi:
https://www.kaggle.com/c/quora-question-pairs

Kod podzielony jest na notebooka i skrypty pythonowe, z których zaczytujemy potrzebne funkcje.

In [1]:
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from string import punctuation
import pandas as pd
import numpy as np
import time
import datetime
import pickle
from utils.preprocessing_utils import clear_text, prepare_representation
from utils.model_utils import prepare_embedding_matrix, one_or_zero, build_model_blstm, exponent_neg_manhattan_distance, model_statistics

Using TensorFlow backend.


Definiujemy timestamp, który umożliwi nam wersjonowanie danych, modelu i tokenizera przy zapisywaniu.

In [2]:
start_time = time.time()
now = time.strftime("%Y%m%d-%H%M%S")

### Wczytanie danych

Opis danych ze strony:

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

The goal of this competition is to predict which of the provided pairs of questions contain two questions with the same meaning. 

Data fields:
id - the id of a training set question pair
qid1, qid2 - unique ids of each question (only available in train.csv)
question1, question2 - the full text of each question
is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

In [3]:
data = pd.read_csv("src/Siamese_workshops_quora.csv", index_col="id", nrows=10000)
data = data[data['question1'].apply(lambda x: isinstance(x,str))]
data = data[data['question2'].apply(lambda x: isinstance(x,str))]

In [4]:
data.head()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
data.tail()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9995,19404,19405,How would you order these four cities (Bangalo...,What is the cost of living in Europe and the U...,0
9996,19406,19407,Stphen william hawking?,"What are the differences between SM, YG and JY...",0
9997,19408,19409,Mathematical Puzzles: What is () + () + () = 3...,What are the steps to solve this equation: [ma...,0
9998,19410,19411,Is IMS noida good for BCA?,How good is IMS Noida for studying BCA?,1
9999,19412,19413,What are the most respected and informative te...,What are Caltech's required and recommended te...,0


In [6]:
data[data.is_duplicate==1].shape

(3711, 5)

In [7]:
data[data.is_duplicate==0].shape

(6289, 5)

### Przygotowanie tekstu

Czyścimy dane z interpunkcji i lemmatyzujemy tekst. Nie usuwamy stopwordsów, ponieważ ich brak mógłby znacząco zmienić sens pytania.
Używamy funkcji clear_text, która przyjmuje argumenty: dane, znaki do usunięcia, czy usuwać stopwordsy i czy lematyzować tekst.
Następnie wykonuje szereg zadanych operacji i zwraca pytania podzielone na oczyszczone słowa

In [8]:
for question in ['question1', 'question2']:
    strip_chars = punctuation + '„”–'
    data[question + '_cleared'] = clear_text(data[question], strip_chars, is_remove_stopwords=False, is_lemmatize=True)

2019-04-11 19:17:39.453541 Oczyszczenie danych - SUKCES
2019-04-11 19:17:42.633574 Lemmatyzacja - SUKCES
2019-04-11 19:17:42.870540 Oczyszczenie danych - SUKCES
2019-04-11 19:17:43.525556 Lemmatyzacja - SUKCES


In [9]:
data.tail()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate,question1_cleared,question2_cleared
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9995,19404,19405,How would you order these four cities (Bangalo...,What is the cost of living in Europe and the U...,0,"[how, would, you, order, these, four, city, ba...","[what, is, the, cost, of, living, in, europe, ..."
9996,19406,19407,Stphen william hawking?,"What are the differences between SM, YG and JY...",0,"[stphen, william, hawking]","[what, are, the, difference, between, sm, yg, ..."
9997,19408,19409,Mathematical Puzzles: What is () + () + () = 3...,What are the steps to solve this equation: [ma...,0,"[mathematical, puzzle, what, is, 30, using, 1,...","[what, are, the, step, to, solve, this, equati..."
9998,19410,19411,Is IMS noida good for BCA?,How good is IMS Noida for studying BCA?,1,"[is, ims, noida, good, for, bca]","[how, good, is, ims, noida, for, studying, bca]"
9999,19412,19413,What are the most respected and informative te...,What are Caltech's required and recommended te...,0,"[what, are, the, most, respected, and, informa...","[what, are, caltech's, required, and, recommen..."


### Tworzenie i zapisywanie tokenizera

Tworzymy i zapisujemy tokenizer. Przyda nam się on gdy będziemy chcieli ponownie użyć modelu i przygotować do niego dowolny zbiór danych
Używamy funkcji prepare_representation, która przyjmuje dane tekstowe z obydwu pytań (korpus danych), tworzy tokenizer i zwraca zarówno tokenizer jak i ztokenizowaną treść pytań. 

Argument oov_token: opcjonalny argument, który zostanie dodany do indeksu słów i użyty za każdym razem kiedy pojawi się słowo, które nie było uwzględnione w tokenizerze.

In [10]:
tokenizer, stacked_representation = prepare_representation(
    pd.concat([data['question1_cleared'], data['question2_cleared']], axis=0), 'unk')
data['question1_tokens'], data['question2_tokens'] = np.array_split(stacked_representation, 2)
with open(f"results/{now}_tokenizer_warsztaty.pickle", 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [11]:
data.tail()

Unnamed: 0_level_0,qid1,qid2,question1,question2,is_duplicate,question1_cleared,question2_cleared,question1_tokens,question2_tokens
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9995,19404,19405,How would you order these four cities (Bangalo...,What is the cost of living in Europe and the U...,0,"[how, would, you, order, these, four, city, ba...","[what, is, the, cost, of, living, in, europe, ...","[7, 46, 16, 515, 352, 1556, 321, 488, 1018, 65...","[3, 4, 2, 277, 11, 426, 9, 1104, 13, 2, 85, 59..."
9996,19406,19407,Stphen william hawking?,"What are the differences between SM, YG and JY...",0,"[stphen, william, hawking]","[what, are, the, difference, between, sm, yg, ...","[11146, 11147, 11148]","[3, 12, 2, 66, 50, 2017, 13747, 13, 13748, 611..."
9997,19408,19409,Mathematical Puzzles: What is () + () + () = 3...,What are the steps to solve this equation: [ma...,0,"[mathematical, puzzle, what, is, 30, using, 1,...","[what, are, the, step, to, solve, this, equati...","[2985, 2738, 3, 4, 753, 177, 109, 144, 188, 31...","[3, 12, 2, 649, 8, 791, 69, 1308, 192, 1153, 2..."
9998,19410,19411,Is IMS noida good for BCA?,How good is IMS Noida for studying BCA?,1,"[is, ims, noida, good, for, bca]","[how, good, is, ims, noida, for, studying, bca]","[4, 7873, 5462, 40, 15, 7874]","[7, 40, 4, 7873, 5462, 15, 845, 7874]"
9999,19412,19413,What are the most respected and informative te...,What are Caltech's required and recommended te...,0,"[what, are, the, most, respected, and, informa...","[what, are, caltech's, required, and, recommen...","[3, 12, 2, 56, 3418, 13, 11149, 2184, 15, 845,...","[3, 12, 13751, 579, 13, 1858, 2184, 15, 2158, ..."


### Ustalenie parametrów modelu i podzielenie danych

In [12]:
model_parameters = {
        'emb_len': 300,
        'lstm_units': 10,
        'max_seq_len': 150,
        'text_rep_dim': 32,
        'batch_size': 64,
        'maxlen': 40,
        'distance': 'manhattan',
        'optimizer': 'adam',
        'loss': 'bin'}

In [13]:
data = data.set_index(['qid1', 'qid2'])

y = data['is_duplicate'].astype(np.int64).apply(one_or_zero, args=(1,))


Y_train, Y_validation, X_train, X_validation = train_test_split(y, data.drop(["is_duplicate"], axis=1), test_size=0.2)

X_train = X_train.drop(["question1","question2", "question1_cleared", "question2_cleared"], axis=1)

In [14]:
X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,question1_tokens,question2_tokens
qid1,qid2,Unnamed: 2_level_1,Unnamed: 3_level_1
2810,2811,"[77, 285, 96, 198, 82, 8605, 971, 1611, 13, 82...","[7, 12, 3913, 971, 152]"
2240,2241,"[7, 10, 16, 28, 33, 371, 2268, 21, 4634]","[7, 14, 16, 76, 4634, 8, 5378, 2, 371]"
18861,18862,"[10, 584, 5205, 39, 150, 846, 664, 476, 1336, ...","[4, 5812, 13679, 5, 3748]"
10842,10843,"[9, 2506, 8, 618, 1706, 590, 43, 12, 9851, 39,...","[3, 655, 1103, 31, 6, 618, 29, 5, 8054, 1706]"
8643,8644,"[3, 12, 32, 11, 2, 6629, 2708, 26, 442, 3485, ...","[7, 1179, 4, 6629, 2708, 346]"


In [15]:
X_train_dataset = [pad_sequences(X_train['question1_tokens'], maxlen=model_parameters['maxlen']),
                   pad_sequences(X_train['question2_tokens'], maxlen=model_parameters['maxlen'])]
X_val_dataset = [pad_sequences(X_validation['question1_tokens'], maxlen=model_parameters['maxlen']),
                 pad_sequences(X_validation['question2_tokens'], maxlen=model_parameters['maxlen'])]

In [16]:
X_train_dataset

[array([[   0,    0,    0, ...,   82, 1964, 3913],
        [   0,    0,    0, ..., 2268,   21, 4634],
        [   0,    0,    0, ...,  104,  785, 3275],
        ...,
        [   0,    0,    0, ...,   58,  294,  166],
        [   0,    0,    0, ...,   40,   13,  223],
        [   0,    0,    0, ..., 2680,  144, 9298]]),
 array([[    0,     0,     0, ...,  3913,   971,   152],
        [    0,     0,     0, ...,  5378,     2,   371],
        [    0,     0,     0, ..., 13679,     5,  3748],
        ...,
        [    0,     0,     0, ...,   991, 11693,   819],
        [    0,     0,     0, ...,   223,    13,    40],
        [    0,     0,     0, ...,   132,     9,  2680]])]

Funkcja prepare_embedding_matrix umożliwia wczytanie przetrenowych embeddingów w2v lub ustawienie parametrów umożliwiające wytrenowanie własnych embeddingów

In [17]:
model_parameters.update({'nb_tokens': len(tokenizer.index_word) + 1})

embedding_matrix, embeddings_index, is_trainable = prepare_embedding_matrix(model_parameters, tokenizer.word_index,
                                                                            'src/wiki.en.vec.csv', 'fb_emb', nrows=1000)
model_parameters.update({'is_trainable': is_trainable})

found 1001 word vectors
number of null word embeddings: 12888


### Trenowanie modelu

In [18]:
model = build_model_blstm(model_parameters, embedding_matrix)
print(model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 40)           0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 40)           0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 40, 300)      4125600     input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 40, 20)       24880       embedding_1[0][0]                
          

In [19]:
ones_share = Y_train.sum() / Y_train.shape[0]

ones_weight = (1 - ones_share) / ones_share

In [20]:
model_trained = model.fit(X_train_dataset, Y_train.values,
                          validation_split=0.1, batch_size=model_parameters['batch_size'], epochs=2,
                          class_weight={1: ones_weight, 0: 1})

model.save("results/" + now + "_model.h5")

Train on 7200 samples, validate on 800 samples
Epoch 1/2
Epoch 2/2


### Analiza wyników

Wczytanie wcześniej wytrenowego modelu i tokenizera

In [21]:
from keras.models import load_model
from keras import backend as K

#read tokenizer
model = load_model('results/20190411-164644_model.h5', custom_objects={'exponent_neg_manhattan_distance': exponent_neg_manhattan_distance})
with open('results/20190411-164644_tokenizer_warsztaty.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)
    
X_validation = X_validation.drop(["question1_tokens","question2_tokens"], axis=1)

## Zadanie 1: Należy ztokenizować dane testowe używając zaczytanego tokenizera i przygotować je w taki sposób aby dało się wykonać na nich model.predict


In [22]:
X_val_dataset = #

In [23]:
print(model.evaluate(X_val_dataset, Y_validation.values))
preds = model.predict(X_val_dataset)

[0.35394559454917907, 0.852, 0.2266942652463913, -0.37299996888637543]


## Zadanie 2: Napisz funkcję calculate_preds_binary(preds), która zwróci binarny wynik z modelu

In [33]:
def calculate_preds_binary(preds):


SyntaxError: unexpected EOF while parsing (<ipython-input-33-364f40f917ca>, line 2)

In [25]:
preds_binary = calculate_preds_binary(preds)
cm, metrics = model_statistics(preds_binary, Y_validation)

Liczba poprawnie przewidzianych ogłoszeń:  1704
Liczba wszystkich ogłoszeń w zbiorze testowym:  2000
Confusion matrix: 
Predicted  False  True
Actual                
False       1039   215
True          81   665
Metryki: 
              precision    recall  f1-score   support

           0       0.93      0.83      0.88      1254
           1       0.76      0.89      0.82       746

   micro avg       0.85      0.85      0.85      2000
   macro avg       0.84      0.86      0.85      2000
weighted avg       0.86      0.85      0.85      2000



## Zadanie 3: Analiza błędu - Wyświetl pierwsze 5 poprawnie zaklasyfikowanych par i pierwsze 5 niepoprawnie zakwalifikowanych