<a href="https://colab.research.google.com/github/AbhishekMajhi/DeepLearning-Using-Tensorflow/blob/master/RNN/Sentence_Similarity_using_siamese_network_and_MLSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Let's load our WORDTOVEC pretrained model**

import gensim

from gensim.models import Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec

path_word_to_vec = os.path.expanduser('./datasets/GoogleNews-vectors-negative300.bin')
word2vec = gensim.models.KeyedVectors.load_word2vec_format(path_word_to_vec, binary=True)


# Notes of MALSTM Saimese Network

*  A Siamese network, it is easier to train because it shares weights on both sides.
* Here is the model architecture that we gonna use.

![picture](https://drive.google.com/uc?id=1Yvhg6ENollDtPigU_PsbDaydHOoEdshb)

<br>
* Siamese networks seem to perform well on similarity tasks and have been used for tasks like sentence semantic similarity, recognizing forged signatures and many more.
* In MaLSTM the identical sub-network is all the way from the embedding up to the last LSTM hidden state.(You can see in fig.)
* Inputs to the network are zero-padded sequences of word indices. 
* These inputs are vectors of fixed length, where the first zeros are being ignored and the nonzeros are indices that uniquely identify words.
* Those vectors are then fed into the embedding layer.
* This layer looks up the corresponding embedding for each word and encapsulates all them into a matrix.
* This matrix represents the given text as a series of embeddings.
* Here I used Google’s word2vec embedding, same as in the original paper.<br>
**Here is  the diagram.**
<br>



![picture](https://drive.google.com/uc?id=1BVkMYC9cUeOlQwPeDq2gZd_f2eQwd_FD)

* In this network we will have two embedded matrices that represent a candidate of two similar commands/sentences.
* Then we feed them into the LSTM (practically, there is only one) and the final state of the LSTM for each question is a 50-dimensional vector denoted by **h**.
* It is trained to capture semantic meaning of the sentences.
* By now we have the two vectors that hold the semantic meaning of each sentence.
* We put them through the defined similarity function (below)<br>

![picture](https://drive.google.com/uc?id=1q9n3hn7bcAvgzHmJTn72Cxn5gi9bXrI4)

<br>
* Since we have an exponent of a negative the output (the prediction in our case) will be between 0 and 1.

In [None]:
# Import libraries

from time import time
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

import itertools
import datetime
import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import Input, Embedding, LSTM, Lambda, Dropout, Bidirectional, Concatenate, merge
import keras.backend as K
from keras.optimizers import Adadelta,Adam, RMSprop
from keras.callbacks import ModelCheckpoint
from keras.regularizers import l2



## Testing on my data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
title = ['idx','command1','command2', 'label']

In [None]:
data = pd.read_excel('./datasets/Book1.xlsx',names=title,header=None)

In [None]:
# lets set our random seed
np.random.seed(2)

In [None]:
# Shuffle our data 
#data = data.sample(frac= 1)

In [None]:
data.head()

Unnamed: 0,idx,command1,command2,label
0,1,open youtube,open youtube for me,1
1,2,can you open youtube for me,hey dude open youtube for me,1
2,3,i want youtube,i want to visit youtube,1
3,4,open music,play music,1
4,5,can you play music for me,"i am bored, can you play a song for me",1


In [None]:
# Shuffle our data 
data = data.sample(frac= 1)

In [None]:
train_data,test_data = train_test_split(data,test_size = 0.2, random_state = 101)

In [None]:
test_data = test_data.drop(['label'],axis = 1)

In [None]:
train_data.head()

Unnamed: 0,idx,command1,command2,label
52,53,i have to take notes,notepad open,1
84,85,can you destroy yourself?,can you do self-distruct,1
56,57,i have to take notes,i want to write something,1
14,15,open google for me,google open,1
71,72,delete account,delete my account,1


In [None]:
train_data.shape

(79, 4)

In [None]:
test_data.head()

Unnamed: 0,idx,command1,command2
38,39,open settings app,open windows settings app
53,54,i want to write something,i have to take notes
5,6,play a song for me,play a song
30,31,what was your name again?,what is your name?
2,3,i want youtube,i want to visit youtube


In [None]:
test_data.shape

(20, 3)

In [None]:
#sample_data = test_data['']

# Exp

In [None]:
# import zipfile

# file = '/content/drive/MyDrive/Datasets/test.csv.zip'
# import zipfile
# with zipfile.ZipFile(file, 'r') as zip_ref:
#     zip_ref.extractall('/content/drive/MyDrive/Datasets/')

In [None]:
#quora_data:
train_quora = pd.read_csv('/content/drive/MyDrive/Datasets/train.csv')
test_quora =  pd.read_csv('/content/drive/MyDrive/Datasets/test.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
train_quora.head(10)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [None]:
test_quora.head(10)

Unnamed: 0,test_id,question1,question2
0,0,How does the Surface Pro himself 4 compare wit...,Why did Microsoft choose core m3 and not core ...
1,1,Should I have a hair transplant at age 24? How...,How much cost does hair transplant require?
2,2,What but is the best way to send money from Ch...,What you send money to China?
3,3,Which food not emulsifiers?,What foods fibre?
4,4,"How ""aberystwyth"" start reading?",How their can I start reading?
5,5,How are the two wheeler insurance from Bharti ...,I admire I am considering of buying insurance ...
6,6,How can I reduce my belly fat through a diet?,How can I reduce my lower belly fat in one month?
7,7,"By scrapping the 500 and 1000 rupee notes, how...",How will the recent move to declare 500 and 10...
8,8,What are the how best books of all time?,What are some of the military history books of...
9,9,After 12th years old boy and I had sex with a ...,Can a 14 old guy date a 12 year old girl?


In [None]:
print(train_quora.shape)
print(test_quora.shape)

(404290, 6)
(3563475, 3)


In [None]:
# sample = pd.read_csv('./datasets/quora-question-pairs/sample_submission.csv')

In [None]:
# Taking a slice of train_quora and and test_qoura
train_set = train_quora.iloc[:60000, :]

In [None]:
test_set = test_quora.iloc[:2000, :]

In [None]:
print(train_set.shape)
print(test_set.shape)

(60000, 6)
(2000, 3)


In [None]:
import  nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Now let's create a helper function named **text_to_word_list(text)** which takes a string as input and outputs a list where each entry is a single word from the text and does some preprocessing (removing specific signs etc).

In [None]:
stops = set(stopwords.words('english'))

def text_to_word_list(text):
    ''' Pre process and convert texts to a list of words '''
    text = str(text)
    text = text.lower()

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)

    text = text.split()

    return text

In [None]:
vocabulary = dict()
inverse_vocabulary = ['<unk>']  # '<unk>' will never be used, it is only a placeholder for the [0, 0, ....0] embedding

question_col = ['question1','question2']
K.clear_session()

In [None]:
#!wget 'https://storage.googleapis.com/kaggle-data-sets/12162/16683/compressed/GoogleNews-vectors-negative300.bin.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20210607%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20210607T023632Z&X-Goog-Expires=259199&X-Goog-SignedHeaders=host&X-Goog-Signature=2c0526fd225648ba8f52d57a913c66a18e1fbc85e46e866d80e7f9437944bc4fe3e5bb0f7254f2fdd9cc6ccf6966d69c6449cdd6d5918c67d4d6d05c69f7c2e0645b6bd219bda5d439de8b9fdb6b8349345105054be7e6e37ccedbdd222d8e11d24ecb60ac64c2f5faa1216d37970b4c4893ea625bdc30731f81e49a53adda127c1138aba14b22d7577217894f4cecffe62c53c13f4d2e7a7867000d884e4bf5e2ee421dde67a8c936d5f06dda0aafc09775fb69af826bb848699934618fcb7bb6de9c7db502c9bf400976fd996cca7cc409377170aba0cad14bd4030f0948b249a7b8048ec386b641dac28da4141267d6578490a65dfcfe52e366f9f2fc7aa0'

In [None]:
# import zipfile
# file_path = '/content/drive/MyDrive/Datasets/global-vectors-for-word-representation.zip'
# import zipfile
# with zipfile.ZipFile(file_path, 'r') as zip_ref:
#     zip_ref.extractall('/content/sample_data/')

In [None]:
import gzip
import shutil
file_path = '/content/drive/MyDrive/deep learning/GoogleNews-vectors-negative300.bin.gz'
with gzip.open(file_path, 'rb') as f_in:
    with open('GoogleNews-vectors-negative300.bin', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [None]:
# word2vec model.
word2vec = KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Datasets/GoogleNews-vectors-negative300.bin', binary=True)

In [None]:
# Iterate over the questions only of both training and test datasets
for dataset in [train_set,test_set]:
    for index,row in dataset.iterrows():
        # Iterate through the text of both question of the row
         for question in question_col:
                com = []  # for question numbers representation
                for word in text_to_word_list(row[question]):
                    # Check for unwanted words
                    
                    if word in stops and word not in word2vec.vocab:
                        continue
                    if word not in vocabulary:
                        vocabulary[word] = len(inverse_vocabulary)
                        com.append(len(inverse_vocabulary))
                        inverse_vocabulary.append(word)
                    else:
                        com.append(vocabulary[word])
                # Replace questions as word to question as number representation
                dataset.at[index,question] = com    

In [None]:
# Creating embedding metrix                
embedding_dim = 300
embeddings = 1 * np.random.randn(len(vocabulary) + 1, embedding_dim)  # This will be the embedding matrix
embeddings[0] = 0  # So that the padding will be ignored

In [None]:
# Build the embedding matrix
for word, index in vocabulary.items():
    if word in word2vec.vocab:
        embeddings[index] = word2vec.word_vec(word)

del word2vec


This method will turn a word into its embedding given by word2vec.<br>
Here parameters are:<br>
* vocabulary which is a dict where the keys are words (str) and values are the corresponding indices (a unique id as int).
* inverse_vocabulary which is a list of words (str) where the index in the list is the matching id (from vocabulary).

Then we create our embedding metrix.
We will assign each word its word2vec embedding and leave the unrecognized ones (less than 0.5%) to random.
We keep the first index all zeros

Its realy take too much time and computation you know. So lets save it as csv file for later use.We need to save train_quora and test_quora.

In [None]:
# train_quora.to_csv('./datasets/train_quora.csv')
# test_quora.to_csv('./datasets/test_quora.csv')

 '''Run this cell 2nd time when you enter to this notebook for easier use.'''
train_quora = pd.read_csv('train_quora.csv')
test_quora = pd.read_csv('test_quora.csv')

In [None]:
K.clear_session()

In [None]:
train_set.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,"[1, 2, 3, 4, 5, 4, 6, 7, 8, 9, 10, 8, 11]","[1, 2, 3, 4, 5, 4, 6, 7, 8, 9, 10]",0
1,1,3,4,"[1, 2, 3, 12, 13, 14, 15, 16, 15, 17, 18]","[1, 19, 20, 21, 3, 22, 23, 24, 3, 13, 14, 15, ...",0
2,2,5,6,"[26, 27, 16, 28, 3, 29, 30, 31, 32, 33, 34, 35]","[26, 27, 31, 29, 36, 37, 5, 38, 39, 40]",0
3,3,7,8,"[41, 42, 16, 43, 44, 45, 26, 27, 16, 46, 47]","[48, 3, 49, 50, 51, 52, 53, 54, 51, 2, 55, 5, ...",0
4,4,9,10,"[56, 57, 58, 8, 59, 60, 61, 62, 63, 64, 65, 66]","[56, 67, 19, 68, 8, 62, 59]",0


Output will be looks like this..<br>

![picture](https://drive.google.com/uc?id=13vZxqMvIV_enZSpG-QbnFYRLjgYyD8gQ)

In [None]:
test_set.head()

Unnamed: 0,test_id,question1,question2
0,0,"[26, 76, 3, 1663, 1237, 5472, 745, 149, 175, 4...","[41, 330, 3863, 331, 1580, 12991, 212, 1580, 6..."
1,1,"[84, 16, 401, 942, 11093, 225, 833, 54, 26, 21...","[26, 214, 534, 76, 942, 11093, 1864]"
2,2,"[1, 1220, 2, 3, 195, 250, 1096, 251, 91, 817, ...","[1, 99, 1096, 251, 817]"
3,3,"[56, 818, 212, 31956]","[1, 388, 11099]"
4,4,"[26, 36359, 1221, 1204]","[26, 290, 27, 16, 1221, 1204]"


Output should looks like this:::<br>
![picture](https://drive.google.com/uc?id=1ZOMbMCHq1pPSvMv1mL_sb3uzdzcJ532u)

In [None]:
X_test = test_set[question_col]

In [None]:
X_test

Unnamed: 0,question1,question2
0,"[26, 76, 3, 1663, 1237, 5472, 745, 149, 175, 4...","[41, 330, 3863, 331, 1580, 12991, 212, 1580, 6..."
1,"[84, 16, 401, 942, 11093, 225, 833, 54, 26, 21...","[26, 214, 534, 76, 942, 11093, 1864]"
2,"[1, 1220, 2, 3, 195, 250, 1096, 251, 91, 817, ...","[1, 99, 1096, 251, 817]"
3,"[56, 818, 212, 31956]","[1, 388, 11099]"
4,"[26, 36359, 1221, 1204]","[26, 290, 27, 16, 1221, 1204]"
...,...,...
1995,"[1, 115, 3, 278, 14283, 251, 3096, 161, 185]","[1, 159, 2408, 175, 3, 188, 16, 5496, 360, 309..."
1996,"[218, 27, 57, 401, 3, 851, 5546, 4012, 8, 11]","[2, 47, 27, 16, 401, 3, 851, 3912, 11093, 8, 11]"
1997,"[1, 16, 319, 263, 225, 4221, 5339]","[22919, 787, 27, 16, 3939, 30, 598]"
1998,"[26, 97, 16, 357, 7278, 208, 416, 76, 212, 781...","[27, 99, 781, 208, 416, 76, 212, 781, 99, 25]"


#  Data preparation

In order to prepare our data for use in Keras we have to do two things:
* Split our data to ‘left’ and ‘right’ inputs (one for each side of the MaLSTM).
* Pad all of the word number sequences with zeros.
* we will also create a validation dataset using train_test_split
In max_seq_length we have the length of the longest question, and here is the code.

In [None]:
# # Here we took the validation size 
validation_size = 5000
train_size = len(train_set) - validation_size  # after removing validation data its the new size of train data.

In [None]:
# Defining X and Y for train_test_split 
X = train_set[question_col]   # Here it will contain questions only.
Y = train_set['is_duplicate']  # It will contain labels only.  

In [None]:
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size,random_state = 101)

In [None]:
Y_validation.shape

(5000,)

In [None]:
# WE need to pass our X_train data in terms of left questions and right questions. Because we have left MALSTM and right MALSTM
X_train = {'left':X_train.question1,'right':X_train.question2}
# now for validation data
X_validation = {'left':X_validation.question1,'right':X_validation.question2}

In [None]:
# Now we convert our train data as the input that we need, its time to convert our labels to their numpy representation.
Y_train = Y_train.values
Y_validation = Y_validation.values


In [None]:
def convert_to_one_hot(Y, C):
    Y = np.eye(C)[Y.reshape(-1)]
    return Y

In [None]:
Y_validation = convert_to_one_hot(Y_validation,2)

**Now zero padding**

In [None]:
max_seq_length = max(train_set.question1.map(lambda x: len(x)).max(),
                     train_set.question2.map(lambda x: len(x)).max(),
                     test_set.question1.map(lambda x: len(x)).max(),
                     test_set.question2.map(lambda x: len(x)).max())
print(max_seq_length)
# Zero padding
for dataset, side in itertools.product([X_train, X_validation], ['left', 'right']):
    dataset[side] = pad_sequences(dataset[side], maxlen=max_seq_length)



212


**Here max_seq_length should be 213.**

In [None]:
# Here we just make sure everything is ok
assert X_train['left'].shape == X_train['right'].shape
assert len(X_train['left']) == len(Y_train)

# Model Creation Time:

1. Since we need to **merge** our two LSTMs output using the MaLSTM similarity function, we need to learn about keras 'merge' Layer.<br>
2. The Merge layer allows us to merge elements with some built-in methods, but also supports custom methods.
3. So we can merge our left and right LSTM together.

In [None]:
import tensorflow as tf
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.layers import Layer

In [None]:
# Now let’s define the MaLSTM similarity function.
# @tf.autograph.experimental.do_not_convert

def maLSTM_similarity_fun(left,right):
    return K.exp(-K.sum(K.abs(left-right), axis = 1, keepdims= True))

In [None]:
# Lets initialize our parameters that we will need through-out the model.

n_hidden = 64
gradient_clipping_norm = 1.25
batch_size = 80
num_epochs  = 50

# Now define visible layers
left_input = Input(shape= (max_seq_length, ),dtype='int32')
right_input = Input(shape = (max_seq_length,), dtype = 'int32')

# Remember we have a embedding layer in  our network architecture of MaLSTM


embedding_layer = Embedding(len(embeddings),embedding_dim, weights = [embeddings], input_length = max_seq_length,trainable = False)

# Embedding verson of input 

embedding_left = embedding_layer(left_input)
embedding_right = embedding_layer(right_input)



See keras Embedding class documentation [Embedding](https://keras.io/api/layers/core_layers/embedding/).

In [None]:
# lstm_layer = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2))# loading our matrix
# emb = tf.keras.layers.Embedding(max_words, embedding_dim, input_length=300, weights=[embedding_matrix],trainable=False)input1 = tf.keras.Input(shape=(300,))
# e1 = emb(input1)
# x1 = lstm_layer(e1)input2 = tf.keras.Input(shape=(300,))
# e2 = emb(input2)
# x2 = lstm_layer(e2)mhd = lambda x: tf.keras.backend.abs(x[0] - x[1])
# merged = tf.keras.layers.Lambda(function=mhd, output_shape=lambda x: x[0],
# name='L1_distance')([x1, x2])
# preds = tf.keras.layers.Dense(1, activation='sigmoid')(merged)
# model = tf.keras.Model(inputs=[input1, input2], outputs=preds)
# model.compile(loss='mse', optimizer='adam')

In [None]:
class ManDist(Layer):
    """
    Keras Custom Layer that calculates Manhattan Distance.
    """

    # initialize the layer, No need to include inputs parameter!
    def __init__(self, **kwargs):
        self.result = None
        super(ManDist, self).__init__(**kwargs)

    # input_shape will automatic collect input shapes to build layer
    def build(self, input_shape):
        super(ManDist, self).build(input_shape)

    # This is where the layer's logic lives.
    def call(self, x, **kwargs):
        self.result = K.exp(-K.sum(K.abs(x[0] - x[1]), axis=1, keepdims=True))
        return self.result

    # return output shape
    def compute_output_shape(self, input_shape):
        return K.int_shape(self.result)

In [None]:
# model 
#  Since this is a siamese network, both sides share the same LSTM

shared_lstm = LSTM(n_hidden,return_sequences=False)


left_out = shared_lstm(embedding_left)
right_out = shared_lstm(embedding_right)

# Now let's calculate Manhattan distance as we described.
# mhd = lambda x: maLSTM_similarity_fun(x[0],x[1])
# merged = Lambda(function=mhd, output_shape = lambda x: (x[0][0],1))([left_out, right_out])
#flat = tf.keras.layers.Flatten(merged)
#preds = tf.keras.layers.Dense(2, activation='sigmoid')(merged)
# malstm_distance = Concatenate(Lambda(function=lambda x: maLSTM_similarity_fun(x[0],x[1]), output_shape = lambda x: (x[0][0],1))([left_out,right_out])

malstm_distance = ManDist()([left_out, right_out])
# Now lets build our model

model = Model([left_input,right_input], malstm_distance)
model.summary()


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 212)]        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 212)]        0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 212, 300)     11128200    input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
lstm (LSTM)                     (None, 64)           93440       embedding[0][0]              

In [None]:
# ### model arc  test

# # model 
# #  Since this is a siamese network, both sides share the same LSTM

# shared_lstm = Bidirectional(LSTM(n_hidden, dropout= 0.5, recurrent_dropout= 0.17,return_sequences= True, return_state = True))
# # shared_lstm = Dropout(0.8)(shared_lstm)
# # shared_lstm = LSTM(32, activity_regularizer= l2(0.02), return_sequences= False)(shared_lstm)

# left_out = shared_lstm(embedding_left)
# right_out = shared_lstm(embedding_right)

# # Now let's calculate Manhattan distance as we described.
# malstm_distance = Lambda(function= lambda x: maLSTM_similarity_fun(x[0],x[1]), output_shape = lambda x: (x[0][0],1))([left_out,right_out])

# # Now lets build our model

# model = Model([left_input,right_input], malstm_distance)



In [None]:
# Optimizer
optimizer = Adam(learning_rate= 0.001, clipnorm=gradient_clipping_norm)

In [None]:
model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['accuracy'])

In [None]:
# train_start_time = time()
# Early stopping to overcome overfitting issue
callbacks = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience= 3)]
history = model.fit([X_train['left'], X_train['right']], Y_train, batch_size=batch_size, epochs=num_epochs,
                            validation_data=([X_validation['left'], X_validation['right']], Y_validation),verbose = 2, callbacks= callbacks)
model.reset_states()
 
# print("Training time finished.\n{} epochs in {}".format(n_epoch, datetime.timedelta(seconds=time()-training_start_time)))

Epoch 1/50
688/688 - 35s - loss: 0.1913 - accuracy: 0.7230 - val_loss: 0.1700 - val_accuracy: 0.7514
Epoch 2/50
688/688 - 15s - loss: 0.1648 - accuracy: 0.7662 - val_loss: 0.1615 - val_accuracy: 0.7712
Epoch 3/50
688/688 - 15s - loss: 0.1550 - accuracy: 0.7868 - val_loss: 0.1576 - val_accuracy: 0.7726
Epoch 4/50
688/688 - 16s - loss: 0.1483 - accuracy: 0.7981 - val_loss: 0.1544 - val_accuracy: 0.7770
Epoch 5/50
688/688 - 16s - loss: 0.1438 - accuracy: 0.8059 - val_loss: 0.1523 - val_accuracy: 0.7830
Epoch 6/50
688/688 - 16s - loss: 0.1398 - accuracy: 0.8133 - val_loss: 0.1508 - val_accuracy: 0.7836
Epoch 7/50
688/688 - 15s - loss: 0.1366 - accuracy: 0.8178 - val_loss: 0.1503 - val_accuracy: 0.7860
Epoch 8/50
688/688 - 15s - loss: 0.1337 - accuracy: 0.8233 - val_loss: 0.1477 - val_accuracy: 0.7892
Epoch 9/50
688/688 - 16s - loss: 0.1310 - accuracy: 0.8265 - val_loss: 0.1480 - val_accuracy: 0.7928
Epoch 10/50
688/688 - 16s - loss: 0.1285 - accuracy: 0.8318 - val_loss: 0.1468 - val_accura

In [None]:
model.save('mlstm_adam_noBidire_earlystopping.h5')

**Backup Time!!!**

In [None]:
import shutil

In [None]:
model_path = '/content/mlstm_adam_noBidire_earlystopping.h5'
model_dest = '/content/drive/MyDrive/Datasets/mlstm_adam_noBidire_earlystopping.h5'
shutil.copy(model_path, model_dest)

### Testing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Optimizer
optimizer = Adam(learning_rate= 0.001, clipnorm=1.25)

In [None]:
model = tf.keras.models.load_model('/content/drive/MyDrive/Datasets/mlstm_adam_noBidire_earlystopping.h5')
model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['accuracy'])

#### Data processing.

In [None]:
import  nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from saimese_utils import text_to_word_list,make_embeddings

In [None]:
# Util 1
def split_and_zero_padding(df, max_seq_length):
    # Split to dicts
    X = {'left': df['question1'], 'right': df['question2']}

    # Zero padding
    for dataset, side in itertools.product([X], ['left', 'right']):
        dataset[side] = pad_sequences(dataset[side], padding='pre', truncating='post', maxlen=max_seq_length)

    return dataset

In [None]:
test_data = pd.read_csv('MALSTM_test.csv',names=['question1','question2'])
test_data

Unnamed: 0,question1,question2
0,hey dude open youtube for me,play something on youtube
1,tell me what will be the weather today,can you tell me about the weather?
2,"I want to do mount some partitions, so open disk.",I need to do some disk partition stuff
3,cafe is a better place for dating. So lets go.,Why don't you go to a cafe with your girlfriend


In [None]:
lst = {'question1':'A man with a hard hat is dancing', 'question2':'A man wearing a hard hat is dancing'}
test_data = test_data.append(lst,ignore_index= True)

In [None]:
stops = set(stopwords.words('english'))
X_test,embeddings = make_embeddings(test_data, word2vec, stops)

In [None]:
test_data

Unnamed: 0,question1,question2
0,"[1, 2, 3, 4, 5, 6]","[7, 8, 9, 4]"
1,"[10, 6, 11, 12, 13, 14, 15, 16]","[17, 18, 10, 6, 19, 14, 15]"
2,"[20, 21, 22, 23, 24, 25, 26, 3, 27]","[20, 28, 22, 24, 27, 29, 30]"
3,"[31, 32, 33, 34, 5, 35, 26, 36, 37]","[38, 22, 39, 18, 37, 31, 40, 41, 42]"
4,"[43, 40, 44, 45, 32, 46]","[43, 47, 44, 45, 32, 46]"


In [None]:
X_test = split_and_zero_padding(test_data, max_seq_length = 212)

In [None]:
# testing
prediction = model.predict([X_test['left'], X_test['right']])

In [None]:
print(prediction)

[[0.05708584]
 [0.01347459]
 [0.57039064]
 [0.30254477]
 [0.42550346]]
