# Sameer Gupta


#### Imports

In [3]:
import numpy as np
import pandas as pd
import nltk
import string

from sklearn.model_selection import train_test_split
from gensim.models import KeyedVectors
from nltk.corpus import stopwords

from keras.models import Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers.merge import concatenate
from keras.layers.normalization import BatchNormalization
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

#### Constants

In [4]:
# Reproducibility
np.random.seed(1234)

MAX_SEQUENCE_LENGTH = 30
MAX_NB_WORDS = 200000
EMBEDDING_DIM = 300
VALIDATION_SPLIT = 0.15

## Section 1: Load Data

There is no code for you to fill out in this section but please make sure you understand what the code is doing so you aren't confused in later parts. If you want to run this code on Colab, you can pass in the links to the CSV rather than the file path.

You can find the training and testing data here: https://www.kaggle.com/c/quora-question-pairs/data

#### File Paths

In [12]:
TRAIN_CSV = 'train.csv'
TEST_CSV = 'test.csv'
EMBEDDING_FILE = 'GoogleNews-vectors-negative300.bin.gz'

In [6]:
train_df = pd.read_csv(TRAIN_CSV)
test_df = pd.read_csv(TEST_CSV)

  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
train_df.head(10)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [8]:
test_df.head()

Unnamed: 0,test_id,question1,question2
0,0,How does the Surface Pro himself 4 compare wit...,Why did Microsoft choose core m3 and not core ...
1,1,Should I have a hair transplant at age 24? How...,How much cost does hair transplant require?
2,2,What but is the best way to send money from Ch...,What you send money to China?
3,3,Which food not emulsifiers?,What foods fibre?
4,4,"How ""aberystwyth"" start reading?",How their can I start reading?


## Section 2: Data Processing

In this section we will proccess our data into a format suitable for inputting into a LSTM or more specifically, a Siamese LSTM. The general structure we will follow is:

1. Preprocess the text (clean it up)
2. Tokenize every word in the text (replace each word with an index number)
3. Pad every sequence of indices with zeros to make them all the same length
4. Build an embedding matrix that we can use to look up every indices word embedding

In this section, you will be responsible for both preprocessing the text in the function `clean_text` and also building the embedding matrix. Fillers for where you should write your code are marked with #YOUR CODE HERE.

Rather than training our own word embeddings on the dataset, it is generally better to use a pretrained one since it is much more accurate. In our example, we will be using a Word2Vec model trained on Google News but you are free to use any you would like. Spacy's Glove model is a good choice.

You can download the pretrained model here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

#### Load Pretrained Word2Vec Model

We will load the model into a Keyed Vectors file. Using this, we can get the embedding of any word by calling `.word_vec(word)` and we can get all the words in the model's vocabulary through `.vocab`. It is important to note that if the word does not exist in the model's vocabulary, you can NOT get it's embedding and so standard practice is to ignore it or initialize it to a random or zero vector.

In [13]:
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

In [14]:
print(len(word2vec.word_vec("key")))

300


### Preprocessing (TODO)

Please preprocess the text. This is a standard practice to make sure that the text isn't noisy. Some examples of what you can do here are:

* Removing stop words
* Removing punctuation
* Getting rid of stuff like "what's" and making it "what is'
* Stemming words so they are all the same tense (e.g. ran -> run)

In [24]:
def clean_text(text):
    text = str(text)
    
    # YOUR CODE HERE
    stop_words = set(stopwords.words('english'))
    #print(text)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    #print(text)
    text = text.translate(string.punctuation)
    #print(text)
    #text = ''.join([WordNetLemmatizer().lemmatize(word,'v') for word in text.split()])
    
    # Return type should be str
    return text

Applying the preprocessing `clean_text` function to every element in the training and testing data.

In [25]:
X_train_1 = [clean_text(x) for x in train_df['question1']]
X_train_2 = [clean_text(x) for x in train_df['question2']]
labels = train_df['is_duplicate']
print('Loaded Training Data')

X_test_1 = [clean_text(x) for x in test_df['question1']]
X_test_2 = [clean_text(x) for x in test_df['question2']]
print('Loaded Testing Data')

Loaded Training Data
Loaded Testing Data


In [26]:
print(X_train_1[16])
print(train_df['question1'][16])

What manipulation mean?
What does manipulation mean?


### Tokenizer

To avoid manually having to assign indices and filtering out unfrequent words, we can use a Tokenizer to do this for us. It essentially creates a map of every unique word and an assigned index to it. We specify a parameter called `num_words` which says to only care about the top 20000 most frequent words. 

In [27]:
MAX_NB_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(X_train_1 + X_train_2 + X_test_1 + X_test_2)
print('Finished Building Tokenizer')

Finished Building Tokenizer


Applying the tokenizer to the training and testing data.

In [28]:
train_sequences_1 = tokenizer.texts_to_sequences(X_train_1)
train_sequences_2 = tokenizer.texts_to_sequences(X_train_2)
print('Finished Tokenizing Training')

test_sequences_1 = tokenizer.texts_to_sequences(X_test_1)
test_sequences_2 = tokenizer.texts_to_sequences(X_test_2)
print('Finished Tokenizing Testing')

Finished Tokenizing Training
Finished Tokenizing Testing


Number of unique words in tokenizer. Has to be <= 20,000.

In [29]:
word_index = tokenizer.word_index
print('Found %s unique tokens' % len(word_index))

Found 137031 unique tokens


Pad sequences all to the same length of 30 words.

In [30]:
train_data_1 = pad_sequences(train_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
train_data_2 = pad_sequences(train_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
labels = np.array(labels)
print('Shape of data tensor:', train_data_1.shape)
print('Shape of label tensor:', labels.shape)
print('Finished Padding Training')

test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)
test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)
print('Finished Padding Testing')

Shape of data tensor: (404290, 30)
Shape of label tensor: (404290,)
Finished Padding Training
Finished Padding Testing


### Embedding Matrix (TODO)

The embedding matrix is a `n x m` matrix where `n` is the number of words and `m` is the dimension of the embedding. In our case, `m=300` and `n=20000`. We take the min between the number of unique words in our tokenizer and max words in case there are less unique words than the max we specified. 

Row `i` in the matrix should contain the embedding of the word with index `i` in the tokenizer. An easy way to create this would be to iterate over `word_index.items()` which gives you the word and it's index. Keep in mind that you can't generate an embedding for a word not in your word2vec model vocabulary.

In [31]:
nb_words = min(MAX_NB_WORDS, len(word_index))

embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))

print(len(embedding_matrix[0]))

count = 0;

for key, value in word_index.items():
    while(count < nb_words):
        embedding_matrix[count] = word2vec.word_vec(key)
        count += 1
    #print(key)
    #print(value)   
# YOUR CODE HERE

300


In [32]:
print(embedding_matrix[0])

[ 0.13964844 -0.00616455  0.21484375  0.07275391 -0.16113281  0.07568359
  0.16796875 -0.20117188  0.12597656  0.00915527  0.05249023 -0.15136719
 -0.02758789  0.04199219 -0.234375    0.13867188 -0.02600098  0.07910156
  0.02746582 -0.13085938 -0.02478027  0.10009766 -0.07910156 -0.07714844
  0.03759766  0.16894531  0.05371094 -0.05200195  0.14453125 -0.04370117
 -0.12597656  0.06884766 -0.10595703 -0.14550781 -0.00331116  0.01367188
  0.13964844  0.01660156  0.03417969  0.16113281 -0.01080322  0.06689453
  0.06835938 -0.15136719 -0.16894531  0.03295898 -0.06884766  0.06787109
 -0.07373047  0.08300781  0.05761719  0.14550781 -0.11865234 -0.13671875
  0.12402344  0.04296875 -0.11962891 -0.08154297  0.06494141 -0.05639648
 -0.04394531  0.1484375  -0.07714844  0.04614258 -0.02624512 -0.06591797
  0.04980469  0.08886719 -0.01647949 -0.02294922  0.10546875  0.04199219
  0.11035156 -0.08251953 -0.13574219 -0.07324219  0.1015625   0.05371094
 -0.07275391  0.08496094 -0.04443359 -0.078125    0

### Formatting Data

Here we just format the data into each half for the input (left and right). There is no code to write here but understand what it is doing.

In [33]:
# Random shuffle
perm = np.random.permutation(len(train_data_1))
idx_train = perm[:int(len(train_data_1)*(1-VALIDATION_SPLIT))]
idx_val = perm[int(len(train_data_1)*(1-VALIDATION_SPLIT)):]

data_1_train = np.vstack((train_data_1[idx_train], train_data_2[idx_train]))
data_2_train = np.vstack((train_data_2[idx_train], train_data_1[idx_train]))
labels_train = np.concatenate((labels[idx_train], labels[idx_train]))
print('Finished Creating Training Data')

data_1_val = np.vstack((train_data_1[idx_val], train_data_2[idx_val]))
data_2_val = np.vstack((train_data_2[idx_val], train_data_1[idx_val]))
labels_val = np.concatenate((labels[idx_val], labels[idx_val]))
print('Finished Creating Validation Data')

Finished Creating Training Data
Finished Creating Validation Data


## Section 3: Building the Model

In this section you will write code to build the actual Siamese network. It should take in two arguments (question1 and question2) and then output a single number representing the probability that the two questions are duplicates.

The model should take in each input sentence, replace it with it's embeddings, then run the new embedding vector through a LSTM layer. The output of each LSTM layer should be concatenated together and then a standard Dense model can be used.

Make sure to note that you should only use one LSTM layer that is shared by both the left and the right half. 

Make sure to title your output layers as `predictions`.

### Build Model (TODO)

In [75]:
from keras.layers import Lambda, TimeDistributed, merge, Bidirectional
from keras import backend as K
from keras.layers import Concatenate, dot, Flatten, Reshape, add

In [77]:
SENT_EMBEDDING_DIM = 128
DROPOUT = 0.2
question1 = Input(shape=(MAX_SEQUENCE_LENGTH,))
question2 = Input(shape=(MAX_SEQUENCE_LENGTH,))

q1 = Embedding(nb_words, 
                 EMBEDDING_DIM, 
                 weights=[embedding_matrix], 
                 input_length=MAX_SEQUENCE_LENGTH, 
                 trainable=False)(question1)
q1 = Bidirectional(LSTM(SENT_EMBEDDING_DIM, return_sequences=True), merge_mode="sum")(q1)

q2 = Embedding(nb_words, 
                 EMBEDDING_DIM, 
                 weights=[embedding_matrix], 
                 input_length=MAX_SEQUENCE_LENGTH, 
                 trainable=False)(question2)
q2 = Bidirectional(LSTM(SENT_EMBEDDING_DIM, return_sequences=True), merge_mode="sum")(q2)

attention = dot([q1,q2], [1,1])
attention = Flatten()(attention)
attention = Dense((MAX_SEQUENCE_LENGTH*SENT_EMBEDDING_DIM))(attention)
attention = Reshape((MAX_SEQUENCE_LENGTH, SENT_EMBEDDING_DIM))(attention)

merged = add([q1,attention])
merged = Flatten()(merged)
merged = Dense(200, activation='relu')(merged)
merged = Dropout(DROPOUT)(merged)
merged = BatchNormalization()(merged)
merged = Dense(200, activation='relu')(merged)
merged = Dropout(DROPOUT)(merged)
merged = BatchNormalization()(merged)
merged = Dense(200, activation='relu')(merged)
merged = Dropout(DROPOUT)(merged)
merged = BatchNormalization()(merged)
merged = Dense(200, activation='relu')(merged)
merged = Dropout(DROPOUT)(merged)
merged = BatchNormalization()(merged)

is_duplicate = Dense(1, activation='sigmoid')(merged)

model = Model(inputs=[question1,question2], outputs=is_duplicate)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [62]:
# YOUR CODE HERE

sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,))
sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,))

sharedLSTM = LSTM(64)

embeddingLayer = Embedding(nb_words, 
                 EMBEDDING_DIM, 
                 weights=[embedding_matrix], 
                 input_length=MAX_SEQUENCE_LENGTH, 
                 trainable=False)


question1 = embeddingLayer(sequence_1_input)
question1 = sharedLSTM(question1)
#question1 = BatchNormalization()(question1)

#q1 = TimeDistributed(Dense(EMBEDDING_DIM, activation='relu'))(q1)
#q1 = Lambda(lambda x: K.max(x, axis=1), output_shape=(EMBEDDING_DIM, ))(q1)

question2 = embeddingLayer(sequence_2_input)
question2 = sharedLSTM(question2)
#question2 = BatchNormalization()(question2)

def exponent_neg_manhattan_distance(left, right):
    ''' Helper function for the similarity estimate of the LSTMs outputs'''
    return K.exp(-K.sum(K.abs(left-right), axis=1, keepdims=True))

malstm_distance = Concatenate(axis=-1)([question1, question2])

# Pack it all up into a model
malstm = Model(inputs = [question1, question2], outputs = [malstm_distance])

# Adadelta optimizer, with gradient clipping by norm
optimizer = Adadelta(clipnorm=gradient_clipping_norm)

malstm.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['accuracy'])


#K.exp(-K.sum(K.abs(x[0]-x[1]), axis=1, keepdims=True))
#(x[0][0], 1)

#lambda x:  K.abs(x[0]-x[1]), output_shape=lambda x: x[0]

#malstm_distance = merge(mode=lambda x: exponent_neg_manhattan_distance(x[0], x[1]), output_shape=lambda x: (x[0][0], 1))([question1, question2])

# Pack it all up into a model
#malstm = Model([left_input, right_input], [malstm_distance])

# Adadelta optimizer, with gradient clipping by norm
#optimizer = Adadelta(clipnorm=gradient_clipping_norm)

#merged = Concatenate(axis=-1)([question1,question2])
#malstm = concatenate([question1, question2])
#predictions = Dense(1, activation='sigmoid')(merged)

#merged = Dense(32, activation='relu')(merged)
#merged = Dropout(0.1)(merged)
#merged = BatchNormalization()(merged)
#predictions.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['accuracy'])


#merged = merge([question1,question2], mode = lambda x:K.exp(-K.sum(K.abs(x[0]-x[1]), axis=1, keepdims=True)), output_shape=lambda x:(x[0][0], 1))

#merged = Dense(32, activation='relu')(merged)
#merged = Dropout(0.1)(merged)
#merged = BatchNormalization()(merged)

#predictions = Dense(1, activation='sigmoid')(merged)

Note that input tensors are instantiated via `tensor = keras.layers.Input(shape)`.
The tensor that caused the issue was: lstm_17/TensorArrayReadV3:0
  str(x.name))
Note that input tensors are instantiated via `tensor = keras.layers.Input(shape)`.
The tensor that caused the issue was: lstm_17_1/TensorArrayReadV3:0
  str(x.name))


AssertionError: 

### Compiling Model

In [50]:
model = Model(inputs=[sequence_1_input, sequence_2_input], outputs=predictions)
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])

In [78]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_55 (InputLayer)           (None, 30)           0                                            
__________________________________________________________________________________________________
input_56 (InputLayer)           (None, 30)           0                                            
__________________________________________________________________________________________________
embedding_30 (Embedding)        (None, 30, 300)      6000000     input_55[0][0]                   
__________________________________________________________________________________________________
embedding_31 (Embedding)        (None, 30, 300)      6000000     input_56[0][0]                   
__________________________________________________________________________________________________
bidirectio

## Section 4: Training the Model

In this section we will simply train the model. We use the Early Stopping argument to end training if the loss or accuracy don't improve within 3 epochs.

Since the training time is incredibly long (30 minutes or so on a CPU), only train it for one epoch if you don't have time. For better results, train it to around 50 epochs.

In [79]:
early_stop = EarlyStopping(monitor='val_loss', patience=3)

hist = model.fit([data_1_train, data_2_train], labels_train, \
        validation_data=([data_1_val, data_2_val], labels_val), \
        epochs=10, batch_size=64, shuffle=True, \
        callbacks=[early_stop])

Train on 687292 samples, validate on 121288 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
104000/687292 [===>..........................] - ETA: 35:28 - loss: 0.6579 - acc: 0.6322

KeyboardInterrupt: 

In [None]:
hist.history