# Siamese-LSTM baseline
- The baseline of Siamese LSTM, from https://www.kaggle.com/amoyyean/lstm-with-glove
- With Glove pretrained model as the initial embedding matrix for embedding layer.
- In the thesis of `Siamese Recurrent Architectures for Learning Sentence Similarity`, a siamese lstm model for qqp problem is defined with the name of Manhattan LSTM Model.(https://www.researchgate.net/profile/Aditya_Thyagarajan/publication/307558687_Siamese_Recurrent_Architectures_for_Learning_Sentence_Similarity/links/5bf2424ba6fdcc3a8de0e69e/Siamese-Recurrent-Architectures-for-Learning-Sentence-Similarity.pdf)
  - Elior Cohen has done some experiments on MaLSTM with this dataset, results in: https://medium.com/mlreview/implementing-malstm-on-kaggles-quora-question-pairs-competition-8b31b0b16a07, the result is 80.xx% for validation accuracy.
  - Here this model `Siamese-LSTM baseline` adds some leaky feature and changes some network structure, achiving a better performance. 
- So I plan to use this model as a baseline for NN solution in our project, and make changes based on that to see if we can construct a better model and achive a even better performance.
   
   
   
- **Output:** 
  - `lstm.csv`: 4 epoch, score: 0.193
  - `lstm_1.csv`: 20 epoch, score 0.18662

In [1]:
import os
import re
import csv
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from string import punctuation
from collections import defaultdict
# from tqdm import tqdm
from sklearn.preprocessing import StandardScaler
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Keras package
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, Embedding, Dropout, Activation, LSTM, Lambda
from keras.layers.merge import concatenate
from keras.models import Model
from keras.layers.normalization import BatchNormalization
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.layers.pooling import GlobalAveragePooling1D
import keras.backend as K


Using TensorFlow backend.


In [2]:
# Hyperparameter defination

# Use the following instructions to download glove and unzip it, if already installed, just comment them.
# !wget http://nlp.stanford.edu/data/glove.840B.300d.zip
# !unzip glove.840B.300d.zip
EMBEDDING_FILE = 'glove.840B.300d.txt'

TRAIN_DATA_FILE = 'Data/train.csv'
TEST_DATA_FILE = 'Data/test.csv'

MAX_SEQUENCE_LENGTH = 60  
MAX_NUM_WORDS = 200000  # There are about 201000 unique words in training dataset, 200000 is enough for tokenization
EMBEDDING_DIM = 300  # word-embedded-vector dimension(300 is for 'glove.840B.300d')
VALIDATION_SPLIT_RATE = 0.1 
N_HIDDEN = np.random.randint(175, 275) # 250-400
N_DENSE = np.random.randint(100, 150)  # 120-200
DROPOUT_RATE_LSTM = 0.15 + np.random.rand() * 0.33  # drop-out possibility, random set to avoid outfitting
DROUPOUT_RATE_DENSE = 0.15 + np.random.rand() * 0.33

VERSION = 'Temp/lstm'
print('LSTM Stucture:')
print('Num_Lstm:', N_HIDDEN)
print('Num_Dense:', N_DENSE)
print('Dropout rate in LSTM layer:', DROPOUT_RATE_LSTM) 
print('Dropout rate in Dense layer::', DROUPOUT_RATE_DENSE)

ACTIVE_FUNC = 'relu'
re_weight = True  # whether to re-weight classes to fit the 17.4% share in test set


LSTM Stucture:
Num_Lstm: 198
Num_Dense: 142
Dropout rate in LSTM layer: 0.4117299071146573
Dropout rate in Dense layer:: 0.3726788153726159


In [3]:
# Create word embedding dictionary from 'glove.840B.300d.txt', {key:value} is {word: glove vector(300,)}
print('Create word embedding dictionary')

embeddings_index = {}  # the output dictionary
f = open(EMBEDDING_FILE, encoding='utf-8')
for line in f:  # tqdm
    values = line.split()
    word = ''.join(values[:-300])   
    coefs = np.asarray(values[-300:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found {} word vectors of glove.'.format(len(embeddings_index)))


# Preprocess text in dataset
print('Processing text dataset')

def text_to_wordlist(text, remove_stopwords=False, stem_words=False):
    # Clean the text, with the option to remove stopwords and to stem words.
    
    # Convert words to lower case and split them
    text = text.lower().split()

    # Optionally, remove stop words
    if remove_stopwords:
        stop_words = set(stopwords.words("english"))
        text = [w for w in text if not w in stop_words]
    
    text = " ".join(text)

    # Use re to clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    # text = re.sub(r"\0s", "0", text) # It doesn't make sense to me
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    
    # Optionally, shorten words to their stems
    if stem_words:
        text = text.split()
        stemmer = SnowballStemmer('english')
        stemmed_words = [stemmer.stem(word) for word in text]
        text = " ".join(stemmed_words)
    # or nltk package
    # if lemma:
    #    text = text.split()
    #    wn = nltk.WordNetLemmatizer()
    #    lemm_words = [wn.lemmatize(word) for word in text]
    #    text = " ".join(lemm_words)
    
    # Return a list of words
    return(text)

# Load training data and process with text_to_wordlist (Preprocessing)
train_texts_1 = []  # the preprocessed text of q1
train_texts_2 = []  # the preprocessed text of q2
train_labels = []  # training labels

df_train = pd.read_csv(TRAIN_DATA_FILE, encoding='utf-8')  # the original training data
df_train = df_train.fillna('empty')
train_q1 = df_train.question1.values  # the original text of q1
train_q2 = df_train.question2.values  # the original text fo q2
train_labels = df_train.is_duplicate.values  # the original label('is_duplicate')

for text in train_q1:
    train_texts_1.append(text_to_wordlist(text, remove_stopwords=False, stem_words=False))
    
for text in train_q2:
    train_texts_2.append(text_to_wordlist(text, remove_stopwords=False, stem_words=False))

print('{} texts are found in train.csv'.format(len(train_texts_1)))

# Load testing data and process with text_to_wordlist (Preprocessing)
test_texts_1 = []  # the preprocessed text of q1_test
test_texts_2 = []  # the preprocessed text of q2_test
test_ids = []  # id..

df_test = pd.read_csv(TEST_DATA_FILE, encoding='utf-8')  # the original testing data
df_test = df_test.fillna('empty')
test_q1 = df_test.question1.values  # the original text of q1_test
test_q2 = df_test.question2.values  # the original text of q2_test
test_ids = df_test.test_id.values  # id..

for text in test_q1:
    test_texts_1.append(text_to_wordlist(text, remove_stopwords=False, stem_words=False))
    
for text in test_q2:
    test_texts_2.append(text_to_wordlist(text, remove_stopwords=False, stem_words=False))
    
print('{} texts are found in test.csv'.format(len(test_texts_1)))


Create word embedding dictionary
Found 2195892 word vectors of glove.
Processing text dataset
404290 texts are found in train.csv
2345796 texts are found in test.csv


In [4]:
# Keras.Tokenize for all text:
# First construct a Tokenizer()
# Then use tokenizer_on_texts() method to learn the dictionary of the corpus(all texts(sentences)). We can use .word_index to map between the each word (distinct) with the corresponding number.
# Then use text_to_sequence() method to transfer every text(sentence) in texts into sequences of word_indexes.
# Then add the same length by padding method: padding_sequences().
# Finally use the embedding layer in keras to carry out a vectorization, and input it into LSTM.

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(train_texts_1 + train_texts_2 + test_texts_1 + test_texts_2)  # generate a token dictionary, 

train_sequences_1 = tokenizer.texts_to_sequences(train_texts_1)  # sequence of q1
train_sequences_2 = tokenizer.texts_to_sequences(train_texts_2)  # sequence of q2
test_sequences_1 = tokenizer.texts_to_sequences(test_texts_1)  # sequence of q1_test
test_sequences_2 = tokenizer.texts_to_sequences(test_texts_2)  # sequence of q2_test

word_index = tokenizer.word_index
print('{} unique tokens are found'.format(len(word_index)))

# Pad all train with Max_Sequence_Length: 60
train_data_1 = pad_sequences(train_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)  # padded_sequence of q1 as train_data
train_data_2 = pad_sequences(train_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)  # padded_sequence of q2 as train_data
print('Shape of train data tensor:', train_data_1.shape)
print('Shape of train labels tensor:', train_labels.shape)

# Pad all test with Max_Sequence_Length
test_data_1 = pad_sequences(test_sequences_1, maxlen=MAX_SEQUENCE_LENGTH)  # padded_sequence of q1_test as test_data
test_data_2 = pad_sequences(test_sequences_2, maxlen=MAX_SEQUENCE_LENGTH)  # padded_sequence of q2_test as test_data
print('Shape of test data vtensor:', test_data_2.shape)
print('Shape of test ids tensor:', test_ids.shape)


120499 unique tokens are found
Shape of train data tensor: (404290, 60)
Shape of train labels tensor: (404290,)
Shape of test data vtensor: (2345796, 60)
Shape of test ids tensor: (2345796,)


In [5]:
# Get leaky features (NLP features)

import pandas as pd

questions = pd.concat([df_train[['question1', 'question2']], df_test[['question1', 'question2']]], axis=0).reset_index(drop='index')
q_dict = defaultdict(set)
for i in range(questions.shape[0]):
        q_dict[questions.question1[i]].add(questions.question2[i])
        q_dict[questions.question2[i]].add(questions.question1[i])

def q1_freq_train(row):
    return len(q_dict.get(row[3]))
    # return(len(q_dict[row['question1']]))

def q2_freq_train(row):
    return len(q_dict.get(row[4]))
    # return(len(q_dict[row['question2']]))

def q1_q2_intersect_train(row):
    return(len(set(q_dict.get(row[3])).intersection(set(q_dict.get(row[4])))))
    # return(len(set(q_dict[row['question1']]).intersection(set(q_dict[row['question2']]))))

def q1_freq_test(row):
    return len(q_dict.get(row[1]))
    # return(len(q_dict[row['question1']]))

def q2_freq_test(row):
    return len(q_dict.get(row[2]))
    # return(len(q_dict[row['question2']]))

def q1_q2_intersect_test(row):
    return(len(set(q_dict.get(row[1])).intersection(set(q_dict.get(row[2])))))
    # return(len(set(q_dict[row['question1']]).intersection(set(q_dict[row['question2']]))))

df_train['q1_q2_intersect'] = df_train.apply(q1_q2_intersect_train, axis=1, raw=True)
df_train['q1_freq'] = df_train.apply(q1_freq_train, axis=1, raw=True)
df_train['q2_freq'] = df_train.apply(q2_freq_train, axis=1, raw=True)

df_test['q1_q2_intersect'] = df_test.apply(q1_q2_intersect_test, axis=1, raw=True)
df_test['q1_freq'] = df_test.apply(q1_freq_test, axis=1, raw=True)
df_test['q2_freq'] = df_test.apply(q2_freq_test, axis=1, raw=True)

leaks = df_train[['q1_q2_intersect', 'q1_freq', 'q2_freq']]  # the leaky feature
test_leaks = df_test[['q1_q2_intersect', 'q1_freq', 'q2_freq']]  # the leaky feature_test


# Make scaling for leaky feature
ss = StandardScaler()
ss.fit(np.vstack((leaks, test_leaks)))
leaks = ss.transform(leaks)  # the leaky feature
test_leaks = ss.transform(test_leaks)  # the leaky feature_test


In [6]:
# Train & Validation split
perm = np.random.permutation(len(train_data_1))
idx_train = perm[:int(len(train_data_1)*(1-VALIDATION_SPLIT_RATE))]
idx_val = perm[int(len(train_data_1)*(1-VALIDATION_SPLIT_RATE)):]

data_1_train = np.vstack((train_data_1[idx_train], train_data_2[idx_train]))
data_2_train = np.vstack((train_data_2[idx_train], train_data_1[idx_train]))
leaks_train = np.vstack((leaks[idx_train], leaks[idx_train]))
labels_train = np.concatenate((train_labels[idx_train], train_labels[idx_train]))

data_1_val = np.vstack((train_data_1[idx_val], train_data_2[idx_val]))
data_2_val = np.vstack((train_data_2[idx_val], train_data_1[idx_val]))
leaks_val = np.vstack((leaks[idx_val], leaks[idx_val]))
labels_val = np.concatenate((train_labels[idx_val], train_labels[idx_val]))

weight_val = np.ones(len(labels_val))
if re_weight:
    weight_val *= 0.471544715
    weight_val[labels_val==0] = 1.309033281


In [7]:
# Create embedding matrix for embedding layer, which is used in the keras.embedding weight as the initializer.
print('Preparing embedding matrix')

num_words = min(MAX_NUM_WORDS, len(word_index))+1

embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))  # the weight of Embedding layer
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
print('Null word embeddings: '.format(np.sum(np.sum(embedding_matrix, axis=1) == 0)))


# NN Model design
# Structure: (q1-embedding-lstm + q2-embedding-lstm + leaky-dense)-dense-sigmoid-result

# The embedding layer containing the word vectors
emb_layer = Embedding(
    input_dim=num_words,
    output_dim=EMBEDDING_DIM,
    weights=[embedding_matrix],
    input_length=MAX_SEQUENCE_LENGTH,
    trainable=False
)    




# Input layer
seq1 = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
seq2 = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

# Embedding layer
emb1 = emb_layer(seq1)
emb2 = emb_layer(seq2)

# LSTM layer
lstm_layer = LSTM(N_HIDDEN, dropout=DROPOUT_RATE_LSTM, recurrent_dropout=DROPOUT_RATE_LSTM)
lstm_a = lstm_layer(emb1)
lstm_b = lstm_layer(emb2)

# add features
leaky_input = Input(shape=(leaks.shape[1],))
leaky_dense = Dense(int(N_DENSE/2), activation=ACTIVE_FUNC)(leaky_input)

# merge 
merged = concatenate([lstm_a, lstm_b, leaky_dense])
merged = BatchNormalization()(merged)
merged = Dropout(DROUPOUT_RATE_DENSE)(merged)
merged = Dense(N_DENSE, activation=ACTIVE_FUNC)(merged)
merged = BatchNormalization()(merged)
merged = Dropout(DROUPOUT_RATE_DENSE)(merged)

preds = Dense(1, activation='sigmoid')(merged)


# Add class weight, magic feature for the unbalancement of training labels.
if re_weight:
    class_weight = {0: 1.309033281, 1: 0.471544715}
else:
    class_weight = None
    

# Train the model

print('Starting the model training')

model = Model(inputs=[seq1, seq2, magic_input], outputs=preds)
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['acc'])

# Summerization of model
model.summary()

# Set early stopping (large patience should be useful)
early_stopping =EarlyStopping(monitor='val_loss', patience=10)
bst_model_path = VERSION + '.h5' 
model_checkpoint = ModelCheckpoint(bst_model_path, save_best_only=True, save_weights_only=True)


hist = model.fit([data_1_train, data_2_train, leaks_train], labels_train, \
        validation_data=([data_1_val, data_2_val, leaks_val], labels_val, weight_val), \
        epochs=4, batch_size=2048, shuffle=True, \
        class_weight=class_weight, callbacks=[early_stopping, model_checkpoint])

model.load_weights(bst_model_path) # sotre model parameters in .h5 file
bst_val_score = min(hist.history['val_loss'])



Preparing embedding matrix
Null word embeddings: 
Starting the model training
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 60)           0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 60)           0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 60, 300)      36150000    input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
input_3 (Input

In [12]:
model.summary()

Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_7 (InputLayer)            (None, 60)           0                                            
__________________________________________________________________________________________________
input_8 (InputLayer)            (None, 60)           0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 60, 300)      36150000    input_7[0][0]                    
                                                                 input_8[0][0]                    
__________________________________________________________________________________________________
input_9 (InputLayer)            (None, 3)            0                                      

In [8]:
# Make the submission

print('Making the submission')

preds = model.predict([test_data_1, test_data_2, test_leaks], batch_size=8192, verbose=1)
preds += model.predict([test_data_2, test_data_1, test_leaks], batch_size=8192, verbose=1)
preds /= 2

submission = pd.DataFrame({'test_id':test_ids, 'is_duplicate':preds.ravel()})
submission.to_csv('Models/lstm_1.csv', index=False)

Making the submission


# Comparison Experiments

We also tried many **indivisual** comparison experiments on model hyperparameters and network stuctures, here is a briefly introduction on what we did:
- Hyperparameters:
  - Here the baseline is `lstm.ipynb`
  - We did some changes on hyperparameters such as hidden neurons in LSTM layer, neurons in Dense layer to find the 'best' parameter values: (250, 120).
    - it is not convincible to say 'the best', here we just put up our experiment results.
  - We also did some changes on: (under the hyperparameter set: (250,120,0.33,0.33))
    - preprocess raw data with stopwords and stemwords:
      - scored 0.19642 while 0.18839 is the baseline score
    - don't use leaky features (just 3 features):
      - scored 0.31223 while 0.18839 is the baseline score
      - the result shows that leaky feature may have a big impact on the final score, so we tried to add features from previous work (feature_engineering_train,ipynb) and change the neural network structure accordingly to build a new model. Actually it works even better (details in the next chapter). 
    - don't use class reweight:
      - scored 0.31730 while 0.18839 is the baseline score
      - reweight method is a magic method provided by kagglers, it's inpiring! 
- Network Sturcutres 
  - We also changed the neural network structure to see if there is an improvement on the final score
    - Here the baseline is the `lstm_featured.ipynb`, which scores 0.16515
    - Add substract and multiply on **featured_lstm** feature
      - Scored 0.16674, no obvious improvements
    - Change N_DENSE to N_DENSE/2
      - Scored 0.16681 while 0.16515 is the baseline
    - Add Dense layer after merging features
      - Scored 0.1840 while 0.16515 is the baseline


# New Analysis/Models

Apart from the baseline model and the comparison experiments. We also implemented new analysis/models based on the baseline model, and all of them have achieve an improvement on the test_data score. Here are the details:
- BiLSTM model, see `bilstm.ipynb`
- Siamese-LSTM with Features (tm+nlp), see `lstm_featured.ipynb`
- Combined model: combine all the improvements to a final model, see `lstm_final.ipynb`