In [1]:
import numpy as np 
import pandas as pd 
import os

import tensorflow as tf
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.callbacks import EarlyStopping,ModelCheckpoint,ReduceLROnPlateau


from keras.layers import Dense, Embedding, LSTM, Input, Lambda
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical 

import keras.backend as K
from keras.optimizers import Adadelta
from tensorflow.keras.optimizers import Adam

import re




**Loading Training Data**

The training data is loaded from the 'train.csv' file using Pandas. The dataset is then displayed to provide a quick overview of the initial rows.


In [2]:
train_data = pd.read_csv('msr_paraphrase_train.csv')
pd.set_option('display.max_colwidth',None)
print(f'shape{train_data.shape}')
train_data.head()

shape(4076, 4)


Unnamed: 0,ID,Sentence1,Sentence2,Class
0,1726,"Amrozi accused his brother, whom he called ""the witness"", of deliberately distorting his evidence.","Referring to him as only ""the witness"", Amrozi accused his brother of deliberately distorting his evidence.",1
1,1727,Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.,Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.,0
2,1728,"They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.","On June 10, the ship's owners had published an advertisement on the Internet, offering the explosives for sale.",1
3,1729,"Around 0335 GMT, Tab shares were up 19 cents, or 4.4%, at A$4.56, having earlier set a record high of A$4.57.","Tab shares jumped 20 cents, or 4.6%, to set a record closing high at A$4.57.",0
4,1730,"The stock rose $2.11, or about 11 percent, to close Friday at $21.51 on the New York Stock Exchange.",PG&E Corp. shares jumped $1.63 or 8 percent to $21.03 on the New York Stock Exchange on Friday.,1


**Loading Test Data**

The test data is loaded from the 'test.csv' file using Pandas. The dataset is then displayed to offer an initial glimpse of the data structure.

In [3]:
test_data = pd.read_csv('./msr_paraphrase_test.csv')
pd.set_option('display.max_colwidth',None)
print(f'shape{test_data.shape}')
test_data.head()

shape(1725, 4)


Unnamed: 0,ID,Sentence1,Sentence2,Class
0,1,"PCCW's chief operating officer, Mike Butcher, and Alex Arena, the chief financial officer, will report directly to Mr So.",Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So.,1
1,2,The world's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected.,Domestic sales at both GM and No. 2 Ford Motor Co. declined more than predicted as a late summer sales frenzy prompted a larger-than-expected industry backlash.,1
2,3,"According to the federal Centers for Disease Control and Prevention (news - web sites), there were 19 reported cases of measles in the United States in 2002.",The Centers for Disease Control and Prevention said there were 19 reported cases of measles in the United States in 2002.,1
3,4,A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night.,A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night.,0
4,5,The company didn't detail the costs of the replacement and repairs.,But company officials expect the costs of the replacement work to run into the millions of dollars.,0


**Extracting Training Sentences (Column 1)**

The sentences from the first column of the training data are extracted and stored in the 'train_1' list. This list is then displayed using the print function, providing a sample of the sentences in the first column.


In [4]:
train_1 = train_data.iloc[:,1]
train_1 = list(train_1)
print(train_1[:100])



**Extracting Training Sentences (Column 2)**

Similarly, sentences from the second column of the training data are extracted and stored in the 'train_2' list. The content of this list is displayed using the print function, presenting a sample of sentences from the second column.


In [5]:
train_2 = train_data.iloc[:,2]
train_2 = list(train_2)
print(train_2[:100])



In [6]:
full_train = train_1 + train_2
print(full_train[:100])



**Text Tokenization Setup**

A Tokenizer is initialized with a vocabulary size of 5000 words. It is configured to filter out specific characters, convert text to lowercase, and split text based on predefined characters. This tokenizer will be used to convert textual data into numerical sequences for further processing.


In [7]:
num_words = 5000
tokenizer = Tokenizer(num_words=num_words, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                   lower=True,split=' ')


**Tokenization on Training Data**

The Tokenizer is fitted on the entire training dataset (`full_train`), extracting unique tokens and building a vocabulary. The number of unique tokens found is printed along with the dictionary mapping words to their respective indices. This information is crucial for the subsequent conversion of text data into sequences of numerical values.


In [8]:
tokenizer.fit_on_texts(full_train)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
print(word_index) # print the mapping between unique word and index.

Found 13792 unique tokens.


**Text to Sequence Conversion and Padding for Sentence1**

The sentences from the 'Sentence1' column in the training data are converted into sequences of numerical values using the previously fitted tokenizer. The resulting sequences are then padded to a specified maximum length (`maxlen`) to ensure uniform dimensions. The printed output displays the original and padded sequences for the first sentence, providing insight into the preprocessing steps applied.


In [9]:
X_1 = tokenizer.texts_to_sequences(train_data['Sentence1'].values)
print(X_1[0])
maxlen = 60
X_1 = pad_sequences(X_1, maxlen=maxlen)
print("Padded Sequences: ")
print(X_1)
print(X_1[0])

X_1.shape

[1558, 507, 28, 1693, 1397, 16, 221, 1, 946, 3, 4082, 28, 353]
Padded Sequences: 
[[   0    0    0 ... 4082   28  353]
 [   0    0    0 ...   46   82   81]
 [   0    0    0 ...  910   16  215]
 ...
 [   0    0    0 ...  560 3623    7]
 [   0    0    0 ...  170  101   17]
 [   0    0    0 ...   13 4665  216]]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0 1558  507   28 1693 1397   16  221    1  946
    3 4082   28  353]


(4076, 60)

**Text to Sequence Conversion and Padding for Sentence2**

Similar to 'Sentence1', the sentences from the 'Sentence2' column in the data are converted into sequences of numerical values using the pre-fitted tokenizer. The resulting sequences are then padded to a specified maximum length (`maxlen`) to ensure uniform dimensions. The printed output displays the original and padded sequences for the first sentence, offering insight into the preprocessing steps applied to 'Sentence2'.


In [10]:
X_2 = tokenizer.texts_to_sequences(train_data['Sentence2'].values)
print(X_2[0])
maxlen = 60
X_2 = pad_sequences(X_2, maxlen=maxlen)
print("Padded Sequences: ")
print(X_2)
print(X_2[0])

X_2.shape

[2150, 2, 146, 20, 96, 1, 946, 1558, 507, 28, 1693, 3, 4082, 28, 353]
Padded Sequences: 
[[   0    0    0 ... 4082   28  353]
 [   0    0    0 ...   81    5  777]
 [   0    0    0 ... 1960    9  910]
 ...
 [   0    0    0 ...  560    5  298]
 [   0    0    0 ...  695    8  101]
 [   0    0    0 ...   15  245  110]]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0 2150    2  146   20   96    1  946 1558  507   28 1693
    3 4082   28  353]


(4076, 60)

**Training Data Splitting**

The training data is split into training and validation sets using a specified portion (`training_portion`). The labels corresponding to the sentences are extracted from the fourth column of the dataset and stored in the variable 'y'. This step is crucial for training the model and evaluating its performance on unseen data during the training process.


In [11]:
training_portion = 0.8
y = list(train_data.iloc[:,3])

**Text Tokenization and Padding (Test Data - Sentence1)**

For the test data, the sentences from 'Sentence1' are tokenized using the previously fitted tokenizer. The resulting sequences are then padded to ensure uniform length, with a maximum length specified by 'maxlen'. This processing is essential to prepare the test data for input into the trained model, maintaining consistency with the training data format.


In [12]:
X_test1 = tokenizer.texts_to_sequences(test_data['Sentence1'].values)
print(X_test1[0])
maxlen = 60
X_test1 = pad_sequences(X_test1, maxlen=maxlen)
print("Padded Sequences: ")
print(X_test1)
print(X_test1[0])

X_test1.shape

[130, 496, 361, 1927, 6, 4477, 1, 130, 376, 361, 26, 162, 3987, 2, 60, 209]
Padded Sequences: 
[[   0    0    0 ...    2   60  209]
 [   0    0    0 ...  464   55  126]
 [   0    0    0 ...  124    5  286]
 ...
 [   0    0    0 ...  282  101  334]
 [   0    0    0 ...  376 4964  614]
 [   0    0    0 ... 1350   10 2210]]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0  130  496  361 1927    6 4477    1  130  376  361   26  162
 3987    2   60  209]


(1725, 60)

**Text Tokenization and Padding (Test Data - Sentence2)**

Similarly, for the test data, the sentences from 'Sentence2' are tokenized using the previously fitted tokenizer. The resulting sequences are then padded to ensure uniform length, with a maximum length specified by 'maxlen'. This preprocessing step ensures that the test data is formatted appropriately for input into the trained model, maintaining consistency with the training data.


In [13]:
X_test2 = tokenizer.texts_to_sequences(test_data['Sentence2'].values)
print(X_test2[0])
maxlen = 60
X_test2 = pad_sequences(X_test2, maxlen=maxlen)
print("Padded Sequences: ")
print(X_test2)
print(X_test2[0])

print(X_test2.shape)

[728, 130, 496, 361, 1927, 6, 157, 130, 376, 361, 4477, 26, 162, 2, 209]
Padded Sequences: 
[[   0    0    0 ...  162    2  209]
 [   0    0    0 ...   55  126  464]
 [   0    0    0 ...  124    5  286]
 ...
 [   0    0    0 ...    1  282  406]
 [   0    0    0 ...   35 4964  614]
 [   0    0    0 ...    6 1350 1110]]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0  728  130  496  361 1927    6  157  130  376  361 4477
   26  162    2  209]
(1725, 60)


**Train-Validation Data Splitting**

The training dataset is split into training and validation sets to facilitate model training and evaluation. The split is performed based on the specified 'training_portion,' ensuring a portion of the data is reserved for validation. This division allows the model to learn from the training set and assess its performance on unseen data during validation, helping to prevent overfitting and ensure generalization.


In [16]:
training_size = int(len(X_1)*training_portion)

X_train1 = X_1[:training_size,:]
X_train2 = X_2[:training_size,:]
y_train  = y[:training_size]
X_val1   = X_1[training_size:,:]
X_val2   = X_2[training_size:,:]
y_val    = y[training_size:]


In [17]:
print(X_train1.shape)
print(X_train2.shape)
len(y_train)

(3260, 60)
(3260, 60)


3260

**Model Configuration Parameters**

The following parameters are crucial for configuring the Siamese LSTM model:

- `embedding_dim`: The dimensionality of the word embeddings. Adjusting this parameter can impact the model's ability to capture semantic relationships.

- `lstm_out`: The number of LSTM units in the output layer. This parameter determines the complexity of the LSTM layer and influences the model's learning capacity.

- `gradient_clipping_norm`: The normalization value for gradient clipping. This technique helps stabilize training by preventing exploding gradients.

- `batch_size`: The number of samples used in each iteration during training. It affects the model's training speed and memory consumption.

- `n_epoch`: The number of training epochs. An epoch represents one complete pass through the entire training dataset. Adjust this parameter based on training convergence.


In [18]:
embedding_dim = 40 #Change to observe effects
lstm_out = 256
gradient_clipping_norm = 2.40
batch_size = 128
n_epoch = 50


**Callback Configuration**

The code sets up callbacks to monitor the model during training:

- `ReduceLROnPlateau`: This callback dynamically adjusts the learning rate when a monitored metric plateaus. It helps improve convergence and training efficiency.

- `EarlyStopping`: Monitors the validation loss and stops training when the loss stops decreasing, preventing overfitting.

- `ModelCheckpoint`: Saves the model's weights during training based on the best validation loss. The saved model can be used for further analysis or deployment.

These callbacks collectively enhance the training process, ensuring optimal model performance.


In [22]:
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.15,
                              patience=5, min_lr=0.001)

earlystop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)

modelcheckpoint = ModelCheckpoint("weights.{epoch:02d}-{val_loss:.3f}.h5", monitor='val_loss', verbose=0, save_best_only=True, save_weights_only=False, mode='auto',  save_freq='epoch')

callbacks = [earlystop,modelcheckpoint,reduce_lr]

**Siamese LSTM Model Overview**

This code defines a Siamese LSTM model for paraphrase detection. It comprises:

- **Inputs**: Two sequences processed by shared embedding and LSTM layers.

- **Outputs**: Manhattan distance measures similarity between LSTM outputs.

- **Compilation**: Adadelta optimizer, mean squared error loss, and accuracy metric.

- **Summary**: Model architecture is summarized for quick reference.

The Siamese LSTM detects paraphrases by learning sentence pair similarity.


In [23]:
def exponent_neg_manhattan_distance(left, right):
    ''' Helper function for the similarity estimate of the LSTMs outputs'''
    return K.exp(-K.sum(K.abs(left-right), axis=1, keepdims=True))



left_input = Input(shape=(maxlen,), dtype='int32')
right_input = Input(shape=(maxlen,), dtype='int32')

embedding_layer = Embedding(num_words, embedding_dim, input_length=maxlen, trainable=False)

# Embedded version of the inputs
encoded_left = embedding_layer(left_input)
encoded_right = embedding_layer(right_input)

# Since this is a siamese network, both sides share the same LSTM
shared_lstm = LSTM(lstm_out)

left_output = shared_lstm(encoded_left)
right_output = shared_lstm(encoded_right)

malstm_distance = Lambda(function=lambda x: exponent_neg_manhattan_distance(x[0], x[1]),output_shape=lambda x: (x[0][0], 1))([left_output, right_output])


malstm = Model([left_input, right_input], [malstm_distance])

# Adadelta optimizer, with gradient clipping by norm
# optimizer = Adadelta(clipnorm=gradient_clipping_norm,learning_rate=.2,rho=0.90)

# Replace Adadelta with Adam optimizer
optimizer = Adam(learning_rate=0.002)  # You can adjust the learning rate as needed

malstm.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['accuracy'])

# Use tf.compat.v1.executing_eagerly_outside_functions instead of tf.executing_eagerly_outside_functions
# tf.compat.v1.executing_eagerly_outside_functions

print(malstm.summary())

 

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_3 (InputLayer)        [(None, 60)]                 0         []                            
                                                                                                  
 input_4 (InputLayer)        [(None, 60)]                 0         []                            
                                                                                                  
 embedding_1 (Embedding)     (None, 60, 40)               200000    ['input_3[0][0]',             
                                                                     'input_4[0][0]']             
                                                                                                  
 lstm_1 (LSTM)               (None, 256)                  304128    ['embedding_1[0][0]',   

**Training the Siamese LSTM Model**

The code trains the Siamese LSTM model using the fit() function. It takes training inputs (X_train1, X_train2), labels (y_train), and other parameters like batch size, epochs, and validation data.

Callbacks, including early stopping, model checkpointing, and learning rate reduction, are employed during training.

The training progress is stored in the malstm_trained variable.


In [24]:



malstm_trained = malstm.fit([X_train1, X_train2], np.array(y_train), batch_size=batch_size, epochs=n_epoch,
                            validation_data=([X_val1, X_val2], np.array(y_val)), callbacks=callbacks)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 13: early stopping


**Loading Pre-trained Siamese LSTM Model Weights**

The code loads pre-trained weights for a Siamese LSTM model from the file "weights.01-0.26.h5". After successful loading, it prints "Loaded model from disk".


In [25]:
malstm.load_weights("weights.03-0.188.h5")
print("Loaded model from disk")

Loaded model from disk


**Model Evaluation on Validation Data**

The code evaluates the Siamese LSTM model on the validation data ([X_val1, X_val2], np.array(y_val)) using the pre-defined loss function. The batch size for evaluation is set to 'batch_size'. The 'earlystop' callback is used during evaluation.


In [26]:
loss = malstm.evaluate([X_val1,X_val2], np.array(y_val), batch_size = batch_size, callbacks=[earlystop])
print(loss)

[0.18768925964832306, 0.7095588445663452]


**Evaluate Model on Test Data**

To assess the model's performance on the test set, we can use the `evaluate` function with the test data. The steps are as follows:


In [27]:
y_test = test_data.iloc[:,3]
# Use the evaluate function with the test data
test_loss = malstm.evaluate([X_test1, X_test2], np.array(y_test), batch_size=batch_size)

# Print or use the test loss for further analysis
print("Test Loss:", test_loss)


Test Loss: [0.19649088382720947, 0.6973913311958313]
