#  ✨ **Quora Question Pairs Prediction - NLP Final Project** ✨ 


---
**Marah Habashi** - 211668751

**Celine Karam** - 314658428


---

# **Importing Required Libraries:**

In [66]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gensim

from sklearn.metrics import accuracy_score,confusion_matrix,precision_score, recall_score,classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense,Bidirectional
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings("ignore")

#  **Importing The Data:**

In [67]:
df = pd.read_csv('/kaggle/input/quora-question-pairs/train.csv.zip')

In [68]:
print(f'\033[1m_______________________________ Shape of the data: {df.shape} __________________________________\033[0m')
print("_____________________________________________data________________________________________________")
df.head()

[1m_______________________________ Shape of the data: (404290, 6) __________________________________[0m
_____________________________________________data________________________________________________


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


# **Data Isights:**

In [69]:
new_df = df.sample(30000,random_state=2)
new_df['is_duplicate'].value_counts()

0    19013
1    10987
Name: is_duplicate, dtype: int64

In [70]:
new_df[['question1','question2','is_duplicate']].iloc[4]

question1                     Consequences of Bhopal gas tragedy?
question2       What was the reason behind the Bhopal gas trag...
is_duplicate                                                    0
Name: 151235, dtype: object

# **Preprocessing Step:**

**The preprocess(q) function performs several text preprocessing steps on the input text q. Let's go through each step:**
* Lowercasing and Stripping
* Special Character Replacement
* Removing '[math]' Pattern
* Number Representation
* Decontracting Words
* HTML Tag Removal
* Punctuation Removal

**Finally, the preprocessed text q is returned by the function.**

**Overall, the preprocess() function aims to clean and normalize the input text by removing special characters, standardizing numbers, expanding contractions, removing HTML tags, and eliminating punctuation. These preprocessing steps help in preparing the text data for further analysis or natural language processing tasks.**

In [71]:
def preprocess(q):
    
    q = str(q).lower().strip()
    
    # Replace certain special characters with their string equivalents
    q = q.replace('%', ' percent')
    q = q.replace('$', ' dollar ')
    q = q.replace('₹', ' rupee ')
    q = q.replace('€', ' euro ')
    q = q.replace('@', ' at ')
    
    # The pattern '[math]' appears around 900 times in the whole dataset.
    q = q.replace('[math]', '')
    
    # Replacing some numbers with string equivalents (not perfect, can be done better to account for more cases)
    q = q.replace(',000,000,000 ', 'b ')
    q = q.replace(',000,000 ', 'm ')
    q = q.replace(',000 ', 'k ')
    q = re.sub(r'([0-9]+)000000000', r'\1b', q)
    q = re.sub(r'([0-9]+)000000', r'\1m', q)
    q = re.sub(r'([0-9]+)000', r'\1k', q)
    
    # Decontracting words
    # https://en.wikipedia.org/wiki/Wikipedia%3aList_of_English_contractions
    # https://stackoverflow.com/a/19794953
    contractions = { 
    "ain't": "am not",
    "aren't": "are not",
    "can't": "can not",
    "can't've": "can not have",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "couldn't've": "could not have",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "hasn't": "has not",
    "haven't": "have not",
    "he'd": "he would",
    "he'd've": "he would have",
    "he'll": "he will",
    "he'll've": "he will have",
    "he's": "he is",
    "how'd": "how did",
    "how'd'y": "how do you",
    "how'll": "how will",
    "how's": "how is",
    "i'd": "i would",
    "i'd've": "i would have",
    "i'll": "i will",
    "i'll've": "i will have",
    "i'm": "i am",
    "i've": "i have",
    "isn't": "is not",
    "it'd": "it would",
    "it'd've": "it would have",
    "it'll": "it will",
    "it'll've": "it will have",
    "it's": "it is",
    "let's": "let us",
    "ma'am": "madam",
    "mayn't": "may not",
    "might've": "might have",
    "mightn't": "might not",
    "mightn't've": "might not have",
    "must've": "must have",
    "mustn't": "must not",
    "mustn't've": "must not have",
    "needn't": "need not",
    "needn't've": "need not have",
    "o'clock": "of the clock",
    "oughtn't": "ought not",
    "oughtn't've": "ought not have",
    "shan't": "shall not",
    "sha'n't": "shall not",
    "shan't've": "shall not have",
    "she'd": "she would",
    "she'd've": "she would have",
    "she'll": "she will",
    "she'll've": "she will have",
    "she's": "she is",
    "should've": "should have",
    "shouldn't": "should not",
    "shouldn't've": "should not have",
    "so've": "so have",
    "so's": "so as",
    "that'd": "that would",
    "that'd've": "that would have",
    "that's": "that is",
    "there'd": "there would",
    "there'd've": "there would have",
    "there's": "there is",
    "they'd": "they would",
    "they'd've": "they would have",
    "they'll": "they will",
    "they'll've": "they will have",
    "they're": "they are",
    "they've": "they have",
    "to've": "to have",
    "wasn't": "was not",
    "we'd": "we would",
    "we'd've": "we would have",
    "we'll": "we will",
    "we'll've": "we will have",
    "we're": "we are",
    "we've": "we have",
    "weren't": "were not",
    "what'll": "what will",
    "what'll've": "what will have",
    "what're": "what are",
    "what's": "what is",
    "what've": "what have",
    "when's": "when is",
    "when've": "when have",
    "where'd": "where did",
    "where's": "where is",
    "where've": "where have",
    "who'll": "who will",
    "who'll've": "who will have",
    "who's": "who is",
    "who've": "who have",
    "why's": "why is",
    "why've": "why have",
    "will've": "will have",
    "won't": "will not",
    "won't've": "will not have",
    "would've": "would have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "y'all": "you all",
    "y'all'd": "you all would",
    "y'all'd've": "you all would have",
    "y'all're": "you all are",
    "y'all've": "you all have",
    "you'd": "you would",
    "you'd've": "you would have",
    "you'll": "you will",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have"
    }

    q_decontracted = []

    for word in q.split():
        if word in contractions:
            word = contractions[word]

        q_decontracted.append(word)

    q = ' '.join(q_decontracted)
    q = q.replace("'ve", " have")
    q = q.replace("n't", " not")
    q = q.replace("'re", " are")
    q = q.replace("'ll", " will")
    
    # Removing HTML tags
    q = BeautifulSoup(q)
    q = q.get_text()
    
    # Remove punctuations
    pattern = re.compile('\W')
    q = re.sub(pattern, ' ', q).strip()

    
    return q
    

In [72]:
preprocess("I've already! wasn't <b>done</b>?")

'i have already  was not done'

In [73]:
new_df['question1'] = new_df['question1'].apply(preprocess)
new_df['question2'] = new_df['question2'].apply(preprocess)

In [74]:
new_df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
398782,398782,496695,532029,what is the best marketing automation tool for...,what is the best marketing automation tool for...,1
115086,115086,187729,187730,i am poor but i want to invest what should i do,i am quite poor and i want to be very rich wh...,0
327711,327711,454161,454162,i am from india and live abroad i met a guy f...,t i e t to thapar university to thapar univers...,0
367788,367788,498109,491396,why do so many people in the u s hate the sou...,my boyfriend doesnt feel guilty when he hurts ...,0
151235,151235,237843,50930,consequences of bhopal gas tragedy,what was the reason behind the bhopal gas tragedy,0


# **Calculating Some Useful Features:**

Following lines of code add two new columns to the DataFrames new_df, respectively, containing the lengths of the values in the question1 and question2 columns.

In [75]:
new_df['q1_len'] = new_df['question1'].str.len() 
new_df['q2_len'] = new_df['question2'].str.len()

These lines of code add two new columns to the DataFrames new_df , respectively, containing the number of words in each sentence of the question1 and question2 columns. The number of words is calculated by splitting the sentences by spaces and counting the resulting words.

In [76]:
new_df['q1_num_words'] = new_df['question1'].apply(lambda row: len(row.split(" ")))
new_df['q2_num_words'] = new_df['question2'].apply(lambda row: len(row.split(" ")))


Defines a function called common_words that takes a row as input. Inside the function, it splits the sentences in the question1 and question2 columns of the row into words using the space character as a delimiter. It then converts the words to lowercase and removes any leading or trailing whitespace. Next, it creates sets w1 and w2 from these processed words. Finally, it returns the length of the intersection (&) of w1 and w2, representing the number of common words between the two sentences. This function can be used to calculate the number of common words for each row in a DataFrame.

In [77]:
def common_words(row):
    w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
    w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
    return len(w1 & w2)

In [78]:
new_df['word_common'] = new_df.apply(common_words, axis=1)


Defines a function called total_words that takes a row as input. It splits the values in the 'question1' and 'question2' columns of the row into individual words and converts them to lowercase after removing leading and trailing spaces. It then creates sets of unique words for both 'question1' and 'question2'. Finally, it returns the total count of unique words in both questions combined by adding the lengths of the two sets.

In [79]:
def total_words(row):
    w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
    w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
    return (len(w1) + len(w2))

In [80]:
new_df['word_total'] = new_df.apply(total_words, axis=1)


In [81]:
new_df['word_share'] = round(new_df['word_common']/new_df['word_total'],2)


# **Preprocessings Function for removing stop words , replacing cotractions and tokenization:**

**Defines a function called fetch_token_features that takes a row as input. It performs several operations to extract token-based features from the 'question1' and 'question2' columns of the row. Here's a summary of what the code does:**

It imports the stopwords from the NLTK (Natural Language Toolkit) corpus for the English language.

It initializes a list token_features with eight elements, all initially set to 0.0.

The 'question1' and 'question2' values from the row are assigned to variables q1 and q2, respectively.

It splits q1 and q2 into individual tokens (words).

If either q1 or q2 has no tokens (empty), it returns the token_features list.

It filters out the stopwords from q1 and q2, creating sets of non-stopword words.

It also creates sets of stopwords from q1 and q2.

It calculates the count of common non-stopword words, common stopwords, and common tokens between q1 and q2.

The token-based features are computed and stored in the token_features list as follows:

**Index 0**: Ratio of common non-stopword words to the minimum length of q1_words and q2_words.

**Index 1**: Ratio of common non-stopword words to the maximum length of q1_words and q2_words.

**Index 2**: Ratio of common stopwords to the minimum length of q1_stops and q2_stops.

**Index 3**: Ratio of common stopwords to the maximum length of q1_stops and q2_stops.

**Index 4**: Ratio of common tokens to the minimum length of q1_tokens and q2_tokens.

**Index 5**: Ratio of common tokens to the maximum length of q1_tokens and q2_tokens.

**Index 6**: Indicator (1 or 0) whether the last word of q1 is the same as the last word of q2.

**Index 7**: Indicator (1 or 0) whether the first word of q1 is the same as the first word of q2.

Finally, the token_features list is returned as the output.

In [82]:
# Advanced Features
from nltk.corpus import stopwords

def fetch_token_features(row):
    
    q1 = row['question1']
    q2 = row['question2']
    
    SAFE_DIV = 0.0001 

    STOP_WORDS = stopwords.words("english")
    
    token_features = [0.0]*8
    
    # Converting the Sentence into Tokens: 
    q1_tokens = q1.split()
    q2_tokens = q2.split()
    
    if len(q1_tokens) == 0 or len(q2_tokens) == 0:
        return token_features

    # Get the non-stopwords in Questions
    q1_words = set([word for word in q1_tokens if word not in STOP_WORDS])
    q2_words = set([word for word in q2_tokens if word not in STOP_WORDS])
    
    #Get the stopwords in Questions
    q1_stops = set([word for word in q1_tokens if word in STOP_WORDS])
    q2_stops = set([word for word in q2_tokens if word in STOP_WORDS])
    
    # Get the common non-stopwords from Question pair
    common_word_count = len(q1_words.intersection(q2_words))
    
    # Get the common stopwords from Question pair
    common_stop_count = len(q1_stops.intersection(q2_stops))
    
    # Get the common Tokens from Question pair
    common_token_count = len(set(q1_tokens).intersection(set(q2_tokens)))
    
    
    token_features[0] = common_word_count / (min(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[1] = common_word_count / (max(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[2] = common_stop_count / (min(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[3] = common_stop_count / (max(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[4] = common_token_count / (min(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    token_features[5] = common_token_count / (max(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    
    # Last word of both question is same or not
    token_features[6] = int(q1_tokens[-1] == q2_tokens[-1])
    
    # First word of both question is same or not
    token_features[7] = int(q1_tokens[0] == q2_tokens[0])
    
    return token_features


In [83]:
token_features = new_df.apply(fetch_token_features, axis=1)

new_df["cwc_min"]       = list(map(lambda x: x[0], token_features))
new_df["cwc_max"]       = list(map(lambda x: x[1], token_features))
new_df["csc_min"]       = list(map(lambda x: x[2], token_features))
new_df["csc_max"]       = list(map(lambda x: x[3], token_features))
new_df["ctc_min"]       = list(map(lambda x: x[4], token_features))
new_df["ctc_max"]       = list(map(lambda x: x[5], token_features))
new_df["last_word_eq"]  = list(map(lambda x: x[6], token_features))
new_df["first_word_eq"] = list(map(lambda x: x[7], token_features))

In [84]:
!pip install distance

[0m

In [85]:
import distance

def fetch_length_features(row):
    
    q1 = row['question1']
    q2 = row['question2']
    
    length_features = [0.0]*3
    
    # Converting the Sentence into Tokens: 
    q1_tokens = q1.split()
    q2_tokens = q2.split()
    
    if len(q1_tokens) == 0 or len(q2_tokens) == 0:
        return length_features
    
    # Absolute length features
    length_features[0] = abs(len(q1_tokens) - len(q2_tokens))
    
    #Average Token Length of both Questions
    length_features[1] = (len(q1_tokens) + len(q2_tokens))/2
    
    strs = list(distance.lcsubstrings(q1, q2))
    length_features[2] = len(strs[0]) / (min(len(q1), len(q2)) + 1)
    
    return length_features
    

In [86]:
length_features = new_df.apply(fetch_length_features, axis=1)

new_df['abs_len_diff'] = list(map(lambda x: x[0], length_features))
new_df['mean_len'] = list(map(lambda x: x[1], length_features))
new_df['longest_substr_ratio'] = list(map(lambda x: x[2], length_features))

In [87]:
# Fuzzy Features
from fuzzywuzzy import fuzz

def fetch_fuzzy_features(row):
    
    q1 = row['question1']
    q2 = row['question2']
    
    fuzzy_features = [0.0]*4
    
    # fuzz_ratio
    fuzzy_features[0] = fuzz.QRatio(q1, q2)

    # fuzz_partial_ratio
    fuzzy_features[1] = fuzz.partial_ratio(q1, q2)

    # token_sort_ratio
    fuzzy_features[2] = fuzz.token_sort_ratio(q1, q2)

    # token_set_ratio
    fuzzy_features[3] = fuzz.token_set_ratio(q1, q2)

    return fuzzy_features

In [88]:
fuzzy_features = new_df.apply(fetch_fuzzy_features, axis=1)

# Creating new feature columns for fuzzy features
new_df['fuzz_ratio'] = list(map(lambda x: x[0], fuzzy_features))
new_df['fuzz_partial_ratio'] = list(map(lambda x: x[1], fuzzy_features))
new_df['token_sort_ratio'] = list(map(lambda x: x[2], fuzzy_features))
new_df['token_set_ratio'] = list(map(lambda x: x[3], fuzzy_features))

In [89]:
print(new_df.shape)

(30000, 28)


# **Subsetting Required Data For Modeling:**

In [90]:
ques_df = new_df[['question1','question2']]
ques_df.head(4)

Unnamed: 0,question1,question2
398782,what is the best marketing automation tool for...,what is the best marketing automation tool for...
115086,i am poor but i want to invest what should i do,i am quite poor and i want to be very rich wh...
327711,i am from india and live abroad i met a guy f...,t i e t to thapar university to thapar univers...
367788,why do so many people in the u s hate the sou...,my boyfriend doesnt feel guilty when he hurts ...


# **Preparing Data Countvectorizer To Get Bag of words:**

In [91]:
questions = list(ques_df['question1']) + list(ques_df['question2'])

In [92]:
questions[:10]

['what is the best marketing automation tool for small and mid size companies',
 'i am poor but i want to invest  what should i do',
 'i am from india and live abroad  i met a guy from france in a party i want to date him  how do i do that',
 'why do so many people in the u s  hate the southern states',
 'consequences of bhopal gas tragedy',
 'i killed a snake on a friday  there is a belief that when you kill a snake on a friday it will certainly take revenge  will i be killed',
 'is the royal family a net gain or a net loss to the british taxpayer',
 'if a huge asteroid was about to hit earth in x year  would we be able to find survival solutions in due time',
 'what would happen if a woman took viagra',
 'how could i improve my love to my girlfriend']

# **Bag of Words:**

CountVectorizer from the sklearn.feature_extraction.text module to convert a collection of text documents (questions) into a matrix representation of word occurrences.

Here's a summary of what the code does:

It imports the CountVectorizer class from the sklearn.feature_extraction.text module. It initializes an instance of CountVectorizer named vectorizer with the following parameters: max_features=1000: Limits the number of features (words) to the top 1000 most frequent words based on their occurrence in the training data. stop_words='english': Specifies that common English stopwords should be excluded from the vocabulary. It calls the fit_transform method of vectorizer on the questions data (presumably a list or array-like object containing the training questions) to learn the vocabulary and transform the questions into a matrix representation. fit_transform learns the vocabulary from the training data and returns a sparse matrix representation of the questions, where each row corresponds to a question, and each column represents a word in the vocabulary. The resulting matrix X_train is a sparse matrix with dimensions (number of questions, number of unique words in the vocabulary). It calls the transform method of vectorizer on the questions_t data (presumably a list or array-like object containing the test questions) to transform the test questions into the same matrix representation as the training data. transform applies the learned vocabulary from the training data to the test data and returns a sparse matrix representation. The resulting matrix X_test is a sparse matrix with the same dimensions as X_train. It calls the toarray method on X_train to convert the sparse matrix representation of the training data into a dense matrix. toarray converts the sparse matrix to a regular NumPy array. The resulting matrix bag_of_word is a dense matrix with dimensions (number of questions, number of unique words in the vocabulary), where each element represents the count of a word in a specific question.

In [93]:
from sklearn.feature_extraction.text import CountVectorizer
# Initialize the CountVectorizer
vectorizer = CountVectorizer(max_features=1000,stop_words='english')

# Fit the vectorizer on the questions to learn the vocabulary
X_train = vectorizer.fit_transform(questions)

#X_test = vectorizer.transform(questions_t)

# Convert the bag-of-words representation to a dense matrix
bag_of_word = X_train.toarray()

In [94]:
X_train

<60000x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 174767 stored elements in Compressed Sparse Row format>

In [95]:
(X_train !=0).sum()

174767

In [96]:
print(f'What percentage of values are non zero : {(X_train !=0).sum()/np.prod(X_train.shape)}')

What percentage of values are non zero : 0.0029127833333333335


# **TFIDF Vectors:**

TfidfVectorizer to transform a collection of text questions into a matrix of TF-IDF values, and then converts the sparse matrix representation of the TF-IDF data into a dense matrix.

In [97]:
# Initialize the TfidfVectorizer and fit on the questions to learn the vocabulary
vectorizer = TfidfVectorizer(max_features=1000,stop_words='english')
tfidf = vectorizer.fit_transform(questions)
# Convert the TF-IDF representation to a dense matrix
tfidf = tfidf.toarray()

Here we retrieves the vocabulary learned by the TfidfVectorizer, prints a subset of the vocabulary, and prints a subset of the TF-IDF representation of the questions.

In [98]:
# Retrieve the vocabulary
vocabulary = vectorizer.get_feature_names_out()
# Print a subset of the vocabulary (optional)
# Print a subset of the vocabulary (optional)
max_vocabulary_display = 100
print("Vocabulary (subset):")
print(vocabulary[:max_vocabulary_display])

# Print a subset of the TF-IDF representation (optional)
max_tfidf_display = 100
print("TF-IDF representation (subset):")
print(tfidf[:max_tfidf_display])

Vocabulary (subset):
['10' '100' '11' '12' '12th' '13' '15' '16' '18' '1k' '20' '2014' '2015'
 '2016' '2017' '24' '2k' '30' '50' '500' 'able' 'abroad' 'access'
 'account' 'accounts' 'acne' 'act' 'actor' 'actually' 'add' 'address'
 'admission' 'advanced' 'advantages' 'advice' 'affect' 'age' 'ago' 'air'
 'alcohol' 'allowed' 'amazon' 'america' 'american' 'americans' 'ancient'
 'android' 'animals' 'answer' 'answers' 'anxiety' 'app' 'apple'
 'application' 'applications' 'apply' 'approach' 'apps' 'area' 'army'
 'art' 'ask' 'asked' 'attack' 'attractive' 'australia' 'available'
 'average' 'avoid' 'away' 'bad' 'balance' 'ban' 'bangalore' 'bank'
 'banning' 'based' 'basic' 'battle' 'beautiful' 'believe' 'belly'
 'benefits' 'best' 'better' 'big' 'biggest' 'birth' 'birthday' 'black'
 'block' 'blocked' 'blog' 'blood' 'blowing' 'blue' 'board' 'body'
 'bollywood' 'book']
TF-IDF representation (subset):
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0.

# **Applying word2vec as well:**

In [99]:
import gensim

In [100]:
questions = list(ques_df['question1']) + list(ques_df['question2'])

The simple_preprocess function to preprocess each sentence in the questions list and stores the processed sentences in the ques_sent list.

In [101]:
ques_sent = []
for sentence in questions:
    ques_sent.append(gensim.utils.simple_preprocess(sentence))

In following lines of code we initializes a Word2Vec model object with specific parameters:

window=2 sets the maximum distance between the target word and its context words to 2. min_count=3 sets the minimum frequency count of words to 3. Words that occur less frequently than this will be ignored. sg=1 indicates the use of the Skip-gram algorithm. Alternative value of sg=0 would use the Continuous Bag of Words (CBOW) algorithm. vector_size=100 sets the dimensionality of the word vectors to 100. It builds the vocabulary of the Word2Vec model by calling the build_vocab method and passing the preprocessed sentences (ques_sent).

It trains the Word2Vec model by calling the train method and passing the preprocessed sentences (ques_sent) as the training corpus.

corpus_iterable specifies the input corpus as an iterable of sentences. total_examples specifies the total number of sentences in the corpus. epochs specifies the number of training epochs (passes over the corpus) to perform. After training, it retrieves the word vectors from the trained model and stores them in the w2v dictionary.

model.wv.index_to_key retrieves the vocabulary words as a list. model.wv.vectors.round(3) retrieves the corresponding word vectors, rounded to 3 decimal places. dict(zip(...)) creates a dictionary mapping each word to its vector.

In [102]:
model  = gensim.models.Word2Vec(window=2,min_count=3,sg=1,vector_size=100)

In [103]:
model.build_vocab(ques_sent)

In [104]:
model.train(corpus_iterable=ques_sent,total_examples= model.corpus_count, epochs=model.epochs)

(2174565, 3150050)

In [105]:
w2v = dict(zip(model.wv.index_to_key, (model.wv.vectors.round(3))))

In [106]:
w2v['what']

array([-6.200e-01,  1.020e-01,  2.910e-01,  2.430e-01,  3.600e-01,
       -4.650e-01,  4.100e-01,  5.140e-01, -3.900e-02, -4.040e-01,
       -2.780e-01,  1.800e-02, -1.000e-03,  2.740e-01,  3.080e-01,
       -1.960e-01,  1.600e-02, -4.400e-02, -2.300e-01, -4.920e-01,
        7.060e-01,  1.600e-01,  9.250e-01, -9.480e-01,  3.840e-01,
        1.400e-02,  3.450e-01,  6.090e-01, -3.140e-01,  3.410e-01,
        1.000e+00, -4.840e-01,  2.960e-01, -7.650e-01,  4.300e-02,
        3.240e-01,  2.770e-01,  3.210e-01,  6.200e-02,  3.690e-01,
        5.500e-02, -4.460e-01, -5.110e-01, -4.260e-01,  4.200e-02,
        2.860e-01,  5.200e-02, -9.000e-02, -1.340e-01,  1.320e-01,
        5.100e-02, -7.240e-01, -3.330e-01, -2.010e-01,  1.840e-01,
        4.280e-01,  3.290e-01,  5.680e-01, -9.000e-03,  3.460e-01,
       -3.770e-01,  2.700e-02, -2.800e-02, -1.320e-01, -1.330e-01,
        6.150e-01,  7.640e-01, -2.300e-02, -3.380e-01,  1.297e+00,
       -1.260e-01, -6.900e-02,  5.570e-01, -4.740e-01,  6.730e

In [107]:
w2v['why']

array([-0.604,  0.876, -0.516, -0.1  , -0.055, -0.458, -0.212,  0.228,
       -0.298, -0.809,  0.03 , -0.268, -0.098,  0.339,  0.547, -0.12 ,
        0.22 , -0.552, -0.615, -0.787,  0.119, -0.155,  0.519,  0.013,
        0.227,  0.172, -0.23 , -0.207, -0.089,  0.46 ,  1.215, -0.493,
        0.309, -0.057, -0.104,  0.301,  0.084,  0.125, -0.081, -0.248,
       -0.159, -0.264,  0.524, -0.031,  0.009, -0.637, -0.207, -0.099,
        0.175,  0.118,  0.468, -0.516, -0.199, -0.097,  0.083,  0.47 ,
        0.359,  0.134, -0.437,  0.193,  0.348,  0.248,  0.527,  0.377,
        0.316,  0.178, -0.024, -0.157, -0.151,  0.355,  0.227,  0.056,
        0.382,  0.188,  0.464,  0.378,  0.185,  0.322,  0.069,  0.131,
       -0.032, -0.469, -0.104, -0.202,  0.384,  0.108,  0.255,  0.361,
       -0.043,  0.066,  0.386,  0.26 ,  0.294,  0.013,  0.45 , -0.265,
        0.794,  0.122,  0.088,  0.049], dtype=float32)

# **TF-IDF Vector For word2vec:**

In [108]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [109]:
tfidf1 = TfidfVectorizer()

In [110]:
tfidf1.fit_transform(questions)

<60000x26200 sparse matrix of type '<class 'numpy.float64'>'
	with 606923 stored elements in Compressed Sparse Row format>

In [111]:
a = tfidf1.vocabulary_.items()

# **Creating a dictionary word2weight that maps each word in the vocabulary of a TF-IDF vectorizer (tfidf1) to its corresponding IDF (Inverse Document Frequency) weight:**

In [112]:
word2weight = [(w, round(tfidf1.idf_[i])) for w, i in tfidf1.vocabulary_.items()]

In [113]:
word2weight = dict(word2weight)

In [114]:
model.wv.similar_by_word('pakistan',topn=15)

[('russia', 0.9117721915245056),
 ('china', 0.8754943013191223),
 ('strike', 0.8475841283798218),
 ('attack', 0.8412674069404602),
 ('declare', 0.8370988965034485),
 ('kashmir', 0.828040599822998),
 ('vietnam', 0.8251993060112),
 ('terrorists', 0.8244001269340515),
 ('israel', 0.8226666450500488),
 ('syria', 0.8189303278923035),
 ('defeat', 0.8165336847305298),
 ('declared', 0.8101333975791931),
 ('uri', 0.8079087734222412),
 ('invade', 0.8062509298324585),
 ('north', 0.8046808242797852)]

# **Data splitting in train test and val sets also prepration for modeling:**

Defines a function document_vector that calculates a document vector based on a given document.

Here's a summary of what the code does:

It checks the length of the input document:

If the document has no words (len(doc.split()) == 0) or has only one word (len(doc.split()) == 1), it returns a zero vector of shape (100). If the document has more than one word:

It initializes an empty list doc_vec to store the word vectors multiplied by their corresponding TF-IDF weights. It initializes a variable tfidf_weight_sum to store the sum of TF-IDF weights. It iterates over each word in the document. For each word that exists in both the word2vec model (w2v) and the TF-IDF weights (word2weight), it calculates the TF-IDF weight for the word based on its frequency in the document. It multiplies the word vector (w2v[word]) by the TF-IDF weight and appends the result to the doc_vec list. It accumulates the TF-IDF weight in the tfidf_weight_sum variable. After processing all the words in the document:

If no valid word vectors were found (len(doc_vec) == 0), it returns a zero vector of shape (100). Otherwise, it calculates the weighted average of the word vectors by summing them (np.sum(doc_vec, axis=0)) and dividing by the total TF-IDF weight (tfidf_weight_sum). The resulting document vector is rounded to 3 decimal places using np.round().

In [115]:
def document_vector(doc):
    if len(doc.split()) == 0:
        return np.zeros(shape=(100))
    elif len(doc.split()) == 1:
        return np.zeros(shape=(100))
    else:
#         doc = [word for word in doc.split() if word in model.wv.index_to_key]
#         return np.mean(model.wv[doc],axis=0).round(2)
        doc_vec = []
        tfidf_weight_sum = 0
        for word in doc.split():
            if word in w2v.keys() and word in word2weight.keys():
                tfidf_weight = word2weight[word]*doc.split().count(word)/len(doc.split())
                product = (w2v[word]*tfidf_weight)
                doc_vec.append(product)
                tfidf_weight_sum = tfidf_weight_sum + tfidf_weight
                #print(f"weight of {word} : {word2weight[word]}")
                #print(f"word vector of {word} : {w2v[word]}")
                #print(product)\n",
        #print(doc_vec)
        if len(doc_vec) == 0:
            return np.round(np.sum(doc_vec,axis=0)/1,3)
        else:
            return np.round(np.sum(doc_vec,axis=0)/tfidf_weight_sum,3)

In [116]:
from tqdm import tqdm

The question data by applying the document_vector function to each question in the dataset. It creates two separate lists (X and X2) to store the document vectors for the questions in 'question1' and 'question2' columns, respectively.

It iterates over each question in the 'question1' column of the dataset. For each question, it calls the document_vector function to calculate the document vector and appends it to the X list. It performs a similar process for the 'question2' column and appends the document vectors to the X2 list. The X and X2 lists are converted into numpy arrays (X = np.array(X) and X2 = np.array(X2)). Two temporary data frames (temp_df1 and temp_df2) are created using the document vectors as the data and the original data frame index as the index. The temporary data frames are concatenated along the column axis to create a new data frame (temp_df). The 'id', 'qid1', 'qid2', 'question1', and 'question2' columns are dropped from the new_df data frame, and the result is stored in final_df. The shape of final_df is printed. The final_df and temp_df data frames are concatenated along the column axis to create the complete_df. The shape of complete_df is printed. The head() method is called on complete_df to display the first few rows.

In [117]:
X = []
for doc in tqdm(ques_df['question1']):
    X.append(document_vector(doc))

100%|██████████| 30000/30000 [00:02<00:00, 14686.40it/s]


In [118]:
X2 = []
for doc in tqdm(ques_df['question2']):
    X2.append(document_vector(doc))

100%|██████████| 30000/30000 [00:02<00:00, 14489.22it/s]


In [119]:
X = np.array(X)

In [120]:
X2 = np.array(X2)

In [121]:
temp_df1 = pd.DataFrame(X, index= new_df.index)
temp_df2 = pd.DataFrame(X2, index= new_df.index)
temp_df = pd.concat([temp_df1, temp_df2], axis=1)

In [122]:
temp_df.shape

(30000, 200)

In [123]:
final_df = new_df.drop(columns=['id','qid1','qid2','question1','question2'])
print(final_df.shape)

(30000, 23)


In [124]:
complete_df = pd.concat([final_df, temp_df], axis=1)
print(complete_df.shape)
complete_df.head()

(30000, 223)


Unnamed: 0,is_duplicate,q1_len,q2_len,q1_num_words,q2_num_words,word_common,word_total,word_share,cwc_min,cwc_max,...,90,91,92,93,94,95,96,97,98,99
398782,1,75,76,13,13,12,26,0.46,0.874989,0.874989,...,0.227,0.083,-0.038,-0.041,0.324,0.211,-0.0,-0.137,0.067,0.012
115086,0,48,56,13,16,8,24,0.33,0.666644,0.499988,...,0.178,0.066,-0.018,0.093,0.536,-0.011,0.252,-0.174,0.019,-0.097
327711,0,104,119,28,21,4,38,0.11,0.0,0.0,...,0.23,0.119,-0.005,0.019,0.446,0.086,-0.163,-0.129,0.043,-0.057
367788,0,58,145,14,32,1,34,0.03,0.0,0.0,...,0.377,0.031,0.23,0.194,0.62,-0.241,0.244,0.391,-0.038,0.179
151235,0,34,49,5,9,3,13,0.23,0.749981,0.599988,...,0.161,0.056,0.096,-0.086,0.447,0.097,-0.096,0.034,-0.027,-0.036


In [125]:
complete_df.columns[0:24]

Index([        'is_duplicate',               'q1_len',               'q2_len',
               'q1_num_words',         'q2_num_words',          'word_common',
                 'word_total',           'word_share',              'cwc_min',
                    'cwc_max',              'csc_min',              'csc_max',
                    'ctc_min',              'ctc_max',         'last_word_eq',
              'first_word_eq',         'abs_len_diff',             'mean_len',
       'longest_substr_ratio',           'fuzz_ratio',   'fuzz_partial_ratio',
           'token_sort_ratio',      'token_set_ratio',                      0],
      dtype='object')

In [126]:
complete_df.isnull().sum()

is_duplicate    0
q1_len          0
q2_len          0
q1_num_words    0
q2_num_words    0
               ..
95              0
96              0
97              0
98              0
99              0
Length: 223, dtype: int64

In [127]:
from sklearn.model_selection import train_test_split
X_tr,X_te,y_tr,y_te = train_test_split(complete_df.iloc[:,1:],complete_df.iloc[:,0],test_size=0.2,random_state=1)

**To train an LSTM model for binary classification:**

It splits the data into training and test sets using train_test_split, where complete_df.iloc[:,1:].values represents the input features (X) and complete_df.iloc[:,0].values represents the target variable (y). It applies feature scaling using StandardScaler to standardize the input features. The scaler is fit on the training data and then applied to transform both the training and test data. The input data for the LSTM model is reshaped to have the shape (number of samples, number of time steps, number of features) using np.reshape. In this case, the number of features is set toThe LSTM model is defined using Sequential from Keras and consists of an LSTM layer with 128 units and a dense output layer with sigmoid activation. The model is compiled with the binary cross-entropy loss function, the Adam optimizer, and accuracy as the evaluation metric. The model is trained using the fit function, where X_train_lstm and y_train are the training data, epochs=10 specifies the number of training epochs, batch_size=64 determines the batch size, and validation_data is provided as (X_test_lstm, y_test) for validation during training. The training progress and performance metrics are stored in the lstm_hist variable.

In [128]:
X_train,X_test,y_train,y_test = train_test_split(complete_df.iloc[:,1:].values,complete_df.iloc[:,0].values,test_size=0.2,random_state=1)

In [129]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_stdscaled = scaler.transform(X_train)
X_test_stdscaled = scaler.transform(X_test)

In [130]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# Separate scalar values and arrays
X_train_scalar = []
X_train_array = []

for value in X_train:
    if isinstance(value, int):
        X_train_scalar.append(value)
    elif isinstance(value, np.ndarray):
        X_train_array.append(value)

# Convert scalar values to numpy array
X_train_scalar = np.array(X_train_scalar)

# Convert arrays to numpy array
X_train_array = np.array(X_train_array)

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the training data (arrays)
if X_train_array.shape[0] > 0:
    X_train_array_stdscaled = np.vstack(X_train_array)
    X_train_array_stdscaled = scaler.fit_transform(X_train_array_stdscaled)
else:
    X_train_array_stdscaled = np.array([])

# Concatenate scalar values with standardized arrays
if X_train_array_stdscaled.shape[0] > 0:
    X_train_stdscaled = np.concatenate((X_train_scalar.reshape(-1, 1), X_train_array_stdscaled), axis=1)
else:
    X_train_stdscaled = X_train_scalar.reshape(-1, 1)

# Transform the test data (scalar values only)
X_test_stdscaled = scaler.transform(X_test)

# Verify the shapes
print("X_train_stdscaled shape:", X_train_stdscaled.shape)
print("X_test_stdscaled shape:", X_test_stdscaled.shape)


ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 0 and the array at index 1 has size 24000

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the training data
X_train_stdscaled = scaler.fit_transform(X_train)

# Transform the test data
X_test_stdscaled = scaler.transform(X_test)


In [None]:
# Reshape the input data for LSTM
X_train_lstm = np.reshape(X_train_stdscaled, (X_train_stdscaled.shape[0], X_train_stdscaled.shape[1], 1))
X_test_lstm = np.reshape(X_test_stdscaled, (X_test_stdscaled.shape[0], X_test_stdscaled.shape[1], 1))

In [None]:
# Define the LSTM model
lstm_model = Sequential()
lstm_model.add(LSTM(128, input_shape=(X_train_lstm.shape[1], 1)))
lstm_model.add(Dense(1, activation='sigmoid'))

# Compile the model
lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
# Train the BiLSTM model with validation data
lstm_hist = lstm_model.fit(X_train_lstm, y_train, epochs=20, batch_size=64, validation_data=(X_test_lstm, y_test))

In [None]:
#Plot the training history
plt.plot(lstm_hist.history['accuracy'])
plt.plot(lstm_hist.history['val_accuracy'])
plt.title('LSTM Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

plt.plot(lstm_hist.history['loss'])
plt.plot(lstm_hist.history['val_loss'])
plt.title('LSTM Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

The results suggest that the LSTM model achieved moderate performance in terms of accuracy on both the training and validation sets. The model's performance on the validation set is slightly lower than its performance on the training set, indicating some degree of overfitting or a discrepancy between the training and validation data. Further analysis and adjustments may be necessary to improve the model's performance.

In [None]:
# Predict on the test data
y_pred_prob = lstm_model.predict(X_test_lstm)
y_pred_lstm = (y_pred_prob > 0.5).astype(int)

In [None]:
# Convert the predicted values to 1D array
y_pred_lstm = np.squeeze(y_pred_lstm)

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred_lstm)
print("Accuracy:", accuracy)

In [None]:
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_pred_lstm)
print("Confusion Matrix:")
print(cm)

In [None]:
# Create a heatmap of the confusion matrix
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Calculate precision and recall
precision = precision_score(y_test, y_pred_lstm)
recall = recall_score(y_test, y_pred_lstm)

print("Precision:", precision)
print("Recall:", recall)

In [None]:
# Generate classification report
report = classification_report(y_test, y_pred_lstm)
print(report)

# **BiLSTM Model:**

In [None]:
# Define the BiLSTM model
bilstm_model = Sequential()
bilstm_model.add(Bidirectional(LSTM(128), input_shape=(X_train_lstm.shape[1], 1)))
bilstm_model.add(Dense(1, activation='sigmoid'))

In [None]:
# Compile the model
bilstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the BiLSTM model
hist = bilstm_model.fit(X_train_lstm, y_train, epochs=10, batch_size=64,validation_data=(X_test_lstm, y_test))

In [None]:
# Plot the training history
plt.plot(hist.history['accuracy'])
plt.plot(hist.history['val_accuracy'])
plt.title('BiLSTM Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.title('BiLSTM Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(['Train', 'Validation'], loc='upper right')
plt.show()

In [None]:
# Predict on the test data
y_pred_bilstm = bilstm_model.predict(X_test_lstm)

# Convert the predicted values to 1D array
y_pred_bilstm = np.squeeze(y_pred_bilstm)

In [None]:
# Convert y_test to binary format
threshold = 0.5
y_test_binary = np.array(y_test > threshold, dtype=int)
y_pred_bilstm = np.array(y_pred_bilstm > threshold, dtype=int)

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test_binary, y_pred_bilstm)
print("Accuracy:", accuracy)

In [None]:
# Calculate the confusion matrix
cm = confusion_matrix(y_test_binary, y_pred_bilstm)
print("Confusion Matrix:")
print(cm)

# Create a heatmap of the confusion matrix
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Calculate precision and recall
precision = precision_score(y_test, y_pred_bilstm)
recall = recall_score(y_test, y_pred_bilstm)

print("Precision:", precision)
print("Recall:", recall)

In [None]:
# Generate classification report
report = classification_report(y_test, y_pred_bilstm)
print(report)

The improved results of the BiLSTM classifier can be attributed to the use of bidirectional LSTM, which allows the model to capture information from both past and future contexts. The bidirectional nature helps the model understand the sequential dependencies in the input data more effectively, leading to enhanced performance in the classification task.

# **Machine Learning Models: RandomForestClassifiers and XGBclassifier:**

# 1. Random forest Classifier:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf = RandomForestClassifier(n_estimators=250,criterion='gini',n_jobs=4)
rf.fit(X_train_stdscaled,y_train)

# Calculate train accuracy
y_train_pred = rf.predict(X_train_stdscaled)
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Train Accuracy:", train_accuracy)

# Calculate test accuracy
y_test_pred = rf.predict(X_test_stdscaled)
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Accuracy:", test_accuracy)

In [None]:
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_test_pred)
print("Confusion Matrix:")
print(cm)

# Create a heatmap of the confusion matrix
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Calculate precision and recall
precision = precision_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)

print("Precision:", precision)
print("Recall:", recall)

In [None]:
# Generate classification report
report = classification_report(y_test, y_test_pred)
print(report)

Overall, the Random Forest classifier achieves a relatively high test accuracy and captures a good number of true positives and true negatives. However, the possibility of overfitting should be considered due to the significantly higher training accuracy. Fine-tuning the model or exploring other algorithms might be necessary to further improve the performance and address any overfitting issues.

# 2. XGBoost Classifier:

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier(eta=0.1,n_estimators=200,n_jobs=4, learning_rate=0.1)
xgb.fit(X_train_stdscaled,y_train)

# Calculate train accuracy
y_train_pred = xgb.predict(X_train_stdscaled)
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Train Accuracy:", train_accuracy)


# Calculate test accuracy
y_test_pred1 = xgb.predict(X_test_stdscaled)
test_accuracy = accuracy_score(y_test, y_test_pred1)
print("Test Accuracy:", test_accuracy)

In [None]:
# Calculate the confusion matrix
cm = confusion_matrix(y_test, y_test_pred1)
print("Confusion Matrix:")
print(cm)

# Create a heatmap of the confusion matrix
sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Calculate precision and recall
precision = precision_score(y_test, y_test_pred1)
recall = recall_score(y_test, y_test_pred1)

print("Precision:", precision)
print("Recall:", recall)

In [None]:
# Generate classification report
report = classification_report(y_test, y_test_pred1)
print(report)

Overall, the XGBoost classifier achieves a relatively high test accuracy and captures a good number of true positives and true negatives. It performs slightly better than the Random Forest classifier in terms of accuracy. However, similar to the Random Forest classifier, fine-tuning the model or exploring other algorithms might be necessary to further improve the performance and address any potential limitations.