# BOW Model Only:

We will construct a basic Bag-of-Words (BOW) model. By employing this technique, we will transform textual data into numerical features. The BOW model considers word frequency, disregarding the sequence and structure of the sentences. We will utilize traditional classifiers to predict the similarity between question pairs and evaluate the model's performance.


In [2]:
# Import all standrad librires

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
# Data Loading process
df = pd.read_csv("/content/train.csv")
df.shape

(404290, 6)

In [4]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
new_df = df.sample(30000)


In [6]:
# Missing Values

new_df.isnull().sum()

id              0
qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64

In [7]:
# Duplicate rows
new_df.duplicated().sum()


0

In [8]:
ques_df = new_df[['question1','question2']]
ques_df.head()


Unnamed: 0,question1,question2
136430,Which STDs can or can't be tested for?,"In an arranged marriage scenario, is it okay t..."
238430,How can I tell the girl I'm dating about my la...,"If you could go back in time, what would you t..."
311009,Was the pilot of Sherlock’s plan Irene Adler?,Why is Sherlock so special ?
190265,Let's say humans did exist at the time of dino...,What was Donald Trump referring to during the ...
327775,How can we download torrents now after the ban...,Can I download movies through YTS in India?


In [9]:
from sklearn.feature_extraction.text import CountVectorizer
# merge texts
questions = list(ques_df['question1']) + list(ques_df['question2'])

cv = CountVectorizer(max_features=3000)
q1_arr, q2_arr = np.vsplit(cv.fit_transform(questions).toarray(),2)


In [10]:
temp_df1 = pd.DataFrame(q1_arr, index= ques_df.index)
temp_df2 = pd.DataFrame(q2_arr, index= ques_df.index)
temp_df = pd.concat([temp_df1, temp_df2], axis=1)
temp_df.shape


(30000, 6000)

In [11]:
temp_df


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
136430,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
238430,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,1,0,0,0,0,0
311009,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
190265,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
327775,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
273203,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
167205,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
266081,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
202236,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
temp_df['is_duplicate'] = new_df['is_duplicate']


In [13]:
temp_df.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2991,2992,2993,2994,2995,2996,2997,2998,2999,is_duplicate
136430,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
238430,0,0,0,0,0,0,0,0,0,0,...,2,0,0,1,0,0,0,0,0,0
311009,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
190265,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
327775,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [14]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(temp_df.iloc[:,0:-1].values,temp_df.iloc[:,-1].values,test_size=0.2,random_state=1)


In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)


0.7496666666666667

In [16]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train,y_train)
y_pred = xgb.predict(X_test)
accuracy_score(y_test,y_pred)


0.7296666666666667

# Conclusion:

In the "BOW Model Only" phase, we used CountVectorizer to represent text data numerically and predicted question pair similarity. The RandomForestClassifier achieved an accuracy of 74.9%, and the XGBClassifier achieved 72.9%. The BOW model shows promise, and in the upcoming phases, we'll enhance it with more advanced techniques to improve accuracy.