The code below is used as a baseline to check our model's accuracy without performing any preprocessing

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
df = pd.read_csv('train.csv')

In [3]:
df.shape

(404290, 6)

In [4]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
df.isnull().sum()

id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64

In [6]:
df.dropna(subset=['question1', 'question2'], inplace = True)

In [7]:
df.isnull().sum()

id              0
qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64

In [9]:
df.duplicated().sum()

np.int64(0)

In [16]:
new_df = df.sample(60000)

In [21]:
new_df.shape

(60000, 6)

In [26]:
new_df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
354914,354914,484094,351474,What are mental health disabilities?,What is mental health?,0
324469,324469,26610,32452,How can I increase my vocabulary?,How can I improve my English vocabulary?,1
7446,7446,14547,14548,Who made you realize you were gay?,When did you first realize that you were gay/l...,1
203405,203405,306023,306024,Does chondromalacia platellae ever go away com...,Does chondromalacia platella ever go away comp...,1
366001,366001,178913,496142,What can I do to find peace in my life?,What is a peaceful life?,0


In [17]:
#creating a df with only our question columns
ques_df = new_df[['question1', 'question2']]
ques_df.sample(6)

Unnamed: 0,question1,question2
240248,Are there hidden apps on my phone?,Is possible to open a hidden app on an Android...
155277,How can we earn money online without investment?,How can i make money online easily?
349666,Is there any incident happen in your life whic...,What is the spice savory and how is it manufac...
229607,What are some characteristics less developed c...,What are the characteristics of developed coun...
7858,How are we enabled to see the black colour?,How we could see black colour which has no wav...
211616,Why do North Indians mispronounce so many Engl...,Why aren't majority of north Indians learning ...


In [11]:
from sklearn.feature_extraction.text import CountVectorizer

In [18]:
#merge our question df
questions = list(ques_df['question1']) + list(ques_df['question2'])
questions

['What are mental health disabilities?',
 'How can I increase my vocabulary?',
 'Who made you realize you were gay?',
 'Does chondromalacia platellae ever go away completely?',
 'What can I do to find peace in my life?',
 'How do you print a family tree from Ancestry.com?',
 'Does ਬੱਬਰ ਸ਼ੇਰ killing a Bengal tiger in a fair fight defy logic?',
 'What kind of hat is this?',
 'What is the difference between scripting and programming?',
 'Is there another way too help with sleep apnea without using a cpap machine?',
 'How can we change the educational system?',
 'Why is Hrithik Roshan called the "Greek God of Bollywood"?',
 'During work experience I met a guy.He sits all the time next to me,he calls me beautiful and I like him. Is he just friendly?',
 'How do I deal with stage fright?',
 'Filial piety and humaneness?',
 'Why is Saltwater taffy candy imported in Germany?',
 'Do cell phones cause cancer? If not, how did that rumor start?',
 'My face has gained a lot of fat .How do I reduce i

The reason we have converted into a list and merged it so we can train it on the same feature space so it can learn same representations.

Varna usually both columns are trained on different feature space differently. By training them together, both questions learn the exact same representation of the feature space

so for the same word say python, it's vector representation for both columns will be same

In [19]:
cv = CountVectorizer(max_features = 3000)

In [20]:
q1_arr, q2_arr = np.vsplit(cv.fit_transform(questions).toarray(), 2)

np.vsplit splits the array vertically into two equal parts. That 2 in the end tells how many equal parts to split 

So total number of rows are split again into two q1_arr and q2_arr equally

In [23]:
q1_arr.shape, q2_arr.shape

((60000, 3000), (60000, 3000))

In [24]:
temp_df1 = pd.DataFrame(q1_arr, index = ques_df.index)
temp_df2 = pd.DataFrame(q2_arr, index = ques_df.index)
temp_df = pd.concat([temp_df1, temp_df2], axis = 1)
temp_df.shape

(60000, 6000)

In [25]:
temp_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
354914,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
324469,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7446,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
203405,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
366001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
204434,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
220556,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
117570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
85463,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [27]:
#adding is duplicate column
temp_df['is_duplicate'] = new_df['is_duplicate']

In [28]:
temp_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2991,2992,2993,2994,2995,2996,2997,2998,2999,is_duplicate
354914,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
324469,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
7446,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,1
203405,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
366001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
from sklearn.model_selection import train_test_split 
X_train, X_test, Y_train, Y_test = train_test_split(temp_df.iloc[:, 0:-1], temp_df.iloc[:, -1].values, test_size = 0.2, random_state = 1)

.values makes sure that index is not carried along.

Only the values are there

In [30]:
#modelling experiments
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier()
rf.fit(X_train, Y_train)
y_pred = rf.predict(X_test)
accuracy_rf = accuracy_score(Y_test, y_pred)

In [31]:
print(accuracy_rf)

0.7570833333333333


In [45]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train.values,Y_train)
y_pred = xgb.predict(X_test.values)
accuracy_score(Y_test,y_pred)

0.7366666666666667

When working with XGBoost (and honestly most ML libraries), using .values is a safe best practice