## Only Bag of words and random forest

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('train.csv')

In [3]:
df.shape

(404290, 6)

In [10]:
new_df = df.sample(30000) # randomly taking 30k samples because pc isn't able to handle whole data

In [11]:
new_df.dropna(inplace=True)

In [12]:
new_df.isnull().sum()

id              0
qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64

In [13]:
# new dataframe
ques_df = new_df[['question1', 'question2']]
ques_df.head()

Unnamed: 0,question1,question2
292959,Can I find or track my lost mobile device usin...,How can I locate my cell phone with the phone ...
401894,How can I get a monthly Yojana magazine?,Where can I purchase the Yojana magazine in Mu...
93648,I am really unhappy with my height. I am a 27 ...,My father's height is 5.9 and my mother's heig...
283579,What do kindergartners learn?,What should be the top things for a kindergart...
253384,Are Canadians really that nice?,Are Canadians smarter than Americans?


**CountVectorizer working:**  
Tokenization: It splits each document into words, or tokens.  
Counting: It counts the occurrence of each token in the documents.  
Vectorization: It transforms the counts into a vector representation, where each column corresponds to a specific token and each row corresponds to a document.

eg. 
documents = [
    "I love machine learning",
    "Machine learning is fun",
    "Learning is a continuous process"
]  

vectors = [[1 0 0 1 0 1 1]
 [1 1 1 0 1 1 0]
 [0 0 1 0 1 0 1]]
  
 corpus = ['continuous' 'fun' 'is' 'learning' 'love' 'machine' 'process']


In [15]:
from sklearn.feature_extraction.text import CountVectorizer
# merge texts
questions = list(ques_df['question1']) + list(ques_df['question2'])

cv = CountVectorizer(max_features=3000) # bag-of-words representation
q1_arr, q2_arr = np.vsplit(cv.fit_transform(questions).toarray(), 2) # transform and splited into 2 vectors
#(30k vectors each arr)

In [16]:
# converting to dataframe
temp_df1 = pd.DataFrame(q1_arr, index=ques_df.index)
temp_df2 = pd.DataFrame(q2_arr, index=ques_df.index)

In [17]:
temp_df = pd.concat([temp_df1, temp_df2], axis=1)
temp_df.shape

(30000, 6000)

total 60k questions were there then splitted from centre and added to q1_arr and q2_arr.  
converted them to dataframe and concatenated them.

therefore, total 30k elements will be present

In [18]:
temp_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
292959,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
401894,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
93648,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
283579,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
253384,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
temp_df['is_duplicate'] = new_df['is_duplicate']

In [20]:
temp_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2991,2992,2993,2994,2995,2996,2997,2998,2999,is_duplicate
292959,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
401894,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
93648,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
283579,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
253384,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Our data is ready now, let's train the model

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(temp_df.iloc[:, 0:-1].values, temp_df.iloc[:, -1].values, 
                                                    test_size=.2, random_state=42)

In [22]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [23]:
y_pred = rf.predict(X_test)
accuracy_score(y_test, y_pred)

0.7408333333333333

In [24]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)


In [25]:
y_pred = xgb.predict(X_test)
accuracy_score(y_test, y_pred)

0.7283333333333334