<a href="https://colab.research.google.com/github/TanmayMehta-ml/Quora-Question-Pairs/blob/main/onlyBOW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Bag of Words (BoW) with Random Forest and XGBoost

## Introduction
This Jupyter Notebook explores the task of question similarity using Bag of Words (BoW) representation in conjunction with Random Forest and XGBoost classification models. The analysis focuses on a subset of the dataset containing 40,000 samples from a larger dataset of nearly 400,000 samples. The objective is to determine whether two questions are similar or dissimilar based on the BoW features and subsequently evaluate the models' performance in predicting question similarity using the accuracy metric.

## Loading Dataset
The dataset containing question pairs and their corresponding similarity labels is loaded for analysis. Due to computational constraints, a representative subset of 40,000 samples is taken from the original dataset. Any samples with missing values (NA) are dropped during preprocessing to ensure data integrity.

## Preprocessing
The preprocessing step involves handling the dataset with 40,000 samples to ensure it is clean and ready for analysis. As part of the preprocessing, any NA values are removed to create a complete and balanced dataset.

## Bag of Words (BoW) Representation
The questions are transformed into numerical vectors using Bag of Words (BoW) representation. BoW creates a fixed-size vector for each question, where each element of the vector corresponds to the count of a specific word in the question.

## Model Training and Evaluation
Two classification models, Random Forest and XGBoost, are trained on the BoW feature matrix derived from the dataset. The models are evaluated on a separate testing dataset to measure their performance in predicting question similarity. The evaluation metric used is accuracy, which assesses the overall correctness of the model's predictions.

## Results
The performance and processing efficiency of Random Forest are analyzed and compared to XGBoost. The results shed light on the efficacy of the Bag of Words approach with Random Forest in predicting question similarity, particularly when working with a subset of the larger dataset.

The primary objective of this analysis is to demonstrate the use of Bag of Words with Random Forest and XGBoost for question similarity classification, along with insights into the models' performance on a representative subset of the original dataset.


### Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

### Dataset Preprocessing

In [2]:
os.chdir('/content/drive/MyDrive/Quora Dataset')
df = pd.read_csv('train.csv')

In [3]:
df.shape

(404290, 6)

In [4]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [5]:
new_df = df.sample(40000)

In [6]:
new_df.isnull().sum()

id              0
qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64

In [7]:
# Drop rows with NA values in any column
new_df.dropna(axis=0, inplace=True)
new_df.reset_index(drop=True, inplace=True)

In [8]:
new_df.isnull().sum()

id              0
qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64

In [9]:
new_df.duplicated().sum()

0

In [10]:
ques_df = new_df[['question1','question2']]
ques_df.head()

Unnamed: 0,question1,question2
0,What's the best smartphone I can get under rs....,Which is the best smartphone to buy for 15000 ...
1,What is it like to have Indian friends?,What is like to be an Indian and not have any ...
2,What should be my approach and CAT percentile ...,I secured 90.36 percentile in CAT 2016. Will I...
3,What do Chinese people think about the Indians?,What do the Chinese people think about India?
4,How can you not care what others think?,How can I learn to not care what others think?


In [11]:
from sklearn.feature_extraction.text import CountVectorizer
# merge texts
questions = list(ques_df['question1']) + list(ques_df['question2'])

cv = CountVectorizer(max_features=3000)
q1_arr, q2_arr = np.vsplit(cv.fit_transform(questions).toarray(),2)

In [12]:
temp_df1 = pd.DataFrame(q1_arr, index= ques_df.index)
temp_df2 = pd.DataFrame(q2_arr, index= ques_df.index)
temp_df = pd.concat([temp_df1, temp_df2], axis=1)
temp_df.shape

(40000, 6000)

In [13]:
temp_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
temp_df['is_duplicate'] = new_df['is_duplicate']

In [15]:
temp_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2991,2992,2993,2994,2995,2996,2997,2998,2999,is_duplicate
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### Model Training

In [16]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(temp_df.iloc[:,0:-1].values,temp_df.iloc[:,-1].values,test_size=0.2,random_state=1)

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

0.746125

In [18]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train,y_train)
y_pred = xgb.predict(X_test)
accuracy_score(y_test,y_pred)

0.7335