# Apply Bag of Words with ensemble techniques and analyze performance

##### Dataset: <a href='https://www.kaggle.com/competitions/quora-question-pairs/data'>Quora Question Pairs</a>

## Table of Contents
<ul>
    <li><a href="#start">Let's get started</a></li>
    <li><a href="#gather">Gather</a></li>
    <li><a href="#bow">Bag of Words</a></li>
    <li><a href="#rf">Random Forest</a></li>
    <li><a href="#xgb">XGBoost</a></li>
    <li><a href="#analyze">Analyze</a></li>
    <li><a href="#conclusion">Conclusion</a></li>
</ul>

<a id='start'></a>
### Let's get started

In [1]:
import numpy as np
import pandas as pd

import os

<a id='gather'></a>
### Gather

In [2]:
df = pd.read_csv(os.path.join('data', 'preprocessed', 'train.csv'))
df.shape

(404287, 7)

<a id='bow'></a>
### Bag of Words

In [3]:
questions = df['question1'].tolist() + df['question2'].tolist()
len(questions)

808574

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features=3000)
question1, question2 = np.vsplit(cv.fit_transform(questions).toarray(), 2)

In [5]:
bag_of_words_df = pd.concat([pd.DataFrame(question1), pd.DataFrame(question2)], axis=1)
bag_of_words_df.shape

(404287, 6000)

In [6]:
bag_of_words_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2990,2991,2992,2993,2994,2995,2996,2997,2998,2999
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
bag_of_words_df['is_duplicate'] = df['is_duplicate'].values
bag_of_words_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2991,2992,2993,2994,2995,2996,2997,2998,2999,is_duplicate
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
bag_of_words_df.columns = ['q1_' + str(i) for i in range(3000)] + ['q2_' + str(i) for i in range(3000, 6000)] + ['is_duplicate']

In [9]:
bag_of_words_df.head()

Unnamed: 0,q1_0,q1_1,q1_2,q1_3,q1_4,q1_5,q1_6,q1_7,q1_8,q1_9,...,q2_5991,q2_5992,q2_5993,q2_5994,q2_5995,q2_5996,q2_5997,q2_5998,q2_5999,is_duplicate
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(bag_of_words_df.iloc[:,:-1], bag_of_words_df.iloc[:,-1], test_size=0.2, random_state=42)

<a id="rf"></a>
### Random Forest

In [11]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(X_train, y_train)

In [12]:
from sklearn.metrics import accuracy_score
y_pred = rf.predict(X_test)
accuracy_score(y_test, y_pred)

0.8102475945484677

<a id="xgb"></a>
### XGBoost

In [13]:
from xgboost import XGBClassifier
xgb = XGBClassifier(n_jobs=-1, verbosity=2)
xgb.fit(X_train, y_train)

[03:23:28] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46bfb-1/xgboost/xgboost-ci-windows/src/tree/updater_prune.cc:98: tree pruning end, 94 extra nodes, 0 pruned nodes, max_depth=6
[03:23:37] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46bfb-1/xgboost/xgboost-ci-windows/src/tree/updater_prune.cc:98: tree pruning end, 98 extra nodes, 0 pruned nodes, max_depth=6
[03:23:45] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46bfb-1/xgboost/xgboost-ci-windows/src/tree/updater_prune.cc:98: tree pruning end, 96 extra nodes, 0 pruned nodes, max_depth=6
[03:23:54] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46bfb-1/xgboost/xgboost-ci-windows/src/tree/updater_prune.cc:98: tree pruning end, 82 extra nodes, 0 pruned nodes, max_depth=6
[03:24:02] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46bfb-1

In [14]:
y_pred = xgb.predict(X_test)
accuracy_score(y_test, y_pred)

0.7465556902223651

In [15]:
X_train.head()

Unnamed: 0,q1_0,q1_1,q1_2,q1_3,q1_4,q1_5,q1_6,q1_7,q1_8,q1_9,...,q2_5990,q2_5991,q2_5992,q2_5993,q2_5994,q2_5995,q2_5996,q2_5997,q2_5998,q2_5999
174949,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
119442,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
252941,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
13551,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
274896,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


<a id="analyze"></a>
### Analyze

- The Random Forest outperforms XGBoost algorithm with Bag of words
- Accuracy of the Random Forest algorithm is around 81%.

<a id="conclusion"></a>
### Conclusion

81% accuracy is good, but we must add custom features to the bag of words features in order to improve the performance of the model.