# Apply ensemble techniques on bag of words with advanced features.

### Table of Contents
<ul>
    <li><a href="#start">Let's get started</a></li>
    <li><a href="#gather">Gather</a></li>
    <li><a href="#rf">Random Forest</a></li>
    <li><a href="#xgb">XGBoost</a></li>
    <li><a href="#sm">Save Model</a></li>
    <li><a href="#con">Conclusion</a></li>
</ul>

<a id='start'></a>
### Let's get started

In [1]:
import pandas as pd
import numpy as np

import os

<a id='gather'></a>
### Gather

In [2]:
df = pd.read_csv(os.path.join('data', 'preprocessed', 'custom_train.csv'), index_col=0)
df.shape

(404284, 5023)

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('is_duplicate', axis=1), df['is_duplicate'], test_size=0.2, random_state=42)

<a id='rf'></a>
### Random Forest

In [4]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_jobs=-1)
rf.fit(X_train, y_train)

In [5]:
from sklearn.metrics import accuracy_score
y_pred = rf.predict(X_test)
accuracy_score(y_test, y_pred)

0.8256922715411158

<a id='xgb'></a>
### XGBoost

In [6]:
import warnings
warnings.filterwarnings('ignore')

from xgboost import XGBClassifier
xgb = XGBClassifier(n_jobs=-1, verbosity=2)
xgb.fit(X_train, y_train)

[02:44:54] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46bfb-1/xgboost/xgboost-ci-windows/src/tree/updater_prune.cc:98: tree pruning end, 122 extra nodes, 0 pruned nodes, max_depth=6
[02:45:01] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46bfb-1/xgboost/xgboost-ci-windows/src/tree/updater_prune.cc:98: tree pruning end, 124 extra nodes, 0 pruned nodes, max_depth=6
[02:45:09] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46bfb-1/xgboost/xgboost-ci-windows/src/tree/updater_prune.cc:98: tree pruning end, 120 extra nodes, 0 pruned nodes, max_depth=6
[02:45:16] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46bfb-1/xgboost/xgboost-ci-windows/src/tree/updater_prune.cc:98: tree pruning end, 120 extra nodes, 0 pruned nodes, max_depth=6
[02:45:23] INFO: C:/buildkite-agent/builds/buildkite-windows-cpu-autoscaling-group-i-030221e36e1a46b

In [7]:
y_pred = xgb.predict(X_test)
accuracy_score(y_test, y_pred)

0.8063742162088625

<a id='sm'></a>
### Save Models

In [10]:
import joblib
joblib.dump(rf, os.path.join('models', 'rf.pkl'))
joblib.dump(xgb, os.path.join('models', 'xgb.pkl'))

['models\\xgb.pkl']

<a id='con'></a>
### Conclusion

Running ensemble techniques on added advanced features improved accuracy.

Accuracy
- 82.6% - Random Forest

- 80.6% - XGBoost 