# Tree-based Models

Authored by [JumpThanawut](https://github.com/orgs/Datatouille/people/JumpThanawut); Edited by [cstorm125](https://github.com/cstorm125/)

Tree-based models are strong baselines when doing any type of supervised learning. They come with handy characteristics such as not requiring standardizing your features, handling categorical variables and powerful ensembling. It is always a decent thing to start with tree-based models as baselines. This notebook will get you started on training a default-parameter decision tree, random forest and gradient boosted tree.

In [None]:
# #uncomment if you are running from google colab
# !wget https://github.com/Datatouille/snaplogic_snap_recommendation/archive/master.zip; unzip master
# !mv snaplogic_snap_recommendation-master/* .
# !ls

In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook
import matplotlib.pyplot as plt
from utils import *

## Load Data

In [None]:
train_df = pd.read_csv("dataset/train_df.csv")
valid_df = pd.read_csv("dataset/valid_df.csv")
submit_df = pd.read_csv("dataset/submit_df.csv")
all_df = pd.concat([train_df,valid_df,submit_df],0).reset_index(drop=True)
train_df.shape, valid_df.shape, submit_df.shape

In [None]:
train_df.head()

In [None]:
all_df.describe()

## Feature Engineering

All our features are discrete so we need to perform one-hot encoding before serving them to the model.

In [None]:
#one hot encode the features
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories=[np.arange(1,5), \
                                np.arange(501),np.arange(501),np.arange(501),np.arange(501),\
                                np.arange(1,581),np.arange(1,231)])

feature_cols = ['org','prev_snap_1','prev_snap_2','prev_snap_3','prev_snap_4','project','user']
train_x = train_df[feature_cols].values
enc_fit = enc.fit(train_x)
train_x = enc_fit.transform(train_x)
train_y = train_df["target_snap"].values.astype(str)
valid_x = enc_fit.transform(valid_df[feature_cols].values)
valid_y = valid_df["target_snap"].values.astype(str)
train_x.shape, train_y.shape, valid_x.shape, valid_y.shape

## Models

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf = clf.fit(train_x, train_y)
print(f'Accuracy: {clf.score(valid_x,valid_y)}') 
print(f'Top-5 Accuracy: {score_topk(clf,valid_x,valid_y,k=5)}')

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(train_x, train_y)
print(f'Accuracy: {clf.score(valid_x,valid_y)}') 
print(f'Top-5 Accuracy: {score_topk(clf,valid_x,valid_y,k=5)}')

### Gradient Boosted Tree

In [None]:
from lightgbm.sklearn import LGBMClassifier
clf = LGBMClassifier(boosting_type='gbdt', num_leaves=31, n_estimators=10,
                objective='ovr', num_class=486)
clf = clf.fit(train_x, train_y)
probs = clf.predict_proba(valid_x)

In [None]:
print(f'Accuracy: {clf.score(valid_x,valid_y)}')
print(f'Top-5 Accuracy: {score_topk(probs,valid_y,k=5)}')

## Evaluation

With 486 target classes, it is almost impossible to diagnose how well your model performs by looking at confusion matrix like you would normally do. Using the decision tree and random forest classifier, we provide some ideas for model evaulation.

In [None]:
clf_tree = DecisionTreeClassifier()
clf_tree = clf_tree.fit(train_x, train_y)
print(f'Accuracy: {clf_tree.score(valid_x,valid_y)}') 
print(f'Top-5 Accuracy: {score_topk(clf_tree,valid_x,valid_y,k=5)}')

In [None]:
clf_forest = RandomForestClassifier(n_estimators=10)
clf_forest = clf_forest.fit(train_x, train_y)
print(f'Accuracy: {clf_forest.score(valid_x,valid_y)}') 
print(f'Top-5 Accuracy: {score_topk(clf_forest,valid_x,valid_y,k=5)}')

We can see that while decision tree has higher validation accuracy, it has lower top-5 validation accuracy. In order to see how top-k number of suggestions play a part in model performance, we plot the accuracies of each model at each k. You can see that according to the top-k-vs-accuracy plot, random forest outperforms decision tree in all cases excpet when k=1.

In [None]:
#accurayc curve
accs_tree = []
accs_forest = []
for i in tqdm_notebook(range(1,101)):
    accs_tree.append(score_topk(clf_tree,valid_x,valid_y,k=i))
    accs_forest.append(score_topk(clf_forest,valid_x,valid_y,k=i))

In [None]:
#zoom in on top 10
print(f'Area Under top-k-vs-accuracy line; Tree: {sum(accs_tree)}, Forest: {sum(accs_forest)}')
plt.plot(accs_tree)
plt.plot(accs_forest)

In [None]:
print(f'Area top-k-vs-accuracy line; Tree: {sum(accs_tree[:10])}, Forest: {sum(accs_forest[:10])}')
plt.plot(accs_tree[:10])
plt.plot(accs_forest[:10])