<a href="https://colab.research.google.com/github/NickBrecht/model_performance/blob/master/RF%20vs.%20XGB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Model Performance**

We'll import & organize data, then run some simple tests on classifers from sklearn as well as xgboost. 

In [0]:
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import accuracy_score
from lightgbm import LGBMClassifier
from lightgbm import LGBMClassifier


In [0]:
iris = datasets.load_iris()

import pandas as pd

x=pd.DataFrame({
    'sepal length':iris.data[:,0],
    'sepal width':iris.data[:,1],
    'petal length':iris.data[:,2],
    'petal width':iris.data[:,3]
})

In [0]:
x_train, x_test, y_train, y_test = train_test_split(x, iris.target, test_size=0.3)

### Models

We'll limit the n_estimators to 100 and keep other params as default. We will leverage n_jobs where possible. 

In [0]:
ran = RandomForestClassifier(n_estimators=100,n_jobs= -1, random_state=310)
gbc = GradientBoostingClassifier(n_estimators=100, random_state=310) # does not support n_jobs? :(
xgb = XGBClassifier(n_estimators=100,n_jobs= -1,random_state=310)
lgbm = LGBMClassifier(objective='multiclass', random_state=310, n_jobs=-1)

### Testing

In [5]:
%timeit -n 100 ran.fit(x_train,y_train)

100 loops, best of 3: 149 ms per loop


In [6]:
%timeit -n 100 gbc.fit(x_train,y_train)

100 loops, best of 3: 166 ms per loop


In [7]:
%timeit -n 100 xgb.fit(x_train,y_train)

100 loops, best of 3: 26.7 ms per loop


In [8]:
%timeit -n 100 lgbm.fit(x_train,y_train)

100 loops, best of 3: 29.6 ms per loop


### XGB completes in about 1/5th the time of Random Forest. LGBM does almost as well.

# Accuracy

In [9]:
print(f"Random Forest accuracy:{accuracy_score(y_test, ran.predict(x_test))}")
print(f"Normal Gradient Boosting accuracy:{accuracy_score(y_test, gbc.predict(x_test))}")
print(f"XGB accuracy:{accuracy_score(y_test, xgb.predict(x_test))}")
print(f"LGBM accuracy:{accuracy_score(y_test, lgbm.predict(x_test))}")

Random Forest accuracy:0.9555555555555556
Normal Gradient Boosting accuracy:0.9555555555555556
XGB accuracy:0.9555555555555556
LGBM accuracy:0.9555555555555556


I found random_state to have huge impact on the models. Using 310, they seem to all do about the same. Random Forest did have slightly better performance when using lower random_states. All models seem to have ballpark same accuracy scores.