### Random Forest Exercise

------------------

In [1]:
# import pandas
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix

In [2]:
# list for column headers
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# load data
df = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", names=names)

Spend some time to explore the dataset.
- head
- shape

In [3]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   preg    768 non-null    int64  
 1   plas    768 non-null    int64  
 2   pres    768 non-null    int64  
 3   skin    768 non-null    int64  
 4   test    768 non-null    int64  
 5   mass    768 non-null    float64
 6   pedi    768 non-null    float64
 7   age     768 non-null    int64  
 8   class   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


* create the X and y (the goal is to predict column **class** based on other variables)

In [5]:
df.shape

(768, 9)

In [6]:
df.columns

Index(['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'], dtype='object')

* split data set into a train set and test set

In [7]:
X = df[['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age']]
y = df['class']       

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

------------------------
#### Part 1: Setting up the Random Forest Classifier
* import RandomForestClassifier from sklearn. It is suggested to spend some time on the doccumentation of this classifier to get familiar with the available parameters.

* create model

In [8]:
clf = RandomForestClassifier(n_estimators=100)

* fit training set with default parameters

In [9]:
clf.fit(X_train, y_train)

RandomForestClassifier()

* predict X_test

In [10]:
pred = clf.predict(X_test)

* import roc_auc_score and confusion_matrix from sklearn

* print confusion matrix

In [11]:
print(confusion_matrix(y_test, pred))

[[123  21]
 [ 42  45]]


* print AUC

In [12]:
print(roc_auc_score(y_test, pred))

0.6857040229885057


----------------------------------
#### Part 2: Using a Grid Search
- import GridSearchCV from sklearn

* create grid (optimize for number of trees and max depth in one tree)

In [13]:
num_trees = [100,150,200,250,300,400,]
max_depth = [5,10,15,None]
param_dict = {'n_estimators': num_trees, 'max_depth': max_depth }

clf = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_dict, n_jobs=-1)


* fit training data with grid search

In [14]:
clf.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'max_depth': [5, 10, 15, None],
                         'n_estimators': [100, 150, 200, 250, 300, 400]})

* print confusion matrix with the best model

In [15]:
print('Best score:', clf.best_score_)

Best score: 0.7802526825891313


In [19]:
print('Best n_estimators:', clf.best_estimator_.n_estimators)
print('Best max_depth:', clf.best_estimator_.max_depth)

AttributeError: 'RandomForestClassifier' object has no attribute 'best_estimator_'

* print AUC with the best model

In [26]:
clf = RandomForestClassifier(n_estimators=250, max_depth=None)
clf.fit(X_train, y_train)


RandomForestClassifier(n_estimators=250)

In [27]:
pred = clf.predict(X_test)
print(roc_auc_score(y_test, pred))

0.7029454022988505


- is the model better than default?

not really