# <font color='#31394d'> Random Forests Practice Exercise </font>

In this exercise we're going to use the famous <a href="https://archive.ics.uci.edu/ml/datasets/iris" target="_blank">Iris dataset</a> to determine the species of iris using a random forest classifier. Begin by importing the necessary libraries and loading the Iris dataset from sklearn.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(6,6)}) 
import warnings
warnings.simplefilter("ignore")

%matplotlib inline

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import SCORERS

In [2]:
data = datasets.load_iris()

# for display purposes
iris = pd.DataFrame(data.data, columns=data.feature_names)
iris["target"] = data.target
iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


🚀 <font color='#D9C4B1'>Exercise: </font> Build a random forest classifier, train and evaluate it using cross-validation. You can use the functions below.

In [4]:
X = data.data
y = data.target

# Instantiate model
rfc = RandomForestClassifier(n_estimators=50, random_state=42)

def evaluate_model(estimator):
    cv_results = cross_validate(estimator, X, y, scoring='accuracy', n_jobs=-1, cv=10, return_train_score=True)
    return pd.DataFrame(cv_results).abs().mean().to_dict()

# Evaluate
results = evaluate_model(rfc)

def display_results(results):
    results_df  = pd.DataFrame(results, index=[0]).T
    results_cols = results_df.columns
    for col in results_df:
        results_df[col] = results_df[col].apply(np.mean)
    return results_df

# Display results
display_results(results)

Unnamed: 0,0
fit_time,0.504849
score_time,0.031602
test_score,0.96
train_score,1.0


🚀 <font color='#D9C4B1'>Exercise: </font> Adjust the hyperparameters (e.g. number of trees). Does model performance decrease or increase? 

In [5]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'n_estimators': [50, 100, 200, 300, 400, 500]}

rfc = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(rfc, param_grid=param_grid, scoring='accuracy', n_jobs=-1, cv=10)

# Fit the grid search object to the data
grid_search.fit(X, y)

# Display the best hyperparameters
print("Best hyperparameters: ", grid_search.best_params_)

# Display the average accuracy score of the best model
print("Best accuracy score: ", grid_search.best_score_)

Best hyperparameters:  {'n_estimators': 200}
Best accuracy score:  0.9666666666666666
