# A Random Forest model

Our previous work has all been based on logistic regression, which is the most common 'standard model' against which all other models are compared.

In this notebook we swap out the logistic regression model for a Random Forest model. Random Forests are often chosen for classification based on structured data (i.e. when we have specific features of data, rather than unstructured data like a picture or a sound file).

Random Forests are based on constructing multiple decision trees, each of which sees only part of the data for each case, and only has limited ‘branches’. Random Forests tend to be less prone to over-fitting than decision trees. For more on the basis of Random Forests see:

https://en.wikipedia.org/wiki/Random_forest

Note in this example how similar the code is to our previous logistic regression model. A couple of notable changes are:

* Data for Random Forest models do not need standardisation; we use the raw data.
* Rather than having coefficients, we output model ‘importances’ which reflect how influential a feature is in deciding classification. This is accessed through examining `model.feature_importances_`.

Here we will again use stratified K-fold validation to test the model performance. We will use default settings for the Random Forest model.

## Load modules

In [1]:
import numpy as np
import pandas as pd
import optuna
from sklearn import linear_model
from sklearn import ensemble
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

## Download data

Run the following code if data for Titanic survival has not been previously downloaded.

In [2]:
download_required = False

if download_required:
    
    # Download processed data:
    address = 'https://raw.githubusercontent.com/MichaelAllen1966/' + \
                '1804_python_healthcare/master/titanic/data/processed_data.csv'
    
    data = pd.read_csv(address)

    # Create a data subfolder if one does not already exist
    import os
    data_directory ='./data/'
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)

    # Save data
    data.to_csv(data_directory + 'processed_data.csv', index=False)

## Load data

In [3]:
# Load data & drop passenger ID
data = pd.read_csv('data/processed_data.csv')

# Make all data 'float' type
data = data.astype(float)

data.drop('PassengerId', inplace=True, axis=1)

# Split data into two DataFrames
X_df = data.drop('Survived',axis=1)
y_df = data['Survived']

# Convert DataFrames to NumPy arrays
X = X_df.values
y = y_df.values

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

In [5]:
model = RandomForestClassifier(n_jobs=-1)
model.fit(X_train,y_train)
y_pred_test = model.predict(X_test)
accuracy = np.mean(y_pred_test == y_test)
print (f'Accuracy: {accuracy:0.3f}')

Accuracy: 0.839


In [6]:


# 1. Define an objective function to be maximized.
def objective(trial):

    classifier_name = trial.suggest_categorical(
        "classifier", ["LogReg", "RandomForest"])

    # Step 2. Setup values for the hyperparameters:
    if classifier_name == 'LogReg':
        logreg_c = trial.suggest_float("logreg_c", 1e-10, 1e10, log=True)
        classifier_obj = linear_model.LogisticRegression(C=logreg_c)
    else:
        rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
        classifier_obj = ensemble.RandomForestClassifier(
            max_depth=rf_max_depth, n_estimators=rf_n_estimators
        )
    
    # Step 3: Scoring method:
    score = model_selection.cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
    accuracy = score.mean()
    return accuracy

# Step 4: Running it
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

from IPython.display import clear_output
clear_output(wait=True)
print('Finished')

Finished


In [8]:
# Getting the best score:
print(f"The best value is : \n{study.best_value}")

# Getting the best parameters:
print(f"The best parameters are : \n{study.best_params}")

The best value is : 
0.8226711560044894
The best parameters are : 
{'classifier': 'RandomForest', 'rf_n_estimators': 224, 'rf_max_depth': 7}


In [None]:
# 1. Define an objective function to be maximized.
def objective(trial):

    classifier_name = trial.suggest_categorical(
        "classifier", ["LogReg", "RandomForest"])

    # Step 2. Setup values for the hyperparameters:
    if classifier_name == 'LogReg':
        logreg_c = trial.suggest_float("logreg_c", 1e-10, 1e10, log=True)
        classifier_obj = linear_model.LogisticRegression(C=logreg_c)
    else:
        rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
        classifier_obj = ensemble.RandomForestClassifier(
            max_depth=rf_max_depth, n_estimators=rf_n_estimators
        )
    
    # Step 3: Scoring method:
    score = model_selection.cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
    accuracy = score.mean()
    return accuracy

# Step 4: Running it
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

from IPython.display import clear_output
clear_output(wait=True)
print('Finished')