# Tuning Random Forests

The `RandomForestClassifier` in scikit-learn has many parameters, and choosing the best ones can significantly impact your model's performance. Here's a systematic approach to tuning those parameters.

## Python Prerequisites

Let's install and import the prerequisites so they are ready to use.

In [1]:
# %pip install --quiet --upgrade pip 
# %pip install numpy --quiet
# %pip install PyArrow --quiet
# %pip install Pandas --quiet
# %pip install scikit-learn --quiet

In [7]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from helpers.onehot_encode import onehot_encode


## Starting with a Baseline

For starters, let's train a `RandomForestClassifier` on the trusty Titanic dataset to get an understanding of how well the _default parameters_ do at predicting the accuracy of survivors.


In [None]:
titanic_data = pd.read_csv("Data/titanic_train.csv")

titanic_data, gender_categories = onehot_encode(titanic_data, "Sex")
titanic_data, class_categories = onehot_encode(titanic_data, "Pclass")
predictors = ["Age", "Fare"] + gender_categories + class_categories
prediction = "Survived"

train, validate = (
    train_test_split(
        titanic_data, 
        test_size=0.2, 
        stratify=titanic_data[prediction], 
        random_state=42)
    )

x = train[predictors]
y = train[[prediction]].values.ravel()

In [11]:
random_forest = RandomForestClassifier(n_estimators=100, random_state=42)
random_forest.fit(x, y)

predictions = random_forest.predict(validate[predictors])
actuals = validate[[prediction]].values

score = accuracy_score(actuals, predictions)
print(f'Baseline model accuracy: {score *100:.2f}%')

Baseline model accuracy: 84.36%




`84.36%` accuracy. Not a bad start.



## RandomForestClassifier parameters




In GridSearchCV, stratification is handled automatically if:

- You're doing classification
- You pass y (the target labels) when calling .fit()
- You use an integer for cv (e.g., cv=5)

scikit-learn will automatically use StratifiedKFold under the hood for classification.

In [39]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold

param_grid = {
    'n_estimators': [i + 40 for i in range(200)], #[45, 50, 100, 104, 200],                       # small, default, and larger
    # 'max_depth': [None, 10, 3],                      # default, moderate, shallow (bad)
    # 'min_samples_split': [2, 10, 50],                # default, medium, large (underfit)
    # 'min_samples_leaf': [1, 5, 20],                  # default, medium, large (underfit)
    # 'max_features': ['sqrt', 10, None],              # default, moderate, all (may overfit)
    # 'class_weight': [None, 'balanced']  # default, important for imbalance
    'n_jobs': [-1],  # use all available cores
}
rf2 = RandomForestClassifier(random_state=42)

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

grid_search = GridSearchCV(estimator=rf2, param_grid=param_grid, cv=cv, scoring='accuracy', verbose=1)
grid_search.fit(x, y)

print("Best Parameters:", grid_search.best_params_)
print("Best accuracy:", grid_search.best_score_)


predictions = grid_search.best_estimator_.predict(validate[predictors])
actuals = validate[[prediction]].values

score = accuracy_score(actuals, predictions)
print(f'Validation accuracy: {score *100:.4f}%')

Fitting 3 folds for each of 200 candidates, totalling 600 fits
Best Parameters: {'n_estimators': 84, 'n_jobs': -1}
Best accuracy: 0.8272760581025659
Validation accuracy: 84.3575%


In [44]:
results = pd.DataFrame(grid_search.cv_results_)[['params', 'param_n_estimators', 'mean_test_score', 'std_test_score']]
results.sort_values(by='mean_test_score', ascending=False).head(3)

Unnamed: 0,params,param_n_estimators,mean_test_score,std_test_score
44,"{'n_estimators': 84, 'n_jobs': -1}",84,0.827276,0.015425
43,"{'n_estimators': 83, 'n_jobs': -1}",83,0.82587,0.015191
2,"{'n_estimators': 42, 'n_jobs': -1}",42,0.824475,0.018575


In [33]:
from sklearn.datasets import make_classification

# Create a synthetic dataset for demonstration purposes
# where we have a highly imbalanced dataset
# with 20,000 samples, 30 features, and a very small percentage of positive samples
X, y = make_classification(
    n_samples=20000,
    n_features=30,
    n_informative=6,
    n_redundant=2,
    n_repeated=0,
    n_classes=2,
    weights=[0.995, 0.005],  # ~0.5% positives (very imbalanced)
    class_sep=1.0,
    random_state=42
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df['target'] = y
df

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_21,feature_22,feature_23,feature_24,feature_25,feature_26,feature_27,feature_28,feature_29,target
0,1.210664,0.349601,0.029481,-0.020006,-0.601959,1.726634,2.573545,0.223895,-1.636214,2.394204,...,-1.655477,-1.486167,0.487789,-0.580844,-0.122460,-0.840531,-1.624671,-0.034947,-0.622731,0
1,0.634961,0.223648,-1.196056,0.417422,0.214757,-0.207356,0.756490,-1.306920,0.880136,0.263538,...,0.837681,-1.260614,-0.582005,0.137562,-1.121255,0.474646,-1.387886,-0.479281,0.294088,0
2,-0.572048,0.431379,1.672453,-0.164474,1.428403,-0.604543,-4.224891,-1.177700,0.564690,0.600523,...,-1.320712,-1.710154,0.712096,-0.637608,-0.362928,0.592051,-0.613427,-2.880198,-0.314463,0
3,0.534496,1.087456,0.790938,-1.757833,-0.061646,-0.390053,0.416728,0.148818,0.335285,0.131284,...,1.346023,-1.254552,-0.574714,1.477617,-1.997026,-0.170192,-0.058041,1.908746,1.838352,0
4,-0.543131,0.109050,-1.856372,-1.382639,-1.772783,-0.422195,3.203891,-0.979045,0.287339,0.968977,...,-2.215352,-0.333146,1.356225,-0.026778,0.782405,-0.352308,2.401678,4.205997,-0.641665,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,0.303397,-0.568428,-0.072603,0.655341,0.178053,0.705260,-0.951625,1.592417,-0.260190,0.145472,...,-0.911432,0.522614,-0.449445,0.259839,-2.098249,-0.992923,-1.809622,-2.714639,-1.587788,0
19996,-0.137859,-0.383533,1.923109,0.618716,-1.203447,0.286761,3.544024,-0.808329,-0.489768,-0.363143,...,0.642698,-3.574396,-0.027515,0.049381,-1.132985,2.430925,-0.681227,0.240363,-0.243601,0
19997,0.588165,-0.394008,-0.548438,0.041910,0.041069,1.213395,-0.364276,1.535406,-0.341339,-0.052770,...,0.181754,-1.706225,-1.002879,0.279353,-1.201080,-0.057721,-1.237656,-0.628472,-0.501242,0
19998,0.681884,0.004827,0.474925,-0.258151,-0.695408,0.364455,-4.467280,-1.068672,0.356080,-0.153367,...,-0.793321,0.842952,3.032088,0.035651,1.010222,-1.524072,-1.579162,-3.917595,-1.015519,0
