### In this notebook we will be wroking on the iris dataset to practice the random forest algorithm

In [8]:
from sklearn.datasets import load_iris # used to load the iris dataset
from sklearn.ensemble import RandomForestClassifier # loads the RandomForest algorithm for our model
from sklearn.model_selection import train_test_split # used to split the train and test data
from sklearn.metrics import  accuracy_score # used to check the accuracy score of the model

In [2]:
# load the dataset
d_set=load_iris()

In [3]:
# separate the x and y values for training purpose
X=d_set.data
y=d_set.target # the output column

In [4]:
# shapes of x and y data 
X.shape, y.shape

((150, 4), (150,))

In [6]:
# now separating the train and test data
# we will keep 20% data for testing purpose
Xtrain, Xtest, ytrain, ytest=train_test_split(X, y, test_size=0.2)

In [22]:
# make the classifier and create the model
# for the random forest classifier we have a list of parameters to make use of and some of key parameters are as under

# n_estimators: The number of trees in the forest. by default we have 10-100

# max_depth: The maximum depth of each tree. Depth refers to the number of splits or decisions from the root node to a leaf node.
# Deeper trees (larger max_depth) can capture more complexity but may lead to overfitting, as they may learn noise from the training data.
# Shallow trees (smaller max_depth) may lead to underfitting, as they might not capture the data’s true patterns.
# If max_depth=None, the trees will keep splitting until all leaves are pure (perfect classification) or the number of samples in a leaf node is less than min_samples_split.


# min_samples_split: The minimum number of samples required to split a node.
# Larger values (like 10, 20) reduce the risk of overfitting, as it forces the trees to be more generalized by not splitting on small data segments.
# Smaller values (like 2 or 5) can create deeper, more complex trees, which may lead to overfitting if the data is noisy.

# max_features: The number of features to consider when looking for the best split.
# Smaller values (e.g., a small subset of features) lead to less correlation between the trees, which is beneficial for reducing variance and overfitting. However, it may also cause underfitting if important features are missed.
# Larger values (e.g., considering all features) can result in trees that are more similar to each other, reducing the diversity in the forest and limiting the benefits of the Random Forest’s ensemble method.

# criterion{“gini”, “entropy”, “log_loss”}, default=”gini”


# creating a simple random forest classifier with default values only

rf=RandomForestClassifier(max_depth=10, n_estimators=100, min_samples_split=5, max_features='sqrt', random_state=42)

# fit the training data
rf.fit(Xtrain, ytrain)

In [23]:
# now make pridictions and check the accuracy score of the model
y_pred = rf.predict(Xtest)

# now cehck the accuracy score
accuracy_score(ytest, y_pred)

0.9333333333333333

### the model performs 93% well on test data

In [24]:
# checking the accuracy score at the traing level to check the over or underfitting problems
y_train_pred=rf.predict(Xtrain)

# now check the train accuracy score
accuracy_score(ytrain, y_train_pred)

0.9916666666666667

### When the train accuracy is 99% and the test accuracy is 93%, it indicates that your model is performing very well on the training data but slightly less well on the test data. This discrepancy suggests that your model may be overfitting the training data.

### Now making some changes in the model to make it provide better accuracy score than that of the above model

In [25]:
from sklearn.model_selection import GridSearchCV
# GridSearchCV is a technique in machine learning used for hyperparameter tuning. It systematically works through multiple combinations of hyperparameter values, cross-validating each to determine the best set of parameters for improving the performance of a model.

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), 
                           param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(Xtrain, ytrain)

# Best parameters
print(f"Best parameters: {grid_search.best_params_}")

# Best model
best_model = grid_search.best_estimator_

Fitting 3 folds for each of 72 candidates, totalling 216 fits
Best parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 50}


In [26]:
# checking accuracy after making the above changes 
y_pred2=best_model.predict(Xtest)

# score
accuracy_score(ytest, y_pred2)

0.9333333333333333

In [27]:
# a check for the train score
y_pred2_train=best_model.predict(Xtrain)
accuracy_score(ytrain, y_pred2_train)

1.0

### Even after hyperparameter tunning the model gives the same 93% accuracy score.
### and we still have the overfiting issue
### to avoid this issue we need more data. but here we have made use of sklearn's iris dataset for an example