## What is Random forest
A Random Forest is like a group decision-making team in machine learning. It combines the opinions of many “trees” (individual models) to make better predictions, creating a more robust and accurate overall model.

## What is Random Forest Algorithm?
Random Forest Algorithm widespread popularity stems from its user-friendly nature and adaptability, enabling it to tackle both classification and regression problems effectively. The algorithm’s strength lies in its ability to handle complex datasets and mitigate overfitting, making it a valuable tool for various predictive tasks in machine learning.

One of the most important features of the Random Forest Algorithm is that it can handle the data set containing **continuous variables**, as in the case of regression, and **categorical variables**, as in the case of classification.

## Working of Random Forest Algorithm
Suppose you are in a situation and you are confused to make decision so you ask your parents, friends, cousins and teachers to help you out. You asked different questions to each of them and they give you some suggestions. Finally after consulting various people about the course you decides to follow the method suggested my most of the people.<br>
Random Forest combines multiple models and hence a collection of these models are used to make prediction and not just an individual model.<br>
There are two types of methods used for combining(Ensembling) the models:
### 1. Bagging
It creates a different training subset from sample training data with replacement & the final output is based on majority voting. For example,  Random Forest.

### 2. Boosting
It combines weak learners into strong learners by creating sequential models such that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST.
<img src="Images/RF1.png" width="500" height="200"><br>

#### Random Forrest works on Bagging Method
Bagging, also known as **Bootstrap Aggregation**, serves as the ensemble technique in the Random Forest algorithm. Here are the steps involved in Bagging:

**1. Selection of Subset:** Bagging starts by choosing a random sample, or subset, from the entire dataset.<br>
**2. Bootstrap Sampling:** Each model is then created from these samples, called Bootstrap Samples, which are taken from the original data with replacement. This process is known as row sampling.<br>
**3. Bootstrapping:** The step of row sampling with replacement is referred to as bootstrapping.<br>
**4. Independent Model Training:** Each model is trained independently on its corresponding Bootstrap Sample. This training process generates results for each model.<br>
**5. Majority Voting:** The final output is determined by combining the results of all models through majority voting. The most commonly predicted outcome among the models is selected.<br>
**6. Aggregation:** This step, which involves combining all the results and generating the final output based on majority voting, is known as aggregation.<br>
<img src="Images/RF2.png" width="400" height="200"><br>

Here the bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with a replacement which means there is a high possibility that each sample won’t contain unique data. The model (Model 01, Model 02, and Model 03) obtained from this bootstrap sample is trained independently. Each model generates results as shown. Now the Happy emoji has a majority when compared to the Sad emoji. Thus based on majority voting final output is obtained as Happy emoji.
<img src="Images/RF3.png" width="300" height="100"><br>

## Boosting
Boosting is one of the techniques that use the concept of ensemble learning. A boosting algorithm combines multiple simple models (also known as weak learners or base estimators) to generate the final output. It is done by building a model by using weak models in series.

There are several boosting algorithms; **AdaBoost** was the first really successful boosting algorithm that was developed for the purpose of binary classification. AdaBoost is an abbreviation for **Adaptive Boosting** and is a prevalent boosting technique that combines multiple "weak classifiers" into a single "strong classifier". There are Other Boosting techniques.

## Steps Involved in Random Forest Algorithm
**Step 1:** In the Random forest model, a subset of data points and a subset of features is selected for constructing each decision tree. Simply put, n random records and m features are taken from the data set having k number of records.<br>
**Step 2:** Individual decision trees are constructed for each sample.<br>
**Step 3:** Each decision tree will generate an output.<br>
**Step 4:** Final output is considered based on Majority Voting or Averaging for Classification and regression, respectively.<br>

### Example
Consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken from the fruit basket, and an individual decision tree is constructed for each sample. Each decision tree will generate an output, as shown in the figure. The final output is considered based on majority voting. In the below figure, you can see that the majority decision tree gives output as an apple when compared to a banana, so the final output is taken as an apple.<br>
<img src="Images/RF4.png" width="500" height="200"><br>

## Important Features of Random Forest
* **Diversity:** Not all attributes/variables/features are considered while making an individual tree; each tree is different.
Immune to the curse of dimensionality: Since each tree does not consider all the features, the feature space is reduced.
* **Parallelization:** Each tree is created independently out of different data and attributes. This means we can fully use the CPU to build random forests.
* **Train-Test split:** In a random forest, we don’t have to segregate the data for train and test as there will always be 30% of the data which is not seen by the decision tree.
* **Stability:** Stability arises because the result is based on majority voting/ averaging.

## Difference Between Decision Tree and Random Forest
Random forest is a collection of decision trees; still, there are a lot of differences in their behavior.

### Decision Tree:
1. Decision trees normally suffer from the problem of overfitting if it’s allowed to grow without any control.
2. A single decision tree is faster in computation.
3. When a data set with features is taken as input by a decision tree, it will formulate some rules to make predictions.

### Random Forrest:
1. Random forests are created from subsets of data, and the final output is based on average or majority ranking; hence the problem of overfitting is taken care of..
2. It is comparatively slower.
3. Random forest randomly selects observations, builds a decision tree, and takes the average result. It doesn’t use any set of formulas.

## Important Hyperparameters in Random Forest
Hyperparameters are used in random forests to either enhance the performance and predictive power of models or to make the model faster.

### Hyperparameters to Increase the Predictive Power
* **n_estimators:** Number of trees the algorithm builds before averaging the predictions.
* **max_features:** Maximum number of features random forest considers splitting a node.
* **mini_sample_leaf:** Determines the minimum number of leaves required to split an internal node.
* **criterion:** How to split the node in each tree? (Entropy/Gini impurity/Log Loss)
* **max_leaf_nodes:** Maximum leaf nodes in each tree

### Hyperparameters to Increase the Speed
* **n_jobs:** it tells the engine how many processors it is allowed to use. If the value is 1, it can use only one processor, but if the value is -1, there is no limit.
* **random_state:** controls randomness of the sample. The model will always produce the same results if it has a definite value of random state and has been given the same hyperparameters and training data.
* **oob_score:** OOB means out of the bag. It is a random forest cross-validation method. In this, one-third of the sample is not used to train the data; instead used to evaluate its performance. These samples are called out-of-bag samples.

## Advantages
* It can be used in classification and regression problems.
* It solves the problem of overfitting as output is based on majority voting or averaging.
* It performs well even if the data contains null/missing values.
* Each decision tree created is independent of the other; thus, it shows the property of parallelization.
* It is highly stable as the average answers given by a large number of trees are taken.
* It maintains diversity as all the attributes are not considered while making each decision tree though it is not true in all cases.
* It is immune to the curse of dimensionality. Since each tree does not consider all the attributes, feature space is reduced.
* We don’t have to segregate data into train and test as there will always be 30% of the data, which is not seen by the decision tree made out of bootstrap.

## Disadvantages
* Random forest is highly complex compared to decision trees, where decisions can be made by following the path of the tree.
* Training time is more than other models due to its complexity. Whenever it has to make a prediction, each decision tree has to generate output for the given input data.

# Random Forest Classifier

In [1]:
import os
import math
import numpy as np
import pandas as pd
from sklearn import model_selection

import warnings
warnings.filterwarnings("ignore")

Here I have imported **warnings** library to avoid warning messages.

In [2]:
class Leaf:
    def __init__(self, value):
        self.value = value

    def predict(self, row):
        return self.value

This class represents a leaf in decision tree.<br>
The **`__init__()`** method is a constructor method which initializes a Leaf object with a value. This value is typically the predicted outcome or class label associated with the leaf node.<br>
The **`predict()`** method predict method takes a row as input and returns the value of the leaf node. In a decision tree, when a new data point (represented by row) reaches a leaf node during prediction, the predict method is called to return the predicted value for that data point.<br>

In [3]:
class Node:
    def __init__(self, level, split_feature, split_value, left_node=None, right_node=None):
        self.level = level
        self.split_feature = split_feature
        self.split_value = split_value
        self.left_node = left_node
        self.right_node = right_node

    def predict(self, row):
        if row[self.split_feature] >= self.split_value:
            return self.right_node.predict(row)
        return self.left_node.predict(row)

The Node class represents a node in a decision tree that contains a splitting criterion and references to its left and right child nodes.<br>
The **`__init__()`** method is a constructor method that initializes a Node object with the following attributes:
* **level:** The level of the node in the decision tree.
* **split_feature:** The index of the feature used for splitting at this node.
* **split_value:** The value of the feature used for splitting at this node.
* **left_node:** Reference to the left child node.
* **right_node:** Reference to the right child node.

The **`predict()`** method takes a row as input and returns the predicted value by traversing the decision tree. If the feature value of the row is greater than or equal to the split_value, it follows the right child node; otherwise, it follows the left child node.

In [4]:
class DecisionTreeClassifier:
    def __init__(self, max_depth):
        self.max_depth = max_depth
        self.root = None
    def set_root(self, node):
        if self.root == None:
            self.root = node
    
    def class_counts(self, y):
        values, counts = np.unique(y, return_counts=True)
        return values, counts

    def calc_popular_class(self, y):
        values, counts = self.class_counts(y)
        idx = np.argmax(counts)
        popular_class = values[idx]
        return popular_class

    def calc_gini(self, y):
        values, counts = self.class_counts(y)
        class_probabilities = counts/float(len(y))
        return 1-np.sum(class_probabilities**2, axis=0)

    def features_to_check(self, num_features):
        num_features_to_check = int(math.sqrt(num_features))
        idxs = np.random.randint(0, num_features, size=num_features_to_check)
        return idxs

    def get_best_split(self, X, y):
        num_features = X.shape[1]
        num_rows = len(y)
        best_split_feature = 0
        best_split_value = 0
        best_gini = 1

        for feature in self.features_to_check(num_features-1):
            values = np.unique(X[:, feature])

            for val in values:
                right_rows, right_labels, left_rows, left_labels = self.data_split(X, y, feature, val)
                p = float(len(right_rows))/num_rows
                average_gini = p*self.calc_gini(right_labels)/num_rows + (1-p)*self.calc_gini(left_labels)/num_rows

                if average_gini < best_gini:
                    best_gini = average_gini
                    best_split_feature, best_split_value = feature, val
        
        return best_split_feature, best_split_value, best_gini

    def data_split(self, X, y, split_feature, split_value):
        idx_right_subtree = X[:, split_feature] >= split_value
        right_subtree = X[idx_right_subtree]
        right_subtree_labels = y[idx_right_subtree]

        idx_left_subtree = X[:, split_feature] < split_value
        left_subtree = X[idx_left_subtree]
        left_subtree_labels = y[idx_left_subtree]

        return right_subtree, right_subtree_labels, left_subtree, left_subtree_labels

    def fit(self, X, y):
        self.set_root(self.split_node(X, y))
    
    def split_node(self, X, y, node_level=0):
        node_level += 1
        
        if len(y) == 1:
            return Leaf(y[0])
        split_feature, split_value, gini = self.get_best_split(X, y)

        if gini == 0.0 or self.max_depth < node_level:
            popular_class = self.calc_popular_class(y)
            return Leaf(popular_class)
        right_subtree, right_subtree_labels, left_subtree, left_subtree_labels = self.data_split(X, y, split_feature, split_value)

        if len(right_subtree_labels) == 1:
            return Leaf(right_subtree_labels[0])
        if len(left_subtree_labels) == 1:
            return Leaf(left_subtree_labels[0])
        right_node = self.split_node(right_subtree, right_subtree_labels, node_level)
        left_node = self.split_node(left_subtree, left_subtree_labels, node_level)
        return Node(node_level, split_feature, split_value, left_node, right_node)

    def predict_labels(self, X_test):
        y_probs = []
        for row in X_test:
            y_probs.append(self.root.predict(row))
        return np.asarray(y_probs)

    def get_accuracy(self, y, y_probs):
        correct = y == y_probs
        acc = (np.sum(correct)/float(len(y)))*100.0
        return acc

The **DecisionTreeClassifier()** class represents a decision tree model for classification tasks.<br>

The **`__init__()`** method initializes a DecisionTreeClassifier object with the **max_depth** parameter, which specifies the maximum depth of the decision tree. It also initializes the **root** attribute to None, which will later hold the root node of the decision tree.<br>
The **`set_root()`** method sets the root node of the decision tree if it is not already set.<br>
The **`class_counts()`** method calculates the counts of unique classes in the target variable y and returns the values and their corresponding counts.<br>
The **`calc_popular_class()`** method calculates the most popular class in the target variable y, which is the class with the highest frequency, and returns it as the predicted class for a leaf node.<br>
The **`calc_gini()`** method calculates the **Gini impurity** of a node based on the target variable y. Gini impurity is a measure of how often a randomly chosen element would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the node. Lower Gini impurity indicates better purity (i.e., more homogeneous class distribution).<br>
The **`features_to_check()`** method randomly selects a subset of features to consider for splitting a node. It is a common technique in decision tree algorithms to reduce overfitting by considering only a subset of features at each node.<br>
The **`get_best_split()`** method finds the best feature and value to split a node based on the Gini impurity. It iterates over each feature and value combination to find the split that results in the lowest average Gini impurity for the child nodes.<br>
The **`data_split()`** method splits the data X and labels y into two subsets based on a split feature and value.<br>
The **`fit()`** method fits the decision tree to the training data by recursively splitting nodes until a stopping criterion is met, such as reaching the maximum depth or achieving perfect purity.<br>
The **`split_node()`** method recursively splits nodes to build the decision tree. It checks for the stopping criteria and splits nodes based on the best split found by get_best_split.<br>
The **`predict_labels()`** method predicts the class labels for a given set of input data X_test by traversing the decision tree and returning the predicted class for each input.<br>
The **`get_accuracy()`** method calculates the accuracy of the model by comparing the predicted labels y_probs with the actual labels y and computing the percentage of correct predictions.

In [5]:
class RandomForestClassifier():
    def __init__(self):
        self.forest = []
    
    def create_subsample(self, X, y, a=0.25):
        n = len(y)
        n_tag = int(a*n)
        idx = np.random.randint(0, n, size=n_tag)
        X_subsample = X[idx]
        y_subsample = y[idx]
        return X_subsample, y_subsample

    def fit(self, X, y, T=300, max_depth=4):
        for i in range(0,T):
            X_subsample, y_subsample = self.create_subsample(X, y)
            tree = DecisionTreeClassifier(max_depth)
            tree.fit(X_subsample, y_subsample)
            self.forest.append(tree)

    def calc_popular_class(self, y):
        values, counts = np.unique(y, return_counts=True)
        idx = np.argmax(counts)
        popular_class = values[idx]
        return popular_class

    def bagging_predict(self, X_test):
        predictions = []

        for row in X_test:
            all_trees_preds = np.asarray([tree.root.predict(row) for tree in self.forest])
            if len(all_trees_preds)>0:
                predictions.append(self.calc_popular_class(all_trees_preds))
        return np.asarray(predictions)

    def get_accuracy(self, y, y_probs):
        correct = y == y_probs
        acc = (np.sum(correct)/float(len(y)))*100.0
        return acc

The **RandomForestClassifier** class implements a random forest model for classification tasks. It builds an ensemble of decision trees and aggregates their predictions to make the final classification.<br>

The **`__init__()`** method initializes the forest attribute as an empty list, which will hold the individual decision trees in the random forest.<br>
The **`create_subsample()`** method creates a random subsample of the dataset X and target variable y for training each decision tree in the forest. It randomly selects a subset of rows (a fraction of the total rows) and returns the subsampled dataset.<br>
The **`fit()`** method fits the random forest model to the training data X and target variable y. It iteratively creates a subsample of the training data, trains a decision tree on the subsample, and adds the tree to the forest. The process is repeated T times (default is 300) to create a forest of decision trees.<br>
The **`calc_popular_class()`** method calculates the most popular class in a set of predictions y. It is used for aggregating the predictions of individual decision trees in the forest.<br>
The **`bagging_predict()`** method makes predictions for a given set of input data X_test by aggregating the predictions of all decision trees in the forest. For each input row, it collects the predictions of all trees and calculates the most popular class among them as the final prediction.<br>
The **`get_accuracy()`** method calculates the accuracy of the model by comparing the predicted labels y_probs with the actual labels y and computing the percentage of correct predictions.<br>

In [6]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digit = load_digits()
X = digit.data
y = digit.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
preds = clf.bagging_predict(X_test)
acc = clf.get_accuracy(y_test, preds)
print(acc)

91.38888888888889


# Random Forest Regressor

In [8]:
class DecisionTreeRegressor:
    def __init__(self, max_depth):
        self.max_depth = max_depth
        self.root = None
    
    def set_root(self, node):
        if self.root == None:
            self.root = node
    
    def class_counts(self, y):
        values, counts = np.unique(y, return_counts=True)
        return values, counts

    def calc_popular_class(self, y):
        return np.mean(y)

    def calc_mse(self, y):
        return np.mean((y-np.mean(y))**2)

    def features_to_check(self, num_features):
        num_features_to_check = int(math.sqrt(num_features))
        idxs = np.random.randint(0, num_features, size=num_features_to_check)
        return idxs

    def get_best_split(self, X, y):
        num_features = X.shape[1]
        num_rows = len(y)
        best_split_feature = 0
        best_split_value = 0
        best_mse = float('inf')

        for feature in self.features_to_check(num_features-1):
            values = np.unique(X[:, feature])

            for val in values:
                right_rows, right_labels, left_rows, left_labels = self.data_split(X, y, feature, val)
                p = float(len(right_rows))/num_rows
                average_mse = p*self.calc_mse(right_labels)/num_rows + (1-p)*self.calc_mse(left_labels)/num_rows

                if average_mse < best_mse:
                    best_mse = average_mse
                    best_split_feature, best_split_value = feature, val
        
        return best_split_feature, best_split_value, best_mse

    def data_split(self, X, y, split_feature, split_value):
        idx_right_subtree = X[:, split_feature] >= split_value
        right_subtree = X[idx_right_subtree]
        right_subtree_labels = y[idx_right_subtree]

        idx_left_subtree = X[:, split_feature] < split_value
        left_subtree = X[idx_left_subtree]
        left_subtree_labels = y[idx_left_subtree]

        return right_subtree, right_subtree_labels, left_subtree, left_subtree_labels

    def fit(self, X, y):
        self.set_root(self.split_node(X, y))
    
    def split_node(self, X, y, node_level=0):
        node_level += 1
        
        if len(y) == 1:
            return Leaf(y[0])
        split_feature, split_value, mse = self.get_best_split(X, y)

        if mse == 0.0 or self.max_depth < node_level:
            popular_class = self.calc_popular_class(y)
            return Leaf(popular_class)
        right_subtree, right_subtree_labels, left_subtree, left_subtree_labels = self.data_split(X, y, split_feature, split_value)

        if len(right_subtree_labels) == 1:
            return Leaf(right_subtree_labels[0])
        if len(left_subtree_labels) == 1:
            return Leaf(left_subtree_labels[0])
        right_node = self.split_node(right_subtree, right_subtree_labels, node_level)
        left_node = self.split_node(left_subtree, left_subtree_labels, node_level)
        return Node(node_level, split_feature, split_value, left_node, right_node)

    def predict(self, X_test):
        y_preds = []
        for row in X_test:
            pred = self.root.predict(row)
            if not np.isfinite(pred):
                pred = np.nanmean(y)
            y_preds.append(pred)
        return np.asarray(y_preds)

    def get_mse(self, y, y_preds):
        return np.mean((y-y_preds)**2)

This **DecisionTreeRegressor** class implements a decision tree model for regression tasks. It builds a binary tree where each internal node represents a decision based on a feature value, and each leaf node represents the output value (mean of the target variable) for the corresponding subset of data.

The **`__init__()`** method initializes the decision tree with a specified maximum depth max_depth and sets the root node to None.<br>
The **`set_root()`** method sets the root node of the decision tree if it is not already set.<br>
The **`class_counts()`** method calculates the counts of unique classes in the target variable y and returns the values and their corresponding counts.<br>
The **`calc_popular_class()`** method calculates the mean of a given set of values y, which is used when determining the output value for a leaf node.<br>
The **`calc_mse()`** method calculates the mean squared error for a given set of values y, which is used to evaluate the quality of a split.<br>
The **`features_to_check()`** method returns a random subset of feature indices to consider when finding the best split. This helps in creating random forests with diverse trees.<br>
The **`get_best_split()`** method finds the best feature and value to split a node based on the Gini impurity. It iterates over each feature and value combination to find the split that results in the lowest average Gini impurity for the child nodes.<br>
The **`data_split()`** method splits the data X and labels y into two subsets based on a split feature and value.<br>
The **`fit()`** methods fits the decision tree to the training data X and target variable y by calling the split_node method to recursively split the data and create the tree.<br>
The **`split_node()`** method recursively splits the data at each node based on the best feature and value to minimize the mean squared error (mse). It uses the get_best_split method to find the best split and creates child nodes accordingly.<br>
The **`predict()`** method predicts the target variable for a given set of input features X_test by traversing the decision tree from the root node to a leaf node. If a leaf node's prediction is not finite (due to no data in that leaf), it replaces the prediction with the mean of the target variable y.

In [9]:
class RandomForestRegressor():
    def __init__(self):
        self.forest = []
    
    def create_subsample(self, X, y, a=0.25):
        n = len(y)
        n_tag = int(a*n)
        idx = np.random.randint(0, n, size=n_tag)
        X_subsample = X[idx]
        y_subsample = y[idx]
        return X_subsample, y_subsample

    def fit(self, X, y, T=300, max_depth=4):
        for i in range(T):
            X_subsample, y_subsample = self.create_subsample(X, y)
            tree = DecisionTreeRegressor(max_depth)
            tree.fit(X_subsample, y_subsample)
            self.forest.append(tree)

    def predict(self, X):
        predictions = []
        for tree in self.forest:
            predictions.append(tree.predict(X))
        return np.mean(predictions, axis=0)

    def get_mse(self, y_true, y_pred):
            return np.mean((y_true - y_pred)**2)

This **RandomForestRegressor** class implements a random forest model for regression tasks. Random forests are an ensemble learning method that builds multiple decision trees during training and outputs the mean prediction of the individual trees for regression.

The **`__init__()`** method initializes the random forest regressor with an empty forest list to store the individual decision trees.<br>
The **`create_subsample()`** method creates a random subsample of the dataset X and target variable y with a specified sampling ratio a. It randomly selects a*n samples from the dataset, where n is the total number of samples in y.<br>
The **`fit()`** method fits the random forest to the training data X and target variable y by creating T decision trees using the DecisionTreeRegressor class. Each tree is trained on a different random subsample of the data.<br>
The **`predict()`** method predicts the target variable for a given set of input features X by averaging the predictions of all the decision trees in the forest.<br>
The **`get_mse()`** method calculates the mean squared error (MSE) between the true target variable y_true and the predicted values y_pred.

In [10]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target

X_diabetes_train, X_diabetes_test, y_diabetes_train, y_diabetes_test = train_test_split(X_diabetes, y_diabetes, test_size=0.2, random_state=42)

In [11]:
regressor = RandomForestRegressor()
regressor.fit(X_diabetes_train, y_diabetes_train)

y_diabetes_pred = regressor.predict(X_diabetes_test)

mse = regressor.get_mse(y_diabetes_test, y_diabetes_pred)
print(mse)

3370.6018386463347


# Using Sklearn Models

In [12]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import load_digits, load_diabetes
from sklearn.metrics import accuracy_score, mean_squared_error

digit = load_digits()
diabetes = load_diabetes()

X_digit_train, X_digit_test, y_digit_train, y_digit_test = train_test_split(digit.data, digit.target, test_size=0.2, random_state=42)
X_diabetes_train, X_diabetes_test, y_diabetes_train, y_diabetes_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=42)

## Classification

In [13]:
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_digit_train, y_digit_train)
y_digit_pred = clf.predict(X_digit_test)
digit_accuracy = accuracy_score(y_digit_test, y_digit_pred)
print(f"Digit Classification Accuracy: {digit_accuracy}")

Digit Classification Accuracy: 0.9722222222222222


## Regression

In [14]:
reg = RandomForestRegressor(n_estimators=100, random_state=42)
reg.fit(X_diabetes_train, y_diabetes_train)
y_diabetes_pred = reg.predict(X_diabetes_test)
diabetes_mse = mean_squared_error(y_diabetes_test, y_diabetes_pred)
print(f"Diabetes Regression Mean Squared Error: {diabetes_mse}")

Diabetes Regression Mean Squared Error: 2952.0105887640448


# Hyperparameter Tuning

In [15]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Create the GridSearchCV object
rf_clf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(rf_clf, param_grid, cv=5, n_jobs=-1)

# Perform the grid search
grid_search.fit(X_digit_train, y_digit_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters for classifier model:", best_params)
print("Best Score:", best_score)

# Use the best model for predictions
best_rf_clf = grid_search.best_estimator_
y_pred = best_rf_clf.predict(X_digit_test)

Best Parameters for classifier model: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Best Score: 0.9749516066589237


The **GridSearchCV** is used for performing hyperparameter tuning through an exhaustive search over specified parameter values for an estimator.<br>
The **param_grid** dictionary defines the hyperparameters and their corresponding values to be tuned. It includes parameters like **n_estimators, max_depth, min_samples_split, min_samples_leaf,** and **max_features**.<br>
The RandomForestClassifier is initialized with a random_state for reproducibility. Then, a GridSearchCV object is created with the classifier, the parameter grid, **5-fold cross-validation (cv=5)**, and **parallel processing (n_jobs=-1)**.<br>
The fit method of the GridSearchCV object is called with the training data (X_digit_train and y_digit_train). This step performs an exhaustive search over the hyperparameter values specified in param_grid and evaluates the model performance using cross-validation.<br>
After the grid search is complete, the best parameters **(best_params_)** and best score **(best_score_)** are obtained from the GridSearchCV object. These values represent the hyperparameters that yielded the highest cross-validated score during the grid search and printed to the console.<br>
Finally, the best estimator **(best_rf_clf)** from the grid search is used to make predictions on the test data **(X_digit_test)**, and the predicted labels **(y_pred)** are obtained.

In [16]:
rf_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(rf_reg, param_grid, cv=5, n_jobs=-1)

# Perform the grid search
grid_search.fit(X_diabetes_train, y_diabetes_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters for regressor model:", best_params)
print("Best Score:", best_score)

# Use the best model for predictions
best_rf_reg = grid_search.best_estimator_
y_pred = best_rf_reg.predict(X_diabetes_test)

Best Parameters for regressor model: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}
Best Score: 0.4266408924229098
