#Hyperparameter tuning


###Introduction:

You can think of parameters as the settings when setting up your model.


```
clf = RandomForestClassifier(n_estimators = 10000, max_leaf_nodes = 100, max_depth=20, random_state=0)
```

You first initialize it, but for how you use the model, the estimators determine how accurate and precise your model is. So tuning them and obtaining the correct combinations of these parameters is essential.

But how will you find the perfect parameters for you model so it generalizes data and predicts accurately?






Try making an algorithm here that finds the best parameters. You first need to:

-  Find parameters to test, then test them,
-  Check wether they give you a good result
-  Store Parameters if the model performs well.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import pandas as pd


clf = RandomForestClassifier(n_estimators = '', max_leaf_nodes = '', max_depth = '', random_state = '')

from sklearn.model_selection import train_test_split
Data_x, Data_y = load_iris(return_X_y = True, as_frame = True)
X_train, X_test, Y_train, Y_test = train_test_split(Data_x, Data_y, test_size=0.8, shuffle = True)




Once you finish trying that, check out the different type of hyper-parameter tuning algorithms.

##Grid Search Algorithm

A grid search algorithm is a method used to optimize a model's hyperparameters by exhaustively testing every possible combination of parameter values within a defined range. It involves defining a set of values for each hyperparameter and systematically evaluating the model's performance for each unique combination.

While grid search is effective in exploring a broad range of hyperparameters, it is highly prone to overfitting, especially with complex models. This occurs because the method rigorously searches for the best parameter set on the training data, which can lead to a model that performs exceptionally well on training data but poorly on unseen data. Additionally, grid search can be computationally expensive, as it evaluates every possible parameter combination, potentially leading to inefficiency.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split


Data_x, Data_y = load_iris(return_X_y = True, as_frame = True)
X_train, X_test, Y_train, Y_test = train_test_split(Data_x, Data_y, test_size=0.8, shuffle = True)

class Grid_search_RF():

  def __init__(self, n_estimators = 30, max_leaf_nodes = 30, max_depth = 20, best_parameters = [], Best_Model_accuracy = 0 ):
    self.n_estimators = n_estimators
    self.max_leaf_nodes = max_leaf_nodes
    self.max_depth = max_depth
    self.best_parameters = best_parameters
    self.Best_Model_accuracy = Best_Model_accuracy

  def fit(self, X_train, Y_train):
    try:

      for estimators in range(1, self.n_estimators):
        for l_nodes in range(2, self.max_leaf_nodes):
          for depth in range(1, self.max_depth):

            Random_forest = RandomForestClassifier(n_estimators = estimators, max_leaf_nodes = l_nodes, max_depth = depth)

            Random_forest.fit(X_train, Y_train) # the model itself is fitting the data here

            report_considered = classification_report(Y_test, Random_forest.predict(X_test), output_dict= True, zero_division=1)

            if report_considered['accuracy'] > self.Best_Model_accuracy:

              self.Best_Model_accuracy = report_considered['accuracy']

              self.best_parameters = {'Estimators:': estimators, 'l_nodes:': l_nodes, 'depth:': depth }

          print("********************************************************************************************************************************************************")
          print(f"your best parameters are {self.best_parameters} ")
          print("********************************************************************************************************************************************************")
          print("********************************************************************************************************************************************************")


          Tested_params = {'Estimators:': estimators, 'l_nodes:': l_nodes, 'depth:': depth}
          print("********************************************************************************************************************************************************")
          print(f"The parameters being tested are {Tested_params}")
          print("********************************************************************************************************************************************************")

      return self.best_parameters


    except KeyboardInterrupt:

      print("********************************************************************************************************************************************************")
      print(f"your best parameters are {self.best_parameters} ")
      print(f"its corresponding Accuracy is {self.Best_Model_accuracy}%")
      print("********************************************************************************************************************************************************")
      return self.best_parameters

Test = Grid_search_RF()

Parameters = Test.fit(X_train, Y_train )




********************************************************************************************************************************************************
your best parameters are {'Estimators:': 1, 'l_nodes:': 2, 'depth:': 6} 
********************************************************************************************************************************************************
********************************************************************************************************************************************************
********************************************************************************************************************************************************
The parameters being tested are {'Estimators:': 1, 'l_nodes:': 2, 'depth:': 19}
********************************************************************************************************************************************************
**********************************************************************************

The code is dissected and explained below

###Data

The data being used for this machine learning algorithm is the Iris dataset from Sklearn of which is a catalog of 3 types of irises

In [None]:
from sklearn.datasets import load_iris # This is a dataset of 2 types of irises, it is a dataset that is for classification
import pandas as pd
from sklearn.model_selection import train_test_split

Data_x, Data_y = load_iris(return_X_y = True, as_frame = True) # When loading in a dataset from sklearn, you can have it split up into X and Y variables. As_frame simply returns a pandas dataframe
X_train, X_test, Y_train, Y_test = train_test_split(Data_x, Data_y, test_size=0.8, shuffle = True) # Train_test_split  is a tool used for data segmentation so you will be able to test and train your model


###Class Initialization

Your grid search algorithm will differ depending on the model you are hyperparameter tuning, but in this case, I have chosen to use the random forest classifier.

Although the random forest classifier has many parameters, I am specifically testing and tuning the specific parameters below

In [None]:
class Grid_search_RF():

# The reason I set parameters within the initialization of the class is for two reasons:
    # 1. They can act as a place holder, as you can change them!
    # 2. It enables easy use of the algorithm
  def __init__(self, n_estimators = 30, max_leaf_nodes = 30, max_depth = 20, best_parameters = [], Best_Model_accuracy = 0 ):


    self.n_estimators = n_estimators # The number of trees in the forest in the Random Forest Classifier
    self.max_leaf_nodes = max_leaf_nodes # The maximum number of leaf nodes per tree
    self.max_depth = max_depth # The maximum depth of each tree

    self.best_parameters = best_parameters
    self.Best_Model_accuracy = Best_Model_accuracy


### Fitting
This function, fit, optimizes a RandomForestClassifier model by systematically testing different parameter combinations to find the best-performing setup. It iterates through ranges of values for three key parameters: the number of estimators (or trees in the forest), maximum leaf nodes, and maximum tree depth.

This fit function demonstrates a manual grid search algorithm, where it systematically tests different parameter combinations to find the most effective configuration for a RandomForestClassifier model.

For each combination, a RandomForestClassifier model is initialized and then trained (fit) on the training data. This fitting step builds the model based on the given parameters.



In [None]:
def fit(self, X_train, Y_train):
    try:
        # This method tries different combinations of parameters to find the optimal ones.
        # The try block allows for interruption (e.g., with Ctrl+C) if the process takes too long.

        # Loop over a range of values for the number of estimators (number of trees in the forest)
        for estimators in range(1, self.n_estimators):
            # Loop over a range of values for the maximum number of leaf nodes
            for l_nodes in range(2, self.max_leaf_nodes):
                # Loop over a range of values for the maximum depth of the trees
                for depth in range(1, self.max_depth):

                    # Initialize a RandomForestClassifier with current values of estimators, max_leaf_nodes, and max_depth
                    Random_forest = RandomForestClassifier(
                        n_estimators=estimators, max_leaf_nodes=l_nodes, max_depth=depth
                    )

                    # Train (fit) the model on the training data
                    Random_forest.fit(X_train, Y_train)

                    # Test the model by making predictions on the test data and getting the classification report
                    report_considered = classification_report(
                        Y_test, Random_forest.predict(X_test), output_dict=True
                    )

                    # If the model accuracy with current parameters is better than the best recorded accuracy, update it
                    if report_considered['accuracy'] > self.Best_Model_accuracy:
                        self.Best_Model_accuracy = report_considered['accuracy']
                        # Store the best parameters found so far
                        self.best_parameters = {
                            'Estimators': estimators,
                            'l_nodes': l_nodes,
                            'depth': depth
                        }

            # Print the best parameters found after each iteration
            print("********************************************************************************************************************************************************")
            print(f"Your best parameters so far are {self.best_parameters}")
            print("********************************************************************************************************************************************************")

            # Print the parameters currently being tested
            Tested_params = {'Estimators': estimators, 'l_nodes': l_nodes, 'depth': depth}
            print("********************************************************************************************************************************************************")
            print(f"The parameters being tested are {Tested_params}")
            print("********************************************************************************************************************************************************")


        return self.best_parameters

    except KeyboardInterrupt:
        # Allow the user to interrupt the training process, and print the best parameters found so far
        print("********************************************************************************************************************************************************")
        print(f"Your best parameters are {self.best_parameters}")
        print(f"Corresponding Accuracy: {self.Best_Model_accuracy}%")
        print("********************************************************************************************************************************************************")
        return self.best_parameters


## Random Search algorithm

A random search algorithm is an optimization technique used to find the best hyperparameters for a model by randomly selecting values from a defined range for each parameter. Unlike grid search, which evaluates all possible parameter combinations, random search samples a fixed number of random combinations from the parameter space, making it computationally efficient and often quicker.

Random search is less prone to overfitting compared to grid search, as it explores the parameter space in a less structured way. This method is particularly useful when some hyperparameters have little effect on the model’s performance. By sampling combinations randomly, it has a higher probability of discovering an effective combination without the exhaustive testing required by grid search.

Here's how random search typically works:

1. Define the range or distribution for each hyperparameter (e.g., a uniform distribution for learning rate).

2. Randomly sample a specified number of combinations of these values.

3. Evaluate each sampled combination’s performance on the model using cross-validation.

4. Select the combination that yields the highest performance metric.

Random search can be an effective alternative to grid search, especially when computational resources are limited.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split


Data_x, Data_y = load_iris(return_X_y = True, as_frame = True)
X_train, X_test, Y_train, Y_test = train_test_split(Data_x, Data_y, test_size=0.8, shuffle = True)

class Random_search_RF():

  def __init__(self,n_estimators = 31, max_leaf_nodes = 31, max_depth = 21, iterations = 1000, best_parameters = [], Best_Model_accuracy = 0 ):

    self.n_estimators = n_estimators # The number of trees in the forest in the Random Forest Classifier
    self.max_leaf_nodes = max_leaf_nodes # The maximum number of leaf nodes per tree
    self.max_depth = max_depth # The maximum depth of each tree

    self.iterations = iterations
    self.best_parameters = best_parameters
    self.Best_Model_accuracy = Best_Model_accuracy

  def fit(self, X_train, Y_train):
    try:

      for iterations in range(self.iterations):
        estimators = np.random.randint(1, self.n_estimators)
        max_depth = np.random.randint(1, self.max_depth)
        max_leaf_nodes = np.random.randint(2, self.max_leaf_nodes)

        Random_forest = RandomForestClassifier(n_estimators = estimators, max_leaf_nodes = max_leaf_nodes, max_depth = max_depth)

        Random_forest.fit(X_train, Y_train)

        # Test the model by making predictions on the test data and getting the classification report
        report_considered = classification_report(
            Y_test, Random_forest.predict(X_test), output_dict=True
        )

        # If the model accuracy with current parameters is better than the best recorded accuracy, update it
        if report_considered['accuracy'] > self.Best_Model_accuracy:
            self.Best_Model_accuracy = report_considered['accuracy']
            # Store the best parameters found so far
            self.best_parameters = {
                'Estimators': estimators,
                'l_nodes': max_leaf_nodes,
                'depth': max_depth
            }
        Tested_Parameter = { 'Estimators': estimators, 'l_nodes': max_leaf_nodes, 'depth': max_depth }

        print(f"The parameters being tested are {Tested_Parameter}")
        print("******************************************************************************************************************************************************")
        print(f"The best parameters are {self.best_parameters}")
        print("******************************************************************************************************************************************************")
      return self.best_parameters

    except KeyboardInterrupt:

      print("********************************************************************************************************************************************************")
      print(f"your best parameters are {self.best_parameters} ")
      print(f"its corresponding Accuracy is {self.Best_Model_accuracy}%")
      print("********************************************************************************************************************************************************")
      return self.best_parameters

Test = Grid_search_RF()

Parameters = Test.fit(X_train, Y_train )



# If you want to see it working, look at the print statements here

The parameters being tested are {'Estimators': 19, 'l_nodes': 14, 'depth': 19}
******************************************************************************************************************************************************
The best parameters are {'Estimators': 19, 'l_nodes': 14, 'depth': 19}
******************************************************************************************************************************************************
The parameters being tested are {'Estimators': 14, 'l_nodes': 22, 'depth': 16}
******************************************************************************************************************************************************
The best parameters are {'Estimators': 19, 'l_nodes': 14, 'depth': 19}
******************************************************************************************************************************************************
The parameters being tested are {'Estimators': 8, 'l_nodes': 6, 'depth': 15}
*******************

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The parameters being tested are {'Estimators': 4, 'l_nodes': 26, 'depth': 14}
******************************************************************************************************************************************************
The best parameters are {'Estimators': 6, 'l_nodes': 4, 'depth': 13}
******************************************************************************************************************************************************
The parameters being tested are {'Estimators': 3, 'l_nodes': 25, 'depth': 15}
******************************************************************************************************************************************************
The best parameters are {'Estimators': 6, 'l_nodes': 4, 'depth': 13}
******************************************************************************************************************************************************
The parameters being tested are {'Estimators': 21, 'l_nodes': 6, 'depth': 13}
************************

The code is disected and explained below

### Data
The data being used for this machine learning algorithm is the Iris dataset from Sklearn of which is a catalog of 3 types of irises

In [None]:
from sklearn.datasets import load_iris # This is a dataset of 2 types of irises, it is a dataset that is for classification
import pandas as pd
from sklearn.model_selection import train_test_split

Data_x, Data_y = load_iris(return_X_y = True, as_frame = True) # When loading in a dataset from sklearn, you can have it split up into X and Y variables. As_frame simply returns a pandas dataframe
X_train, X_test, Y_train, Y_test = train_test_split(Data_x, Data_y, test_size=0.8, shuffle = True) # Train_test_split  is a tool used for data segmentation so you will be able to test and train your model


###Class Initialization

Your Random Search Algorithm will differ depending on the model you are hyperparameter tuning, but in this case, I have chosen to use the random forest classifier.

Although the random forest classifier has many parameters, I am specifically testing and tuning the specific parameters below

In contrast to the grid search algorithm, an extra variable is included, iterations. It is to control the time of which the model runs for.

In [None]:
class Grid_search_RF():

  def __init__(self,n_estimators = 31, max_leaf_nodes = 31, max_depth = 21, iterations = 1000, best_parameters = [], Best_Model_accuracy = 0 ):

    self.n_estimators = n_estimators # The number of trees in the forest in the Random Forest Classifier
    self.max_leaf_nodes = max_leaf_nodes # The maximum number of leaf nodes per tree
    self.max_depth = max_depth # The maximum depth of each tree

    self.iterations = iterations
    self.best_parameters = best_parameters
    self.Best_Model_accuracy = Best_Model_accuracy


###Fitting

The fit function optimizes hyperparameters for a RandomForestClassifier through a random search approach. In each iteration, it randomly selects values for n_estimators, max_depth, and max_leaf_nodes, trains the model on the training data, and evaluates its performance on the test data. If the current model's accuracy surpasses the best recorded accuracy, it updates the best parameters accordingly. The function allows for graceful interruption, enabling users to retrieve the best parameters found so far if training is stopped prematurely.








In [None]:
def fit(self, X_train, Y_train):
    # Start a try block to allow for clean handling of interruptions (e.g., KeyboardInterrupt)
    try:
        # Loop through a specified number of iterations for parameter testing
        for iterations in range(self.iterations):


            estimators = np.random.randint(1, self.n_estimators) # Randomly select a number of estimators (trees) for the random forest model
            max_depth = np.random.randint(1, self.max_depth) # Randomly select a maximum depth for the trees
            max_leaf_nodes = np.random.randint(2, self.max_leaf_nodes) # Randomly select a maximum number of leaf nodes per tree

            # Initialize a RandomForestClassifier with the randomly chosen parameters
            Random_forest = RandomForestClassifier( n_estimators=estimators, max_leaf_nodes=max_leaf_nodes, max_depth=max_depth)

            # Train the model on the training data
            Random_forest.fit(X_train, Y_train)

            # Generate predictions on test data and calculate accuracy using classification report
            report_considered = classification_report(
                Y_test, Random_forest.predict(X_test), output_dict=True
            )

            # Update the best model accuracy and parameters if current parameters yield higher accuracy
            if report_considered['accuracy'] > self.Best_Model_accuracy:
                self.Best_Model_accuracy = report_considered['accuracy']

                # Store the current parameters as the best parameters found so far
                self.best_parameters = {
                    'Estimators': estimators,
                    'l_nodes': max_leaf_nodes,
                    'depth': max_depth
                }

            # Store the current parameters for reference
            Tested_Parameter = {
                'Estimators': estimators,
                'l_nodes': max_leaf_nodes,
                'depth': max_depth
            }

            # Print the parameters currently being tested and the best parameters found so far
            print(f"The parameters being tested are {Tested_Parameter}")
            print("******************************************************************************************************************************************************")
            print(f"The best parameters are {self.best_parameters}")
            print("******************************************************************************************************************************************************")

        # Return the best parameters after completing all iterations
        return self.best_parameters

    # Handle keyboard interruptions, allowing the user to stop and see the best parameters so far
    except KeyboardInterrupt:
        print("********************************************************************************************************************************************************")
        print(f"Your best parameters are {self.best_parameters}")
        print(f"The corresponding accuracy is {self.Best_Model_accuracy}%")
        print("********************************************************************************************************************************************************")

        # Return the best parameters found so far
        return self.best_parameters


##Gradient Descent Algorithm

[Place holder, will be updated]

In [None]:
# [Place holder, will be updated]