In this code, we first load the dataset and split it into training and testing sets. We then define an AdaBoostClassifier model and a hyperparameter grid to search over using numpy arrays. The RandomizedSearchCV object is then defined, with the number of iterations, number of cross-validation folds, and random seed specified. We fit the randomized search object to the training data, and print out the best hyperparameters and corresponding mean cross-validated score.

The logic for choosing the hyperparameters is as follows:

    n_estimators: This hyperparameter determines the number of decision trees to include in the AdaBoost ensemble. We search over a range of values from 50 to 500 with a step of 50. This range was chosen based on common values used in the literature.
    learning_rate: This hyperparameter determines the contribution of each tree in the ensemble. We search over a range of values from 0.0001 to 1 on a logarithmic scale. This range was chosen based on common values used in the literature.
    algorithm: This hyperparameter determines the boosting algorithm to use. We search over two values: SAMME and SAMME.R. The SAMME.R algorithm is the default and is recommended in most cases, but we include SAMME as well for completeness.

By searching over a range of values for each hyperparameter and using cross-validation to evaluate the performance of the model with each combination of hyperparameters, we can find the optimal hyperparameters for the AdaBoost model.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
df=pd.read_excel('RE_Data.xlsx')

In [3]:
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1)

In [4]:
# Shuffle the DataFrame
df_shuff = df

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
var_columns = [c for c in df if c not in ['ph','ph_labels']]

X = df.loc[:,var_columns].values
y = df.loc[:,'ph_labels'].values

In [7]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import classification_report


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the AdaBoost classifier model
model = AdaBoostClassifier()

# Define the hyperparameter grid to search over
param_dist = {
    'n_estimators': np.arange(50, 501, 50),
    'learning_rate': np.logspace(-4, 0, 50),
    'algorithm': ['SAMME', 'SAMME.R']
}

# Define the randomized search object
search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=50, cv=5, random_state=42)

# Fit the randomized search object to the training data
search.fit(X_train, y_train)


y_pred = search.predict(X_test)
accuracy_train = search.score(X_train, y_train)
accuracy_test = search.score(X_test, y_test)

report_test = classification_report(y_test, y_pred)

print("AdaBoostClassifier_train ",accuracy_train)
print("AdaBoostClassifier report:\n{}\nAccuracy_test: {:.3f}\n".format(report_test, accuracy_test))
print("AdaBoostClassifier best params:\n{}\n".format(search.best_params_))




AdaBoostClassifier_train  0.6158857142857143
AdaBoostClassifier report:
              precision    recall  f1-score   support

           0       0.70      0.19      0.30      3818
           1       0.49      0.81      0.61      3701
           2       0.63      0.85      0.72      3719
           3       0.81      0.60      0.69      3762

    accuracy                           0.61     15000
   macro avg       0.66      0.61      0.58     15000
weighted avg       0.66      0.61      0.58     15000

Accuracy_test: 0.610

AdaBoostClassifier best params:
{'n_estimators': 500, 'learning_rate': 1.0, 'algorithm': 'SAMME'}

