In this code, we first load the dataset and split it into training and testing sets. We then define an AdaBoostClassifier model and a hyperparameter grid to search over using numpy arrays. The RandomizedSearchCV object is then defined, with the number of iterations, number of cross-validation folds, and random seed specified. We fit the randomized search object to the training data, and print out the best hyperparameters and corresponding mean cross-validated score.

The logic for choosing the hyperparameters is as follows:

    n_estimators: This hyperparameter determines the number of decision trees to include in the AdaBoost ensemble. We search over a range of values from 50 to 500 with a step of 50. This range was chosen based on common values used in the literature.
    learning_rate: This hyperparameter determines the contribution of each tree in the ensemble. We search over a range of values from 0.0001 to 1 on a logarithmic scale. This range was chosen based on common values used in the literature.
    algorithm: This hyperparameter determines the boosting algorithm to use. We search over two values: SAMME and SAMME.R. The SAMME.R algorithm is the default and is recommended in most cases, but we include SAMME as well for completeness.

By searching over a range of values for each hyperparameter and using cross-validation to evaluate the performance of the model with each combination of hyperparameters, we can find the optimal hyperparameters for the AdaBoost model.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
df=pd.read_excel('RE_Data.xlsx')

In [3]:
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1)

In [4]:
# Shuffle the DataFrame
df_shuff = df

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
var_columns = [c for c in df_shuff if c not in ['ph','ph_labels']]

X = df_shuff.loc[:,var_columns].values
y = df_shuff.loc[:,'ph'].values

In [7]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Split the data into training and testing sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)


# Define the AdaBoost regression model
model = AdaBoostRegressor()

# Define the hyperparameter grid to search over
param_dist = {
    'n_estimators': np.arange(50, 501, 50),
    'learning_rate': np.logspace(-4, 0, 50),
    'loss': ['linear', 'square', 'exponential']
}

# Define the randomized search object
search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=50, cv=5, random_state=42)

# Fit the randomized search object to the training data
search.fit(X_train, y_train)

# Predict on training and validation sets
y_pred_train = search.predict(X_train)
y_pred_valid = search.predict(X_valid)

# Calculate evaluation metrics
mse_train = mean_squared_error(y_train, y_pred_train)
mse_valid = mean_squared_error(y_valid, y_pred_valid)
rmse_train = np.sqrt(mse_train)
rmse_valid = np.sqrt(mse_valid)
r2_train = r2_score(y_train, y_pred_train)
r2_valid = r2_score(y_valid, y_pred_valid)

# Print the evaluation metrics and best hyperparameters
print("AdaBoost Regressor_train R^2 score: {:.3f}".format(r2_train))
print("AdaBoost Regressor_valid R^2 score: {:.3f}".format(r2_valid))
print("AdaBoost Regressor_train RMSE score: {:.3f}".format(rmse_train))
print("AdaBoost Regressor_valid RMSE score: {:.3f}".format(rmse_valid))
print("AdaBoost Regressor best params:\n{}\n".format(search.best_params_))


AdaBoost Regressor_train R^2 score: 0.449
AdaBoost Regressor_valid R^2 score: 0.444
AdaBoost Regressor_train RMSE score: 0.214
AdaBoost Regressor_valid RMSE score: 0.216
AdaBoost Regressor best params:
{'n_estimators': 500, 'loss': 'linear', 'learning_rate': 0.6866488450042998}

