In this code, we use the GradientBoostingClassifier model from sklearn.ensemble module. We then define the hyperparameter grid to search over using param_dist dictionary. Here, we tune the learning rate, the number of trees (n_estimators), maximum depth of the trees (max_depth), minimum number of samples required to split an internal node (min_samples_split), and minimum number of samples required to be at a leaf node (min_samples_leaf).

We then define the randomized search object with search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=50, cv=5, random_state=42) and fit it to the training data using search.fit(X_train, y_train). We make predictions on the testing data using search.predict(X_test).

Finally, we evaluate the performance of the model using the score() method to calculate the accuracy on the training and testing data. We also generate a classification report for the testing data using classification_report() function.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
df=pd.read_excel('RE_Data.xlsx')

In [3]:
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1)

In [4]:
# Shuffle the DataFrame
df_shuff = df

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
var_columns = [c for c in df if c not in ['ph','ph_labels']]

X = df.loc[:,var_columns].values
y = df.loc[:,'ph'].values

In [7]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Split the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the Gradient Boosting Regressor model
model = GradientBoostingRegressor()

# Define the hyperparameter grid to search over
param_dist = {
    'n_estimators': np.arange(50, 501, 50),
    'learning_rate': np.logspace(-4, 0, 50),
    'max_depth': np.arange(2, 11),
    'min_samples_split': np.arange(2, 11),
    'min_samples_leaf': np.arange(1, 6),
    'max_features': ['auto', 'sqrt', 'log2']
}

# Define the randomized search object
search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=50, cv=5, random_state=42)

# Fit the randomized search object to the training data
search.fit(X_train, y_train)

# Make predictions on training and validation sets
y_pred_train = search.predict(X_train)
y_pred_valid = search.predict(X_valid)

# Calculate evaluation metrics
mse_train = mean_squared_error(y_train, y_pred_train)
mse_valid = mean_squared_error(y_valid, y_pred_valid)
rmse_train = np.sqrt(mse_train)
rmse_valid = np.sqrt(mse_valid)
r2_train = r2_score(y_train, y_pred_train)
r2_valid = r2_score(y_valid, y_pred_valid)

# Print results
print("GradientBoostingRegressor train R^2: {:.3f}".format(r2_train))
print("GradientBoostingRegressor valid R^2: {:.3f}".format(r2_valid))
print("RMSE_train: {:.3f}".format(rmse_train))
print("RMSE_valid: {:.3f}".format(rmse_valid))
print("GradientBoostingRegressor best params:\n{}\n".format(search.best_params_))


GradientBoostingRegressor train R^2: 1.000
GradientBoostingRegressor valid R^2: 0.998
RMSE_train: 0.006
RMSE_valid: 0.013
GradientBoostingRegressor best params:
{'n_estimators': 250, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 10, 'learning_rate': 0.12648552168552957}

