In this code, we use the GradientBoostingClassifier model from sklearn.ensemble module. We then define the hyperparameter grid to search over using param_dist dictionary. Here, we tune the learning rate, the number of trees (n_estimators), maximum depth of the trees (max_depth), minimum number of samples required to split an internal node (min_samples_split), and minimum number of samples required to be at a leaf node (min_samples_leaf).

We then define the randomized search object with search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=50, cv=5, random_state=42) and fit it to the training data using search.fit(X_train, y_train). We make predictions on the testing data using search.predict(X_test).

Finally, we evaluate the performance of the model using the score() method to calculate the accuracy on the training and testing data. We also generate a classification report for the testing data using classification_report() function.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
df=pd.read_excel('RE_Data.xlsx')

In [3]:
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1)

In [4]:
# Shuffle the DataFrame
df_shuff = df

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
var_columns = [c for c in df if c not in ['ph','ph_labels']]

X = df.loc[:,var_columns].values
y = df.loc[:,'ph_labels'].values

In [7]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import classification_report
import numpy as np

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the Gradient Boosting classifier model
model = GradientBoostingClassifier()

# Define the hyperparameter grid to search over
param_dist = {
    'learning_rate': np.logspace(-4, 0, 50),
    'n_estimators': np.arange(50, 501, 50),
    'max_depth': np.arange(2, 11),
    'min_samples_split': np.arange(2, 21),
    'min_samples_leaf': np.arange(1, 11)
}

# Define the randomized search object
search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=50, cv=5, random_state=42)

# Fit the randomized search object to the training data
search.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = search.predict(X_test)

# Evaluate the model's performance
accuracy_train = search.score(X_train, y_train)
accuracy_test = search.score(X_test, y_test)
report_test = classification_report(y_test, y_pred)

print("Gradient Boosting Classifier Train Accuracy: {:.3f}".format(accuracy_train))
print("Gradient Boosting Classifier Test Accuracy: {:.3f}".format(accuracy_test))
print("Gradient Boosting Classifier Best Params:\n{}\n".format(search.best_params_))
print("Gradient Boosting Classifier Report:\n{}\n".format(report_test))


Gradient Boosting Classifier Train Accuracy: 1.000
Gradient Boosting Classifier Test Accuracy: 0.966
Gradient Boosting Classifier Best Params:
{'n_estimators': 50, 'min_samples_split': 13, 'min_samples_leaf': 4, 'max_depth': 9, 'learning_rate': 0.2682695795279725}

Gradient Boosting Classifier Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97      3818
           1       0.95      0.96      0.95      3701
           2       0.97      0.95      0.96      3719
           3       0.97      0.98      0.98      3762

    accuracy                           0.97     15000
   macro avg       0.97      0.97      0.97     15000
weighted avg       0.97      0.97      0.97     15000


