Here, we are using Decision Tree Classifier as the base model for RandomizedSearchCV. We have defined a hyperparameter grid to search over that includes the maximum depth of the tree (max_depth), minimum number of samples required to split a node (min_samples_split), minimum number of samples required at a leaf node (min_samples_leaf), and the criterion used for splitting ('gini' or 'entropy').

The range of values for each hyperparameter are chosen based on prior knowledge about the classifier and the problem at hand. For example, we have set the range of values for max_depth from 1 to 20, as we know that having a very deep tree may lead to overfitting.

We have set the number of iterations (n_iter) to 50 and the number of cross-validation folds (cv) to 5, which determines how many times the data will be split into training and validation sets during the search.

Finally, we fit the RandomizedSearchCV object to the training data, predict on the test set, and report the train and test accuracy, classification report, and best hyperparameters found during the search.

Note that the hyperparameters and their ranges were chosen based on empirical evidence and prior knowledge about the Decision Tree Classifier. The choice of hyperparameters and their ranges may vary depending on the specific problem and dataset.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
df=pd.read_excel('RE_Data.xlsx')

In [3]:
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1)

In [4]:
# Shuffle the DataFrame
df_shuff = df

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
var_columns = [c for c in df if c not in ['ph','ph_labels']]

X = df.loc[:,var_columns].values
y = df.loc[:,'ph_labels'].values

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import classification_report
import numpy as np

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the Decision Tree Classifier model
model = DecisionTreeClassifier()

# Define the hyperparameter grid to search over
param_dist = {
    'max_depth': np.arange(1, 21),
    'min_samples_split': np.arange(2, 11),
    'min_samples_leaf': np.arange(1, 11),
    'criterion': ['gini', 'entropy']
}

# Define the randomized search object
search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=50, cv=5, random_state=42)

# Fit the randomized search object to the training data
search.fit(X_train, y_train)

y_pred = search.predict(X_test)
accuracy_train = search.score(X_train, y_train)
accuracy_test = search.score(X_test, y_test)

report_test = classification_report(y_test, y_pred)

print("Decision Tree Classifier train accuracy: {:.3f}".format(accuracy_train))
print("Decision Tree Classifier test accuracy: {:.3f}".format(accuracy_test))
print("Decision Tree Classifier test report:\n{}\n".format(report_test))
print("Decision Tree Classifier best params:\n{}\n".format(search.best_params_))


Decision Tree Classifier train accuracy: 0.988
Decision Tree Classifier test accuracy: 0.961
Decision Tree Classifier test report:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97      3818
           1       0.95      0.94      0.95      3701
           2       0.96      0.95      0.95      3719
           3       0.97      0.98      0.97      3762

    accuracy                           0.96     15000
   macro avg       0.96      0.96      0.96     15000
weighted avg       0.96      0.96      0.96     15000


Decision Tree Classifier best params:
{'min_samples_split': 7, 'min_samples_leaf': 1, 'max_depth': 20, 'criterion': 'entropy'}

