# Import Libraries

In [1]:
import pandas as pd
from xgboost import XGBClassifier
import sklearn
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import ParameterGrid, RandomizedSearchCV, GridSearchCV
import warnings
import random
import numpy as np
from pickle import dump

In [2]:
# Disable specific FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Disable specific UserWarnings
warnings.simplefilter(action='ignore', category=UserWarning)

# Ignore FitFailedWarning from scikit-learn
warnings.filterwarnings("ignore", category=sklearn.exceptions.FitFailedWarning)

# Import Data

In [3]:
X_train = pd.read_csv("/workspaces/alfonsoMG_boosting/data/processed/diabetes_X_train.csv")
X_test = pd.read_csv("/workspaces/alfonsoMG_boosting/data/processed/diabetes_X_test.csv")
y_train = pd.read_csv("/workspaces/alfonsoMG_boosting/data/processed/diabetes_y_train.csv")
y_test = pd.read_csv("/workspaces/alfonsoMG_boosting/data/processed/diabetes_y_test.csv")

# Model Training

In the context of training machine learning models in a GitHub Codespace with Python, the idea is to experiment with and evaluate two model boosting approaches, specifically XGBoost and Gradient Boost, both designed for addressing classification problems. The strategy involves conducting the training of both models and then performing a detailed analysis of the results to determine which one exhibits the most effective or favorable performance.

The purpose of this comparison is to identify which of the two model boosting methods better suits the specific nature of the classification problem being addressed. By delving into this comparative evaluation, the goal is to understand the strengths and weaknesses of each approach to make informed decisions regarding the selection of the most suitable model for the given case. This evaluation process will provide a solid foundation for making implementation and parameter-tuning decisions that lead to optimal performance in the classification task.

### **XGBoost Classifier**

In [4]:
# Create an XGBoost classifier with a specified random state
model_x = XGBClassifier(random_state=24)

# Train the model on the training set
model_x.fit(X_train, y_train)

# Make predictions on the training set and calculate the accuracy
y_pred_train = model_x.predict(X_train)
train_score = accuracy_score(y_train, y_pred_train)
print(f'The accuracy score for Train is {train_score}.')

# Make predictions on the testing set and calculate the accuracy
y_pred_test = model_x.predict(X_test)
test_score = accuracy_score(y_test, y_pred_test)
print(f'The accuracy score for Test is {test_score}.')

# Calculate and print the difference in accuracy between the training and testing sets
difference = train_score - test_score
print(f'The accuracy difference between the models is {difference}.')

# Save the trained model to a file
dump(model_x, open("/workspaces/alfonsoMG_boosting/models/x_boosting_default_model.pk", "wb"))


The accuracy score for Train is 1.0.
The accuracy score for Test is 0.7142857142857143.
The accuracy difference between the models is 0.2857142857142857.


### **Gardient Boost Classifier**

In [5]:
# Create a Gradient Boosting classifier with specified parameters
model = GradientBoostingClassifier(n_estimators=10, random_state=24)

# Train the model on the training set
model.fit(X_train, y_train)

# Make predictions on the training set and calculate the accuracy
y_pred_train = model.predict(X_train)
train_score = accuracy_score(y_train, y_pred_train)
print(f'The accuracy score for Train is {train_score}.')

# Make predictions on the testing set and calculate the accuracy
y_pred_test = model.predict(X_test)
test_score = accuracy_score(y_test, y_pred_test)
print(f'The accuracy score for Test is {test_score}.')

# Calculate and print the difference in accuracy between the training and testing sets
difference = train_score - test_score
print(f'The accuracy difference between the models is {difference}.')

# Save the trained model to a file
dump(model, open("/workspaces/alfonsoMG_boosting/models/gradient_boosting_default_model.pk", "wb"))


The accuracy score for Train is 0.7947882736156352.
The accuracy score for Test is 0.7727272727272727.
The accuracy difference between the models is 0.022061000888362492.


Following the implementation of both classification models, our analysis revealed that XGBoost, while producing promising results on the training dataset, faced a substantial challenge of overfitting. In contrast, the Gradient Boosting option, although achieving a relatively lower accuracy, exhibited a minor issue with overfitting.

Considering these findings, we have made the strategic decision to continue with the Gradient Boosting model for the next phase, which involves hyperparameter optimization. Our objective during this optimization process is twofold: to improve the overall accuracy of the model and to further minimize the gap between the performance on the training and test datasets.

By choosing the Gradient Boosting model and focusing on hyperparameter tuning, we aim to strike a balance that maximizes accuracy while mitigating overfitting concerns. This approach aligns with our goal of achieving a more robust and generalizable model for the given classification task.

# Optimization

### **Gardient Boost Classifier**

In [6]:
# Define hyperparameter grid for Gradient Boosting Classifier
hyper = {
    'n_estimators': [50, 100, 150],         # Number of trees in the ensemble
    'learning_rate': [0.05, 0.1, 0.2],      # Learning rate (step of each tree)
    'max_depth': [None] + [3, 10, 15],      # Maximum depth of each tree
    'min_samples_split': [2, 4, 6],         # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 5, 15],          # Minimum number of samples required to be a leaf
    'subsample': [0.8, 1.0],                # Fraction of samples used to fit each tree
    'max_features': [None, 'sqrt', 'log2'], 
    'random_state': [24]
}

# Create the Gradient Boosting Classifier model
gb_classifier = GradientBoostingClassifier()

# Perform hyperparameter tuning with cross-validation
grid = GridSearchCV(estimator=gb_classifier, param_grid=hyper, cv=5)
grid.fit(X_train, y_train)  
best_hyper = grid.best_params_

# Create the optimized model using the best hyperparameters
model_opt = GradientBoostingClassifier(**best_hyper)
model_opt.fit(X_train, y_train)

# Make predictions on the training set and calculate the accuracy
y_pred_train = model_opt.predict(X_train)
train_score_opt = accuracy_score(y_train, y_pred_train)
print(f"The accuracy score for Train is {train_score_opt}.")

# Make predictions on the testing set and calculate the accuracy
y_pred_test = model_opt.predict(X_test)
test_score_opt = accuracy_score(y_test, y_pred_test)
print(f"The accuracy score for Test is {test_score_opt}.")

# Calculate and print the overfitting
overfitting = train_score_opt - test_score_opt
print(f"The resulting overfitting is {overfitting} points.")

# Save the optimized model to a file
dump(model_opt, open("/workspaces/alfonsoMG_boosting/models/gradient_boosting_opt_model.pk", "wb"))

The accuracy score for Train is 1.0.
The accuracy score for Test is 0.7467532467532467.
The resulting overfitting is 0.2532467532467533 points.


### **XGBoost Classifier**

Let's delve deeper into the rationale and methodology behind choosing RandomSearch over GridSearch for the optimization and hyperparameter tuning of XGBoost.

Hyperparameter tuning is a critical aspect of machine learning model development, influencing the model's performance and generalization to unseen data. GridSearch is a conventional method that exhaustively searches through a predefined set of hyperparameter values, evaluating the model's performance for each combination. While GridSearch is systematic, it can be computationally expensive, especially when dealing with a large search space.

In contrast, RandomSearch takes a more randomized approach by sampling hyperparameter values from a specified distribution. This method offers several advantages. First, it allows exploration of a broader range of hyperparameter values, which may include those outside the scope of a grid search. This is particularly beneficial when the optimal values are not easily predictable or fall within a specific grid.

Secondly, RandomSearch can be more computationally efficient. Since it randomly selects hyperparameter values, it has the potential to converge to an optimal solution more quickly, especially when dealing with a large search space. This can be advantageous in scenarios where computational resources are limited.

By opting for RandomSearch, we aim to strike a balance between thorough exploration of hyperparameter space and computational efficiency. We will define a wide range for each hyperparameter, allowing for a comprehensive search. The iteration process will involve randomly selecting combinations of hyperparameter values, training the XGBoost model, and evaluating its performance. The best-performing configuration will be retained as the final model, providing an efficient and effective means of hyperparameter tuning for XGBoost.

In [7]:
# Set a random seed for reproducibility
random.seed(24)

# Define hyperparameters for XGBoost
hyperparameters_x = {
    "n_estimators": [50, 100, 150, 200],
    "objective": ["reg:squarederror", "reg:logistic", "binary:logistic"],
    "subsample": np.arange(0, 1, 0.1),
    "max_depth": [None] + list(np.arange(1, 25)),
    "gamma": [0, 0.1, 0.2],
    "min_child_weight": [1, 3, 5],
    "colsample_by_level": [0.8, 1.0],
    "grow_policy": ["depthwise", "lossguide"],
    "n_jobs": [-1],
    "lambda": np.arange(1, 10),
    "eta": np.arange(0, 1, 0.1),
    "random_state": [24]
}

# Perform random search with different random states
boost_model = XGBClassifier(random_state=24)
num_iterations = 10
results = []

for i in range(num_iterations):
    random_state = np.random.randint(1, 100)
    random_search = RandomizedSearchCV(boost_model, hyperparameters_x, n_iter=100, cv=5, scoring="accuracy", random_state=random_state)
    random_search.fit(X_train, y_train)
    
    best_hyper = random_search.best_params_
    print(f"Iteration {i + 1} - Best hyperparameters: {best_hyper}")

    results.append((best_hyper, random_search.best_score_))

# Select the best set of hyperparameters based on the average cross-validation performance
best_hyper, _ = max(results, key=lambda x: x[1])
print(f"Best hyperparameters: {best_hyper}")

# Train the final model with the best hyperparameters
final_model = XGBClassifier(**best_hyper)
final_model.fit(X_train, y_train)

# Evaluate on the test set
y_pred_train = final_model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_pred_train)
print(f"Accuracy score for Train: {train_accuracy}")

y_pred_test = final_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred_test)
print(f"Accuracy score for Test: {test_accuracy}")

# Calculate and display the difference between the training and test sets
difference = train_accuracy - test_accuracy
print(f"Train/Test difference: {difference}")

# Save the optimized model to a file
dump(final_model, open("/workspaces/alfonsoMG_boosting/models/x_boosting_optimized_model.pk", "wb"))

Iteration 1 - Best hyperparameters: {'subsample': 0.7000000000000001, 'random_state': 24, 'objective': 'binary:logistic', 'n_jobs': -1, 'n_estimators': 100, 'min_child_weight': 1, 'max_depth': 22, 'lambda': 9, 'grow_policy': 'depthwise', 'gamma': 0.1, 'eta': 0.2, 'colsample_by_level': 1.0}
Iteration 2 - Best hyperparameters: {'subsample': 0.7000000000000001, 'random_state': 24, 'objective': 'reg:squarederror', 'n_jobs': -1, 'n_estimators': 50, 'min_child_weight': 5, 'max_depth': 4, 'lambda': 9, 'grow_policy': 'depthwise', 'gamma': 0.2, 'eta': 0.1, 'colsample_by_level': 0.8}
Iteration 3 - Best hyperparameters: {'subsample': 0.6000000000000001, 'random_state': 24, 'objective': 'binary:logistic', 'n_jobs': -1, 'n_estimators': 50, 'min_child_weight': 1, 'max_depth': 10, 'lambda': 8, 'grow_policy': 'lossguide', 'gamma': 0.1, 'eta': 0.2, 'colsample_by_level': 1.0}
Iteration 4 - Best hyperparameters: {'subsample': 0.9, 'random_state': 24, 'objective': 'reg:logistic', 'n_jobs': -1, 'n_estimato

# Conclusion


After implementing both boosting models (XGBoost Classifier and Gradient Boost Classifier), several conclusions can be drawn. Firstly, it is evident that the Gradient Boosting model from sklearn exhibits the least overfitting, despite having the lowest accuracy. The difference between the training and test accuracies is minimal in this case. Although XGBoosting achieves the highest accuracy, its significant overfitting discourages its selection in favor of the other model. Upon optimization, the accuracy of XGBoosting substantially increases, but the disparity with the test accuracy is further exacerbated, leading to greater overfitting.

For these reasons, the decision has been made to retain the Gradient Boosting model as the most effective option. While it may not present optimal conditions, it demonstrates the least amount of overfitting compared to the alternatives.