# <center>**Build XGB Model**</center>  
**Author**: Shirshak Aryal  
**Last Updated**: 18 July 2025

---
**Purpose:** This notebook is dedicated to training and evaluating an XGBoost regression model for `pGI50` prediction. It covers loading pre-split data, optimizing hyperparameters using Optuna, training the final model with optimal parameters, and comprehensively evaluating its performance on unseen test data.

---

## 1. Setup Notebook
This section initializes the notebook environment by importing all necessary libraries, configuring system settings for performance, and defining global parameters and file paths.

### 1.1. Configure Environment
This section sets environment variables to optimize CPU core usage for numerical computations, which can significantly impact the performance of libraries like XGBoost.

In [1]:
# General CPU Usage Optimization
import os

os.environ["OMP_NUM_THREADS"] = "16"
os.environ["MKL_NUM_THREADS"] = "16"
os.environ["OPENBLAS_NUM_THREADS"] = "16"
os.environ["NUMEXPR_NUM_THREADS"] = "16"

### 1.2. Import Libraries
All required Python libraries for data manipulation, machine learning model building (XGBoost), hyperparameter optimization (Optuna), model evaluation (scikit-learn metrics), and utility functions are imported here.

In [3]:
# Standard Library Imports
from pathlib import Path
import subprocess  # For getting Git commit ID

# Core Data Science Libraries
import numpy as np
import pandas as pd

# Machine Learning Libraries
import joblib  # For saving/loading models
import optuna  # For hyperparameter optimization
import optuna.integration  # For Optuna's integrations (e.g., LightGBM, XGBoost callbacks)
from sklearn.metrics import mean_squared_error, r2_score  # For model evaluation metrics
import xgboost as xgb  # The XGBoost model library

# Conditional import for progress bars (tqdm)
tqdm_notebook_available = False  # Initialize flag
try:
    from tqdm.notebook import tqdm

    tqdm.pandas()  # Enable tqdm for pandas apply method
    tqdm_notebook_available = True
    print("tqdm.notebook found and enabled for pandas.")
except ImportError:
    print("tqdm.notebook not found. Install with 'pip install tqdm'.")

tqdm.notebook found and enabled for pandas.


### 1.3. Set Final Model Save Location

In [4]:
xgb_models_base_dir = Path("../models/xgb")
xgb_models_base_dir.mkdir(parents=True, exist_ok=True)
print(f"The final XGBoost model will be saved in: {xgb_models_base_dir}")

The final XGBoost model will be saved in: ..\models\xgb


## 2. Load Data Splits
This section loads the pre-engineered and split datasets (training, validation, and test sets for both features and target variable) that were prepared in the previous notebook.

In [5]:
splits_dir = Path("../data/splits")
print(f"\nLoading data splits from {splits_dir}...")

try:
    X_train = pd.read_parquet(splits_dir / "X_train.parquet")
    X_val = pd.read_parquet(splits_dir / "X_val.parquet")
    X_test = pd.read_parquet(splits_dir / "X_test.parquet")
    
    y_train = pd.read_parquet(splits_dir / "y_train.parquet")
    y_val = pd.read_parquet(splits_dir / "y_val.parquet")
    y_test = pd.read_parquet(splits_dir / "y_test.parquet")
    print("Data splits loaded successfully.")
except FileNotFoundError:
    print(f"Error: One or more split files not found in '{splits_dir}'.")
    print("Please ensure you have run '02_Split_Features.ipynb' to generate and save the splits.")

print(f"X_train shape: {X_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"X_test shape: {X_test.shape}")

print(f"y_train shape: {y_train.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"y_test shape: {y_test.shape}")

# Display first few rows to verify data
print("\nFirst 5 rows of X_train:")
display(X_train.head())

print("\nFirst 5 rows of y_train:")
display(y_train.head())


Loading data splits from ..\data\splits...
Data splits loaded successfully.
X_train shape: (13119, 2268)
X_val shape: (2812, 2268)
X_test shape: (2812, 2268)
y_train shape: (13119, 1)
y_val shape: (2812, 1)
y_test shape: (2812, 1)

First 5 rows of X_train:


Unnamed: 0,molregno,canonical_smiles,num_activities,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,SPS,MolWt,...,morgan_fp_2038,morgan_fp_2039,morgan_fp_2040,morgan_fp_2041,morgan_fp_2042,morgan_fp_2043,morgan_fp_2044,morgan_fp_2045,morgan_fp_2046,morgan_fp_2047
0,2307646,COc1cccc2c1OCc1c-2nc2cnc3ccccc3c2c1C,6,6.033142,6.033142,0.494176,0.494176,0.476742,12.56,328.371,...,0,0,0,0,0,0,0,0,0,0
1,2081122,COc1cc(/C(C#N)=C/c2ccc3c(c2)OCCO3)cc(OC)c1OC,9,9.645791,9.645791,0.459195,0.459195,0.604738,12.923077,353.374,...,0,0,0,0,0,0,0,0,0,0
2,2199496,COC(=O)[C@@H]1CCCN1Cc1ccc(-c2ncc(-c3ccc(OCC=C(...,6,11.953178,11.953178,0.169552,-0.173158,0.359463,15.909091,447.535,...,0,0,0,0,0,0,0,0,0,0
3,2221960,O=C(/C=C/c1cccn(C/C=C/c2ccccc2Br)c1=O)NO,4,12.253458,12.253458,0.216419,-0.686457,0.479732,11.217391,375.222,...,0,0,0,0,0,0,0,0,0,0
4,2879093,Cc1cc(C2c3c(-c4cccc5[nH]c(=O)oc45)n[nH]c3C(=O)...,2,14.128489,14.128489,0.124437,-3.116139,0.437556,16.121212,472.879,...,0,0,0,0,0,0,0,0,0,0



First 5 rows of y_train:


Unnamed: 0,pGI50
14387,5.734742
12543,7.164746
12810,4.928428
13172,6.882724
18712,6.094208


## 3. Prepare Data for XGBoost
This section performs final data preparation steps specifically required for the XGBoost model, including dropping identifier columns and converting data splits to NumPy arrays.

In [6]:
# Drop identifier columns which are not features for the model
print("\nPreparing X for XGBoost training (dropping identifiers)...")
X_train_xgb = X_train.drop(columns=['molregno', 'canonical_smiles'], errors='ignore')
X_val_xgb = X_val.drop(columns=['molregno', 'canonical_smiles'], errors='ignore')
X_test_xgb = X_test.drop(columns=['molregno', 'canonical_smiles'], errors='ignore')

print(f"X_train_xgb shape (numerical features only): {X_train_xgb.shape}")
print(f"X_val_xgb shape (numerical features only): {X_val_xgb.shape}")
print(f"X_test_xgb shape (numerical features only): {X_test_xgb.shape}")

display(X_train_xgb.head())
display(y_train.head())

print("Converting data to numpy arrays...")
# XGBoost typically works well with NumPy arrays
X_train_xgb = X_train_xgb.values.astype(np.float32)
y_train = y_train.values.astype(np.float32)

X_val_xgb = X_val_xgb.values.astype(np.float32)
y_val = y_val.values.astype(np.float32)

X_test_xgb = X_test_xgb.values.astype(np.float32)
y_test = y_test.values.astype(np.float32)

print(f"X_train_xgb type after conversion: {type(X_train_xgb)}")
print(f"X_train_xgb dtype: {X_train_xgb.dtype}")
print("\nData preparation for XGBoost complete. Ready for model definition and training.")


Preparing X for XGBoost training (dropping identifiers)...
X_train_xgb shape (numerical features only): (13119, 2266)
X_val_xgb shape (numerical features only): (2812, 2266)
X_test_xgb shape (numerical features only): (2812, 2266)


Unnamed: 0,num_activities,MaxAbsEStateIndex,MaxEStateIndex,MinAbsEStateIndex,MinEStateIndex,qed,SPS,MolWt,HeavyAtomMolWt,ExactMolWt,...,morgan_fp_2038,morgan_fp_2039,morgan_fp_2040,morgan_fp_2041,morgan_fp_2042,morgan_fp_2043,morgan_fp_2044,morgan_fp_2045,morgan_fp_2046,morgan_fp_2047
0,6,6.033142,6.033142,0.494176,0.494176,0.476742,12.56,328.371,312.243,328.121178,...,0,0,0,0,0,0,0,0,0,0
1,9,9.645791,9.645791,0.459195,0.459195,0.604738,12.923077,353.374,334.222,353.126323,...,0,0,0,0,0,0,0,0,0,0
2,6,11.953178,11.953178,0.169552,-0.173158,0.359463,15.909091,447.535,418.303,447.215806,...,0,0,0,0,0,0,0,0,0,0
3,4,12.253458,12.253458,0.216419,-0.686457,0.479732,11.217391,375.222,360.102,374.026604,...,0,0,0,0,0,0,0,0,0,0
4,2,14.128489,14.128489,0.124437,-3.116139,0.437556,16.121212,472.879,453.727,472.111375,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,pGI50
14387,5.734742
12543,7.164746
12810,4.928428
13172,6.882724
18712,6.094208


Converting data to numpy arrays...
X_train_xgb type after conversion: <class 'numpy.ndarray'>
X_train_xgb dtype: float32

Data preparation for XGBoost complete. Ready for model definition and training.


## 4. Optimize Hyperparameters
This section utilizes Optuna to systematically search for the optimal set of hyperparameters for the XGBoost model, aiming to minimize prediction error on the validation set.

### 4.1. Define Optuna Objective Function
The Optuna objective function is defined here. This function trains an XGBoost model with a given set of hyperparameters and returns its performance (e.g., RMSE) on the validation set, which Optuna aims to minimize.

In [7]:
def objective(trial):
    # Suggest these hyperparameters to Optuna.
    learning_rate = trial.suggest_float("learning_rate", 0.001, 0.3, log=True)
    max_depth = trial.suggest_int("max_depth", 3, 12) # Max depth of a tree
    subsample = trial.suggest_float("subsample", 0.6, 1.0) # Subsample ratio of the training instance
    colsample_bytree = trial.suggest_float("colsample_bytree", 0.6, 1.0) # Subsample ratio of columns when constructing each tree
    reg_alpha = trial.suggest_float("reg_alpha", 1e-4, 1.0, log=True) # L1 regularization term
    reg_lambda = trial.suggest_float("reg_lambda", 1e-4, 1.0, log=True) # L2 regularization term
    gamma = trial.suggest_float("gamma", 1e-4, 1.0, log=True) # Minimum loss reduction required to make a further partition on a leaf node
    n_estimators = trial.suggest_int("n_estimators", 500, 3000) # Number of boosting rounds (trees)
    min_child_weight = trial.suggest_int("min_child_weight", 1, 10) # Minimum sum of instance weight needed in a child

    # Initialize model with suggested hyperparams
    model = xgb.XGBRegressor(
        objective='reg:squarederror',  # Objective function (minimizes squared error)
        eval_metric='rmse',  # Evaluation metric to be monitored during training (used for early stopping)
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        max_depth=max_depth,
        subsample=subsample,
        colsample_bytree=colsample_bytree,
        reg_alpha=reg_alpha,
        reg_lambda=reg_lambda,
        gamma=gamma,
        min_child_weight=min_child_weight,
        random_state=42,  # Seed for reproducibility of model training (tree building)
        tree_method='hist', # Use histogram-based algorithm for faster training
        device='cuda:0',  # Specify to use GPU if available
        early_stopping_rounds=50, # Stop if validation metric doesn't improve for 50 rounds
    )

    # Train model on training data, evaluating on validation set for early stopping
    try:
        model.fit(
            X_train_xgb, y_train,
            eval_set=[(X_val_xgb, y_val)], # Use the validation set for early stopping and pruning decisions
            verbose=False,
        )
    except xgb.core.XGBoostError as e:
        # Handle cases where specific hyperparameter combinations might lead to training errors
        print(f"XGBoost training error for trial {trial.number}: {e}")
        trial.set_user_attr("exception_type", "XGBoostError")
        trial.set_user_attr("exception_message", str(e))
        return float('inf') # Return a very high value (infinity) to Optuna to discourage this problematic trial

    # Make predictions on validation set using the best booster from early stopping
    y_pred_val = model.predict(X_val_xgb)
    
    # Calculate RMSE and R2 score on validation set
    rmse = float(np.sqrt(mean_squared_error(y_val, y_pred_val)))
    r2 = float(r2_score(y_val, y_pred_val))
    
    trial.set_user_attr("r2_score", r2)  # Store R2 score as a user attribute for later inspection

    return rmse  # Optuna minimizes this RMSE value to find the best trial

### 4.2. Run Optuna Study
An Optuna study is created and executed to perform the hyperparameter optimization, iterating through trials to find the best combination of parameters.

In [None]:
optuna.logging.set_verbosity(optuna.logging.INFO) # Set Optuna logging level

print("Optuna logging verbosity set to INFO.")
# Define the path for the Optuna study database storage
study_dir = Path("../studies/xgboost_study")
study_dir.mkdir(parents=True, exist_ok=True)

study_db_path = f"sqlite:///{study_dir / 'xgb_optuna_study.db'}"
study_name = "xgboost_regression_pGI50"
print(f"Optuna study will be stored at: {study_db_path}")

# Check if a study with the same name already exists in the database
# If it does, load it to resume the optimization
try:
    study = optuna.load_study(study_name=study_name, storage=study_db_path)
    print(f"Loaded existing study '{study_name}' from {study_db_path}. Resuming optimization.")
except KeyError:
    # If the study does not exist, create a new one
    print(f"Creating new study '{study_name}' at {study_db_path}.")
    study = optuna.create_study(
        study_name=study_name,
        direction="minimize", # Minimize the RMSE
        storage=study_db_path,
    )

print("\nStarting Optuna optimization...")

# Run up to 300 trials or for 2 hours (7200 seconds)
study.optimize(objective, n_trials=300, timeout=7200, show_progress_bar=True)

print("\nOptuna optimization finished.")

# Print best trial results from the study
print("\n--- Best Trial Results ---")
print(f"Best trial number: {study.best_trial.number}")
print(f"Best RMSE (Validation): {study.best_value:.4f}")
print("Best hyperparameters:")
for key, value in study.best_params.items():
    print(f"  {key}: {value}")

# Access and print the R2 score stored as a user attribute for the best trial
if "r2_score" in study.best_trial.user_attrs:
    print(f"Best R2 Score (Validation): {study.best_trial.user_attrs['r2_score']:.4f}")

Optuna logging verbosity set to INFO.
Optuna study will be stored at: sqlite:///..\studies\xgboost_study\xgb_optuna_study.db
Loaded existing study 'xgboost_regression_pGI50' from sqlite:///..\studies\xgboost_study\xgb_optuna_study.db. Resuming optimization.

Starting Optuna optimization...


  0%|          | 0/300 [00:00<?, ?it/s]

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.


  return func(**kwargs)


[I 2025-07-15 23:20:09,394] Trial 1063 finished with value: 0.7694555719797417 and parameters: {'learning_rate': 0.0037184251773346028, 'max_depth': 6, 'subsample': 0.9982931829052706, 'colsample_bytree': 0.6673293221697834, 'reg_alpha': 0.00010284858978825288, 'reg_lambda': 0.000757183626390342, 'gamma': 0.9225937105407367, 'n_estimators': 744, 'min_child_weight': 4}. Best is trial 1056 with value: 0.7633740559033817.
[I 2025-07-15 23:20:31,069] Trial 1064 finished with value: 0.7661969131325894 and parameters: {'learning_rate': 0.003811779089832076, 'max_depth': 6, 'subsample': 0.9980310030228269, 'colsample_bytree': 0.6661423256948277, 'reg_alpha': 0.00010053570082067321, 'reg_lambda': 0.0007550071794087357, 'gamma': 0.9949509709163977, 'n_estimators': 774, 'min_child_weight': 4}. Best is trial 1056 with value: 0.7633740559033817.
[I 2025-07-15 23:20:52,646] Trial 1065 finished with value: 0.7681481183936383 and parameters: {'learning_rate': 0.0036942161729721598, 'max_depth': 6, 's

KeyboardInterrupt: 

## 5. Train Final Model
This section trains the final XGBoost model using the best hyperparameters identified by Optuna and saves it for future use.

### 5.1. Reinitialize Model with Best Hyperparameters
The XGBoost model is reinitialized using the optimal hyperparameters found during the Optuna study.

In [18]:
# Re-load the study to ensure the latest best parameters
study_dir = Path("../studies/xgboost_study")
study_db_path = f"sqlite:///{study_dir / 'xgb_optuna_study.db'}"
study_name = "xgboost_regression_pGI50"

try:
    study = optuna.load_study(study_name=study_name, storage=study_db_path)
    print("Best trial parameters (XGBoost):", study.best_trial.params)
    best_params = study.best_trial.params
except KeyError:
    print("Study does not exist. Please make sure that the previous Optuna study cell has been run.")

# Add fixed parameters that were not part of Optuna's search but are required for the model
best_params['objective'] = 'reg:squarederror' # Objective function for regression
best_params['eval_metric'] = 'rmse' # Evaluation metric
best_params['random_state'] = 42 # For reproducibility of the final model
best_params['tree_method'] = 'hist' # Histogram-based method for efficiency
best_params['device'] = 'cuda:0' # Set device for final training ('cuda:0' for GPU, 'cpu' for CPU)

# Initialize the final XGBoost model with the best parameters found by Optuna
final_xgb_model = xgb.XGBRegressor(**best_params)
print("Final model has been initialized with best parameters. Ready for training.")

Best trial parameters (XGBoost): {'learning_rate': 0.00820193368271431, 'max_depth': 6, 'subsample': 0.9897893799354487, 'colsample_bytree': 0.7223075745984062, 'reg_alpha': 0.00012824377407451583, 'reg_lambda': 0.00397611471104642, 'gamma': 0.6775917319015243, 'n_estimators': 1148, 'min_child_weight': 9}
Final model has been initialized with best parameters. Ready for training.


### 5.2. Get Current Git Commit ID
The current Git commit ID (hash) is programmatically retrieved. This commit ID will be incorporated into the final model's filename to ensure direct traceability and reproducibility.

In [8]:
def get_git_commit_hash():
    try:
        # Get the short commit hash
        commit_hash = subprocess.check_output(['git', 'rev-parse', '--short', 'HEAD']).strip().decode('ascii')
        return commit_hash
    except (subprocess.CalledProcessError, FileNotFoundError):
        return "unknown_commit"

In [9]:
# Optionally, see the current commit ID
current_commit = get_git_commit_hash()
print(f"Current Git Commit ID: {current_commit}")

Current Git Commit ID: a9632d5


### 5.3. Train and Save Model
The final XGBoost model is trained on the combined training and validation datasets and then saved locally with a filename that includes the Git commit ID.

In [21]:
print("\nTraining final XGBoost model on the full training set (combined original training data + original validation data)...")

# Concatenate the original training and validation data for final model training
X_train_xgb_final = np.concatenate([X_train_xgb, X_val_xgb]).astype(np.float32)
y_train_xgb_final = np.concatenate([y_train, y_val]).astype(np.float32)

# Train the model
# No early stopping here as training is done on combined set (hence no separate validation set).
final_xgb_model.fit(X_train_xgb_final, y_train_xgb_final, verbose=False)
print("Final model training complete. Ready for test set evaluation.")

# Construct filename including current git commit ID
current_commit_hash = get_git_commit_hash()
model_filename = Path(f'final_best_xgboost_model_{current_commit_hash}.joblib')
print(f"Associated Git Commit ID for saved model: {current_commit_hash}")

# Directory to save the model
save_dir = Path('../models/xgb')
save_dir.mkdir(parents=True, exist_ok=True)

full_path = save_dir / model_filename

print(f"\nSaving the final XGBoost model to: {full_path}...")
try:
    # Use joblib to save the trained XGBoost model
    joblib.dump(final_xgb_model, full_path)
    print("Model saved successfully!")
except Exception as e:
    print(f"Error saving model: {e}")


Training final XGBoost model on the full training set (combined original training data + original validation data)...
Final model training complete. Ready for test set evaluation.
Associated Git Commit ID for saved model: 271d2f4

Saving the final XGBoost model to: ..\models\xgb\final_best_xgboost_model_271d2f4.joblib...
Model saved successfully!


## 6. Evaluate Model
This section performs a final, unbiased evaluation of the trained XGBoost model's performance on the previously unseen test dataset.

In [23]:
# Load the final xgboost model
print(f"Loading final saved model from '{model_filename}' for final test evaluation...")
path_to_saved_model = xgb_models_base_dir / model_filename
final_xgb_model = joblib.load(path_to_saved_model)

print("Making predictions and evaluating on the test set...")

# Set device to 'cpu' for prediction to prevent potential device
# mismatch errors if the model was trained on GPU
final_xgb_model.set_params(device='cuda:0')
y_pred_test_xgb = final_xgb_model.predict(X_test_xgb)

# Calculate final RMSE and R2 score on the test set
rmse_test_xgb = np.sqrt(mean_squared_error(y_test, y_pred_test_xgb))
r2_test_xgb = r2_score(y_test, y_pred_test_xgb)

print(f"\n--- Final XGBoost Model Performance on Test Set ---")
print(f"Test RMSE: {rmse_test_xgb:.4f}")
print(f"Test R2 Score: {r2_test_xgb:.4f}")

# For comparison, print the best validation RMSE from the Optuna study
print(f"Compared with Best Validation RMSE from Optuna Study: {study.best_value:.4f}")

Loading final saved model from 'final_best_xgboost_model_271d2f4.joblib' for final test evaluation...
Making predictions and evaluating on the test set...


Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.


  return func(**kwargs)



--- Final XGBoost Model Performance on Test Set ---
Test RMSE: 0.6955
Test R2 Score: 0.4953
Compared with Best Validation RMSE from Optuna Study: 0.6995
