<a href="https://colab.research.google.com/github/MZiaAfzal71/Edge-Aware-GNN/blob/main/Models/RF_and_XGBoost_for_ESOL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Descriptor-Based Machine Learning Models for ESOL

This notebook presents a **systematic descriptor-based modeling pipeline** for predicting
aqueous solubility on the **ESOL (Delaney) dataset**, using classical machine learning models.

The workflow is organized into three main stages:

---

## 1Ô∏è‚É£ Hyperparameter Optimization via Repeated Cross-Validation

- Hyperparameter tuning is performed using **Optuna**
- A **5√ó5 repeated cross-validation** strategy is employed to ensure robust model selection
- Each candidate configuration is evaluated using identical data splits
- The optimization objective is based on cross-validated predictive performance

This step is conducted **once per model**, and the resulting best hyperparameters
are fixed for all subsequent experiments.

---

## 2Ô∏è‚É£ Model Evaluation with Fixed Hyperparameters

- Using the optimized hyperparameters, each model is re-evaluated using the same
  **5√ó5 repeated cross-validation** scheme
- This provides an unbiased estimate of model performance and stability
- Mean and standard deviation of performance metrics are reported across all folds and repeats

---

## 3Ô∏è‚É£ Scaffold-Based Ensemble Evaluation

- The ESOL dataset includes a pre-defined **Bemis‚ÄìMurcko scaffold split**
- A dedicated column (`BM-Scaffold`) specifies **Train / Validation / Test** assignments
- Using this split:
  - Five independent models are trained with different random seeds
  - An **ensemble of five models** is constructed by averaging predictions
- This setup evaluates model generalization under a chemically realistic scaffold split

---

## Molecular Representation

- Each molecule is represented exclusively by **RDKit-computed molecular descriptors**
- Out of 217 total RDKit descriptors:
  - **198 descriptors with non-zero variance** are retained
  - Constant descriptors are removed prior to modeling
- Descriptor features are normalized using statistics computed from training data only

---

## Models Covered

- **Random Forest Regressor**
- **XGBoost Regressor**

All experiments are conducted with an emphasis on:
- **Reproducibility**
- **Fair model comparison**
- **Chemically meaningful evaluation protocols**

This notebook complements graph-based modeling approaches by providing
strong descriptor-based baselines for aqueous solubility prediction.


In [None]:
# 1Ô∏è‚É£ Fetch data
!git clone https://github.com/MZiaAfzal71/Edge-Aware-GNN.git

In [None]:
# 2Ô∏è‚É£ Change current/working directory
%cd Combined-Interaction-Matrix/ESOL\ Dataset

In [None]:
# 3Ô∏è‚É£ Install rdkit and Optuna
!pip install rdkit
!pip install -U optuna
!pip install plotly

In [None]:
#  4Ô∏è‚É£ Imports
import numpy as np
import pandas as pd
from pathlib import Path
from tqdm.auto import tqdm
import random
import copy
import os
import optuna

from rdkit import Chem
from rdkit.Chem import Descriptors

from sklearn.model_selection import RepeatedKFold, cross_val_score, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# import matplotlib.pyplot as plt
# import seaborn as sns
# import warnings
# warnings.filterwarnings("ignore")

In [None]:
# 5Ô∏è‚É£ Define dataset paths and initialize repeated k-fold cross-validation configuration

file_path = "delaney-processed-scaffold.csv"
smiles_col = "smiles"
target_col = "measured log solubility in mols per litre"

n_splits = 5
n_repeats = 5
random_state = 42

CV = RepeatedKFold(
    n_splits=n_splits,
    n_repeats=n_repeats,
    random_state=random_state
)


In [None]:
# 6Ô∏è‚É£ Optuna objective function for tuning Random Forest hyperparameters using cross-validated RMSE

def rf_objective(trial, X, y):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 300, 1000),
        "max_depth": trial.suggest_int("max_depth", 8, 40),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 10),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 5),
        "max_features": trial.suggest_float("max_features", 0.3, 0.8),
        "bootstrap": True,
        "random_state": 42,
        "n_jobs": -1,
    }

    model = RandomForestRegressor(**params)

    rmse = -cross_val_score(
        model,
        X,
        y,
        cv=CV,
        scoring="neg_root_mean_squared_error",
        n_jobs=-1
    ).mean()

    return rmse


In [None]:
# 7Ô∏è‚É£ Optuna objective function for tuning XGBoost model hyperparameters using cross-validated RMSE

def xgb_objective(trial, X, y):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 400, 1200),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.1, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 7),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 0.9),
        "gamma": trial.suggest_float("gamma", 0.0, 0.5),
        "reg_alpha": trial.suggest_float("reg_alpha", 0.0, 0.2),
        "reg_lambda": trial.suggest_float("reg_lambda", 1.0, 10.0),
        "random_state": 42,
        "n_jobs": -1,
        "tree_method": "hist",
    }

    model = XGBRegressor(**params)

    rmse = -cross_val_score(
        model,
        X,
        y,
        cv=CV,
        scoring="neg_root_mean_squared_error",
        n_jobs=-1
    ).mean()

    return rmse


In [None]:
# 8Ô∏è‚É£ Calculates all available RDKit descriptors for a given SMILES string.

def rdkit_descriptors_from_smiles(smiles):
    """
    Parameters
    ----------
    smiles : list of smiles strings

    Returns
    -------
    pd.DataFrame
        DataFrame with descriptor names as columns.
        Returns NaN values if SMILES is invalid.
    """
    # Get descriptor names and functions
    rdkit_descs = []

    desc_list = Descriptors.descList
    desc_names = [name for name, _ in desc_list]

    # Initialize output with NaNs
    # values = [np.nan] * len(desc_names)

    for sm in tqdm(smiles, total=len(smiles)):
      sm_descs = []
      mol = Chem.MolFromSmiles(sm)
      if mol is None:
          rdkit_descs.append([]*len(desc_names))
          continue

      for _, func in desc_list:
          try:
              sm_descs.append(func(mol))
          except Exception:
              sm_descs.append(np.nan)

      rdkit_descs.append(sm_descs)

    return pd.DataFrame(rdkit_descs, columns=desc_names)



In [None]:
# 9Ô∏è‚É£ Load dataset, compute RDKit descriptors from SMILES, remove zero-variance features, and prepare input matrix

tqdm.pandas()

df = pd.read_csv(file_path)

y = df[target_col]

# y_mean = y.mean()
# y_std  = y.std()

# y_scaled = (y - y_mean) / y_std

df_data = df[[smiles_col, target_col]]# pd.concat([df[smiles_col], y], axis=1)

rdkit_descriptors = rdkit_descriptors_from_smiles(df[smiles_col])
variance_df = rdkit_descriptors.var()
zero_var_columns = variance_df[variance_df == 0].index.tolist()
cleaned_rdkit_descs = rdkit_descriptors.drop(columns = zero_var_columns)
X = cleaned_rdkit_descs.values


In [None]:
# üîü Run Optuna hyperparameter optimization for Random Forest, report best results, and save trial history
# This cell takes 4 hours and 19 minutes to compolete 37 out of 50 searches. And still (1:30) hours left to
# complete the remaining 13 iterations. However, I be able to complete this search in Kaggle in  (3:15) hours
# approximately, which uses 4 processors instead of colab (uses 2 cores).

rf_study = optuna.create_study(
    direction="minimize",
    study_name="RF_ESOL",
    storage="sqlite:///rf_esol_optuna.db",
    load_if_exists=True
)

rf_study.optimize(
    lambda trial: rf_objective(trial, X, y),
    n_trials=50,
    show_progress_bar=True
)

print("Best RF RMSE:", rf_study.best_value)
print("Best RF Params:")
for k, v in rf_study.best_params.items():
    print(f"  {k}: {v}")

rf_df_trials = rf_study.trials_dataframe()
rf_df_trials.to_csv("rf_study_optuna.csv", index=False)

In [None]:
# 1Ô∏è‚É£1Ô∏è‚É£ Run Optuna hyperparameter optimization for Random Forest, report best results, and save trial history

xgb_study = optuna.create_study(
    direction="minimize",
    study_name="XGB_ESOL",
    storage="sqlite:///xgb_esol_optuna.db",
    load_if_exists=True
)

xgb_study.optimize(
    lambda trial: xgb_objective(trial, X, y),
    n_trials=75,
    show_progress_bar=True
)

print("Best XGB RMSE:", xgb_study.best_value)
print("Best XGB Params:")
for k, v in xgb_study.best_params.items():
    print(f"  {k}: {v}")

xgb_df_trials = xgb_study.trials_dataframe()
xgb_df_trials.to_csv("xgb_study_optuna.csv", index=False)


In [None]:
# 1Ô∏è‚É£2Ô∏è‚É£ Evaluate multiple regression models using cross-validation and return R¬≤, RMSE, and MAE metrics

def evaluate_models(X, y, models, cv=CV):
    """
    models: dict name -> sklearn-style estimator
    returns: dict of metrics DataFrames
    """
    results = {}
    scoring = ['r2','neg_root_mean_squared_error','neg_mean_absolute_error']
    for name, model in models.items():
        # cross_validate
        scores = cross_validate(model, X, y, cv=cv, scoring=scoring, return_train_score=True, n_jobs=-1)
        # convert negatives back for RMSE and MAE
        train_r2s = scores['train_r2']
        train_rmses = -scores['train_neg_root_mean_squared_error']
        train_maes = -scores['train_neg_mean_absolute_error']
        val_r2s = scores['test_r2']
        val_rmses = -scores['test_neg_root_mean_squared_error']
        val_maes = -scores['test_neg_mean_absolute_error']
        results[name] = {
            'best_train_rmse': train_rmses, 'best_train_r2': train_r2s, 'best_train_mae': train_maes,
            'best_val_rmse': val_rmses, 'best_val_r2': val_r2s, 'best_val_mae': val_maes
        }
    return results


In [None]:
# 1Ô∏è‚É£3Ô∏è‚É£ Define optimized RF and XGBoost models, evaluate them via cross-validation, and save fold-wise performance metrics

rf_best_params = {
        "n_estimators": 893,
        "max_depth": 34,
        "min_samples_split": 3,
        "min_samples_leaf": 1,
        "max_features": 0.3299064510871738,
        "bootstrap": True,
        "random_state": 42,
        "n_jobs": -1,
    }

xgb_best_params = {
        "n_estimators": 1195,
        "learning_rate": 0.04937618493560799,
        "max_depth": 3,
        "min_child_weight": 1,
        "subsample": 0.6251344316154072,
        "colsample_bytree": 0.5938164957956602,
        "gamma": 0.0032080746355426033,
        "reg_alpha": 0.05691988236631272,
        "reg_lambda": 6.9129621061574875,
        "random_state": 42,
        "n_jobs": -1,
        "tree_method": "hist",
    }

models = {"RF" : RandomForestRegressor(**rf_best_params),
          "XGB" : XGBRegressor(**xgb_best_params)}

results = evaluate_models(X, y, models, cv=CV)

rf_df = pd.DataFrame(results["RF"])
fold_ind = list(range(1, 6))*5
rf_df.insert(loc=0, column="fold", value=fold_ind)
fold_ind.sort()
rf_df.insert(loc=0, column="repeat", value=fold_ind)

xgb_df = pd.DataFrame(results["XGB"])
fold_ind = list(range(1, 6))*5
xgb_df.insert(loc=0, column="fold", value=fold_ind)
fold_ind.sort()
xgb_df.insert(loc=0, column="repeat", value=fold_ind)

rf_df.to_csv("Folds results Random Forest.csv", index=False)
xgb_df.to_csv("Folds results XGBoost.csv", index=False)

In [None]:
split_col = df['BM-Scaffold']

In [None]:
X[split_col[split_col != "Train"].index].shape

In [None]:
# 1Ô∏è‚É£4Ô∏è‚É£ Train an ensemble of RF or XGBoost models on a scaffold split and report train/validation performance metrics

def train_ensemble_scaffold(
    X,
    y,
    split_col,
    best_params,
    model_name,
    num_models=10,
    seed_start=42
):
    train_ind = split_col[split_col == "Train"].index
    val_ind = split_col[split_col != "Train"].index

    train_X = X[train_ind]
    train_y = y[train_ind]

    val_X = X[val_ind]
    val_y = y[val_ind]

    results = {
            'best_train_rmse': [], 'best_train_r2': [], 'best_train_mae': [],
            'best_val_rmse': [], 'best_val_r2': [], 'best_val_mae': []
        }

    for i in range(num_models):
        print(f"\n===== Ensemble model-{model_name} {i+1}/{num_models} =====")
        seed = (seed_start + i)

        best_params["random_state"] = seed

        if model_name == "RF":
          model = RandomForestRegressor(**best_params)
        else:
          model = XGBRegressor(**best_params)

        model.fit(train_X, train_y)

        train_pred = model.predict(train_X)
        val_pred = model.predict(val_X)


        train_rmse = np.sqrt(mean_squared_error(train_pred, train_y))
        train_r2 = r2_score(train_pred, train_y)
        train_mae = mean_absolute_error(train_pred, train_y)

        val_rmse = np.sqrt(mean_squared_error(val_pred, val_y))
        val_r2 = r2_score(val_pred, val_y)
        val_mae = mean_absolute_error(val_pred, val_y)

        results['best_train_rmse'].append(train_rmse)
        results['best_train_r2'].append(train_r2)
        results['best_train_mae'].append(train_mae)

        results['best_val_rmse'].append(val_rmse)
        results['best_val_r2'].append(val_r2)
        results['best_val_mae'].append(val_mae)


        print(f"Test RMSE: {val_rmse} | R2 score: {val_r2} | MAE: {val_mae}")

    return results


In [None]:
# 1Ô∏è‚É£5Ô∏è‚É£ Train RF and XGBoost ensemble models using scaffold-based splitting and save their performance results


rf_results = train_ensemble_scaffold(X, y, df['BM-Scaffold'], rf_best_params, "RF")

rf_df = pd.DataFrame(rf_results)

ensemble_ind = list(range(1, 11))
rf_df.insert(loc=0, column="Ensemble", value=ensemble_ind)

rf_df.to_csv("Ensemble results Random Forest Scaffold.csv", index=False)

xgb_results = train_ensemble_scaffold(X, y, df['BM-Scaffold'], xgb_best_params, "XGB")

xgb_df = pd.DataFrame(xgb_results)

ensemble_ind = list(range(1, 11))
xgb_df.insert(loc=0, column="Ensemble", value=ensemble_ind)

xgb_df.to_csv("Ensemble results XGBoost Scaffold.csv", index=False)

In [None]:
# 1Ô∏è‚É£6Ô∏è‚É£

In [None]:
# 1Ô∏è‚É£7Ô∏è‚É£

In [None]:
# 1Ô∏è‚É£8Ô∏è‚É£

In [None]:
# 1Ô∏è‚É£9Ô∏è‚É£