# Benchmark Before Feature Engineering

This notebook performs a comprehensive benchmark of machine learning models for predicting thermodynamic properties of molecules **before** applying feature engineering techniques. The goal is to establish baseline performance metrics that can be compared against models trained on engineered features.

## Workflow Summary
1. **Data Loading & Preparation**: Load molecular dataset and prepare features/targets
2. **Data Splitting**: Stratified train/test split based on molecular size
3. **Preprocessing Pipeline**: Standardization and variance filtering
4. **Model Selection & Tuning**: Hyperparameter optimization for multiple regressors
5. **Evaluation**: Performance assessment using RMSE, MAE, and R²

## Regressor Selection Rationale

| **Model** | **Why Chosen** | **Key Advantages** | **Best For** |
|-----------|----------------|-------------------|--------------|
| **Histogram-based Gradient Boosting (HGBR)** | Excellent performance on tabular data with mixed feature types | • Handles missing values natively<br>• Fast training on large datasets<br>• Built-in categorical feature support<br>• Robust to outliers | Complex non-linear relationships in molecular properties |
| **Random Forest (RF)** | Proven ensemble method with strong baseline performance | • Handles feature interactions well<br>• Provides feature importance metrics<br>• Robust to overfitting<br>• Works well with high-dimensional data | Capturing complex feature interactions without extensive tuning |
| **Ridge Regression** | Linear baseline with regularization | • Fast training and prediction<br>• Interpretable coefficients<br>• Handles multicollinearity<br>• Good baseline for linear relationships | Establishing linear baseline and identifying targets with linear trends |


### **Excluded Models**
- **SVR**: Computationally expensive for large datasets and hyperparameter tuning
- **Neural Networks**: Reserved for separate analysis in dedicated notebooks
- **XGBoost**: HGBR provides similar performance with faster training

This combination ensures we capture different types of relationships in the data while maintaining computational efficiency.

## Key Preprocessing Steps
- **Standardization**: Essential for Ridge regression and consistent scaling
- **Variance Filtering**: Removes near-constant features (threshold: 1e-5)
- **Stratified Splitting**: Ensures balanced distribution of molecular sizes
- **Log Transformation**: Applied to `mu` and `r2` due to their skewed distributions

## Expected Outcomes
This benchmark will establish baseline performance metrics that help us:
1. Identify which properties are inherently easier/harder to predict
2. Understand model-specific strengths for different property types
3. Provide comparison baseline for feature engineering improvements
4. Guide selection of best-performing models for further optimization

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.svm import SVR
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from scipy.stats import uniform, randint
import warnings
warnings.filterwarnings("ignore")


In [None]:
# Load & prepare data
df = pd.read_csv('dataset_9May.csv')
smiles = df.pop('Molecule')  # Optional: drop SMILES string

# Define targets
targets = ['gap','mu','alpha','homo','lumo','r2','zpve','U0','U','H','G','Cv']
features = [c for c in df.columns if c not in targets]

X = df[features].fillna(0)
Y = df[targets]

# Train/test split stratified by molecule size
atoms = [c for c in df.columns if len(c)==1 and c.isupper()]
size = df[atoms].sum(axis=1)
X_tr, X_te, Y_tr, Y_te = train_test_split(
    X, Y, test_size=0.2, random_state=42,
    stratify=pd.qcut(size, 5)
)

# Log transform for mu and r2 - training and test
Y_tr[['mu', 'r2']] = np.log1p(Y_tr[['mu', 'r2']])

In [None]:
# --- Define preprocessing pipeline
pre = Pipeline([
    ('scale', StandardScaler()),
    ('var', VarianceThreshold(1e-5))
])

# --- Define regressor search spaces
param_spaces = {
    'HGBR': {
        'model__max_iter': randint(300, 800),
        'model__learning_rate': uniform(0.01, 0.15),
        'model__max_depth': randint(3, 10),
        'model__l2_regularization': uniform(0, 0.5),
        'model__max_leaf_nodes': randint(20, 100)
    },
    'RF': {
        'model__n_estimators': [100, 300, 500],
        'model__max_depth': [None, 10, 20],
        'model__min_samples_leaf': [1, 5, 10],
        'model__max_features': ['sqrt', 0.5, 0.8]
    },
    'Ridge': {
        'model__alpha': [0.01, 0.1, 1.0, 10.0, 100.0]
    }
}

# --- Define regressors
regressors = {
    'HGBR': HistGradientBoostingRegressor(random_state=42),
    'RF': RandomForestRegressor(random_state=42, n_jobs=-1),
    'Ridge': Ridge()
}

# --- Results container
results = {}

In [None]:
# --- Loop over targets and regressors
for target in targets:
    print(f"\n🔍 Target: {target}")
    results[target] = {}
    for name, model in regressors.items():
        print(f"  ➤ Tuning {name}...")
        pipe = Pipeline([
            ('pre', pre),
            ('model', model)
        ])
        # Use RandomizedSearchCV for hyperparameter tuning with less computational cost
        tuned_model = RandomizedSearchCV(
            pipe, param_spaces[name],
            scoring='neg_root_mean_squared_error',
            cv=3,
            n_iter=20,
            n_jobs=-1,
            random_state=42
        )
        # Fit the model on training data
        tuned_model.fit(X_tr, Y_tr[target])
        # Get the best pipeline
        best_pipe = tuned_model.best_estimator_
        # Predict on test data
        preds_ = best_pipe.predict(X_te)

        # Inverse transform for log-transformed targets
        if target in ['mu', 'r2']:
            preds_ = np.expm1(preds_)
            true_vals = np.expm1(Y_te[target])
        else:
            true_vals = Y_te[target]

        # Calculate metrics
        rmse = np.sqrt(mean_squared_error(Y_te[target], preds_))
        mae = mean_absolute_error(Y_te[target], preds_)
        r2 = r2_score(Y_te[target], preds_)

        results[target][name] = {
            'RMSE': rmse, 'MAE': mae, 'R2': r2
        }
        print(f"    ✓ {name}: RMSE={rmse:.4f}, MAE={mae:.4f}, R2={r2:.4f}")

# --- Convert to DataFrame
benchmark_df = pd.concat({
    t: pd.DataFrame(v).T for t, v in results.items()
}, axis=0)
benchmark_df.index.names = ['Target', 'Model']


🔍 Target: gap
  ➤ Tuning HGBR...
    ✓ HGBR: RMSE=0.0302, MAE=0.0236, R2=0.5758
  ➤ Tuning RF...
    ✓ RF: RMSE=0.0303, MAE=0.0236, R2=0.5728
  ➤ Tuning Ridge...
    ✓ Ridge: RMSE=0.0339, MAE=0.0278, R2=0.4679

🔍 Target: mu
  ➤ Tuning HGBR...
    ✓ HGBR: RMSE=1.0697, MAE=0.7335, R2=0.5186
  ➤ Tuning RF...
    ✓ RF: RMSE=1.0671, MAE=0.7332, R2=0.5209
  ➤ Tuning Ridge...
    ✓ Ridge: RMSE=1.2644, MAE=0.9210, R2=0.3274

🔍 Target: alpha
  ➤ Tuning HGBR...
    ✓ HGBR: RMSE=2.7304, MAE=2.0240, R2=0.8380
  ➤ Tuning RF...
    ✓ RF: RMSE=2.7395, MAE=2.0179, R2=0.8369
  ➤ Tuning Ridge...
    ✓ Ridge: RMSE=2.7268, MAE=2.0226, R2=0.8384

🔍 Target: homo
  ➤ Tuning HGBR...
    ✓ HGBR: RMSE=0.0147, MAE=0.0108, R2=0.6129
  ➤ Tuning RF...
    ✓ RF: RMSE=0.0146, MAE=0.0107, R2=0.6160
  ➤ Tuning Ridge...
    ✓ Ridge: RMSE=0.0160, MAE=0.0120, R2=0.5389

🔍 Target: lumo
  ➤ Tuning HGBR...
    ✓ HGBR: RMSE=0.0260, MAE=0.0200, R2=0.6714
  ➤ Tuning RF...
    ✓ RF: RMSE=0.0258, MAE=0.0197, R2=0.6764
  ➤ Tuning

In [None]:
# --- Display benchmark results
print(benchmark_df)

                    RMSE         MAE        R2
Target Model                                  
gap    HGBR     0.030234    0.023564  0.575822
       RF       0.030340    0.023555  0.572846
       Ridge    0.033862    0.027766  0.467891
mu     HGBR     1.069721    0.733469  0.518576
       RF       1.067138    0.733215  0.520898
       Ridge    1.264406    0.921028  0.327395
alpha  HGBR     2.730353    2.023988  0.837992
       RF       2.739537    2.017891  0.836901
       Ridge    2.726814    2.022601  0.838412
homo   HGBR     0.014654    0.010835  0.612929
       RF       0.014595    0.010734  0.616044
       Ridge    0.015994    0.011990  0.538877
lumo   HGBR     0.025992    0.019967  0.671368
       RF       0.025794    0.019707  0.676351
       Ridge    0.028744    0.022960  0.598068
r2     HGBR   224.146351  155.904724  0.156276
       RF     224.190535  155.928552  0.155944
       Ridge  223.726804  155.351656  0.159432
zpve   HGBR     0.017705    0.014066  0.682374
       RF    