# HOMO-LUMO Gap Predictions

### Problem Statement & Motivation

Accurately predicting quantum chemical properties like the HOMO–LUMO energy gap is essential for advancing materials science, drug discovery, and electronic design. The HOMO–LUMO gap is particularly informative for assessing molecular reactivity and stability. While Density Functional Theory (DFT) provides precise estimates, its high computational cost makes it impractical for large-scale screening of molecular libraries. This notebook explores machine learning alternatives that are fast, scalable, and interpretable, offering solutions that are accessible even on modest hardware.

### Related Work & Key Gap

Past work has shown that:

* DFT is accurate but computationally intensive
* ML models like kernel methods and GNNs show promise, but often require large models and expensive hardware

Key Gap: A need for lightweight, high-performing models that can run locally and integrate with user-friendly tools for deployment in research or education.

### Methodology & Evaluation

This notebook:

* Benchmarks a variety of 2D-based models using RDKit descriptors, Coulomb matrices, and graph neural networks (GNNs) on a 5k molecule subset
* Progresses to a hybrid GNN architecture combining OGB-standard graphs with SMILES-derived cheminformatics features
* Achieves **MAE = 0.159 eV**
* Visualizes results using parity plots, error inspection, and predicted-vs-true comparisons
* Evaluates both random and high-error cases to better understand model behavior

| Metric   | Best Model (Hybrid GNN) |
| -------- | ----------------------- |
| **MAE**  | 0.159 eV                |
| **RMSE** | 0.234 eV                |
| **R²**   | 0.965                   |


### Deployment & Accessibility

To make the model practically useful, an **interactive web app** was developed:

**Live App**: [HOMO–LUMO Gap Predictor on Hugging Face](https://huggingface.co/spaces/MooseML/homo-lumo-gap-predictor)

Features:

* **SMILES input** for any organic molecule
* **Real-time prediction** of the HOMO–LUMO gap
* **Molecular visualization**
* Simple **CSV logging** for result tracking

GitHub Repository: [MooseML/homo-lumo-gap-models](https://github.com/MooseML/homo-lumo-gap-models)


In [1]:
# general 
import pandas as pd
import numpy as np
from tqdm import tqdm
import ace_tools_open as tools
import optuna
import optuna.visualization as vis
import pickle
import joblib
import os 

# plotting 
import matplotlib.pyplot as plt
import seaborn as sns

# TensorFlow
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Add
from tensorflow.keras.models import Model
from tensorflow.keras import regularizers

# PyTorch
import torch
import torch.nn.functional as F
from torch.nn import Linear, ReLU, Module, Sequential, Dropout
from torch.utils.data import Subset
import torch.optim as optim
# PyTorch Geometric
from torch_geometric.nn import GINEConv, global_mean_pool
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader

from transformers import get_cosine_schedule_with_warmup

# OGB dataset 
from ogb.lsc import PygPCQM4Mv2Dataset, PCQM4Mv2Dataset
from ogb.utils import smiles2graph
from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder

# RDKit
# from rdkit.Chem import AllChem
from rdkit.Chem import Descriptors
from rdkit import Chem

# ChemML
from chemml.chem import Molecule, RDKitFingerprint, CoulombMatrix, tensorise_molecules
from chemml.models import MLP, NeuralGraphHidden, NeuralGraphOutput
from chemml.utils import regression_metrics

# SKlearn 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor

In [2]:
print("TensorFlow version:", tf.__version__)
print("Built with CUDA:", tf.test.is_built_with_cuda())
print("CUDA available:", tf.test.is_built_with_gpu_support())
print(tf.config.list_physical_devices('GPU'))
# list all GPUs
gpus = tf.config.list_physical_devices('GPU')

# check compute capability if GPU available
if gpus:
    for gpu in gpus:
        details = tf.config.experimental.get_device_details(gpu)
        print(f"Device: {gpu.name}")
        print(f"Compute Capability: {details.get('compute_capability')}")
else:
    print("No GPU found.")

TensorFlow version: 2.10.0
Built with CUDA: True
CUDA available: True
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Device: /physical_device:GPU:0
Compute Capability: (8, 6)


In [3]:
# Paths - Fixed for Kaggle environment
if os.path.exists('/kaggle'):
    DATA_ROOT = '/kaggle/input/neurips-open-polymer-prediction-2025'
    CHUNK_DIR = '/kaggle/working/processed_chunks'  # Writable directory
    BACKBONE_PATH = '/kaggle/input/polymer/best_gnn_transformer_hybrid.pt'
else:
    DATA_ROOT = 'data'
    CHUNK_DIR = os.path.join(DATA_ROOT, 'processed_chunks')
    BACKBONE_PATH = 'best_gnn_transformer_hybrid.pt'

TRAIN_LMDB = os.path.join(CHUNK_DIR, 'polymer_train3d_dist.lmdb')
TEST_LMDB = os.path.join(CHUNK_DIR, 'polymer_test3d_dist.lmdb')

print(f"Data root: {DATA_ROOT}")
print(f"LMDB directory: {CHUNK_DIR}")
print(f"Train LMDB: {TRAIN_LMDB}")
print(f"Test LMDB: {TEST_LMDB}")

# Create LMDBs if they don't exist
if not os.path.exists(TRAIN_LMDB) or not os.path.exists(TEST_LMDB):
    print('Building LMDBs...')
    os.makedirs(CHUNK_DIR, exist_ok=True)
    # Run the LMDB builders
    !python build_polymer_lmdb_fixed.py train
    !python build_polymer_lmdb_fixed.py test
    print('LMDB creation complete.')
else:
    print('LMDBs already exist.')


Data root: data
LMDB directory: data\processed_chunks
Train LMDB: data\processed_chunks\polymer_train3d_dist.lmdb
Test LMDB: data\processed_chunks\polymer_test3d_dist.lmdb
LMDBs already exist.


In [4]:
# LMDB+CSV wiring 
import os, numpy as np, pandas as pd

# 1) Columns / index mapping
label_cols = ['Tg','FFV','Tc','Density','Rg']
task2idx   = {k:i for i,k in enumerate(label_cols)}

# 2) Read the training labels (CSV is only used to know which IDs have labels)
train_path = os.path.join(DATA_ROOT, 'train.csv')
train_df   = pd.read_csv(train_path)
assert {'id','SMILES'}.issubset(train_df.columns), "train.csv must have id and SMILES"
train_df['id'] = train_df['id'].astype(int)

# 3) Read the actual IDs that exist in the LMDB
def read_lmdb_ids(lmdb_path: str) -> np.ndarray:
    ids_txt = lmdb_path + ".ids.txt"
    if not os.path.exists(ids_txt):
        raise FileNotFoundError(f"Missing {ids_txt}. Rebuild LMDB or confirm paths.")
    ids = np.loadtxt(ids_txt, dtype=np.int64)
    if ids.ndim == 0:  # single id edge case
        ids = ids.reshape(1)
    return ids

lmdb_ids = read_lmdb_ids(TRAIN_LMDB)
print(f"LMDB contains {len(lmdb_ids):,} train graphs")

# 4) Helper: IDs that have a label for a given task (intersection with LMDB ids)
def ids_with_label(task: str) -> np.ndarray:
    col = task
    have_label = train_df.loc[~train_df[col].isna(), 'id'].astype(int).values
    # Only keep those that were actually written to the LMDB
    keep = np.intersect1d(have_label, lmdb_ids, assume_unique=False)
    return keep

# 5) Make a global pool split once (reused for each task)
rng = np.random.default_rng(123)
perm = rng.permutation(len(lmdb_ids))
split = int(0.9 * len(lmdb_ids))
train_pool_ids = lmdb_ids[perm[:split]]
val_pool_ids   = lmdb_ids[perm[split:]]

print(f"Global pools -> train_pool={len(train_pool_ids):,}  val_pool={len(val_pool_ids):,}")

# 6) Quick sanity: show available counts per task
for t in label_cols:
    n_task_ids = len(ids_with_label(t))
    print(f"{t:>7}: {n_task_ids:6d} rows with labels (pre-intersection with pools)")


LMDB contains 7,973 train graphs
Global pools -> train_pool=7,175  val_pool=798
     Tg:    511 rows with labels (pre-intersection with pools)
    FFV:   7030 rows with labels (pre-intersection with pools)
     Tc:    737 rows with labels (pre-intersection with pools)
Density:    613 rows with labels (pre-intersection with pools)
     Rg:    614 rows with labels (pre-intersection with pools)


The only property that appears will succeed with a simple imputation strategy is FFV. All other properties contain very high percent missing. Therefore, I will impute median for FFV, train a model for FFV, and train separate models for other properties. I will attempt to filter out missing values for each property. If this yields uncessful, I may explore sampling techniques or use the trained model to impute values to train a secondaery model. |

# Models

In [5]:
# Use the CSV only to know which rows have labels; keep 'id' here.
train_df = pd.read_csv(os.path.join(DATA_ROOT, "train.csv"))
train_df["id"] = train_df["id"].astype(int)

def build_target_df_from_ids(df: pd.DataFrame, target_col: str, keep_ids: np.ndarray):
    """
    Return DataFrame with only SMILES + target, restricted to IDs present in the LMDB
    and dropping missing targets.
    """
    out = df.loc[df["id"].isin(keep_ids), ["SMILES", target_col]].copy()
    print(f"Initial {target_col} shape:", out.shape)
    print(f"Initial {target_col} missing:\n{out.isnull().sum()}")
    out = out.dropna(subset=[target_col]).reset_index(drop=True)
    print(f"Cleaned {target_col} shape:", out.shape)
    print(f"Cleaned {target_col} missing:\n{out.isnull().sum()}\n")
    return out

# Build all five (use same LMDB id set so we only keep rows that exist in LMDB)
df_tg      = build_target_df_from_ids(train_df, "Tg",      lmdb_ids)
df_density = build_target_df_from_ids(train_df, "Density", lmdb_ids)
df_ffv     = build_target_df_from_ids(train_df, "FFV",     lmdb_ids)
df_tc      = build_target_df_from_ids(train_df, "Tc",      lmdb_ids)
df_rg      = build_target_df_from_ids(train_df, "Rg",      lmdb_ids)


Initial Tg shape: (7973, 2)
Initial Tg missing:
SMILES       0
Tg        7462
dtype: int64
Cleaned Tg shape: (511, 2)
Cleaned Tg missing:
SMILES    0
Tg        0
dtype: int64

Initial Density shape: (7973, 2)
Initial Density missing:
SMILES        0
Density    7360
dtype: int64
Cleaned Density shape: (613, 2)
Cleaned Density missing:
SMILES     0
Density    0
dtype: int64

Initial FFV shape: (7973, 2)
Initial FFV missing:
SMILES      0
FFV       943
dtype: int64
Cleaned FFV shape: (7030, 2)
Cleaned FFV missing:
SMILES    0
FFV       0
dtype: int64

Initial Tc shape: (7973, 2)
Initial Tc missing:
SMILES       0
Tc        7236
dtype: int64
Cleaned Tc shape: (737, 2)
Cleaned Tc missing:
SMILES    0
Tc        0
dtype: int64

Initial Rg shape: (7973, 2)
Initial Rg missing:
SMILES       0
Rg        7359
dtype: int64
Cleaned Rg shape: (614, 2)
Cleaned Rg missing:
SMILES    0
Rg        0
dtype: int64



In [6]:
# Morgan FP utilities (no 3D, no external descriptors) 
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
import numpy as np
from typing import Optional, Tuple
from tqdm.auto import tqdm

def smiles_to_morgan_fp(
    smi: str,
    n_bits: int = 1024,
    radius: int = 3,
    use_counts: bool = False,
) -> Optional[np.ndarray]:
    """Return a 1D numpy array Morgan fingerprint; None if SMILES invalid."""
    mol = Chem.MolFromSmiles(smi)
    if mol is None:
        return None
    if use_counts:
        fp = rdMolDescriptors.GetMorganFingerprint(mol, radius)
        # convert to dense count vector
        arr = np.zeros((n_bits,), dtype=np.int32)
        for bit_id, count in fp.GetNonzeroElements().items():
            arr[bit_id % n_bits] += count
        return arr.astype(np.float32)
    else:
        bv = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius, nBits=n_bits)
        arr = np.zeros((n_bits,), dtype=np.int8)
        Chem.DataStructs.ConvertToNumpyArray(bv, arr)
        return arr.astype(np.float32)

def prepare_fp_for_target(
    df_target: pd.DataFrame,
    target_col: str,
    *,
    fp_bits: int = 1024,
    fp_radius: int = 3,
    use_counts: bool = False,
    save_csv_path: Optional[str] = None,
    show_progress: bool = True,
) -> Tuple[pd.DataFrame, np.ndarray, np.ndarray]:
    """
    Drop missing targets, compute Morgan FPs from SMILES only.
    Returns (df_clean, y, X_fp) where:
      df_clean: ['SMILES', target_col]
      y: (N,)
      X_fp: (N, fp_bits)
    """
    assert {"SMILES", target_col}.issubset(df_target.columns)

    # 1) drop missing targets (no imputation)
    work = df_target[["SMILES", target_col]].copy()
    before = len(work)
    work = work.dropna(subset=[target_col]).reset_index(drop=True)
    after = len(work)
    print(f"[{target_col}] dropped {before - after} missing; kept {after}")

    # 2) compute FPs; skip invalid SMILES
    fps, ys, keep_smiles = [], [], []
    it = work.itertuples(index=False)
    if show_progress:
        it = tqdm(it, total=len(work), desc=f"FPs for {target_col}")

    for row in it:
        smi = row.SMILES
        yv  = getattr(row, target_col)
        arr = smiles_to_morgan_fp(smi, n_bits=fp_bits, radius=fp_radius, use_counts=use_counts)
        if arr is None:
            continue
        fps.append(arr)
        ys.append(float(yv))
        keep_smiles.append(smi)

    X_fp = np.stack(fps, axis=0) if fps else np.zeros((0, fp_bits), dtype=np.float32)
    y = np.asarray(ys, dtype=float)
    df_clean = pd.DataFrame({"SMILES": keep_smiles, target_col: y})

    if save_csv_path:
        df_clean.to_csv(save_csv_path, index=False)
        print(f"[{target_col}] saved cleaned CSV -> {save_csv_path}")

    print(f"[{target_col}] X_fp: {X_fp.shape} | y: {y.shape}")
    return df_clean, y, X_fp


In [7]:
# Bit vectors (1024, r=3) 
df_clean_tg,      y_tg,      X_tg      = prepare_fp_for_target(df_tg,      "Tg",      fp_bits=1024, fp_radius=3, use_counts=False, save_csv_path="cleaned_tg_fp.csv")
df_clean_density, y_density, X_density = prepare_fp_for_target(df_density, "Density", fp_bits=1024, fp_radius=3, use_counts=False, save_csv_path="cleaned_density_fp.csv")
df_clean_ffv,     y_ffv,     X_ffv     = prepare_fp_for_target(df_ffv,     "FFV",     fp_bits=1024, fp_radius=3, use_counts=False, save_csv_path="cleaned_ffv_fp.csv")
df_clean_tc,      y_tc,      X_tc      = prepare_fp_for_target(df_tc,      "Tc",      fp_bits=1024, fp_radius=3, use_counts=False, save_csv_path="cleaned_tc_fp.csv")
df_clean_rg,      y_rg,      X_rg      = prepare_fp_for_target(df_rg,      "Rg",      fp_bits=1024, fp_radius=3, use_counts=False, save_csv_path="cleaned_rg_fp.csv")


[Tg] dropped 0 missing; kept 511


FPs for Tg:   0%|          | 0/511 [00:00<?, ?it/s]

[Tg] saved cleaned CSV -> cleaned_tg_fp.csv
[Tg] X_fp: (511, 1024) | y: (511,)
[Density] dropped 0 missing; kept 613


FPs for Density:   0%|          | 0/613 [00:00<?, ?it/s]

[Density] saved cleaned CSV -> cleaned_density_fp.csv
[Density] X_fp: (613, 1024) | y: (613,)
[FFV] dropped 0 missing; kept 7030


FPs for FFV:   0%|          | 0/7030 [00:00<?, ?it/s]

[FFV] saved cleaned CSV -> cleaned_ffv_fp.csv
[FFV] X_fp: (7030, 1024) | y: (7030,)
[Tc] dropped 0 missing; kept 737


FPs for Tc:   0%|          | 0/737 [00:00<?, ?it/s]

[Tc] saved cleaned CSV -> cleaned_tc_fp.csv
[Tc] X_fp: (737, 1024) | y: (737,)
[Rg] dropped 0 missing; kept 614


FPs for Rg:   0%|          | 0/614 [00:00<?, ?it/s]

[Rg] saved cleaned CSV -> cleaned_rg_fp.csv
[Rg] X_fp: (614, 1024) | y: (614,)


In [8]:
from dataclasses import dataclass
from typing import Optional, Tuple
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

@dataclass
class TabularSplits:
    # unscaled (for RF)
    X_train: np.ndarray
    X_test:  np.ndarray
    y_train: np.ndarray
    y_test:  np.ndarray
    # scaled (for KRR/MLP)
    X_train_scaled: Optional[np.ndarray] = None
    X_test_scaled:  Optional[np.ndarray] = None
    y_train_scaled: Optional[np.ndarray] = None  # shape (N,1)
    y_test_scaled:  Optional[np.ndarray] = None
    x_scaler: Optional[StandardScaler] = None
    y_scaler: Optional[StandardScaler] = None

def _make_regression_stratify_bins(y: np.ndarray, n_bins: int = 10) -> np.ndarray:
    """Return integer bins for approximate stratification in regression."""
    y = y.ravel()
    # handle degenerate case
    if np.unique(y).size < n_bins:
        n_bins = max(2, np.unique(y).size)
    quantiles = np.linspace(0, 1, n_bins + 1)
    bins = np.unique(np.quantile(y, quantiles))
    # ensure strictly increasing
    bins = np.unique(bins)
    # np.digitize expects right-open intervals by default
    strat = np.digitize(y, bins[1:-1], right=False)
    return strat

def make_tabular_splits(
    X: np.ndarray,
    y: np.ndarray,
    *,
    test_size: float = 0.2,
    random_state: int = 42,
    scale_X: bool = True,
    scale_y: bool = True,
    stratify_regression: bool = False,
    n_strat_bins: int = 10,
    # if you already decided splits (e.g., scaffold split), pass indices:
    train_idx: Optional[np.ndarray] = None,
    test_idx: Optional[np.ndarray] = None,
) -> TabularSplits:
    """
    Split and (optionally) scale tabular features/targets for a single target.
    Returns both scaled and unscaled arrays, plus fitted scalers.
    """
    y = np.asarray(y, dtype=float).ravel()
    X = np.asarray(X)

    if train_idx is not None and test_idx is not None:
        X_train, X_test = X[train_idx], X[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
    else:
        strat = None
        if stratify_regression:
            strat = _make_regression_stratify_bins(y, n_bins=n_strat_bins)
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state, stratify=strat
        )

    # Unscaled outputs (for RF, tree models)
    splits = TabularSplits(
        X_train=X_train, X_test=X_test,
        y_train=y_train, y_test=y_test
    )

    # Scaled versions (for KRR/MLP)
    if scale_X:
        xscaler = StandardScaler()
        splits.X_train_scaled = xscaler.fit_transform(X_train)
        splits.X_test_scaled  = xscaler.transform(X_test)
        splits.x_scaler = xscaler
    if scale_y:
        yscaler = StandardScaler()
        splits.y_train_scaled = yscaler.fit_transform(y_train.reshape(-1, 1))
        splits.y_test_scaled  = yscaler.transform(y_test.reshape(-1, 1))
        splits.y_scaler = yscaler

    # Shapes summary
    print("Splits:")
    print("X_train:", splits.X_train.shape, "| X_test:", splits.X_test.shape)
    if splits.X_train_scaled is not None:
        print("X_train_scaled:", splits.X_train_scaled.shape, "| X_test_scaled:", splits.X_test_scaled.shape)
    print("y_train:", splits.y_train.shape, "| y_test:", splits.y_test.shape)
    if splits.y_train_scaled is not None:
        print("y_train_scaled:", splits.y_train_scaled.shape, "| y_test_scaled:", splits.y_test_scaled.shape)

    return splits

In [9]:
from typing import Dict, Any, Tuple
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
import numpy as np
import os

def train_eval_rf(
    X: np.ndarray,
    y: np.ndarray,
    *,
    rf_params: Dict[str, Any],
    test_size: float = 0.2,
    random_state: int = 42,
    stratify_regression: bool = True,
    n_strat_bins: int = 10,
    save_dir: str = "saved_models/rf",
    tag: str = "model",
) -> Tuple[RandomForestRegressor, Dict[str, float], TabularSplits, str]:
    """
    Trains a RandomForest on unscaled features; returns (model, metrics, splits, path).
    """
    os.makedirs(save_dir, exist_ok=True)
    # Pick a safe number of bins based on dataset size
    if stratify_regression:
        adaptive_bins = min(n_strat_bins, max(3, int(np.sqrt(len(y)))))
    else:
        adaptive_bins = n_strat_bins
    splits = make_tabular_splits(
        X, y,
        test_size=test_size,
        random_state=random_state,
        scale_X=False, scale_y=False,                 # RF doesn't need scaling
        stratify_regression=stratify_regression,
        n_strat_bins=adaptive_bins
    )

    rf = RandomForestRegressor(random_state=random_state, n_jobs=-1, **rf_params)
    rf.fit(splits.X_train, splits.y_train)

    pred_tr = rf.predict(splits.X_train)
    pred_te = rf.predict(splits.X_test)

    metrics = {
        "train_MAE": mean_absolute_error(splits.y_train, pred_tr),
        "train_RMSE": mean_squared_error(splits.y_train, pred_tr, squared=False),
        "train_R2": r2_score(splits.y_train, pred_tr),
        "val_MAE": mean_absolute_error(splits.y_test, pred_te),
        "val_RMSE": mean_squared_error(splits.y_test, pred_te, squared=False),
        "val_R2": r2_score(splits.y_test, pred_te),
    }
    print(f"[RF/{tag}] val_MAE={metrics['val_MAE']:.6f}  val_RMSE={metrics['val_RMSE']:.6f}  val_R2={metrics['val_R2']:.4f}")

    path = os.path.join(save_dir, f"rf_{tag}.joblib")
    joblib.dump({"model": rf, "metrics": metrics, "rf_params": rf_params}, path)
    return rf, metrics, splits, path

In [10]:
rf_cfg = {
    "FFV": {"n_estimators": 100, "max_depth": 60},
    "Tc":  {'n_estimators': 800, 'max_depth': 20, 'min_samples_split': 6, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'bootstrap': False},
    "Rg":  {'n_estimators': 400, 'max_depth': 260, 'min_samples_split': 6, 'min_samples_leaf': 4, 'max_features': 1.0, 'bootstrap': True},
}

rf_ffv, m_ffv, splits_ffv, p_ffv = train_eval_rf(X_ffv, y_ffv, rf_params=rf_cfg["FFV"], tag="FFV")
rf_tc,  m_tc,  splits_tc,  p_tc  = train_eval_rf(X_tc,  y_tc,  rf_params=rf_cfg["Tc"],  tag="Tc")
# rf_rg,  m_rg,  splits_rg,  p_rg  = train_eval_rf(X_rg,  y_rg,  rf_params=rf_cfg["Rg"],  tag="Rg")
# rf_tg,  m_tg,  splits_tg,  p_tg  = train_eval_rf(X_tg,  y_tg,  rf_params=rf_cfg["Rg"],  tag="Tg")
# rf_density,  m_density,  splits_density,  p_density  = train_eval_rf(X_density,  y_density,  rf_params=rf_cfg["Rg"],  tag="Density")

Splits:
X_train: (5624, 1024) | X_test: (1406, 1024)
y_train: (5624,) | y_test: (1406,)
[RF/FFV] val_MAE=0.009095  val_RMSE=0.019753  val_R2=0.5701
Splits:
X_train: (589, 1024) | X_test: (148, 1024)
y_train: (589,) | y_test: (148,)
[RF/Tc] val_MAE=0.029866  val_RMSE=0.045109  val_R2=0.7304


## ChemML GNN Model Results
| Model Type             | Featurization        |   MAE |  RMSE |   R² | Notes             |
|------------------------|----------------------|-------|-------|------|-------------------|
| GNN (Tuned)            | tensorise_molecules Graph   | 0.302 | 0.411 | 0.900 | Best performance across all metrics   |
| GNN (Untuned)          | tensorise_molecules Graph   | 0.400 | 0.519 | 0.841 | Good overall|


---
# Final Model Training

Having explored different molecular graph representations and model architectures, I am now moving to training what is expected to be the best-performing model using the full dataset. The earlier GNN model was based on `tensorise_molecules` (ChemML) graphs and had strong performance with a **mean absolute error (MAE) around 0.30**. These graphs are based on RDKit's internal descriptors and do not reflect the original PCQM4Mv2 graph structure used in the Open Graph Benchmark (OGB). Therefore, I will shift focus to the `smiles2graph` representation provided by OGB, which aligns more directly with the benchmark's evaluation setup and top-performing models on the leaderboard.


| Source                         | Atom/Bond Features                                                 | Format                                          | Customizable?     | Alignment with PCQM4Mv2?  |
| ------------------------------ | ------------------------------------------------------------------ | ----------------------------------------------- | ----------------- | ---------------------- |
| `tensorise_molecules` (ChemML) | RDKit-based descriptors (ex: atom number, degree, hybridization) | NumPy tensors (`X_atoms`, `X_bonds`, `X_edges`) | Limited           |  Not aligned          |
| `smiles2graph` (OGB / PyG)     | Predefined categorical features from PCQM4Mv2                      | PyTorch Geometric `Data` objects                |  Highly flexible |  Matches OGB standard |

By using `smiles2graph`, we:

* Use OGB-standard graph construction and feature encoding for fair comparisons with leaderboard models
* Include learnable AtomEncoder and BondEncoder embeddings from `ogb.graphproppred.mol_encoder`, which improve model expressiveness
* Maintain compatibility with PyTorch Geometric, DGL, and OGB tools

I will also concatenate GNN-derived embeddings with SMILES-based RDKit descriptors, feeding this hybrid representation into MLP head. This allows you to combine structural and cheminformatics perspectives for improved prediction accuracy. With this setup, I aim to improve upon the MAE of \~0.30 achieved earlier and push closer toward state-of-the-art performance.




In [11]:
# class EdgeEncoderMixed(nn.Module):
#     def __init__(self, emb_dim, cont_dim=32):
#         super().__init__()
#         from ogb.graphproppred.mol_encoder import bond_types, bond_dirs  # or hardcode sizes
#         self.emb0 = nn.Embedding(5, emb_dim)  # bond type
#         self.emb1 = nn.Embedding(6, emb_dim)  # stereo
#         self.emb2 = nn.Embedding(2, emb_dim)  # conjugation
#         self.mlp_cont = nn.Sequential(nn.Linear(cont_dim, emb_dim), nn.ReLU(), nn.Linear(emb_dim, emb_dim))

#     def forward(self, edge_attr):
#         cat = edge_attr[:, :3].long()
#         cont = edge_attr[:, 3:].float()
#         e_cat = self.emb0(cat[:,0]) + self.emb1(cat[:,1]) + self.emb2(cat[:,2])
#         e_cont = self.mlp_cont(cont)
#         return e_cat + e_cont


In [12]:
label_cols = ['Tg','FFV','Tc','Density','Rg']
task2idx   = {k:i for i,k in enumerate(label_cols)}

train_csv = pd.read_csv(os.path.join(DATA_ROOT, "train.csv"))  # keep 'id'!
lmdb_ids_path = TRAIN_LMDB + ".ids.txt"
if os.path.exists(lmdb_ids_path):
    with open(lmdb_ids_path) as f:
        kept_ids = set(int(x.strip()) for x in f if x.strip())
else:
    kept_ids = set(train_csv['id'].astype(int).tolist())

def ids_for_task(task):
    t = task2idx[task]
    col = label_cols[t]
    ids = train_csv.loc[~train_csv[col].isna(), 'id'].astype(int).tolist()
    # only those that actually exist in LMDB
    return np.array([i for i in ids if i in kept_ids], dtype=int)

ids_tg  = ids_for_task("Tg")
ids_den = ids_for_task("Density")
ids_tc = ids_for_task("Tc")
ids_rg = ids_for_task("Rg")
ids_ffv = ids_for_task("FFV")
print("Tg ids:", ids_tg.shape, "Density ids:", ids_den.shape)

Tg ids: (511,) Density ids: (613,)


In [13]:
from torch.utils.data import Dataset
from torch_geometric.data import Data
import torch, numpy as np
from dataset_polymer_fixed import LMDBDataset

def _get_rdkit_feats_from_record(rec):
    arr = getattr(rec, "rdkit_feats", None)
    if arr is None: return torch.zeros(6, dtype=torch.float32)
    return torch.as_tensor(np.asarray(arr, np.float32).reshape(-1), dtype=torch.float32)

class LMDBtoPyGSingleTask(Dataset):
    def __init__(self, ids, lmdb_path, target_index=None):
        self.base = LMDBDataset(ids, lmdb_path)
        self.t = target_index  # int or None

    def __len__(self): return len(self.base)

    def __getitem__(self, idx):
        rec = self.base[idx]
        x  = torch.as_tensor(rec.x, dtype=torch.long)
        ei = torch.as_tensor(rec.edge_index, dtype=torch.long)
        ea = torch.as_tensor(rec.edge_attr)         # (E, 3 + 32RBF)
        ea_cat = ea[:, :3].long()                   # <-- use only 3 categorical cols
        rdkit = _get_rdkit_feats_from_record(rec)   # (15,)

        d = Data(x=x, edge_index=ei, edge_attr=ea_cat, rdkit_feats=rdkit)
        if (self.t is not None) and hasattr(rec, "y"):
            yv = torch.as_tensor(rec.y, dtype=torch.float32).view(-1)
            if self.t < yv.numel():
                d.y = yv[self.t:self.t+1]
        return d

In [14]:
from sklearn.model_selection import train_test_split
from torch_geometric.loader import DataLoader as GeoDataLoader

def make_loaders_for_task(task, ids, *, batch_size=32, seed=42):
    t = task2idx[task]
    tr_ids, va_ids = train_test_split(ids, test_size=0.2, random_state=seed)
    tr_ds = LMDBtoPyGSingleTask(tr_ids, TRAIN_LMDB, target_index=t)
    va_ds = LMDBtoPyGSingleTask(va_ids, TRAIN_LMDB, target_index=t)
    tr = GeoDataLoader(tr_ds, batch_size=batch_size, shuffle=True,  num_workers=0, pin_memory=True)
    va = GeoDataLoader(va_ds, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=True)
    return tr, va

train_loader_tg,  val_loader_tg  = make_loaders_for_task("Tg", ids_tg,  batch_size=32)
train_loader_den, val_loader_den = make_loaders_for_task("Density", ids_den, batch_size=32)
# train_loader_tc,  val_loader_tc  = make_loaders_for_task("Tc", ids_tc,  batch_size=32)
train_loader_rg, val_loader_rg = make_loaders_for_task("Rg", ids_rg, batch_size=32)
# train_loader_ffv, val_loader_ffv = make_loaders_for_task("FFV", ids_ffv, batch_size=32)

## Step 5: Define the Hybrid GNN Model

The final architecture uses both structural and cheminformatics data by combining GNN-learned graph embeddings with SMILES-derived RDKit descriptors. This Hybrid GNN model uses `smiles2graph` for graph construction and augments it with RDKit-based molecular features for improved prediction accuracy.

### Model Components:

* **AtomEncoder / BondEncoder**
  Transforms categorical atom and bond features (provided by OGB) into learnable embeddings using the encoders from `ogb.graphproppred.mol_encoder`. These provide a strong foundation for expressive graph learning.

* **GINEConv Layers (x2)**
  I use two stacked GINEConv layers (Graph Isomorphism Network with Edge features). These layers perform neighborhood aggregation based on edge attributes, allowing the model to capture localized chemical environments.

* **Global Mean Pooling**
  After message passing, node level embeddings are aggregated into a fixed size graph level representation using `global_mean_pool`.

* **Concatenation with RDKit Descriptors**
  The pooled GNN embedding is concatenated with external RDKit descriptors, which capture global molecular properties not easily inferred from graph data alone.

* **MLP Prediction Head**
  A multilayer perceptron processes the combined feature vector with ReLU activations, dropout regularization, and linear layers to predict the HOMO–LUMO gap.

In [15]:
import torch
from torch import nn
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.loader import DataLoader as GeoDataLoader
from sklearn.model_selection import train_test_split
from typing import List

def _act(name: str):
    name = (name or "ReLU").lower()
    if name in ("relu",):   return nn.ReLU()
    if name in ("gelu",):   return nn.GELU()
    if name in ("swish","silu"): return nn.SiLU()
    return nn.ReLU()

class HybridGNN(Module):
    def __init__(self, gnn_dim: int, rdkit_dim: int, hidden_dim: int, dropout_rate: float=0.2, activation: str="ReLU"):
        super().__init__()
        self.gnn_dim = gnn_dim
        self.rdkit_dim = rdkit_dim
        act = _act(activation)
        self.atom_encoder = AtomEncoder(emb_dim=gnn_dim)
        self.bond_encoder = BondEncoder(emb_dim=gnn_dim)

        self.conv1 = GINEConv(Sequential(Linear(gnn_dim, gnn_dim), act, Linear(gnn_dim, gnn_dim)))
        self.conv2 = GINEConv(Sequential(Linear(gnn_dim, gnn_dim), act, Linear(gnn_dim, gnn_dim)))
        self.pool = global_mean_pool

        self.mlp = Sequential(
            Linear(gnn_dim + rdkit_dim, hidden_dim), act, Dropout(dropout_rate),
            Linear(hidden_dim, hidden_dim // 2), act, Dropout(dropout_rate),
            Linear(hidden_dim // 2, 1)
            )

    def forward(self, data):
        # encode atoms and bonds
        x = self.atom_encoder(data.x)
        edge_attr = self.bond_encoder(data.edge_attr)

        # GNN convolutions
        x = self.conv1(x, data.edge_index, edge_attr)
        x = self.conv2(x, data.edge_index, edge_attr)
        x = self.pool(x, data.batch)

        # handle RDKit features
        rdkit_feats = getattr(data, 'rdkit_feats', None)
        if rdkit_feats is not None:
            # Reshape the RDKit features tensor to be (batch_size, rdkit_dim)
            # The number of samples in the batch is given by x.shape[0] after pooling
            reshaped_rdkit_feats = rdkit_feats.view(x.shape[0], self.rdkit_dim)
            
            if x.shape[0] != reshaped_rdkit_feats.shape[0]:
                raise ValueError(f"Shape mismatch: GNN output ({x.shape[0]}) vs rdkit_feats ({reshaped_rdkit_feats.shape[0]})")
            
            x = torch.cat([x, reshaped_rdkit_feats], dim=1)
        else:
            raise ValueError("RDKit features not found in the data object")

        return self.mlp(x)

In [30]:
def train_hybrid_gnn(
    model: nn.Module,
    train_loader,
    val_loader,
    *,
    lr: float,
    optimizer: str = "Adam",
    weight_decay: float = 0.0,
    epochs: int = 100,
    patience: int = 10,
    save_dir: str = "saved_models/gnn",
    tag: str = "model",
    device: torch.device = torch.device("cuda" if torch.cuda.is_available() else "cpu"),
):
    os.makedirs(save_dir, exist_ok=True)
    model = model.to(device)
    opt_name = optimizer.lower()
    if opt_name == "adamw":
        opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay)
    elif opt_name == "rmsprop":
        opt = torch.optim.RMSprop(model.parameters(), lr=lr, weight_decay=weight_decay, momentum=0.0)
    else:
        opt = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
        
    best, bad = float("inf"), 0
    best_path = os.path.join(save_dir, f"{tag}.pt")

    @torch.no_grad()
    def eval_once(loader):
        model.eval()
        preds, trues = [], []
        for b in loader:
            b = b.to(device)
            p = model(b)
            preds.append(p.cpu())
            trues.append(b.y.view(-1,1).cpu())
        preds = torch.cat(preds).numpy(); trues = torch.cat(trues).numpy()
        mse = np.mean((preds - trues)**2)
        return mse, preds, trues

    for ep in range(1, epochs+1):
        model.train()
        total, count = 0.0, 0
        for b in train_loader:
            b = b.to(device)
            pred = model(b)
            loss = F.mse_loss(pred, b.y.view(-1,1))
            opt.zero_grad(); loss.backward(); opt.step()
            total += loss.item() * b.num_graphs
            count += b.num_graphs
        tr_mse = total / max(1, count)
        va_mse, _, _ = eval_once(val_loader)
        print(f"Epoch {ep:02d} | Train MSE {tr_mse:.5f} | Val MSE {va_mse:.5f}")

        if va_mse < best - 1e-7:
            best, bad = va_mse, 0
            torch.save(model.state_dict(), best_path)
        else:
            bad += 1
            if bad >= patience:
                print("Early stopping.")
                break

    model.load_state_dict(torch.load(best_path, map_location=device))
    val_mse, val_pred, val_true = eval_once(val_loader)
    mae = np.mean(np.abs(val_pred - val_true))
    rmse = np.sqrt(val_mse)
    r2 = 1 - np.sum((val_pred - val_true)**2) / np.sum((val_true - val_true.mean())**2)
    print(f"[{tag}] Best Val — MAE {mae:.6f} | RMSE {rmse:.6f} | R2 {r2:.4f}")
    return model, best_path, {"MAE": mae, "RMSE": rmse, "R2": r2}


In [31]:
tg_cfg = {'gnn_dim': 256, 'hidden_dim': 512, 'dropout_rate': 0.34404144200017467, 
          'lr': 0.0005555079210176292, 'activation': 'Swish', 'optimizer': 'RMSprop', 
          'weight_decay': 9.056299733554687e-06}

model_tg = HybridGNN(
    gnn_dim=tg_cfg['gnn_dim'],
    rdkit_dim=15,
    hidden_dim=tg_cfg['hidden_dim'],
    dropout_rate=tg_cfg['dropout_rate'],
    activation=tg_cfg['activation']
)

model_tg, ckpt_tg, metrics_tg = train_hybrid_gnn(
    model_tg, train_loader_tg, val_loader_tg,
    lr=tg_cfg['lr'], optimizer=tg_cfg['optimizer'],
    weight_decay=tg_cfg['weight_decay'],
    epochs=120, patience=15,  
    save_dir="saved_models/gnn_tg", tag="hybridgnn_tg"
)

Epoch 01 | Train MSE 41959.36079 | Val MSE 7592.69531
Epoch 02 | Train MSE 7588.24583 | Val MSE 6447.14258
Epoch 03 | Train MSE 7014.11868 | Val MSE 5595.39502
Epoch 04 | Train MSE 6597.50885 | Val MSE 5446.25928
Epoch 05 | Train MSE 6401.56805 | Val MSE 5403.22070
Epoch 06 | Train MSE 6175.07003 | Val MSE 5421.49268
Epoch 07 | Train MSE 6328.19759 | Val MSE 5410.92480
Epoch 08 | Train MSE 6252.14488 | Val MSE 5200.04248
Epoch 09 | Train MSE 6216.99782 | Val MSE 5180.54932
Epoch 10 | Train MSE 6376.76024 | Val MSE 5086.60107
Epoch 11 | Train MSE 6066.95448 | Val MSE 5056.37793
Epoch 12 | Train MSE 5714.21722 | Val MSE 4869.28467
Epoch 13 | Train MSE 5897.03215 | Val MSE 5118.38965
Epoch 14 | Train MSE 5653.75005 | Val MSE 4949.24951
Epoch 15 | Train MSE 5711.58905 | Val MSE 5175.50293
Epoch 16 | Train MSE 5332.53415 | Val MSE 5128.10938
Epoch 17 | Train MSE 5809.69658 | Val MSE 4756.61914
Epoch 18 | Train MSE 5566.85842 | Val MSE 5304.73340
Epoch 19 | Train MSE 5811.56748 | Val MSE 529

  model.load_state_dict(torch.load(best_path, map_location=device))


In [18]:
den_cfg = {'gnn_dim': 1024, 'hidden_dim': 384, 'dropout_rate': 0.3735260731607324,
           'lr': 5.956024201538505e-04, 'activation': 'Swish', 'optimizer': 'AdamW',
           'weight_decay': 8.619671341229739e-06}

model_den = HybridGNN(
    gnn_dim=den_cfg['gnn_dim'],
    rdkit_dim=15,
    hidden_dim=den_cfg['hidden_dim'],
    dropout_rate=den_cfg['dropout_rate'],
    activation=den_cfg['activation']
)
model_den, ckpt_den, metrics_den = train_hybrid_gnn(
    model_den, train_loader_den, val_loader_den,
    lr=den_cfg['lr'], optimizer=den_cfg['optimizer'],
    weight_decay=den_cfg['weight_decay'],
    epochs=120, patience=15,  
    save_dir="saved_models/gnn_density", tag="hybridgnn_density"
)

Epoch 01 | Train MSE 1.67778 | Val MSE 0.06559
Epoch 02 | Train MSE 0.34782 | Val MSE 0.03503
Epoch 03 | Train MSE 0.15642 | Val MSE 0.11731
Epoch 04 | Train MSE 0.09209 | Val MSE 0.08539
Epoch 05 | Train MSE 0.08105 | Val MSE 0.06634
Epoch 06 | Train MSE 0.06434 | Val MSE 0.05680
Epoch 07 | Train MSE 0.04648 | Val MSE 0.04775
Epoch 08 | Train MSE 0.04682 | Val MSE 0.04406
Epoch 09 | Train MSE 0.03644 | Val MSE 0.02439
Epoch 10 | Train MSE 0.03321 | Val MSE 0.03068
Epoch 11 | Train MSE 0.03518 | Val MSE 0.02869
Epoch 12 | Train MSE 0.03524 | Val MSE 0.02185
Epoch 13 | Train MSE 0.03615 | Val MSE 0.02330
Epoch 14 | Train MSE 0.03829 | Val MSE 0.02591
Epoch 15 | Train MSE 0.03146 | Val MSE 0.02086
Epoch 16 | Train MSE 0.02413 | Val MSE 0.01984
Epoch 17 | Train MSE 0.02266 | Val MSE 0.01457
Epoch 18 | Train MSE 0.02364 | Val MSE 0.01194
Epoch 19 | Train MSE 0.02247 | Val MSE 0.02376
Epoch 20 | Train MSE 0.02117 | Val MSE 0.01255
Epoch 21 | Train MSE 0.02245 | Val MSE 0.01595
Epoch 22 | Tr

  model.load_state_dict(torch.load(best_path, map_location=device))


In [33]:
rg_cfg = {'gnn_dim': 1024, 'hidden_dim': 384, 'dropout_rate': 0.3735260731607324,
           'lr': 5.956024201538505e-04, 'activation': 'Swish', 'optimizer': 'AdamW',
           'weight_decay': 8.619671341229739e-06}

model_rg = HybridGNN(
    gnn_dim=rg_cfg['gnn_dim'],
    rdkit_dim=15,
    hidden_dim=rg_cfg['hidden_dim'],
    dropout_rate=rg_cfg['dropout_rate'],
    activation=rg_cfg['activation']
)

model_rg, ckpt_rg, metrics_rg = train_hybrid_gnn(
    model_rg, train_loader_rg, val_loader_rg,
    lr=rg_cfg['lr'], optimizer=rg_cfg['optimizer'],
    weight_decay=rg_cfg['weight_decay'],
    epochs=120, patience=15,  
    save_dir="saved_models/gnn_rg", tag="hybridgnn_rg"
)

Epoch 01 | Train MSE 80.92240 | Val MSE 57.16220
Epoch 02 | Train MSE 40.70642 | Val MSE 27.66032
Epoch 03 | Train MSE 29.69571 | Val MSE 24.59624
Epoch 04 | Train MSE 26.10429 | Val MSE 27.73846
Epoch 05 | Train MSE 24.36249 | Val MSE 22.10388
Epoch 06 | Train MSE 23.64583 | Val MSE 19.42935
Epoch 07 | Train MSE 22.98108 | Val MSE 20.28521
Epoch 08 | Train MSE 23.47854 | Val MSE 19.16561
Epoch 09 | Train MSE 22.11226 | Val MSE 18.07262
Epoch 10 | Train MSE 24.15085 | Val MSE 22.59981
Epoch 11 | Train MSE 23.79000 | Val MSE 24.73962
Epoch 12 | Train MSE 22.79254 | Val MSE 26.35352
Epoch 13 | Train MSE 21.19017 | Val MSE 16.78869
Epoch 14 | Train MSE 22.03345 | Val MSE 17.55637
Epoch 15 | Train MSE 19.98177 | Val MSE 21.54910
Epoch 16 | Train MSE 19.83724 | Val MSE 15.54232
Epoch 17 | Train MSE 18.04467 | Val MSE 15.57086
Epoch 18 | Train MSE 17.16936 | Val MSE 16.09323
Epoch 19 | Train MSE 16.80016 | Val MSE 14.00111
Epoch 20 | Train MSE 17.62381 | Val MSE 18.94002
Epoch 21 | Train MSE

  model.load_state_dict(torch.load(best_path, map_location=device))


In [20]:
# ffv_cfg = {'gnn_dim': 256, 'hidden_dim': 512, 'dropout_rate': 0.34404144200017467, 
#           'lr': 0.0005555079210176292, 'activation': 'Swish', 'optimizer': 'RMSprop', 
#           'weight_decay': 9.056299733554687e-06}

# model_ffv = HybridGNN(
#     gnn_dim=ffv_cfg['gnn_dim'],
#     rdkit_dim=15,
#     hidden_dim=ffv_cfg['hidden_dim'],
#     dropout_rate=ffv_cfg['dropout_rate'],
#     activation=ffv_cfg['activation']
# )

# model_ffv, ckpt_ffv, metrics_ffv = train_hybrid_gnn(
#     model_ffv, train_loader_ffv, val_loader_ffv,
#     lr=ffv_cfg['lr'], optimizer=ffv_cfg['optimizer'],
#     weight_decay=ffv_cfg['weight_decay'],
#     epochs=120, patience=15,  
#     save_dir="saved_models/gnn_ffv", tag="hybridgnn_ffv"
# )

In [29]:
# ===== Final submission: RF(FFV,Tc) + GNN(Tg,Density,Rg) using LMDB =====
import os, numpy as np, pandas as pd, joblib, torch
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors as rdmd, DataStructs
from torch_geometric.loader import DataLoader as GeoDataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
label_cols = ['Tg','FFV','Tc','Density','Rg']

# ---- 0) Load test ids & smiles
sample   = pd.read_csv(os.path.join(DATA_ROOT, 'sample_submission.csv'))
test_df  = pd.read_csv(os.path.join(DATA_ROOT, 'test.csv'))
test_ids = test_df['id'].astype(sample['id'].dtype).values
test_smiles = test_df['SMILES'].astype(str).tolist()

#  1) Morgan FPs for RF models (FFV, Tc)
def morgan_bits_from_smiles(smiles_list, n_bits=1024, radius=3):
    X = np.zeros((len(smiles_list), n_bits), dtype=np.uint8)
    for i, s in enumerate(smiles_list):
        arr = np.zeros((n_bits,), dtype=np.uint8)
        mol = Chem.MolFromSmiles(s)  # no canonicalization; if parse fails, stays zeros
        if mol is not None:
            fp = rdmd.GetMorganFingerprintAsBitVect(mol, radius=radius, nBits=n_bits)
            DataStructs.ConvertToNumpyArray(fp, arr)
        X[i] = arr
    return X

X_test_fp = morgan_bits_from_smiles(test_smiles, n_bits=1024, radius=3)

# Load trained RFs
rf_ffv = joblib.load(p_ffv)['model']
rf_tc  = joblib.load(p_tc)['model']

pred_ffv = rf_ffv.predict(X_test_fp).astype(float)
pred_tc  = rf_tc.predict(X_test_fp).astype(float)

# ---- 2) GNN predictions (Tg, Density, Rg) via LMDB test loader
test_ds = LMDBtoPyGSingleTask(test_ids, TEST_LMDB, target_index=None)
test_loader = GeoDataLoader(test_ds, batch_size=128, shuffle=False, num_workers=0, pin_memory=True)

@torch.no_grad()
def predict_gnn(model, loader, device):
    model.eval()
    outs = []
    for b in loader:
        b = b.to(device)
        p = model(b).view(-1).cpu().numpy()
        outs.append(p)
    return np.concatenate(outs, axis=0) if outs else np.zeros((0,), dtype=float)


# Tg
m_tg = HybridGNN(gnn_dim=256, rdkit_dim=15, hidden_dim=512,
                 dropout_rate=0.34404144200017467, activation="Swish").to(device)
m_tg.load_state_dict(torch.load(ckpt_tg, map_location=device))
pred_tg = predict_gnn(m_tg, test_loader, device)

# Density
den_cfg = {'gnn_dim': 1024, 'hidden_dim': 384, 'dropout_rate': 0.3735260731607324,
           'lr': 5.956024201538505e-04, 'activation': 'Swish', 'optimizer': 'AdamW',
           'weight_decay': 8.619671341229739e-06}
m_den = HybridGNN(gnn_dim=den_cfg['gnn_dim'], rdkit_dim=15,
                  hidden_dim=den_cfg['hidden_dim'], dropout_rate=den_cfg['dropout_rate'],
                  activation=den_cfg['activation']).to(device)
m_den.load_state_dict(torch.load(ckpt_den, map_location=device))
pred_density = predict_gnn(m_den, test_loader, device)

# Rg (your tuned GNN)
m_rg = HybridGNN(gnn_dim=rg_cfg['gnn_dim'], rdkit_dim=15,
                 hidden_dim=rg_cfg['hidden_dim'], dropout_rate=rg_cfg['dropout_rate'],
                 activation=rg_cfg['activation']).to(device)
m_rg.load_state_dict(torch.load(ckpt_rg, map_location=device))
pred_rg = predict_gnn(m_rg, test_loader, device)

# ---- 3) Safety + assemble submission
pred_tg      = np.nan_to_num(pred_tg)
pred_density = np.nan_to_num(pred_density)
pred_ffv     = np.nan_to_num(pred_ffv)
pred_tc      = np.nan_to_num(pred_tc)
pred_rg      = np.nan_to_num(pred_rg)

sub = pd.DataFrame({
    'id': test_ids,
    'Tg': pred_tg,
    'FFV': pred_ffv,
    'Tc': pred_tc,
    'Density': pred_density,
    'Rg': pred_rg,
})
sub.to_csv('submission.csv', index=False)
print('submission.csv written:', sub.shape)
sub.head()


submission.csv written: (3, 6)


  m_tg.load_state_dict(torch.load(ckpt_tg, map_location=device))
  m_den.load_state_dict(torch.load(ckpt_den, map_location=device))
  m_rg.load_state_dict(torch.load(ckpt_rg, map_location=device))


Unnamed: 0,id,Tg,FFV,Tc,Density,Rg
0,1109053969,99.650856,0.369987,0.225848,1.111971,25.481236
1,1422188626,150.003799,0.376855,0.232995,1.04154,21.680067
2,2032016830,78.97287,0.356142,0.264171,1.074244,19.211025


# Conclusions

## Model Performance Summary

All baseline models were initially trained and evaluated on a 5,000 molecule subset of the full dataset. Below is a comparison of results across different featurization strategies and model types:

### 2D Baseline Models

| Model Type    | Featurization      | MAE   | RMSE  | R²    | Notes                                 |
| ------------- | ------------------ | ----- | ----- | ----- | ------------------------------------- |
| MLP (Tuned)   | RDKit Fingerprints | 0.426 | 0.574 | 0.798 | Strong performance across all metrics |
| KRR (Tuned)   | RDKit Fingerprints | 0.454 | 0.593 | 0.784 | Good overall, slightly lower R²       |
| RF (Tuned)    | RDKit Fingerprints | 0.423 | 0.583 | 0.791 | Best MAE, very competitive overall    |
| MLP (Tuned)   | Coulomb Matrix     | 0.636 | 0.819 | 0.588 | Significantly weaker performance      |
| MLP (Untuned) | RDKit Fingerprints | 0.467 | 0.609 | 0.772 | Solid untuned baseline                |
| KRR (Untuned) | RDKit Fingerprints | 0.519 | 0.668 | 0.726 | Notable drop from tuned version       |
| RF (Untuned)  | RDKit Fingerprints | 0.426 | 0.587 | 0.788 | Surprisingly close to tuned RF        |
| MLP (Untuned) | Coulomb Matrix     | 0.663 | 0.847 | 0.559 | Consistently underperforms            |

### Graph Neural Network Models (ChemML)

| Model Type    | Featurization               | MAE   | RMSE  | R²    | Notes                                |
| ------------- | --------------------------- | ----- | ----- | ----- | ------------------------------------ |
| GNN (Tuned)   | `tensorise_molecules` Graph | 0.302 | 0.411 | 0.900 | Best results from ChemML experiments |
| GNN (Untuned) | `tensorise_molecules` Graph | 0.400 | 0.519 | 0.841 | Strong but less optimized            |

### Final Hybrid GNN Model Trained on Full Dataset (OGB-Compatible)

| Model Type           | Featurization                          | MAE   | RMSE  | R²    | Notes                              |
| -------------------- | -------------------------------------- | ----- | ----- | ----- | ---------------------------------- |
| Hybrid GNN (Tuned)   | OGB `smiles2graph` + RDKit descriptors | 0.159 | 0.234 | 0.965 | State-of-the-art level performance |
| Hybrid GNN (Untuned) | OGB `smiles2graph` + RDKit descriptors | 0.223 | 0.308 | 0.939 | Still very strong pre-tuning       |

---

## Model Error Analysis

I performed qualitative evaluation by comparing predicted vs. true HOMO–LUMO gaps for both randomly selected and poorly predicted molecules. The worst performing molecules often showed rare or complex structures likely underrepresented in the training set. This highlights the importance of structural diversity and potentially more expressive 3D information to improve generalization.

## Next Steps: Integrating 3D Molecular Information

To push performance even further and overcome limitations of 2D graphs and hand crafted descriptors, my next step will involve:

* Using **3D molecular geometries** 
* Incorporating **interatomic distances**, angles, and **spatial encoding** (SchNet, DimeNet, or SE(3)-equivariant models)
* Comparing results against the current best MAE (\~0.159)

This direction aligns with trends in molecular property prediction where 3D aware models often outperform purely 2D approaches, especially for quantum properties like HOMO–LUMO gaps.
