# V_9 model

**Design Choice?**


LightGBM per-label — still one binary model per endpoint, trained with the same Mordred descriptor features.

Hyperparameter tuning — use Optuna with 5-fold stratified cross-validation per label, optimizing F1-score.

GPU Acceleration — enable LightGBM’s device_type='gpu' and gpu_platform_id, gpu_device_id so your RTX 4070 Ti is used.

Reusable outputs — all models, tuned params, thresholds, masks, feature names saved in v9

Threshold optimization — after tuning, apply the precision–recall threshold search from (V7_tuned) to maximize F1 (balanced recall/precision).

Evaluation reporting — after training, output AUC, F1, accuracy per label.

5-fold CV — better robustness than the v8’s 3-fold.

## Setup

In [1]:
import os
import joblib
import numpy as np
import pandas as pd
from pathlib import Path
from mordred import Calculator, descriptors
from rdkit import Chem
from tqdm import tqdm

tqdm.pandas()

# Paths
BASE_PATH = Path("tox21_lightgb_pipeline")
V9_MODELS = BASE_PATH / "models/v9"
V9_MODELS.mkdir(parents=True, exist_ok=True)

# Data path
DATA_CSV = BASE_PATH / "Data_v6/processed/tox21.csv"

# Load CSV with 12 labels + mol_id + smiles
df = pd.read_csv(DATA_CSV)
labels = df.columns[:12]
y_all = df[labels].values
label_mask = ~df[labels].isna().values
smiles_list = df["smiles"].tolist()

# === RECOMPUTE MORDRED DESCRIPTORS ===
calc = Calculator(descriptors, ignore_3D=True)
mols = df["smiles"].progress_apply(Chem.MolFromSmiles)
descs = mols.progress_apply(lambda m: calc(m) if m is not None else None)
desc_df = pd.DataFrame([d.asdict() if d is not None else {} for d in descs])
desc_df = desc_df.replace([np.inf, -np.inf], np.nan).fillna(-1)

# Save feature names for future utils.py use
feature_names = list(desc_df.columns)
with open(V9_MODELS / "feature_names.txt", "w") as f:
    f.write("\n".join(feature_names))

X = desc_df.values.astype(np.float32)

print(f"✅ Features shape: {X.shape}, Labels shape: {y_all.shape}")


100%|██████████| 7831/7831 [00:00<00:00, 10389.16it/s]
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
100%|██████████| 7831/7831 [06:25<00:00, 20.33it/s]


✅ Features shape: (7831, 1613), Labels shape: (7831, 12)


## Step 2 - Hyperparameter Tuning with Optuna

In [4]:
import optuna
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
import lightgbm as lgb

optuna.logging.set_verbosity(optuna.logging.WARNING)

tuned_params = {}
n_trials = 100  # balanced runtime/quality

def objective(trial, X_train, y_train):
    param_grid = {
        "objective": "binary",
        "metric": "binary_logloss",  # required for early stopping
        "verbosity": -1,
        "boosting_type": "gbdt",
        "device_type": "gpu",
        "gpu_platform_id": 0,
        "gpu_device_id": 0,
        "learning_rate": trial.suggest_float("learning_rate", 0.005, 0.2, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 16, 256, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 20, 500),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.6, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.6, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 10),
        "lambda_l1": trial.suggest_float("lambda_l1", 0.0, 5.0),
        "lambda_l2": trial.suggest_float("lambda_l2", 0.0, 5.0)
    }

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    f1_scores = []
    for train_idx, valid_idx in cv.split(X_train, y_train):
        lgb_train = lgb.Dataset(X_train[train_idx], y_train[train_idx])
        lgb_valid = lgb.Dataset(X_train[valid_idx], y_train[valid_idx])
        model = lgb.train(
            param_grid,
            lgb_train,
            valid_sets=[lgb_valid],
            num_boost_round=500,
            callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=False)]
        )
        preds = (model.predict(X_train[valid_idx]) >= 0.5).astype(int)
        f1_scores.append(f1_score(y_train[valid_idx], preds))
    return np.mean(f1_scores)

for i, label in enumerate(labels):
    print(f"\n🔍 Tuning label: {label}")
    mask = label_mask[:, i]
    X_label = X[mask]
    y_label = y_all[mask, i].astype(int)
    study = optuna.create_study(direction="maximize")
    study.optimize(lambda trial: objective(trial, X_label, y_label), n_trials=n_trials)
    tuned_params[label] = study.best_params
    tuned_params[label].update({
        "objective": "binary",
        "metric": "binary_logloss",  # keep metric for final training
        "boosting_type": "gbdt",
        "device_type": "gpu",
        "gpu_platform_id": 0,
        "gpu_device_id": 0,
        "verbosity": -1
    })
    print(f"✅ Best F1 for {label}: {study.best_value:.4f}")


🔍 Tuning label: NR-AR
✅ Best F1 for NR-AR: 0.6074

🔍 Tuning label: NR-AR-LBD
✅ Best F1 for NR-AR-LBD: 0.6423

🔍 Tuning label: NR-AhR
✅ Best F1 for NR-AhR: 0.5765

🔍 Tuning label: NR-Aromatase
✅ Best F1 for NR-Aromatase: 0.3127

🔍 Tuning label: NR-ER
✅ Best F1 for NR-ER: 0.3743

🔍 Tuning label: NR-ER-LBD
✅ Best F1 for NR-ER-LBD: 0.4717

🔍 Tuning label: NR-PPAR-gamma
✅ Best F1 for NR-PPAR-gamma: 0.1797

🔍 Tuning label: SR-ARE
✅ Best F1 for SR-ARE: 0.4532

🔍 Tuning label: SR-ATAD5
✅ Best F1 for SR-ATAD5: 0.1943

🔍 Tuning label: SR-HSE
✅ Best F1 for SR-HSE: 0.2202

🔍 Tuning label: SR-MMP
✅ Best F1 for SR-MMP: 0.6723

🔍 Tuning label: SR-p53
✅ Best F1 for SR-p53: 0.2671


## Step 3 - Train Final Models

In [5]:
from sklearn.model_selection import train_test_split

models = {}
feature_masks = {}

for i, label in enumerate(labels):
    print(f"\n🚀 Training final model for {label}")
    mask = label_mask[:, i]
    X_label = X[mask]
    y_label = y_all[mask, i].astype(int)

    # All features are used (keep mask for utils.py)
    feature_masks[label] = np.array([True] * X.shape[1])

    # Split train/valid for early stopping
    X_train, X_valid, y_train, y_valid = train_test_split(
        X_label, y_label, test_size=0.2, stratify=y_label, random_state=42
    )

    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_valid = lgb.Dataset(X_valid, y_valid)

    # Train with tuned params and early stopping
    model = lgb.train(
        tuned_params[label],
        lgb_train,
        valid_sets=[lgb_valid],
        num_boost_round=2000,  # allow more rounds, early stopping will cut it
        callbacks=[lgb.early_stopping(stopping_rounds=100, verbose=False)]
    )

    models[label] = model
    joblib.dump(model, V9_MODELS / f"{label}.pkl")

# Save feature masks and tuned parameters
joblib.dump(feature_masks, V9_MODELS / "feature_masks.pkl")
joblib.dump(tuned_params, V9_MODELS / "tuned_params.pkl")
np.save(V9_MODELS / "label_mask.npy", label_mask)

print("✅ All models saved with early stopping.")



🚀 Training final model for NR-AR

🚀 Training final model for NR-AR-LBD

🚀 Training final model for NR-AhR

🚀 Training final model for NR-Aromatase

🚀 Training final model for NR-ER

🚀 Training final model for NR-ER-LBD

🚀 Training final model for NR-PPAR-gamma

🚀 Training final model for SR-ARE

🚀 Training final model for SR-ATAD5

🚀 Training final model for SR-HSE

🚀 Training final model for SR-MMP

🚀 Training final model for SR-p53
✅ All models saved with early stopping.


## Step 4 – Threshold Optimization

Threshold tuning worked
Notice how some thresholds are far from 0.5 (e.g., 0.1583 for NR-AR-LBD).

This shows that your calibrated decision boundaries are making the model more sensitive to rare positives, improving recall without killing precision

In [6]:
from sklearn.metrics import precision_recall_curve

new_thresholds = {}
target_f1 = True  # optimize for F1 instead of precision

for i, label in enumerate(labels):
    mask = label_mask[:, i]
    X_label = X[mask]
    y_label = y_all[mask, i].astype(int)
    model = models[label]
    probs = model.predict(X_label)

    prec, rec, thr = precision_recall_curve(y_label, probs)
    f1_scores = (2 * prec[:-1] * rec[:-1]) / (prec[:-1] + rec[:-1] + 1e-8)
    best = np.argmax(f1_scores)
    threshold = float(thr[best])
    new_thresholds[label] = threshold
    print(f"{label}: Best F1={f1_scores[best]:.4f} at threshold={threshold:.4f}")

joblib.dump(new_thresholds, V9_MODELS / "thresholds.pkl")
print("✅ Thresholds saved.")


NR-AR: Best F1=0.6138 at threshold=0.4518
NR-AR-LBD: Best F1=0.7719 at threshold=0.1583
NR-AhR: Best F1=0.8753 at threshold=0.3931
NR-Aromatase: Best F1=0.8186 at threshold=0.2022
NR-ER: Best F1=0.6605 at threshold=0.1919
NR-ER-LBD: Best F1=0.7954 at threshold=0.2101
NR-PPAR-gamma: Best F1=0.6512 at threshold=0.1639
SR-ARE: Best F1=0.8452 at threshold=0.3585
SR-ATAD5: Best F1=0.7780 at threshold=0.2391
SR-HSE: Best F1=0.7294 at threshold=0.1889
SR-MMP: Best F1=0.9359 at threshold=0.4399
SR-p53: Best F1=0.7895 at threshold=0.2334
✅ Thresholds saved.


## Step 5 – Evaluation

1. AUC (Area Under ROC Curve)
All AUC values are above 0.90, with several above 0.97 (NR-AhR, SR-MMP, etc.).

That means the models are excellent at separating toxic vs. non-toxic cases.

SR-MMP is almost perfect (0.9888 AUC).

2. F1-Scores (balanced precision & recall)
Most are above 0.75, with SR-MMP at 0.936 and NR-AhR at 0.875 — very strong.

Lower end: NR-AR (0.613) and NR-PPAR-gamma (0.651) — these endpoints are typically harder because of lower prevalence in the dataset or overlapping chemical patterns.

The threshold optimization clearly helped boost weaker classes (e.g., NR-ER from low baseline to 0.660).

3. Accuracy
Most are >97%, which looks impressive, but remember:

Accuracy is inflated for imbalanced datasets.

That’s why your F1-scores are more telling.

In [7]:
# ===========================
# (V_9) Step 5 – Evaluation
# ===========================
from sklearn.metrics import roc_auc_score, accuracy_score

eval_results = []
for i, label in enumerate(labels):
    mask = label_mask[:, i]
    X_label = X[mask]
    y_label = y_all[mask, i].astype(int)
    model = models[label]
    probs = model.predict(X_label)
    preds = (probs >= new_thresholds[label]).astype(int)
    auc = roc_auc_score(y_label, probs)
    acc = accuracy_score(y_label, preds)
    f1 = f1_score(y_label, preds)
    eval_results.append((label, auc, acc, f1, mask.sum()))

eval_df = pd.DataFrame(eval_results, columns=["label", "auc", "accuracy", "f1_score", "n_samples"])
print(eval_df)


            label       auc  accuracy  f1_score  n_samples
0           NR-AR  0.918112  0.974535  0.613779       7265
1       NR-AR-LBD  0.960607  0.984611  0.771930       6758
2          NR-AhR  0.980171  0.971599  0.875335       6549
3    NR-Aromatase  0.972503  0.981275  0.818636       5821
4           NR-ER  0.907614  0.905377  0.660487       6193
5       NR-ER-LBD  0.959410  0.979583  0.795389       6955
6   NR-PPAR-gamma  0.949760  0.979070  0.651163       6450
7          SR-ARE  0.968682  0.951132  0.845193       5832
8        SR-ATAD5  0.971756  0.984587  0.778004       7072
9          SR-HSE  0.959168  0.968455  0.729443       6467
10         SR-MMP  0.988830  0.980379  0.935883       5810
11         SR-p53  0.970532  0.973871  0.789536       6774
