# Protein–DNA ΔΔG: Neural Network (MLP) Walkthrough

This notebook builds a **Neural Network baseline** using scikit-learn's `MLPRegressor`.
It mirrors the Random Forest tutorial structure but adds NN-specific steps like scaling,
early stopping, and training/validation loss curves.

**Pipeline outline**
1. Load & audit data (`rawdat.csv` + `exp_data_all.csv`).
2. Merge on `SEQUENCE_ID`, align `LABEL_COL`.
3. EDA: missingness, duplicates, dtypes, quick stats.
4. Define features/target and standardize inputs.
5. Train/validation split and **5-fold cross-validation** with a `Pipeline`.
6. Fit a **baseline MLPRegressor** with **early stopping**.
7. Evaluate (R², RMSE, MAE) on hold-out test set.
8. Inspect training curves, cross-validated predictions.
9. Model interpretation: **Permutation Importance** and **PDP**.
10. Quick **RandomizedSearchCV** for better hyperparameters.
11. Save artifacts (model, metrics, predictions, config).


## 0. Setup & Configuration

Adjust paths if your repo layout differs.


In [2]:
# --- Python stdlib ---
import os, math, json
from pathlib import Path

# --- Data stack ---
import numpy as np
import pandas as pd

# --- scikit-learn ---
from sklearn.model_selection import train_test_split, KFold, cross_val_score, cross_val_predict, RandomizedSearchCV, learning_curve
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.inspection import permutation_importance, PartialDependenceDisplay

# --- Utils ---
import joblib
import matplotlib.pyplot as plt

# Reproducibility
RANDOM_STATE = 42

# -------- Project paths (Path objects only) --------
PROJECT_ROOT = Path.cwd()          # or Path('.'), but keep as Path
INPUTS_DIR   = PROJECT_ROOT
OUTPUT_DIR   = PROJECT_ROOT / "ML_NN_CV"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# -------- Column identifiers --------
SEQUENCE_ID = "sequence"    # Unique key across files
LABEL_COL   = "bind_avg"    # ΔΔG label

# -------- Input files (Path objects) --------
FEATURE_FILES  = [INPUTS_DIR / "rawdat.csv"]        # list in case you add more later
REFERENCE_FILE = INPUTS_DIR / "exp_data_all.csv"

print(f"Inputs dir: {INPUTS_DIR}")
print(f"Outputs dir: {OUTPUT_DIR}")
print(f"Feature file(s): {FEATURE_FILES}")
print(f"Reference file: {REFERENCE_FILE}")


Inputs dir: /Users/nakku/Desktop/ML-Protein-DNA-Binding-Affinity/ML-Protein-DNA-Binding-Affinity/tutorial notebooks
Outputs dir: /Users/nakku/Desktop/ML-Protein-DNA-Binding-Affinity/ML-Protein-DNA-Binding-Affinity/tutorial notebooks/ML_NN_CV
Feature file(s): [PosixPath('/Users/nakku/Desktop/ML-Protein-DNA-Binding-Affinity/ML-Protein-DNA-Binding-Affinity/tutorial notebooks/rawdat.csv')]
Reference file: /Users/nakku/Desktop/ML-Protein-DNA-Binding-Affinity/ML-Protein-DNA-Binding-Affinity/tutorial notebooks/exp_data_all.csv


In [3]:
# Load feature data
features = pd.read_csv(FEATURE_FILES[0])
features[SEQUENCE_ID] = features[SEQUENCE_ID].str.replace("MycMax_", "", regex=False)

# Load reference labels
labels = pd.read_csv(REFERENCE_FILE)

# Merge
df = pd.merge(features, labels, on=SEQUENCE_ID, how="inner")
df = df.dropna(subset=[LABEL_COL])
print("Data shape:", df.shape)
df.head()


Data shape: (68040, 13)


Unnamed: 0,sequence,run,VDWAALS,EEL,EGB,ESURF,HB Energy,Hydrophobic Energy,Pi-Pi Energy,Delta_Entropy,bind_avg,binding_type,improving
0,CAGGGCTGGGTCCACCTCATGGCCTTTGTTCTGGAA,9,-236.997,-1869.66,1823.216,-35.292,-2.590101,-156.445725,-4.282747,-24.750849,0.166339,1,0
1,CAGGGCTGGGTCCACCTCATGGCCTTTGTTCTGGAA,9,-218.62,-1850.331,1807.831,-32.521,-2.977171,-142.709472,-7.240534,-25.235404,0.166339,1,0
2,CAGGGCTGGGTCCACCTCATGGCCTTTGTTCTGGAA,9,-232.611,-1878.075,1834.181,-34.17,-3.105868,-145.088977,-8.856276,-25.12494,0.166339,1,0
3,CAGGGCTGGGTCCACCTCATGGCCTTTGTTCTGGAA,9,-203.677,-1870.595,1823.641,-32.402,-3.414769,-150.961716,-5.33867,-23.079573,0.166339,1,0
4,CAGGGCTGGGTCCACCTCATGGCCTTTGTTCTGGAA,9,-212.279,-1864.73,1820.462,-31.858,-3.571942,-146.583284,-7.171679,-22.812241,0.166339,1,0


## 2. Quick Data Audit

We check:
- dtypes and info
- missing values per column
- duplicate sequences (leakage risk)
- quick numeric stats


In [5]:
print("\nDataFrame info:")
df.info()

print("\nMissing values per column (top 20):")
missing = df.isna().sum().sort_values(ascending=False)
display(missing.head(20))

dup_count = df.duplicated(subset=[SEQUENCE_ID]).sum()
print(f"\nDuplicate {SEQUENCE_ID} rows:", dup_count)

print("\nDescriptive stats (numeric):")
display(df.describe().T.head(15))



DataFrame info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68040 entries, 0 to 68039
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   sequence            68040 non-null  object 
 1   run                 68040 non-null  int64  
 2   VDWAALS             68040 non-null  float64
 3   EEL                 68040 non-null  float64
 4   EGB                 68040 non-null  float64
 5   ESURF               68040 non-null  float64
 6   HB Energy           68040 non-null  float64
 7   Hydrophobic Energy  68040 non-null  float64
 8   Pi-Pi Energy        68040 non-null  float64
 9   Delta_Entropy       68040 non-null  float64
 10  bind_avg            68040 non-null  float64
 11  binding_type        68040 non-null  int64  
 12  improving           68040 non-null  int64  
dtypes: float64(9), int64(3), object(1)
memory usage: 7.3+ MB

Missing values per column (top 20):


sequence              0
run                   0
VDWAALS               0
EEL                   0
EGB                   0
ESURF                 0
HB Energy             0
Hydrophobic Energy    0
Pi-Pi Energy          0
Delta_Entropy         0
bind_avg              0
binding_type          0
improving             0
dtype: int64


Duplicate sequence rows: 67998

Descriptive stats (numeric):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
run,68040.0,10.5,5.766324,1.0,5.75,10.5,15.25,20.0
VDWAALS,68040.0,-201.378166,20.130228,-270.351,-215.101,-201.824,-188.088,-105.281
EEL,68040.0,-1898.996636,38.450855,-2076.822,-1924.672,-1897.88,-1873.026,-1758.701
EGB,68040.0,1850.679788,35.684319,1722.168,1826.616,1849.566,1874.488,2013.174
ESURF,68040.0,-31.474066,2.393914,-43.027,-33.082,-31.535,-29.954,-18.292
HB Energy,68040.0,-8.175591,6.065805,-30.470945,-13.585766,-4.378422,-3.275576,-0.324696
Hydrophobic Energy,68040.0,-135.669642,13.0211,-181.225207,-144.756364,-136.267018,-127.247098,-65.594027
Pi-Pi Energy,68040.0,-3.219995,2.584144,-17.248618,-4.999256,-2.974733,-0.964479,0.0
Delta_Entropy,68040.0,-21.768632,2.156622,-33.004332,-23.182778,-21.798697,-20.392486,-9.803474
bind_avg,68040.0,0.182705,0.834584,-0.862667,-0.511587,-0.027842,0.614097,2.03566


## 3. Define Features & Target

We keep only **numeric** features (NN needs numeric inputs).  
We **do not** scale the label here (ΔΔG is in its physical units).  
Feature scaling will be handled in a `Pipeline` via `StandardScaler`.


In [6]:
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
feature_cols = [c for c in numeric_cols if c != LABEL_COL]

X = df[feature_cols].copy()
y = df[LABEL_COL].copy()

print("Number of features:", len(feature_cols))
print("X shape:", X.shape, "| y shape:", y.shape)

# Drop constant columns (harmless but can slow training)
const_cols = [c for c in feature_cols if X[c].nunique(dropna=False) <= 1]
if const_cols:
    print("Dropping constant columns:", const_cols)
    X = X.drop(columns=const_cols)
    feature_cols = [c for c in feature_cols if c not in const_cols]

print("Final feature count:", len(feature_cols))


Number of features: 11
X shape: (68040, 11) | y shape: (68040,)
Final feature count: 11


But in this formulation we also too the 'binding_type' and 'improving' columns as our features which we don't want since they are a form of label, not a feature that we train our ML model on.

In [8]:
# Explicit denylist of non-features / potential leakage columns
LEAKAGE_COLS = {
    SEQUENCE_ID,   # identifier
    LABEL_COL,     # the true label ΔΔG
    'binding_type',
    'improving',
    # add others here if you discover more helper/label-like columns
}

# Identify numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Remove label/ID/leakage columns from features
feature_cols = [c for c in numeric_cols if c not in LEAKAGE_COLS]

# Build X/y
X = df[feature_cols].copy()
y = df[LABEL_COL].copy()

print("Dropped (non-feature) columns that were excluded:", sorted(list(set(LEAKAGE_COLS) & set(df.columns))))
print("Number of features:", len(feature_cols))
print("X shape:", X.shape, "| y shape:", y.shape)

# Drop constant columns (can add noise / slow training)
const_cols = [c for c in feature_cols if X[c].nunique(dropna=False) <= 1]
if const_cols:
    print("Dropping constant columns:", const_cols)
    X = X.drop(columns=const_cols)
    feature_cols = [c for c in feature_cols if c not in const_cols]

print("Final feature count:", len(feature_cols))


Dropped (non-feature) columns that were excluded: ['bind_avg', 'binding_type', 'improving', 'sequence']
Number of features: 9
X shape: (68040, 9) | y shape: (68040,)
Final feature count: 9


### Leakage guard

Sanity checks to ensure no leakage columns sneak into `X`.


In [9]:
leaky_in_X = sorted(list(set(['binding_type','improving', LABEL_COL, SEQUENCE_ID]) & set(X.columns)))
assert len(leaky_in_X) == 0, f"Leakage detected in features: {leaky_in_X}"
print("Leakage guard passed ✅  (no forbidden columns in X)")


Leakage guard passed ✅  (no forbidden columns in X)


## 4. Train/Test Split (80/20)

We keep a **hold-out** test set for honest final evaluation.


In [7]:
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
    X, y, df[SEQUENCE_ID], test_size=0.2, random_state=RANDOM_STATE
)

print("Train:", X_train.shape, "Test:", X_test.shape)

# Save which sequences were in which split (traceability)
split_path = OUTPUT_DIR / 'split_indices.csv'
pd.DataFrame({SEQUENCE_ID: pd.concat([idx_train, idx_test]),
              'split': ['train']*len(idx_train) + ['test']*len(idx_test)}).to_csv(split_path, index=False)
print("Saved split indices ->", split_path)


Train: (54432, 11) Test: (13608, 11)
Saved split indices -> /Users/nakku/Desktop/ML-Protein-DNA-Binding-Affinity/ML-Protein-DNA-Binding-Affinity/tutorial notebooks/ML_NN_CV/split_indices.csv


## 5. Sanity Baseline (Mean Predictor)

A naive baseline that predicts the **training mean** of ΔΔG for all test samples.  
This gives a floorline to beat.


In [None]:
y_mean = np.full_like(y_test, fill_value=y_train.mean(), dtype=float)
baseline_r2  = r2_score(y_test, y_mean)
baseline_rmse = math.sqrt(mean_squared_error(y_test, y_mean))
baseline_mae  = mean_absolute_error(y_test, y_mean)

print(f"Baseline — R²: {baseline_r2:.4f} | RMSE: {baseline_rmse:.4f} | MAE: {baseline_mae:.4f}")


## 6. Neural Network Pipeline

We use a `Pipeline` so scaling happens **inside CV** (no leakage).  
Baseline hyperparameters (good starting point):  
- hidden_layer_sizes: (128, 64)  
- activation: relu  
- solver: adam  
- alpha (L2): 1e-4  
- batch_size: 64  
- learning_rate_init: 1e-3  
- early_stopping: True (uses an internal validation split)  
- max_iter: 500, n_iter_no_change: 20  


In [None]:
mlp = MLPRegressor(
    hidden_layer_sizes=(128, 64),
    activation='relu',
    solver='adam',
    alpha=1e-4,
    batch_size=64,
    learning_rate_init=1e-3,
    early_stopping=True,           # enables validation set inside fit
    n_iter_no_change=20,
    max_iter=500,
    random_state=RANDOM_STATE
)

pipeline = Pipeline(steps=[
    ('scaler', StandardScaler(with_mean=True, with_std=True)),
    ('mlp', mlp)
])

kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

# Cross-validated metrics
cv_r2  = cross_val_score(pipeline, X, y, cv=kf, scoring='r2', n_jobs=-1)
cv_mse = cross_val_score(pipeline, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)
cv_mae = cross_val_score(pipeline, X, y, cv=kf, scoring='neg_mean_absolute_error', n_jobs=-1)

print("CV R²:", cv_r2, "\nMean R²:", cv_r2.mean())
print("\nCV RMSE:", np.sqrt(-cv_mse), "\nMean RMSE:", np.sqrt(-cv_mse).mean())
print("\nCV MAE:", -cv_mae, "\nMean MAE:", (-cv_mae).mean())
