# Notebook 03 - ML modeling for CLD (Stability Prediction)

## Goal
Train machine learning models that predict stability using early CLD measurements.

We will build two model types:
1) **Regression**: predict 'productivity_drop_pct' (continuous)
2) **Classification**: predict 'stable vs unstable' using a threshhold

## Why both?
- Regression provides a continuous risk estimate (useful for ranking)
- Classification maps directly to a decision rule (drop or keep)

## Key constraints
- Use only early-passage-derived features (already done in Notebook 02)
- Avoid leakage (do not use late measurements)

## 01) Import libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

## 02) Load ML dataset created in Notebook 02 - cell 08

In [2]:
DATA_PATH = "../data/synthetic/processed/cld_features_with_label.csv"
dataset = pd.read_csv(DATA_PATH)

dataset.head()

Unnamed: 0,clone_id,titer_mean,titer_std,titer_min,titer_max,vcd_mean,vcd_std,vcd_min,vcd_max,viability_mean,...,viability_max,aggregation_mean,aggregation_std,aggregation_min,aggregation_max,titer_slope,vcd_slope,viability_slope,aggregation_slope,productivity_drop_pct
0,CLONE_0001,2.889683,0.089903,2.790021,3.032986,10415190.0,1111260.0,9043899.0,11816730.0,94.473634,...,96.98057,8.221051,0.18053,8.058368,8.463373,-0.015581,573349.825744,-0.199537,-0.00919,0.316314
1,CLONE_0002,0.877139,0.129996,0.722612,1.077169,13301590.0,1108757.0,11432470.0,14343100.0,95.923996,...,97.211486,7.387775,0.382441,6.937951,7.984501,-0.04802,312021.241928,0.592809,0.026287,0.130139
2,CLONE_0003,4.255553,0.14493,4.039223,4.379778,7941597.0,708776.1,7045903.0,8916481.0,92.98932,...,96.619908,2.21449,0.099077,2.05434,2.29502,-0.022388,84860.410342,0.598021,-0.034916,0.250773
3,CLONE_0004,0.601919,0.143381,0.470253,0.762237,14086460.0,392136.7,13531720.0,14620890.0,96.052966,...,96.989373,3.675444,0.374904,3.376207,4.290907,-0.040071,234162.956545,-0.167806,0.197182,0.518249
4,CLONE_0005,2.441076,0.223477,2.220144,2.802331,9891681.0,877544.7,8810959.0,10991710.0,94.191298,...,97.060231,3.544651,0.260907,3.404482,4.010245,-0.056184,299237.398288,0.6417,0.114477,0.245204


## 03) Prepare features (x) and target (y)

We drop clone_id from x and keep it separately for reference.

In [3]:
# Keep clone_id for later inspection
clone_id = dataset["clone_id"].copy()

# Target for regression
y_reg = dataset["productivity_drop_pct"].copy()

# Feature matrix
X = dataset.drop(columns=["clone_id", "productivity_drop_pct"])

# Simple NaN handling (should be minimal)
X = X.fillna(X.median(numeric_only=True))

print("X shape:", X.shape)
print("y_reg shape:", y_reg.shape)

X shape: (500, 20)
y_reg shape: (500,)


## 04) Train/test split

We hold out 20% for evaluation

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y_reg, test_size=0.2, random_state=42
)

## 05) Regression (baseline): Linear Regression
A simple baseline model

In [5]:
lr = LinearRegression()
lr.fit(X_train, y_train)

pred = lr.predict(X_test)

mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)

print("Linear Regression MAE:", mae)
print("Linear Regression R2:", r2)

Linear Regression MAE: 0.08347228573753677
Linear Regression R2: -0.05358500054742321


## 06) Regression (stronger baseline): Random Forest Regressor

Non-linear model that can caputre interactions between features.

In [6]:
rf = RandomForestRegressor(
    n_estimators=300,
    random_state=42
)
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)

mae_rf = mean_absolute_error(y_test, pred_rf)
r2_rf = r2_score(y_test, pred_rf)

print("Random Forest MAE:", mae_rf)
print("Random Forest R2:", r2_rf)

Random Forest MAE: 0.08078163660766682
Random Forest R2: -0.05321632613006533


## 07) Classification label definition

We define stable vs unstable using a threshold on productivity drop.
Users can later change this threshold based on business / process requirements.

In [7]:
THRESHOLD = 0.15  # example: 15% drop cutoff

y_cls = (dataset["productivity_drop_pct"] <= THRESHOLD).astype(int)  # 1 = stable, 0 = unstable

print("Class balance (1=stable):")
print(y_cls.value_counts(normalize=True))

Class balance (1=stable):
productivity_drop_pct
0    0.66
1    0.34
Name: proportion, dtype: float64


## 08) Classification split

In [8]:
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X, y_cls, test_size=0.2, random_state=42, stratify=y_cls
)

## 09) Classification baseline: Logistic Regression

Works well for tabular features and provides interpretable coefficients.

In [9]:
logreg = LogisticRegression(max_iter=2000)
logreg.fit(X_train_c, y_train_c)

proba = logreg.predict_proba(X_test_c)[:, 1]
pred_c = (proba >= 0.5).astype(int)

auc = roc_auc_score(y_test_c, proba)
acc = accuracy_score(y_test_c, pred_c)
prec = precision_score(y_test_c, pred_c)
rec = recall_score(y_test_c, pred_c)

print("Logistic Regression AUC:", auc)
print("Accuracy:", acc, "Precision:", prec, "Recall:", rec)
print("Confusion matrix:\n", confusion_matrix(y_test_c, pred_c))

Logistic Regression AUC: 0.4353832442067736
Accuracy: 0.66 Precision: 0.0 Recall: 0.0
Confusion matrix:
 [[66  0]
 [34  0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## 10) Classification: Random Forest

Non-linear classifier for potentially better performance.

In [10]:
rf_c = RandomForestClassifier(
    n_estimators=300,
    random_state=42
)
rf_c.fit(X_train_c, y_train_c)

proba_rf = rf_c.predict_proba(X_test_c)[:, 1]
pred_rf_c = (proba_rf >= 0.5).astype(int)

auc_rf = roc_auc_score(y_test_c, proba_rf)
acc_rf = accuracy_score(y_test_c, pred_rf_c)
prec_rf = precision_score(y_test_c, pred_rf_c)
rec_rf = recall_score(y_test_c, pred_rf_c)

print("Random Forest AUC:", auc_rf)
print("Accuracy:", acc_rf, "Precision:", prec_rf, "Recall:", rec_rf)
print("Confusion matrix:\n", confusion_matrix(y_test_c, pred_rf_c))

Random Forest AUC: 0.5405525846702317
Accuracy: 0.64 Precision: 0.375 Recall: 0.08823529411764706
Confusion matrix:
 [[61  5]
 [31  3]]


## 11) Feature importance (Random Forest)

This gives an initial sense of which early metrics drive predictions.

In [11]:
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
importances.head(15)

titer_mean           0.172658
titer_max            0.078698
aggregation_min      0.069832
titer_min            0.066027
aggregation_std      0.055799
aggregation_slope    0.051442
viability_slope      0.048667
titer_slope          0.047825
viability_max        0.044052
aggregation_mean     0.036657
vcd_slope            0.036524
aggregation_max      0.035669
viability_std        0.035130
titer_std            0.033836
viability_mean       0.033737
dtype: float64

## Summary

We trained:
- Regression models predicting continuous stability drop ('productivity_drop_pct')
- Classification models predicting stable vs unstable clones using a threshold

Next step (Notebook 04):
- Use the model predictions to simulate **early clone drop decision-making**
- Compare baseline vs ML-guided outcomes