### Goal

In real CLD projects, late-stage outcomes (e.g. productivity, quality, stability)
are *not available* at early decision time.

However, for **model training and validation**, we need to construct *ground-truth*
late-stage labels from historical (or simulated) data.

In this notebook, we:

- Load early-stage feature table (v2)
- Compute late-stage outcomes directly from the raw SQLite DB
- Attach those late labels to the early feature table
- Save a **multi-target dataset** for downstream modeling

This dataset will be used by:
- Notebook 03b (multi-target ML models)
- Notebook 04b (predicted-late-based clone selection)

In [1]:
import sqlite3
import pandas as pd
from pathlib import Path

In [2]:
# Paths
DB_PATH = "../data/synthetic/raw/cld_2000clones.db"  # or cld.db
FEATURE_PATH = "../data/synthetic/processed/cld_features_with_label_v2.csv"
OUT_PATH = "../data/synthetic/processed/cld_features_with_labels_3targets_v2.csv"

print("DB:", DB_PATH)
print("Feature input:", FEATURE_PATH)
print("Output:", OUT_PATH)

DB: ../data/synthetic/raw/cld_2000clones.db
Feature input: ../data/synthetic/processed/cld_features_with_label_v2.csv
Output: ../data/synthetic/processed/cld_features_with_labels_3targets_v2.csv


In [3]:
features = pd.read_csv(FEATURE_PATH)
features.head()

Unnamed: 0,clone_id,titer_mean,titer_std,titer_min,titer_max,vcd_mean,vcd_std,vcd_min,vcd_max,viability_mean,...,aggregation_curvature,qP_mean,qP_p10,titer_cv,vcd_cv,viability_cv,aggregation_cv,culture_mode_fed-batch,culture_mode_perfusion,productivity_drop_pct
0,CLONE_0001,2.665436,0.145412,2.464814,2.852368,10632900.0,1025472.0,9197038.0,12361790.0,93.637077,...,0.087727,2.506782e-07,2.001998e-07,0.054555,0.096443,0.008769,0.071216,True,False,0.229719
1,CLONE_0002,0.834691,0.191151,0.516513,1.171273,15128100.0,597750.6,14076260.0,16051270.0,96.283457,...,0.045579,5.517484e-08,5.065544e-08,0.229008,0.039513,0.013424,0.121568,True,False,0.356246
2,CLONE_0003,3.990484,0.175857,3.722491,4.270057,8411914.0,1150419.0,6047146.0,9506059.0,93.278459,...,-0.296325,4.743848e-07,4.301554e-07,0.044069,0.136761,0.016516,0.056506,True,False,0.281589
3,CLONE_0004,0.540821,0.154336,0.333873,0.749828,15112980.0,605067.5,14481560.0,16263420.0,96.187877,...,-0.173618,3.578521e-08,2.266833e-08,0.285374,0.040036,0.021262,0.036466,True,False,0.02616
4,CLONE_0005,2.16281,0.124723,1.928686,2.355251,11810710.0,732115.7,10921310.0,13285170.0,95.670482,...,0.01125,1.831228e-07,1.899366e-07,0.057667,0.061987,0.014832,0.406579,True,False,0.382269


In [4]:
print("Rows:", features.shape[0])
print("Columns:", features.shape[1])

Rows: 2000
Columns: 46


In [5]:
def fetch_late_labels(conn, clone_list, late_start=26, late_end=30):
    """
    Compute late-stage mean titer and aggregation per clone.
    """
    if len(clone_list) == 0:
        return pd.DataFrame(
            columns=["clone_id", "late_mean_titer", "late_mean_aggregation"]
        )

    placeholders = ",".join(["?"] * len(clone_list))

    query = f"""
    SELECT
        p.clone_id,
        AVG(CASE WHEN ar.assay_type = 'titer' THEN ar.value END) AS late_mean_titer,
        AVG(CASE WHEN ar.assay_type = 'aggregation' THEN ar.value END) AS late_mean_aggregation
    FROM assay_result ar
    JOIN passage p ON p.passage_id = ar.passage_id
    WHERE p.passage_number BETWEEN ? AND ?
      AND p.clone_id IN ({placeholders})
    GROUP BY p.clone_id
    """

    params = [late_start, late_end] + list(clone_list)
    return pd.read_sql_query(query, conn, params=params)

In [6]:
conn = sqlite3.connect(DB_PATH)

late_labels = fetch_late_labels(
    conn,
    features["clone_id"].tolist(),
    late_start=26,
    late_end=30
)

conn.close()
late_labels.head()

Unnamed: 0,clone_id,late_mean_titer,late_mean_aggregation
0,CLONE_0001,2.053135,4.310553
1,CLONE_0002,0.537335,3.259003
2,CLONE_0003,2.866808,5.945068
3,CLONE_0004,0.526673,7.351199
4,CLONE_0005,1.336034,1.133822


In [7]:
dataset_3targets = features.merge(
    late_labels,
    on="clone_id",
    how="left"
)
dataset_3targets.head()

Unnamed: 0,clone_id,titer_mean,titer_std,titer_min,titer_max,vcd_mean,vcd_std,vcd_min,vcd_max,viability_mean,...,qP_p10,titer_cv,vcd_cv,viability_cv,aggregation_cv,culture_mode_fed-batch,culture_mode_perfusion,productivity_drop_pct,late_mean_titer,late_mean_aggregation
0,CLONE_0001,2.665436,0.145412,2.464814,2.852368,10632900.0,1025472.0,9197038.0,12361790.0,93.637077,...,2.001998e-07,0.054555,0.096443,0.008769,0.071216,True,False,0.229719,2.053135,4.310553
1,CLONE_0002,0.834691,0.191151,0.516513,1.171273,15128100.0,597750.6,14076260.0,16051270.0,96.283457,...,5.065544e-08,0.229008,0.039513,0.013424,0.121568,True,False,0.356246,0.537335,3.259003
2,CLONE_0003,3.990484,0.175857,3.722491,4.270057,8411914.0,1150419.0,6047146.0,9506059.0,93.278459,...,4.301554e-07,0.044069,0.136761,0.016516,0.056506,True,False,0.281589,2.866808,5.945068
3,CLONE_0004,0.540821,0.154336,0.333873,0.749828,15112980.0,605067.5,14481560.0,16263420.0,96.187877,...,2.266833e-08,0.285374,0.040036,0.021262,0.036466,True,False,0.02616,0.526673,7.351199
4,CLONE_0005,2.16281,0.124723,1.928686,2.355251,11810710.0,732115.7,10921310.0,13285170.0,95.670482,...,1.899366e-07,0.057667,0.061987,0.014832,0.406579,True,False,0.382269,1.336034,1.133822


In [8]:
dataset_3targets[
    ["productivity_drop_pct", "late_mean_titer", "late_mean_aggregation"]
].isna().mean()

productivity_drop_pct    0.0
late_mean_titer          0.0
late_mean_aggregation    0.0
dtype: float64

In [9]:
out_dir = Path(OUT_PATH).parent
out_dir.mkdir(parents=True, exist_ok=True)

dataset_3targets.to_csv(OUT_PATH, index=False)
print("Saved:", OUT_PATH)

Saved: ../data/synthetic/processed/cld_features_with_labels_3targets_v2.csv


### Output

We generated a **multi-target CLD dataset** with:

- Early-stage features (v2)
- Stability target: `productivity_drop_pct`
- Productivity target: `late_mean_titer`
- Quality target: `late_mean_aggregation`

This dataset enables:
- Multi-target regression (Notebook 03b)
- Predicted-late decision simulation (Notebook 04b)