### Goal

In real CLD projects, late-stage outcomes (e.g. productivity, quality, stability)
are *not available* at early decision time.

However, for **model training and validation**, we need to construct *ground-truth*
late-stage labels from historical (or simulated) data.

In this notebook, we:

- Load early-stage feature table (v2)
- Compute late-stage outcomes directly from the raw SQLite DB
- Attach those late labels to the early feature table
- Save a **multi-target dataset** for downstream modeling

This dataset will be used by:
- Notebook 03b (multi-target ML models)
- Notebook 04b (predicted-late-based clone selection)

In [10]:
import sqlite3
import pandas as pd
from pathlib import Path

In [11]:
# Paths
DB_PATH = "../data/synthetic/raw/cld_2000clones.db"  # or cld.db
FEATURE_PATH = "../data/synthetic/processed/cld_features_with_label_v2.csv"
OUT_PATH = "../data/synthetic/processed/cld_features_with_labels_3targets_v2_24_30.csv"

print("DB:", DB_PATH)
print("Feature input:", FEATURE_PATH)
print("Output:", OUT_PATH)

DB: ../data/synthetic/raw/cld_2000clones.db
Feature input: ../data/synthetic/processed/cld_features_with_label_v2.csv
Output: ../data/synthetic/processed/cld_features_with_labels_3targets_v2_24_30.csv


In [12]:
features = pd.read_csv(FEATURE_PATH)
features.head()

Unnamed: 0,clone_id,titer_mean,titer_std,titer_min,titer_max,vcd_mean,vcd_std,vcd_min,vcd_max,viability_mean,...,qP_mean,qP_p10,titer_cv,vcd_cv,viability_cv,aggregation_cv,culture_mode_fed-batch,culture_mode_perfusion,ddpcr_cn,productivity_drop_pct
0,CLONE_0001,3.135307,0.15306,2.908198,3.361966,11158540.0,775266.9,10096740.0,12294130.0,94.340916,...,2.809783e-07,2.922963e-07,0.048818,0.069477,0.013108,0.047273,False,True,2.0,0.271237
1,CLONE_0002,1.089709,0.201227,0.881955,1.476979,14533580.0,569550.1,13359300.0,15132320.0,96.108006,...,7.497872e-08,8.253985e-08,0.184661,0.039189,0.019557,0.096506,True,False,3.0,0.52492
2,CLONE_0003,4.715356,0.202982,4.36331,4.991713,9132412.0,799242.1,7744497.0,10047470.0,93.691616,...,5.163319e-07,4.653539e-07,0.043047,0.087517,0.020809,0.061906,True,False,2.0,0.338851
3,CLONE_0004,0.729517,0.140272,0.541439,0.88702,15322590.0,1022267.0,13804880.0,16854200.0,97.318163,...,4.761053e-08,5.733401e-08,0.192281,0.066716,0.013987,0.02734,True,False,2.0,0.646568
4,CLONE_0005,2.480311,0.215895,2.122646,2.781607,11696200.0,1088390.0,9654663.0,13619380.0,95.337131,...,2.120613e-07,2.030646e-07,0.087044,0.093055,0.020933,0.146547,True,False,3.0,0.492373


In [13]:
print("Rows:", features.shape[0])
print("Columns:", features.shape[1])

Rows: 2000
Columns: 47


In [14]:
def fetch_late_labels(conn, clone_list, late_start=24, late_end=30):
    """
    Compute late-stage mean titer and aggregation per clone.
    """
    if len(clone_list) == 0:
        return pd.DataFrame(
            columns=["clone_id", "late_mean_titer", "late_mean_aggregation"]
        )

    placeholders = ",".join(["?"] * len(clone_list))

    query = f"""
    SELECT
        p.clone_id,
        AVG(CASE WHEN ar.assay_type = 'titer' THEN ar.value END) AS late_mean_titer,
        AVG(CASE WHEN ar.assay_type = 'aggregation' THEN ar.value END) AS late_mean_aggregation
    FROM assay_result ar
    JOIN passage p ON p.passage_id = ar.passage_id
    WHERE p.passage_number BETWEEN ? AND ?
      AND p.clone_id IN ({placeholders})
    GROUP BY p.clone_id
    """

    params = [late_start, late_end] + list(clone_list)
    return pd.read_sql_query(query, conn, params=params)

In [15]:
conn = sqlite3.connect(DB_PATH)

late_labels = fetch_late_labels(
    conn,
    features["clone_id"].tolist(),
    late_start=24,
    late_end=30
)

conn.close()
late_labels.head()

Unnamed: 0,clone_id,late_mean_titer,late_mean_aggregation
0,CLONE_0001,2.289753,4.902667
1,CLONE_0002,0.581398,3.686371
2,CLONE_0003,3.146742,6.581963
3,CLONE_0004,0.314086,8.374103
4,CLONE_0005,1.350038,1.890133


In [16]:
dataset_3targets = features.merge(
    late_labels,
    on="clone_id",
    how="left"
)
dataset_3targets.head()

Unnamed: 0,clone_id,titer_mean,titer_std,titer_min,titer_max,vcd_mean,vcd_std,vcd_min,vcd_max,viability_mean,...,titer_cv,vcd_cv,viability_cv,aggregation_cv,culture_mode_fed-batch,culture_mode_perfusion,ddpcr_cn,productivity_drop_pct,late_mean_titer,late_mean_aggregation
0,CLONE_0001,3.135307,0.15306,2.908198,3.361966,11158540.0,775266.9,10096740.0,12294130.0,94.340916,...,0.048818,0.069477,0.013108,0.047273,False,True,2.0,0.271237,2.289753,4.902667
1,CLONE_0002,1.089709,0.201227,0.881955,1.476979,14533580.0,569550.1,13359300.0,15132320.0,96.108006,...,0.184661,0.039189,0.019557,0.096506,True,False,3.0,0.52492,0.581398,3.686371
2,CLONE_0003,4.715356,0.202982,4.36331,4.991713,9132412.0,799242.1,7744497.0,10047470.0,93.691616,...,0.043047,0.087517,0.020809,0.061906,True,False,2.0,0.338851,3.146742,6.581963
3,CLONE_0004,0.729517,0.140272,0.541439,0.88702,15322590.0,1022267.0,13804880.0,16854200.0,97.318163,...,0.192281,0.066716,0.013987,0.02734,True,False,2.0,0.646568,0.314086,8.374103
4,CLONE_0005,2.480311,0.215895,2.122646,2.781607,11696200.0,1088390.0,9654663.0,13619380.0,95.337131,...,0.087044,0.093055,0.020933,0.146547,True,False,3.0,0.492373,1.350038,1.890133


In [17]:
dataset_3targets[
    ["productivity_drop_pct", "late_mean_titer", "late_mean_aggregation"]
].isna().mean()

productivity_drop_pct    0.0
late_mean_titer          0.0
late_mean_aggregation    0.0
dtype: float64

In [18]:
out_dir = Path(OUT_PATH).parent
out_dir.mkdir(parents=True, exist_ok=True)

dataset_3targets.to_csv(OUT_PATH, index=False)
print("Saved:", OUT_PATH)

Saved: ../data/synthetic/processed/cld_features_with_labels_3targets_v2_24_30.csv


### Output

We generated a **multi-target CLD dataset** with:

- Early-stage features (v2)
- Stability target: `productivity_drop_pct`
- Productivity target: `late_mean_titer`
- Quality target: `late_mean_aggregation`

This dataset enables:
- Multi-target regression (Notebook 03b)
- Predicted-late decision simulation (Notebook 04b)