# Notebook 02 - Feature Engineering for CLD (Early-passages only)

## Goal
Convert the raw CLD relational tables (passage-level assay measurements) into a **clone-level ML dataset**.

We will:
1. Load 'assay_result' joined with 'passage' from SQLite
2. Restrict to **early passages** (default: 1-5)
3. Create clone-level features (X), such as:
    - early mean titer / VCD / viability / aggregation
    - early slope (trend) of titer over passages
    - early variability (std) across passages
4. Join with the stability label (y) from 'stability_test'
5. Save a ML-ready feature table

## Why this matters
In real CLD, we must decide which clones to advance **using early data only**.
This notebook creates. the dataset needed to train a model for early clone selection.

## 1) Imports and database connection

In [1]:
import sqlite3
import pandas as pd
import numpy as np

DB_PATH = "../data/synthetic/raw/cld.db"
conn = sqlite3.connect(DB_PATH)
print("Connected to:", DB_PATH)

Connected to: ../data/synthetic/raw/cld.db


## 2) Load assay results joined with passage metadata

We JOIN 'assay_result' with 'passage' so each measurement includes:
- clone_id
- passage_number
- phase (early/mid/late)

In [2]:
assay = pd.read_sql_query("""
SELECT 
  ar.assay_id,
  ar.assay_type,
  ar.value,
  ar.unit,
  ar.method,
  ar.batch_id,
  p.clone_id,
  p.passage_number,
  p.phase
FROM assay_result ar
JOIN passage p
  ON p.passage_id = ar.passage_id
""", conn)

assay.head()

Unnamed: 0,assay_id,assay_type,value,unit,method,batch_id,clone_id,passage_number,phase
0,ASSAY_CLONE_0001_P01_titer,titer,2.889395,g/L,ELISA,B_P01,CLONE_0001,1,early
1,ASSAY_CLONE_0001_P01_vcd,vcd,9043899.0,cells/mL,Vi-CELL,B_P01,CLONE_0001,1,early
2,ASSAY_CLONE_0001_P01_viability,viability,93.77135,%,Vi-CELL,B_P01,CLONE_0001,1,early
3,ASSAY_CLONE_0001_P01_aggregation,aggregation,8.058368,%,SEC-HPLC,B_P01,CLONE_0001,1,early
4,ASSAY_CLONE_0001_P02_titer,titer,2.846536,g/L,ELISA,B_P02,CLONE_0001,2,early


## 3) Restrict to early passages

We build features using early passages only.
Default window: passages 1-5.

This is critical to avoid data leakage and mimic real real CLD screening.

In [3]:
EARLY_START = 1
EARLY_END = 5

assay_early = assay[(assay["passage_number"] >= EARLY_START) & (assay["passage_number"] <= EARLY_END)].copy()

print("Rows in assay (all):", len(assay))
print("Rows in assay (early):", len(assay_early))
assay_early.head()

Rows in assay (all): 60000
Rows in assay (early): 10000


Unnamed: 0,assay_id,assay_type,value,unit,method,batch_id,clone_id,passage_number,phase
0,ASSAY_CLONE_0001_P01_titer,titer,2.889395,g/L,ELISA,B_P01,CLONE_0001,1,early
1,ASSAY_CLONE_0001_P01_vcd,vcd,9043899.0,cells/mL,Vi-CELL,B_P01,CLONE_0001,1,early
2,ASSAY_CLONE_0001_P01_viability,viability,93.77135,%,Vi-CELL,B_P01,CLONE_0001,1,early
3,ASSAY_CLONE_0001_P01_aggregation,aggregation,8.058368,%,SEC-HPLC,B_P01,CLONE_0001,1,early
4,ASSAY_CLONE_0001_P02_titer,titer,2.846536,g/L,ELISA,B_P02,CLONE_0001,2,early


## 4) Pivot early assay data into a wide table (one row per clone per passage)

Raw data is "long format" (one row per assay measurement).
For many feature calculations, it is convenient to pivot into "wide format":

Colums will become:
- titer, vcd, viability, aggregation

In [4]:
early_wide = assay_early.pivot_table(
    index=["clone_id", "passage_number"],
    columns="assay_type",
    values="value",
    aggfunc="mean"
).reset_index()

early_wide.head()

assay_type,clone_id,passage_number,aggregation,titer,vcd,viability
0,CLONE_0001,1,8.058368,2.889395,9043899.0,93.77135
1,CLONE_0001,2,8.356911,2.846536,10343200.0,96.98057
2,CLONE_0001,3,8.463373,3.032986,9698222.0,93.35755
3,CLONE_0001,4,8.071459,2.889477,11816730.0,93.989493
4,CLONE_0001,5,8.155143,2.790021,11173880.0,94.269205


## 5) Compute clone-level early summary features

For each clone, we aggregate early passage values into summary features:
- mean
- std (variability)
- min/max

These capture early productivity, growth, health, and quality signals.

In [5]:
metrics = ["titer", "vcd", "viability", "aggregation"]

agg_dict = {}
for m in metrics:
    agg_dict[m] = ["mean", "std", "min", "max"]

summary = early_wide.groupby("clone_id")[metrics].agg(agg_dict)

# Flatten multi-index column names, e.g., titer_mean, vcd_std, ...
summary.columns = [f"{col[0]}_{col[1]}" for col in summary.columns]
summary = summary.reset_index()

summary.head()

Unnamed: 0,clone_id,titer_mean,titer_std,titer_min,titer_max,vcd_mean,vcd_std,vcd_min,vcd_max,viability_mean,viability_std,viability_min,viability_max,aggregation_mean,aggregation_std,aggregation_min,aggregation_max
0,CLONE_0001,2.889683,0.089903,2.790021,3.032986,10415190.0,1111260.0,9043899.0,11816730.0,94.473634,1.440465,93.35755,96.98057,8.221051,0.18053,8.058368,8.463373
1,CLONE_0002,0.877139,0.129996,0.722612,1.077169,13301590.0,1108757.0,11432470.0,14343100.0,95.923996,1.118012,94.737835,97.211486,7.387775,0.382441,6.937951,7.984501
2,CLONE_0003,4.255553,0.14493,4.039223,4.379778,7941597.0,708776.1,7045903.0,8916481.0,92.98932,2.199671,90.625211,96.619908,2.21449,0.099077,2.05434,2.29502
3,CLONE_0004,0.601919,0.143381,0.470253,0.762237,14086460.0,392136.7,13531720.0,14620890.0,96.052966,0.848271,95.014635,96.989373,3.675444,0.374904,3.376207,4.290907
4,CLONE_0005,2.441076,0.223477,2.220144,2.802331,9891681.0,877544.7,8810959.0,10991710.0,94.191298,2.334033,91.008648,97.060231,3.544651,0.260907,3.404482,4.010245


## 6) Compute early slope features (trend over passages)

In CLD, trend can matter:
- A clone with early titer decreasing quickly may be less stable.
- A clone with improving viability may be adapting well.

We compute the slope of each metric vs passage_number using a simple linear fit.

In [6]:
def slope(x, y):
    """Return slope of y ~ a*x + b. Uses least squares. Handles small N."""
    if len(x) < 2:
        return np.nan
    return np.polyfit(x, y, 1)[0]

slope_rows = []
for clone_id, df in early_wide.groupby("clone_id"):
    x = df["passage_number"].values
    row = {"clone_id": clone_id}
    for m in metrics:
        if m in df.columns:
            y = df[m].values
            row[f"{m}_slope"] = slope(x, y)
        else:
            row[f"{m}_slope"] = np.nan
    slope_rows.append(row)

slopes = pd.DataFrame(slope_rows)
slopes.head()

Unnamed: 0,clone_id,titer_slope,vcd_slope,viability_slope,aggregation_slope
0,CLONE_0001,-0.015581,573349.825744,-0.199537,-0.00919
1,CLONE_0002,-0.04802,312021.241928,0.592809,0.026287
2,CLONE_0003,-0.022388,84860.410342,0.598021,-0.034916
3,CLONE_0004,-0.040071,234162.956545,-0.167806,0.197182
4,CLONE_0005,-0.056184,299237.398288,0.6417,0.114477
