# Notebook 02 - Feature Engineering for CLD (Early-passages only)

## Goal
Convert the raw CLD relational tables (passage-level assay measurements) into a **clone-level ML dataset**.

We will:
1. Load 'assay_result' joined with 'passage' from SQLite
2. Restrict to **early passages** (default: 1-5)
3. Create clone-level features (X), such as:
    - early mean titer / VCD / viability / aggregation
    - early slope (trend) of titer over passages
    - early variability (std) across passages
4. Join with the stability label (y) from 'stability_test'
5. Save a ML-ready feature table

## Why this matters
In real CLD, we must decide which clones to advance **using early data only**.
This notebook creates. the dataset needed to train a model for early clone selection.

## 1) Imports and database connection

In [1]:
import sqlite3
import pandas as pd
import numpy as np

DB_PATH = "../data/synthetic/raw/cld.db"
conn = sqlite3.connect(DB_PATH)
print("Connected to:", DB_PATH)

Connected to: ../data/synthetic/raw/cld.db


## 2) Load assay results joined with passage metadata

We JOIN 'assay_result' with 'passage' so each measurement includes:
- clone_id
- passage_number
- phase (early/mid/late)

In [2]:
assay = pd.read_sql_query("""
SELECT 
  ar.assay_id,
  ar.assay_type,
  ar.value,
  ar.unit,
  ar.method,
  ar.batch_id,
  p.clone_id,
  p.passage_number,
  p.phase
FROM assay_result ar
JOIN passage p
  ON p.passage_id = ar.passage_id
""", conn)

assay.head()

Unnamed: 0,assay_id,assay_type,value,unit,method,batch_id,clone_id,passage_number,phase
0,ASSAY_CLONE_0001_P01_titer,titer,2.889395,g/L,ELISA,B_P01,CLONE_0001,1,early
1,ASSAY_CLONE_0001_P01_vcd,vcd,9043899.0,cells/mL,Vi-CELL,B_P01,CLONE_0001,1,early
2,ASSAY_CLONE_0001_P01_viability,viability,93.77135,%,Vi-CELL,B_P01,CLONE_0001,1,early
3,ASSAY_CLONE_0001_P01_aggregation,aggregation,8.058368,%,SEC-HPLC,B_P01,CLONE_0001,1,early
4,ASSAY_CLONE_0001_P02_titer,titer,2.846536,g/L,ELISA,B_P02,CLONE_0001,2,early


## 3) Restrict to early passages

We build features using early passages only.
Default window: passages 1-5.

This is critical to avoid data leakage and mimic real real CLD screening.

In [3]:
EARLY_START = 1
EARLY_END = 5

assay_early = assay[(assay["passage_number"] >= EARLY_START) & (assay["passage_number"] <= EARLY_END)].copy()

print("Rows in assay (all):", len(assay))
print("Rows in assay (early):", len(assay_early))
assay_early.head()

Rows in assay (all): 60000
Rows in assay (early): 10000


Unnamed: 0,assay_id,assay_type,value,unit,method,batch_id,clone_id,passage_number,phase
0,ASSAY_CLONE_0001_P01_titer,titer,2.889395,g/L,ELISA,B_P01,CLONE_0001,1,early
1,ASSAY_CLONE_0001_P01_vcd,vcd,9043899.0,cells/mL,Vi-CELL,B_P01,CLONE_0001,1,early
2,ASSAY_CLONE_0001_P01_viability,viability,93.77135,%,Vi-CELL,B_P01,CLONE_0001,1,early
3,ASSAY_CLONE_0001_P01_aggregation,aggregation,8.058368,%,SEC-HPLC,B_P01,CLONE_0001,1,early
4,ASSAY_CLONE_0001_P02_titer,titer,2.846536,g/L,ELISA,B_P02,CLONE_0001,2,early
