
# Autonomous Activity: Classical ML Pipeline on Embryo Timelapse

**Dataset:** Six time-lapse `.tif` stacks — **2 controls** and **4 mutants**  
**Biological types:**
- **control**: `Control1`, `Control2`
- **mutantA**: `Mutant1`, `Mutant3`
- **mutantB**: `Mutant2`, `Mutant4`

📁 **Dataset:** [Google Drive Link](https://drive.google.com/drive/folders/1_qxqm-v5yCrme3pAW2rjyOOXIeQDuV54?usp=drive_link)

You will build a complete classical ML pipeline using **per-frame image features**.

## Learning goals
1. **Regression** — predict developmental time (frame index).  
2. **Classification** — (a) 6-class embryo ID, (b) 3-class biological type.  
3. **Clustering** — explore structures and compare with biological types.  
4. **Dimensionality Reduction** — PCA, t‑SNE, UMAP; plot per-embryo **trajectories**.  
5. **Cross‑Validation** — use GroupKFold to avoid leakage across embryos.  
6. **Bias–Variance** — analyze learning and validation curves.

> **No leakage rule:** never mix frames of the **same embryo** across train and test folds.



## 0) Environment setup

**Task:** Install the required packages and verify imports.


Environment ready.



## 1) Download the 6 TIFF stacks

**Task:** Place your files under a local `data/` folder with the exact names below.  
If using Colab, you may download the shared folder with `gdown`:

- `data/Control1.tif`
- `data/Control2.tif`
- `data/Mutant1.tif`   *(mutantA)*
- `data/Mutant2.tif`   *(mutantB)*
- `data/Mutant3.tif`   *(mutantA)*
- `data/Mutant4.tif`   *(mutantB)*



## 2) Load data and define labels

**Task:**
1. Confirm each `.tif` is a stack of shape `(T, H, W)`.
2. Build a tidy `DataFrame` with **one row per frame** and the columns:
   - `embryo_id` ∈ {Control1, Control2, Mutant1, Mutant2, Mutant3, Mutant4}
   - `type3` ∈ {`control`, `mutantA`, `mutantB`}
   - `frame` ∈ {0..T-1}
   - Per-frame features (computed in the next cell)
3. Print basic counts by `embryo_id` and `type3`.


⚠️ Missing file: data/Control1.tif — please place your TIFFs under ./data
⚠️ Missing file: data/Control2.tif — please place your TIFFs under ./data
⚠️ Missing file: data/Mutant1.tif — please place your TIFFs under ./data
⚠️ Missing file: data/Mutant2.tif — please place your TIFFs under ./data
⚠️ Missing file: data/Mutant3.tif — please place your TIFFs under ./data
⚠️ Missing file: data/Mutant4.tif — please place your TIFFs under ./data
Empty DataFrame
Columns: []
Index: []


KeyError: 'embryo_id'


## 2.1) Feature matrix and grouping (anti-leakage)

**Task:**
- Select `feature_cols` (use at least the ones built above).
- Build:  
  `X` (features), `y_reg = frame`, `y_id6 = embryo_id`, `y_type3 = type3`  
- Define `groups = embryo_id` to be used in GroupKFold.



## 3) Supervised Learning — Regression (frame index)

**Task :**
1. Implement 3 models using `Pipeline`:
   - `LinearRegression` (+ `StandardScaler`)
   - `Ridge(alpha)` — try `alpha ∈ {0.1, 1, 10}`
   - `KNeighborsRegressor` — try `n_neighbors ∈ {3,5,7,11}` with `MinMaxScaler`
2. Use **GroupKFold** with `n_splits = min(6, #embryos)` to evaluate **MAE** and **R²**.
3. Report mean metrics for each model and **compare**.
4. Plot **True vs Predicted** for one representative split.



## 4) Supervised Learning — Classification

Two tasks (do **both**):
- **ID-6:** predict `embryo_id` (6 classes)
- **Type-3:** predict `type3` (3 classes: control, mutantA, mutantB)

**Task:**
1. Implement 3 classifiers with `Pipeline`:
   - `LogisticRegression(max_iter=1000)`
   - `KNeighborsClassifier`
   - `SVC(kernel='rbf')`
2. Evaluate with **GroupKFold**. Metrics: **Accuracy** and **F1‑macro**.
3. Compare **ID-6** vs **Type-3** (expect Type-3 to be easier).
4. Add confusion matrices.



## 5) Unsupervised Learning — Clustering

**Task (no solutions provided):**
1. Standardize `X` with `StandardScaler`.
2. Apply **KMeans** and **AgglomerativeClustering** with:
   - `n_clusters=3` (biological types)
   - (Optional) `n_clusters=6` (embryo IDs)
3. Compute **silhouette score** (unsupervised metric).
4. Build crosstabs: **cluster vs `type3`** and **cluster vs `embryo_id`** (for interpretation only).



## 6) Dimensionality Reduction + Trajectories

**Task :**
1. Compute **PCA(2)**, **t‑SNE(2)**, and **UMAP(2)** on standardized `X`.
2. Scatter-plot colored by `type3` (optionally also by `embryo_id`).
3. **Trajectories:** for each `embryo_id`, sort by `frame` and **connect** points in 2D.
4. Comment on which embedding separates **types** more clearly and which trajectories look smoother or diverge.



## 7) Cross‑Validation without leakage vs naïve split

**Task :**
1. Use **GroupKFold** (groups = `embryo_id`) with `LogisticRegression` for the **Type-3** task.
2. Compare against a naïve frame-wise `train_test_split` (this leaks information).
3. Report **Accuracy** and **F1‑macro** for both; discuss overestimation from leakage.



## 8) Bias–Variance and Overfitting

**Task (no solutions provided):**
1. Choose `DecisionTreeClassifier` or `KNN` for the **Type-3** task.
2. **Learning curve:** use `learning_curve` with GroupKFold. Plot training size vs Accuracy (train vs validation).
3. **Validation curve:** plot hyperparameter vs Accuracy (e.g., `max_depth` for a tree, or `n_neighbors` for KNN).
4. Conclude: identify **high variance** (train ≫ val) or **high bias** (both low) and propose tuning (regularization, features, data, hyperparameters).



## 9) Deliverables

1. **Regression (frame):** best model + settings. Report GroupKFold **MAE** and **R²**; include a True vs Predicted plot.  
2. **Classification:** compare **ID-6** vs **Type-3**. Which models worked best? Include **Accuracy** and **F1‑macro**.  
3. **Clustering:** does `k=3` reveal types? How does `k=6` behave? Report silhouette and discuss crosstabs.  
4. **Dimensionality reduction:** which embedding separates **types** best? Show trajectories by `embryo_id` and discuss.  
5. **Cross‑Validation:** quantify overestimation from naïve split vs GroupKFold.  
6. **Bias–Variance:** interpret curves and propose next steps (regularization, features, data, hyperparameters).  

