<a href="https://colab.research.google.com/github/ACobo98/Machine_Learning_Autonomus/blob/main/Autonomus_IntroMachineLearning_Embryos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Autonomous Activity: Classical ML Pipeline on Embryo Timelapse

**Dataset:** Six time-lapse `.tif` stacks — **2 controls** and **4 mutants**  
**Biological types:**
- **control**: `Control1`, `Control2`
- **mutantA**: `Mutant1`, `Mutant3`
- **mutantB**: `Mutant2`, `Mutant4`

📁 **Dataset:** [Google Drive Link](https://drive.google.com/drive/folders/1_qxqm-v5yCrme3pAW2rjyOOXIeQDuV54?usp=drive_link)

You will build a complete classical ML pipeline using **per-frame image features**.

## Learning goals
1. **Regression** — predict developmental time (frame index).  
2. **Classification** — (a) 6-class embryo ID, (b) 3-class biological type.  
3. **Clustering** — explore structures and compare with biological types.  
4. **Dimensionality Reduction** — PCA, t‑SNE, UMAP; plot per-embryo **trajectories**.  
5. **Cross‑Validation** — use GroupKFold to avoid leakage across embryos.  
6. **Bias–Variance** — analyze learning and validation curves.

> **No leakage rule:** never mix frames of the **same embryo** across train and test folds.



## 0) Environment setup

**Task:** Install the required packages and verify imports.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gdown
import os
import glob


from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV, learning_curve, validation_curve
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.linear_model import LinearRegression, Ridge, LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, DBSCAN
from sklearn.datasets import make_regression, make_classification, make_blobs, load_iris
from sklearn.model_selection import GroupKFold
from sklearn.neighbors import KNeighborsRegressor

from skimage.io import imread
from skimage.filters import sobel


## 1) Download the 6 TIFF stacks

**Task:** Place your files under a local `data/` folder with the exact names below.  
If using Colab, you may download the shared folder with `gdown`:

- `data/Control1.tif`
- `data/Control2.tif`
- `data/Mutant1.tif`   *(mutantA)*
- `data/Mutant2.tif`   *(mutantB)*
- `data/Mutant3.tif`   *(mutantA)*
- `data/Mutant4.tif`   *(mutantB)*


In [3]:
FOLDER_ID = "1_qxqm-v5yCrme3pAW2rjyOOXIeQDuV54"

os.makedirs("data")

!gdown --folder https://drive.google.com/drive/folders/$FOLDER_ID -O data


Retrieving folder contents
Processing file 1a85fmd7QWAqAXdl2qSBAElgb_kd_eQNc Control1.tif
Processing file 1oa4UjMY8gJ1QRMQ0cMyR9GyNbGU9yAAz Control2.tif
Processing file 1pqISdlzcrGhxhlRf8nbjzwe-zQe_VgGk Mutant1.tif
Processing file 1f7mik8NqKxvIFSdhbqYvcJ3JiSnjsjtE Mutant2.tif
Processing file 12HgJria1Pntxw1AFEn7QpUyd8BxzS1kQ Mutant3.tif
Processing file 1NJMLyDD4N9dSblwcPo5IUhFCo-ZCGABI Mutant4.tif
Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1a85fmd7QWAqAXdl2qSBAElgb_kd_eQNc
To: /content/data/Control1.tif
100% 18.1M/18.1M [00:00<00:00, 32.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1oa4UjMY8gJ1QRMQ0cMyR9GyNbGU9yAAz
To: /content/data/Control2.tif
100% 18.1M/18.1M [00:00<00:00, 24.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1pqISdlzcrGhxhlRf8nbjzwe-zQe_VgGk
To: /content/data/Mutant1.tif
100% 18.1M/18.1M [00:00<00:00, 28.7MB/s]
Downloading...
From: 


## 2) Load data and define labels

**Task:**
1. Confirm each `.tif` is a stack of shape `(T, H, W)`.
2. Build a tidy `DataFrame` with **one row per frame** and the columns:
   - `embryo_id` ∈ {Control1, Control2, Mutant1, Mutant2, Mutant3, Mutant4}
   - `type3` ∈ {`control`, `mutantA`, `mutantB`}
   - `frame` ∈ {0..T-1}
   - Per-frame features (computed in the next cell)
3. Print basic counts by `embryo_id` and `type3`.


In [8]:
from skimage.filters import sobel

tif_files = sorted(glob.glob("data/*.tif"))
print("Names of TIF files: ",tif_files,"\n")
#Shape files confimation
for file_path in tif_files:
  #Numpy Array
  image_stack = imread(file_path)
  print(image_stack.shape)

#DataFrame Building:

#Function to extract features of each frame image
def extract_features(image_frame):
    """Compute features of 2D Images."""
    intensity_mean = np.mean(image_frame)
    intensity_std = np.std(image_frame)

    edge_map = sobel(image_frame)
    edge_mean = np.mean(edge_map)
    edge_std = np.std(edge_map)

    return {
        'intensity_mean': intensity_mean,
        'intensity_std': intensity_std,
        'edge_mean': edge_mean,
        'edge_std': edge_std
    }

#Dictionary to assign each frame to correct type
type_map = {
    'Control1': 'control',
    'Control2': 'control',
    'Mutant1':  'mutantA',
    'Mutant3':  'mutantA',
    'Mutant2':  'mutantB',
    'Mutant4':  'mutantB'
}

all_frames_data=[]

for file_path in tif_files:
  #Delete .tif from the name of each file and save them in a new list
  embryo_id = os.path.basename(file_path).replace('.tif', '')
  #Save a new list with the type of each file
  type3 = type_map[embryo_id]
  #Numpy Array of each file
  image_stack = imread(file_path)
  #Extract the number of frames of each file
  num_frames = image_stack.shape[0]

  #Loop to go over each frame from each file
  for frame_idx in range(num_frames):

    #Create a register for each frame
    frame_info = {
        'embryo_id': embryo_id,
        'type3': type3,
        'frame': frame_idx
    }
    #current frame image
    current_frame_image = image_stack[frame_idx, :, :]
    features = extract_features(current_frame_image)
    full_record = {**frame_info, **features}
    #Save it in the list
    all_frames_data.append(full_record)

df = pd.DataFrame(all_frames_data)
print(df.shape, "\n")
print(df.head(), "\n")
print(df.tail())


Names of TIF files:  ['data/Control1.tif', 'data/Control2.tif', 'data/Mutant1.tif', 'data/Mutant2.tif', 'data/Mutant3.tif', 'data/Mutant4.tif'] 

(450, 200, 200)
(450, 200, 200)
(450, 200, 200)
(450, 200, 200)
(450, 200, 200)
(450, 200, 200)
(2700, 7) 

  embryo_id    type3  frame  intensity_mean  intensity_std  edge_mean  \
0  Control1  control      0       30.013575      59.946415   0.013368   
1  Control1  control      1       29.247125      60.056361   0.011921   
2  Control1  control      2       29.896775      60.444294   0.012669   
3  Control1  control      3       29.118375      59.554882   0.012019   
4  Control1  control      4       29.568400      60.659023   0.012055   

   edge_std  
0  0.040751  
1  0.036303  
2  0.037463  
3  0.036022  
4  0.037060   

     embryo_id    type3  frame  intensity_mean  intensity_std  edge_mean  \
2695   Mutant4  mutantB    445       31.677625      53.771740   0.015572   
2696   Mutant4  mutantB    446       32.087900      54.037762   0.016


## 2.1) Feature matrix and grouping (anti-leakage)

**Task:**
- Select `feature_cols` (use at least the ones built above).
- Build:  
  `X` (features), `y_reg = frame`, `y_id6 = embryo_id`, `y_type3 = type3`  
- Define `groups = embryo_id` to be used in GroupKFold.


In [9]:
feature_cols = [
    'intensity_mean',
    'intensity_std',
    'edge_mean',
    'edge_std'
]

X = df[feature_cols]
y_reg = df['frame']
y_id6 = df['embryo_id']
y_type3 = df['type3']
groups = df['embryo_id']


## 3) Supervised Learning — Regression (frame index)

**Task :**
1. Implement 3 models using `Pipeline`:
   - `LinearRegression` (+ `StandardScaler`)
   - `Ridge(alpha)` — try `alpha ∈ {0.1, 1, 10}`
   - `KNeighborsRegressor` — try `n_neighbors ∈ {3,5,7,11}` with `MinMaxScaler`
2. Use **GroupKFold** with `n_splits = min(6, #embryos)` to evaluate **MAE** and **R²**.
3. Report mean metrics for each model and **compare**.
4. Plot **True vs Predicted** for one representative split.


In [10]:
n_embryos = len(np.unique(groups))
gkf = GroupKFold(n_splits=min(6, n_embryos)) #Tool to evaluate the MAE and R^2

models_to_test = [
    ('LinearRegression', Pipeline([
        ('scaler', StandardScaler()),
        ('reg', LinearRegression())
    ])),
    ('Ridge(alpha=0.1)', Pipeline([
        ('scaler', StandardScaler()),
        ('reg', Ridge(alpha=0.1, random_state=42))
    ])),
    ('Ridge(alpha=1)', Pipeline([
        ('scaler', StandardScaler()),
        ('reg', Ridge(alpha=1, random_state=42))
    ])),
    ('Ridge(alpha=10)', Pipeline([
        ('scaler', StandardScaler()),
        ('reg', Ridge(alpha=10, random_state=42))
    ])),
    ('KNN(k=3)', Pipeline([
        ('scaler', MinMaxScaler()),
        ('reg', KNeighborsRegressor(n_neighbors=3))
    ])),
    ('KNN(k=5)', Pipeline([
        ('scaler', MinMaxScaler()),
        ('reg', KNeighborsRegressor(n_neighbors=5))
    ])),
    ('KNN(k=7)', Pipeline([
        ('scaler', MinMaxScaler()),
        ('reg', KNeighborsRegressor(n_neighbors=7))
    ])),
    ('KNN(k=11)', Pipeline([
        ('scaler', MinMaxScaler()),
        ('reg', KNeighborsRegressor(n_neighbors=11))
    ]))
]

results = []
for name, model_pipe in models_to_test:
    # R²
    scores_r2 = cross_val_score(model_pipe, X, y_reg, cv=gkf, groups=groups, scoring='r2')

    # MAE (mean_absolute_error)
    scores_mae = cross_val_score(model_pipe, X, y_reg, cv=gkf, groups=groups, scoring='neg_mean_absolute_error')

    results.append({
        'Model': name,
        'R2 Mean': np.mean(scores_r2),
        'R2 Std': np.std(scores_r2),
        'MAE Mean': -np.mean(scores_mae)
    })

results_df = pd.DataFrame(results).sort_values(by='R2 Mean', ascending=False)
print(results_df)

              Model   R2 Mean    R2 Std    MAE Mean
3   Ridge(alpha=10)  0.122660  0.541946   96.791997
2    Ridge(alpha=1)  0.091916  0.577015   98.660671
1  Ridge(alpha=0.1)  0.088483  0.580875   98.865612
0  LinearRegression  0.088097  0.581308   98.888610
7         KNN(k=11) -0.093767  0.719463   96.613569
6          KNN(k=7) -0.143653  0.788802   99.171164
5          KNN(k=5) -0.173829  0.829295  100.563333
4          KNN(k=3) -0.225475  0.891688  103.243827



## 4) Supervised Learning — Classification

Two tasks (do **both**):
- **ID-6:** predict `embryo_id` (6 classes)
- **Type-3:** predict `type3` (3 classes: control, mutantA, mutantB)

**Task:**
1. Implement 3 classifiers with `Pipeline`:
   - `LogisticRegression(max_iter=1000)`
   - `KNeighborsClassifier`
   - `SVC(kernel='rbf')`
2. Evaluate with **GroupKFold**. Metrics: **Accuracy** and **F1‑macro**.
3. Compare **ID-6** vs **Type-3** (expect Type-3 to be easier).
4. Add confusion matrices.



## 5) Unsupervised Learning — Clustering

**Task (no solutions provided):**
1. Standardize `X` with `StandardScaler`.
2. Apply **KMeans** and **AgglomerativeClustering** with:
   - `n_clusters=3` (biological types)
   - (Optional) `n_clusters=6` (embryo IDs)
3. Compute **silhouette score** (unsupervised metric).
4. Build crosstabs: **cluster vs `type3`** and **cluster vs `embryo_id`** (for interpretation only).



## 6) Dimensionality Reduction + Trajectories

**Task :**
1. Compute **PCA(2)**, **t‑SNE(2)**, and **UMAP(2)** on standardized `X`.
2. Scatter-plot colored by `type3` (optionally also by `embryo_id`).
3. **Trajectories:** for each `embryo_id`, sort by `frame` and **connect** points in 2D.
4. Comment on which embedding separates **types** more clearly and which trajectories look smoother or diverge.



## 7) Cross‑Validation without leakage vs naïve split

**Task :**
1. Use **GroupKFold** (groups = `embryo_id`) with `LogisticRegression` for the **Type-3** task.
2. Compare against a naïve frame-wise `train_test_split` (this leaks information).
3. Report **Accuracy** and **F1‑macro** for both; discuss overestimation from leakage.



## 8) Bias–Variance and Overfitting

**Task (no solutions provided):**
1. Choose `DecisionTreeClassifier` or `KNN` for the **Type-3** task.
2. **Learning curve:** use `learning_curve` with GroupKFold. Plot training size vs Accuracy (train vs validation).
3. **Validation curve:** plot hyperparameter vs Accuracy (e.g., `max_depth` for a tree, or `n_neighbors` for KNN).
4. Conclude: identify **high variance** (train ≫ val) or **high bias** (both low) and propose tuning (regularization, features, data, hyperparameters).



## 9) Deliverables

1. **Regression (frame):** best model + settings. Report GroupKFold **MAE** and **R²**; include a True vs Predicted plot.  
2. **Classification:** compare **ID-6** vs **Type-3**. Which models worked best? Include **Accuracy** and **F1‑macro**.  
3. **Clustering:** does `k=3` reveal types? How does `k=6` behave? Report silhouette and discuss crosstabs.  
4. **Dimensionality reduction:** which embedding separates **types** best? Show trajectories by `embryo_id` and discuss.  
5. **Cross‑Validation:** quantify overestimation from naïve split vs GroupKFold.  
6. **Bias–Variance:** interpret curves and propose next steps (regularization, features, data, hyperparameters).  

