# Model Training Pipeline

## Overview

This notebook trains multiclass classification models to generate i-MAP scores for distinguishing between multiple genetic/phenotypic groups using high-content imaging data. The pipeline implements in-silico multiplexing by integrating cellular profiles from multiple marker combinations imaged across separate plates.

**Workflow:**
1. **Configuration**: Set analysis parameters including screen name, marker combinations, and genetic classification scheme
2. **Data Loading**: Load single-cell imaging features from multiple marker sets using the `ImageScreenMultiAntibody` class
3. **Preprocessing**: Filter and normalize features (remove missing/constant features, filter outlier cells, standardize measurements)
4. **Model Training**: Train deep learning classifiers using the MAP (Molecular ALS Phenotype) analysis framework with cross-validation strategies (e.g., leave-one-out, sample splits)
5. **Model Saving**: Serialize trained models and analysis results for downstream evaluation and scoring

**Key Components:**
- **Model Architecture**: Multi-marker deep learning classifier with hierarchical encoders (marker-specific → cell-level → sample-level)
- **Training Strategy**: Two-phase training with per-marker early stopping followed by integrated model optimization
- **Sample Splits**: Configurable cross-validation (leave-one-out or fixed train/test splits) ensuring unbiased predictions on held-out samples
- **Output**: Trained models with predicted class probabilities and feature importances saved to disk

The trained models can be used to score new samples or evaluate classification performance on held-out data.

## 1. Configuration
Set analysis parameters including the screen name, analysis type (e.g., multiclass), marker subset filter, and specific marker combinations to analyze.

In [1]:
# ---- Parameters ----
SCREEN = "20250216_AWALS37_Full_screen_n96"
ANALYSIS = "multiclass"
MARKER = "all"
ANTIBODY = ["FUS/EEA1"]

Set random seeds across all libraries to ensure reproducible model training and evaluation.

In [2]:
import os
import random
import numpy as np
import torch

# ---- Set seeds for reproducibility ----
SEED = 42
os.environ["PYTHONHASHSEED"] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Load analysis parameters from JSON configuration file and update with notebook-specific settings (screen name, marker combinations, feature filters).

In [3]:
from maps.screens import ImageScreenMultiAntibody
import json
from pathlib import Path


# --- Initialize parameters ---
pdir = Path("/home/kkumbier/maps/template_analyses/pipelines/params")
with open(pdir / f"{ANALYSIS}.json", "r") as f:
    params = json.load(f)

params["screen"] = SCREEN
params["antibodies"] = ANTIBODY

# Update marker if specified
if MARKER != "all":
    fstr = params["preprocess"]["drop_feature_types"]["feature_str"]
    fstr += f"|^.*{MARKER}.*$"
    params["preprocess"]["drop_feature_types"]["feature_str"] = fstr

## 2. Data Loading & 3. Preprocessing
Load single-cell imaging features from multiple marker combinations and apply preprocessing steps: feature filtering, outlier removal, missing value imputation, and normalization.

In [4]:
# Load and process screens for train / test
screen = ImageScreenMultiAntibody(params)
screen.load(antibody=params["antibodies"])

print("Processing data...")
screen.preprocess()
assert screen.data is not None, "Loading failed"

for ab in params["antibodies"]:
    print(f"Marker set: {ab}")
    print(f"Data: {screen.data[ab].shape}")

Processing data...
Preprocessing complete
Marker set: FUS/EEA1
Data: (87657, 315)


## 4. Model Training
Train multiclass classification models using the MAP analysis framework. The `fit()` method executes the complete training pipeline including cross-validation, model fitting, and prediction generation on held-out samples.

In [5]:
from maps.analyses import MAP
map_analysis = MAP(screen)
map_analysis.fit()

--- Replicate 1/1 ---
Starting cell-level training...
Cell Epoch 1/100, Overall Loss: 1.1684, Active: 1/1
  FUS/EEA1 - Loss: 1.1684, Acc: 0.3450 
Cell Epoch 2/100, Overall Loss: 1.1071, Active: 1/1
  FUS/EEA1 - Loss: 1.1071, Acc: 0.3962 
Cell Epoch 3/100, Overall Loss: 1.1124, Active: 1/1
  FUS/EEA1 - Loss: 1.1124, Acc: 0.3833 
Cell Epoch 4/100, Overall Loss: 1.0863, Active: 1/1
  FUS/EEA1 - Loss: 1.0863, Acc: 0.4187 
Cell Epoch 5/100, Overall Loss: 1.0755, Active: 1/1
  FUS/EEA1 - Loss: 1.0755, Acc: 0.4354 
Cell Epoch 6/100, Overall Loss: 1.0314, Active: 1/1
  FUS/EEA1 - Loss: 1.0314, Acc: 0.4733 
Cell Epoch 7/100, Overall Loss: 1.0511, Active: 1/1
  FUS/EEA1 - Loss: 1.0511, Acc: 0.4408 
Cell Epoch 8/100, Overall Loss: 1.0116, Active: 1/1
  FUS/EEA1 - Loss: 1.0116, Acc: 0.4758 
Cell Epoch 9/100, Overall Loss: 1.0281, Active: 1/1
  FUS/EEA1 - Loss: 1.0281, Acc: 0.4704 
Cell Epoch 10/100, Overall Loss: 1.0044, Active: 1/1
  FUS/EEA1 - Loss: 1.0044, Acc: 0.4775 
Cell Epoch 11/100, Overal

## 5. Model Saving
Serialize the trained MAP analysis object (including fitted models, predictions, and feature importances) along with configuration parameters to disk for downstream evaluation.

In [6]:
import pickle

ab_string = "_".join(ANTIBODY).replace("/", "-")
output_dir = Path(params.get("result_dir")) / params.get("screen")
with open(output_dir / f"{ANALYSIS}-{ab_string}-{MARKER}.pkl", "wb") as f:
    pickle.dump({"analysis": map_analysis, "params": params}, f)