# Adipocyte Perturbation: Quickstart
End-to-end setup from repo clone to training and submission generation. Run cells top-to-bottom on a GPU-enabled machine with sufficient disk (≥100 GB recommended).

## 0. Prerequisites
- Python 3.10+, CUDA-capable GPU, and disk headroom (≥100 GB).
- Raw challenge files placed under `data/raw/Challenge/`:
  - obesity_challenge_1.h5ad
  - signature_genes.csv
  - program_proportion.csv
  - program_proportion_local_gtruth.csv
  - predict_perturbations.txt
  - gene_to_predict.txt
- Git access to the repository.

In [None]:
# 1) Clone repo (skip if already inside)
git clone https://github.com/Koussaisalem/adipocyte-perturbation-prediction.git
cd adipocyte-perturbation-prediction || exit 1

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

pip install -e ".[dev,notebooks]"

In [None]:
# 4) Verify GPU availability
nvidia-smi || echo 'nvidia-smi not available'
python - <<'PY'
import torch
print('CUDA available:', torch.cuda.is_available())
print('CUDA devices:', torch.cuda.device_count())
if torch.cuda.is_available():
    print('Device 0:', torch.cuda.get_device_name(0))
PY

In [None]:
# 5) Verify raw data presence
ls -lh data/raw/Challenge

# ln -s /path/to/challenge/data data/raw/Challenge

In [None]:
# 6) Run setup helper (checks files, creates all_genes.txt, directories)
bash setup_codespace.sh

In [None]:
# 7) Build Knowledge Graph (CollecTRI/DoRothEA + STRING)
python scripts/build_kg.py \
  --gene-list data/processed/all_genes.txt \
  --output data/kg/knowledge_graph.gpickle \
  --dorothea-levels A B \
  --string-threshold 700

### Embedding Extraction Notes
- Requires GPU and disk headroom; chunked to reduce memory.
- Tune `--max-cells`, `--chunk-cells`, and `--batch-size` to your hardware.

In [None]:
# 8) Extract Geneformer embeddings (adjust caps for your GPU)
python scripts/extract_embeddings.py \
  --h5ad-file data/raw/Challenge/obesity_challenge_1.h5ad \
  --max-cells 20000 \
  --chunk-cells 1000 \
  --batch-size 8 \
  --output data/processed/gene_embeddings.pt

### Training Notes
- Defaults: AdamW lr=1e-4, batch_size=64, epochs=100, precision=16-mixed, early stopping on val/mmd.
- Adjust batch size or `accumulate_grad_batches` if you hit GPU OOM.

In [None]:
# 9) Baseline training run
python scripts/train.py \
  --config configs/default.yaml \
  --seed 42 \
  2>&1 | tee experiments/logs/baseline_run.log

### Experiment Variants (optional)
- Increase MMD weight: create `configs/high_mmd.yaml` with `losses.mmd_weight: 0.2`, `losses.pearson_weight: 0.1`.
- Deeper GAT: `gat_layers: 4`, `gat_heads: 16`, `gat_hidden_dim: 256` (lower batch if needed).
- Higher PCA: `flow_matching.pca_components: 750` if memory allows.

In [None]:
# 10) Generate submission from best checkpoint
python scripts/generate_submission.py \
  --checkpoint checkpoints/best.ckpt \
  --output-dir submissions \
  --n-cells 100 \
  --batch-size 10 \
  2>&1 | tee experiments/logs/inference.log

### Quick Validation
- Expected expression rows: 286,301 (including header).
- NaN check on expression matrix.

In [None]:
# 11) Validate submission files
wc -l submissions/expression_matrix.csv
head submissions/program_proportions.csv
python - <<'PY'
import pandas as pd
df = pd.read_csv('submissions/expression_matrix.csv')
print('NaNs:', df.isna().sum().sum())
PY