# Brain-to-Text '25 Competition — Colab-First Training Plan (with Kaggle fallback)

Goal
- Produce a valid competition submission (CSV with id,text) using this repo’s baseline pipeline: GRU phoneme decoder + n‑gram LM (+ optional OPT rescoring).

Key facts from the repo
- Training code lives in `model_training/` and is driven by `rnn_args.yaml` and `train_model.py`.
- Evaluation uses `model_training/evaluate_model.py` and requires a Redis‑connected LM (`language_model/language-model-standalone.py`).
- Data path expected: `data/hdf5_data_final/<session>/data_{train|val|test}.hdf5`.

Deliverables
- Validation WER (for your report) and a test CSV for Kaggle submission.

Sequence (high‑level)
1) Colab setup and dependency install
2) Download data to Drive and verify layout
3) Train baseline GRU in Colab
4) Start Redis + 1‑gram LM; evaluate on val (compute WER) and test (create CSV)
5) If Colab limits block training, train on Kaggle and decode in Colab

---

## 1) Colab setup (copy/paste cells)

1A. Mount Drive and set working directory
```bash
# Verify GPU
!nvidia-smi

# Mount Drive
from google.colab import drive
drive.mount('/content/drive')

# Use a stable workspace folder in Drive
%cd /content/drive/MyDrive
!mkdir -p b2txt25
%cd b2txt25
```

1B. Place/point the repo
- If you already have the repo in Drive, set its path and skip cloning.
```bash
# Option A: Reuse existing repo (recommended if already in Drive)
%cd /content/drive/MyDrive/nejm-brain-to-text || echo "Repo not at this path, see Option B"

# Option B: Keep workspace clean, symlink or copy your repo under b2txt25
%cd /content/drive/MyDrive/b2txt25
!ln -s "/content/drive/MyDrive/nejm-brain-to-text" ./nejm-brain-to-text
%cd nejm-brain-to-text
```

1C. Install system + Python deps (conflict‑free set for Colab)
```bash
# System deps
!sudo apt-get update -y
!sudo apt-get install -y redis-server cmake build-essential

# Core GPU stack (CUDA 12.1 wheels on Colab)
!pip -q install --upgrade --no-cache-dir \
  torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Remove conflicting HuggingFace libs to avoid resolver issues
!pip -q uninstall -y transformers tokenizers huggingface-hub || true

# Align with Colab constraints
!pip -q install --upgrade --no-cache-dir \
  pandas==2.2.2 \
  numpy==2.0.2

# Known-compatible HF trio
!pip -q install --no-cache-dir \
  huggingface-hub==0.34.1 \
  transformers==4.53.0 \
  tokenizers==0.21.2

# Remaining deps (satisfy umap-learn/tsfresh requirements too)
!pip -q install --upgrade --no-cache-dir \
  redis==5.2.1 \
  matplotlib==3.10.1 \
  scipy==1.14.1 \
  scikit-learn==1.6.1 \
  tqdm==4.67.1 \
  g2p_en==2.1.0 \
  h5py==3.11.0 \
  omegaconf==2.3.0 \
  editdistance==0.8.1 \
  accelerate==1.0.1 \
  bitsandbytes==0.43.1

# Install local utils package from the repo root
%cd /content/drive/MyDrive/nejm-brain-to-text
!pip -q install -e .
```

Troubleshooting (installs)
- If you see: “ERROR: file:///content does not appear to be a Python project”, you ran `pip install -e .` outside the repo. `cd` into `…/nejm-brain-to-text` and rerun.
- If HF trio conflicts, fallback: `transformers==4.51.0`, `tokenizers==0.20.1`, `huggingface-hub==0.34.1`.

---

## 2) Data download and verification
```bash
%cd /content/drive/MyDrive/nejm-brain-to-text
!python download_data.py

# Quick check (first 80 lines)
!ls -R data | head -n 80
```

Expected structure
```
data/
├── t15_copyTask.pkl
├── t15_personalUse.pkl
├── hdf5_data_final/
│   ├── t15.2023.08.11/
│   │   └── data_train.hdf5
│   ├── t15.2023.08.13/
│   │   ├── data_train.hdf5
│   │   ├── data_val.hdf5
│   │   └── data_test.hdf5
│   └── ... (many sessions)
└── t15_pretrained_rnn_baseline/
    └── checkpoint/best_checkpoint, args.yaml
```

---

## 3) Train the baseline GRU (Colab)
```bash
%cd /content/drive/MyDrive/nejm-brain-to-text/model_training

# Optional: quick pipeline check (reduce batches), then restore to 120000 for full run
from omegaconf import OmegaConf
args = OmegaConf.load('rnn_args.yaml')
args.num_training_batches = 120000   # e.g., set 10000 to sanity‑check end‑to‑end
args.gpu_number = '0'
args.output_dir = 'trained_models/baseline_rnn'
args.checkpoint_dir = 'trained_models/baseline_rnn/checkpoint'
OmegaConf.save(config=args, f='rnn_args.yaml')

# Start training
!python train_model.py
```

Outputs (watch for)
- `trained_models/baseline_rnn/training_log`
- `trained_models/baseline_rnn/checkpoint/best_checkpoint`
- `trained_models/baseline_rnn/checkpoint/val_metrics.pkl`

Time
- Full 120k batches: fast GPUs ~3.5h; T4 may be longer. Use smaller `num_training_batches` if session time is tight; resume later by pointing to your last checkpoint (set `init_from_checkpoint: true` and `init_checkpoint_path`).

---

## 4) Language model + evaluation (Colab)

4A. Start Redis
```bash
!redis-server --daemonize yes
!redis-cli ping  # expect PONG
```

4B. Start the 1‑gram LM (keep this cell running)
```bash
%cd /content/drive/MyDrive/nejm-brain-to-text
!python language_model/language-model-standalone.py \
  --lm_path language_model/pretrained_language_models/openwebtext_1gram_lm_sil \
  --do_opt --nbest 100 --acoustic_scale 0.325 --blank_penalty 90 --alpha 0.55 \
  --redis_ip localhost --gpu_number 0
```

Notes
- First run downloads OPT‑6.7b (~13GB). If VRAM is tight on T4, you can remove `--do_opt` (disables OPT rescoring; accuracy drops but RAM usage improves).

4C. Evaluate on validation (compute WER)
```bash
%cd /content/drive/MyDrive/nejm-brain-to-text/model_training
!python evaluate_model.py \
  --model_path trained_models/baseline_rnn \
  --data_dir ../data/hdf5_data_final \
  --eval_type val \
  --gpu_number 0
```

4D. Evaluate on test (produce submission CSV)
```bash
!python evaluate_model.py \
  --model_path trained_models/baseline_rnn \
  --data_dir ../data/hdf5_data_final \
  --eval_type test \
  --gpu_number 0

# Output file lives under the model_path directory:
# trained_models/baseline_rnn/baseline_rnn_test_predicted_sentences_YYYYMMDD_HHMMSS.csv
```

4E. Shutdown Redis (after you finish)
```bash
!redis-cli shutdown
```

---

## 5) Kaggle fallback (if Colab training is too slow/unstable)

Strategy
- Train RNN on Kaggle (GPU notebook). Kaggle may not allow Redis; skip LM there.
- Download the trained checkpoint to Drive.
- Run LM + `evaluate_model.py` in Colab to create submission CSV.

Kaggle steps (high‑level)
1) Create a GPU notebook; attach the repo as a Kaggle Dataset (or upload zip).
2) `pip install` the same Python deps (skip Redis/LM).
3) Run `model_training/train_model.py`; save `trained_models/baseline_rnn/checkpoint/best_checkpoint` to the notebook output.
4) Download the model directory; put it in Drive under `…/model_training/trained_models/baseline_rnn`.
5) Back in Colab, run Section 4 (LM + evaluation) to generate CSV.

---

## Troubleshooting quick reference

- Dependency conflicts (transformers/tokenizers/hf‑hub)
  - Uninstall first, then install compatible trio: `transformers==4.53.0`, `tokenizers==0.21.2`, `huggingface-hub==0.34.1`.
- Colab package constraints
  - Use `pandas==2.2.2`, `numpy==2.0.2`, `scipy==1.14.1`, `scikit-learn==1.6.1`.
- `pip install -e .` error
  - Ensure you `cd` to `/content/drive/MyDrive/nejm-brain-to-text` before installing.
- LM OOM on T4
  - Remove `--do_opt` to skip OPT; or use Colab Pro (A100) for more VRAM.
- Session limits
  - Lower `num_training_batches` to checkpoint quickly; resume later.

---

## Timeline (1 month)

- Week 1: Environment + data + short training (10k batches) to validate end‑to‑end; produce a test CSV.
- Week 2: Full training (120k) on Colab or Kaggle; checkpoint and verify val WER.
- Week 3: Iterate hyperparameters if time allows; re‑evaluate; generate improved CSV.
- Week 4: Final run, prepare report (include val WER, method, settings) and submit to Kaggle.

---

Owner checklist
- Data present under `data/hdf5_data_final/*`
- `trained_models/baseline_rnn/checkpoint/best_checkpoint` exists after training
- LM process connected to Redis (shows “Successfully connected…”)
- Validation WER printed
- Submission CSV generated under model directory

Document version: 2.0
