# SUTD TrafficQA — CNN + LSTM baseline (MCQ-4)

This notebook runs the **CNN+LSTM** multiple-choice baseline (4 options) that mirrors the classic LSTM-style setup described in the SUTD-TrafficQA paper:
- **BiLSTM** encodes the question and each candidate answer (QA Bank)
- **CNN** encodes sampled frames
- **LSTM** encodes the frame sequence
- An **MLP** scores each of the 4 options

## Expected dataset layout
```
CS412-CV-FinalProject-main/
  SUTD/
    videos/
      *.mp4
    questions/
      R3_train.jsonl
      R3_val.jsonl
      R3_test.jsonl
```


## 0) (Optional) Install dependencies
If you're running in a fresh environment, uncomment the cell below.


In [None]:
# %pip install -r requirements.txt
# If you hit a torchvision import error in your environment, this baseline still runs using a small CNN fallback.


## 1) Point to your project + dataset
Set `PROJECT_ROOT` to the folder that contains `train_sutd_cnn_lstm.py`.


In [2]:
import os
import sys
from pathlib import Path

# If this notebook lives in the repo root, keep '.'
PROJECT_ROOT = Path('.').resolve()
print('PROJECT_ROOT:', PROJECT_ROOT)

# Make 'src/' importable
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Update this if your SUTD folder is elsewhere
SUTD_ROOT = PROJECT_ROOT / 'SUTD'
print('SUTD_ROOT:', SUTD_ROOT)


PROJECT_ROOT: /datastore/clc_hcmus/ZaAIC/CS412-CV-FinalProject
SUTD_ROOT: /datastore/clc_hcmus/ZaAIC/CS412-CV-FinalProject/SUTD


In [8]:
os.environ['HF_HOME'] = str(PROJECT_ROOT.parent / 'hf_cache')

## 2) Sanity check the dataset structure


In [4]:
# Create R3_val.jsonl from R3_test.jsonl
import json
test_file = SUTD_ROOT / 'questions' / 'R3_test.jsonl'
val_file = SUTD_ROOT / 'questions' / 'R3_val.jsonl'
with open(test_file, 'r') as f_in, open(val_file, 'w') as f_out:
    for i, line in enumerate(f_in):
        if i % 5 == 0:
            f_out.write(line)

print('Created', val_file)
print('val file:', val_file, 'exists:', val_file.exists())

Created /datastore/clc_hcmus/ZaAIC/CS412-CV-FinalProject/SUTD/questions/R3_val.jsonl
val file: /datastore/clc_hcmus/ZaAIC/CS412-CV-FinalProject/SUTD/questions/R3_val.jsonl exists: True


In [5]:
videos_dir = SUTD_ROOT / 'videos'
questions_dir = SUTD_ROOT / 'questions'

print('videos_dir exists:', videos_dir.exists())
print('questions_dir exists:', questions_dir.exists())
if questions_dir.exists():
    print('question files:', sorted([p.name for p in questions_dir.glob('*.jsonl')])[:10])
if videos_dir.exists():
    vids = list(videos_dir.glob('*'))
    print('num videos:', len(vids))
    print('sample videos:', [v.name for v in vids[:5]])

train_file = questions_dir / 'R3_train.jsonl'
val_file   = questions_dir / 'R3_val.jsonl'
test_file  = questions_dir / 'R3_test.jsonl'
print('train file:', train_file, 'exists:', train_file.exists())
print('val file:', val_file, 'exists:', val_file.exists())
print('test file:', test_file, 'exists:', test_file.exists())


videos_dir exists: True
questions_dir exists: True
question files: ['R3_all.jsonl', 'R3_test.jsonl', 'R3_train.jsonl', 'R3_val.jsonl']
num videos: 10081
sample videos: ['c_movi7919_25.mp4', 'b_19t411q7rr_clip_016.mp4', 'b_1j7411h7kb_clip_044.mp4', 'b_1j741157BM_clip_118.mp4', 'b_1n741157ss_clip_033.mp4']
train file: /datastore/clc_hcmus/ZaAIC/CS412-CV-FinalProject/SUTD/questions/R3_train.jsonl exists: True
val file: /datastore/clc_hcmus/ZaAIC/CS412-CV-FinalProject/SUTD/questions/R3_val.jsonl exists: True
test file: /datastore/clc_hcmus/ZaAIC/CS412-CV-FinalProject/SUTD/questions/R3_test.jsonl exists: True


## 3) Quick smoke test (small subset)
This runs a short training to verify everything works.


In [5]:
NUM_FRAMES = 16
BATCH_SIZE = 8
EPOCHS = 1

!python train_sutd_cnn_lstm.py \
  --sutd_root "{SUTD_ROOT}" \
  --num_frames {NUM_FRAMES} \
  --batch_size {BATCH_SIZE} \
  --epochs {EPOCHS} \
  --max_train_samples 200 \
  --max_val_samples 200 \
  --use_train_aug \
  --device 'cuda:1'


epoch 1/1: 100%|█████████████████████| 25/25 [00:46<00:00,  1.84s/it, loss=1.31]
Epoch 1: val_acc=0.3100 (valid_samples=200)
  New best: 0.3100 -> saved best.pt
Done. Best val acc: 0.3100


## 4) Full training
Remove the `--max_*_samples` flags to train on the full split.


In [None]:
# Uncomment to train fully
NUM_FRAMES = 16
BATCH_SIZE = 8
EPOCHS = 5

!python train_sutd_cnn_lstm.py \
  --sutd_root "{SUTD_ROOT}" \
  --num_frames {NUM_FRAMES} \
  --batch_size {BATCH_SIZE} \
  --epochs {EPOCHS} \
  --use_train_aug


Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/clc_hcmus2/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████████████████████████████████| 44.7M/44.7M [00:01<00:00, 25.8MB/s]
epoch 1/5: 100%|██████████████| 7058/7058 [3:22:35<00:00,  1.72s/it, loss=0.706]
Epoch 1: val_acc=0.2864 (valid_samples=1215)
  New best: 0.2864 -> saved best.pt
epoch 2/5: 100%|██████████████| 7058/7058 [2:40:35<00:00,  1.37s/it, loss=0.648]
Epoch 2: val_acc=0.2856 (valid_samples=1215)
epoch 3/5: 100%|███████████████| 7058/7058 [2:53:22<00:00,  1.47s/it, loss=0.63]
Epoch 3: val_acc=0.2856 (valid_samples=1215)
epoch 4/5: 100%|██████████████| 7058/7058 [3:20:28<00:00,  1.70s/it, loss=0.618]
Epoch 4: val_acc=0.2840 (valid_samples=1215)
epoch 5/5:  63%|████████▊     | 4468/7058 [2:08:37<41:28,  1.04it/s, loss=0.611]

In [None]:
!python train_sutd_cnn_lstm.py \
  --sutd_root "SUTD/" \
  --num_frames 8 \
  --batch_size 32 \
  --epochs 5 \
  --use_train_aug \
  --out_dir "outputs_tmux/" \
  --device "cuda:7"

## 5) Evaluate a checkpoint
By default, training writes:
- `outputs/sutd_cnn_lstm/best.pt`
- `outputs/sutd_cnn_lstm/last.pt`


In [4]:
NUM_FRAMES = 8

# ckpt = PROJECT_ROOT / 'outputs' / 'sutd_cnn_lstm' / 'best.pt'
ckpt = PROJECT_ROOT / 'outputs_tmux' / 'best.pt'
print('Using ckpt:', ckpt, 'exists:', ckpt.exists())

!python eval_sutd_cnn_lstm.py \
  --sutd_root "{SUTD_ROOT}" \
  --ckpt "{ckpt}" \
  --num_frames {NUM_FRAMES}


Using ckpt: /datastore/clc_hcmus/ZaAIC/CS412-CV-FinalProject/outputs_tmux/best.pt exists: True
evaluating: 100%|█████████████████████████████| 760/760 [13:10<00:00,  1.04s/it]
Test accuracy (only samples with video found): 0.3001 (valid_samples=6075)
Wrote predictions to sutd_cnn_lstm_predictions.csv


## 6) Suggested ablations (good for your report)
- **Frame count**: `--num_frames 8/16/32/64`
- **Backbone**: `--cnn_backbone resnet18` vs `resnet50`
- **Augmentation**: toggle `--use_train_aug`
- **Freeze CNN**: (default freeze) remove `--freeze_cnn` if you want to finetune the visual encoder
