# STEM-VLM Baseline Evaluation on Colab

This notebook runs baseline evaluation of Qwen2-VL-2B on MathVista/ScienceQA.

## 1. Check GPU

In [None]:
!nvidia-smi

## 2. Mount Google Drive (for saving results)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 3. Clone Repository

In [None]:
!git clone https://github.com/Saad1926Q/stem-vlm.git
%cd stem-vlm

## 4. Install Dependencies

We use pip in Colab (UV is not needed here)

In [None]:
!pip install -q torch transformers accelerate qwen-vl-utils datasets pillow tqdm pyyaml

## 5. (Optional) Download Datasets First

This pre-downloads and caches datasets. You can skip this and the main script will download them.

In [None]:
# Uncomment to pre-download:
# !python data/download.py

## 6. Run Baseline Evaluation

### Quick Test (10 samples)

In [None]:
# Test with 10 samples first to make sure everything works
!python scripts/run_baseline.py --dataset mathvista --num_samples 10

### Full Evaluation (MathVista)

This will take ~2-3 hours

In [None]:
# Run full MathVista evaluation
!python scripts/run_baseline.py --config configs/baseline.yaml

### Full Evaluation (ScienceQA)

In [None]:
# Run ScienceQA evaluation
!python scripts/run_baseline.py --dataset scienceqa

## 7. Copy Results to Google Drive

Save results to Drive so they persist after Colab session ends

In [None]:
# Create directory in Drive
!mkdir -p /content/drive/MyDrive/stem-vlm-results

# Copy all results
!cp -r experiments/* /content/drive/MyDrive/stem-vlm-results/

print("âœ“ Results saved to Google Drive: stem-vlm-results/")

## 8. Quick Analysis

Load and view results

In [None]:
import json
from pathlib import Path

# Find latest predictions file
results_dir = Path("experiments/baseline")
pred_files = list(results_dir.glob("*_predictions_*.json"))
latest_file = sorted(pred_files)[-1]

print(f"Loading: {latest_file}")

with open(latest_file) as f:
    data = json.load(f)

metadata = data['metadata']
predictions = data['predictions']

print("\n" + "="*50)
print("METADATA")
print("="*50)
for key, value in metadata.items():
    print(f"{key}: {value}")

print("\n" + "="*50)
print("SAMPLE PREDICTIONS")
print("="*50)
for i, pred in enumerate(predictions[:3]):
    print(f"\nSample {i}:")
    print(f"Question: {pred['question'][:100]}...")
    print(f"Prediction: {pred['prediction']}")
    print(f"Ground Truth: {pred['ground_truth']}")