# 🏥 Medical Model Optimization + Mixture of Experts  
## 📓 Notebook 1: Model Selection & Benchmarking

**Authors:**  
- Dan Harvey  
- Xinzhuo Jiang  

**Affiliation:**  
*High-Performance Machine Learning (HPML)*  
*Columbia University*


---

### 🔍 Project Overview

This project investigates how to optimize large language models (LLMs) for medical applications by leveraging modern efficiency techniques and modular model design.

### 🎯 Objectives

1. **Benchmark** multiple medical and general-purpose LLMs to establish quantitative performance baselines.  
2. **Optimize** selected models through quantization, pruning, and architectural tuning.  
3. **Design and evaluate** a Mixture-of-Experts (MoE) architecture with specialized, task-specific experts.

---

### 📁 This notebook focuses on Step 1: loading and benchmarking key candidate models.


In [1]:
## 📦 Environment Setup: Dependencies and Imports
import os
import sys
import importlib
import subprocess
import torch
import platform
import time

In [2]:
# Add project root to path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Required packages
required_packages = [
    'torch', 'transformers', 'datasets', 'accelerate', 'flash_attn',
    'evaluate', 'lm_eval', 'sklearn', 'matplotlib', 'wandb',
    'tqdm', 'sentencepiece', 'scipy', 'einops'
]

# Check and install missing packages
for package in required_packages:
    try:
        module = importlib.import_module(package)
        print(f"✅ {package} installed successfully")
        if package == 'torch':
            print(f"   Version: {torch.__version__}")
            print(f"   CUDA available: {torch.cuda.is_available()}")
            if torch.cuda.is_available():
                print(f"   CUDA version: {torch.version.cuda}")
                print(f"   GPU: {torch.cuda.get_device_name(0)}")
        elif hasattr(module, '__version__'):
            print(f"   Version: {module.__version__}")
    except ImportError:
        print(f"❌ {package} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        module = importlib.import_module(package)
        print(f"✅ {package} installed successfully (post-install)")
        if hasattr(module, '__version__'):
            print(f"   Version: {module.__version__}")

# You may need to restart the Kernel to use these

✅ torch installed successfully
   Version: 2.6.0+cu124
   CUDA available: True
   CUDA version: 12.4
   GPU: NVIDIA A100-SXM4-40GB
✅ transformers installed successfully
   Version: 4.51.3
✅ datasets installed successfully
   Version: 3.6.0
✅ accelerate installed successfully
   Version: 1.6.0
✅ flash_attn installed successfully
   Version: 2.7.4.post1
✅ evaluate installed successfully
   Version: 0.4.3
✅ lm_eval installed successfully
✅ sklearn installed successfully
   Version: 1.6.1
✅ matplotlib installed successfully
   Version: 3.10.0
✅ wandb installed successfully
   Version: 0.19.10
✅ tqdm installed successfully
   Version: 4.67.1
✅ sentencepiece installed successfully
   Version: 0.2.0
✅ scipy installed successfully
   Version: 1.15.2
✅ einops installed successfully
   Version: 0.8.1


In [3]:
# Load section dependencies
from transformers import AutoTokenizer, AutoModelForCausalLM
import gc
import lm_eval

# 🧠 Model Selection: Baseline Models

We will work with the following Hugging Face models:

| Model Name                            | Size | Notes                                                          |
| ------------------------------------- | ---- | -------------------------------------------------------------- |
| `TsinghuaC3I/Llama-3-8B-UltraMedical` | 8B   | Medical domain-specific, fine-tuned, ideal teacher & benchmark |
| `meta-llama/Llama-3.2-3B`             | 3B   | Same architecture, smaller, ideal as an expert or student      |
| `Qwen/Qwen3-4B`                       | 4B   | Non-LLaMA expert for diversity in MoE                          |

These models will serve as the baseline in our pipeline and will be evaluated for:

- Performance on medical QA and reasoning tasks
- Suitability for distillation and expert specialization
- Impact of downstream optimizations (quantization, pruning, MoE routing)

📌 **Note**: All models are initially loaded in **full FP32 (float32) precision** to serve as accurate performance baselines before applying any quantization or memory optimization techniques.


In [6]:
# 🔐 Hugging Face Access - Llama is Gated
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
The token `helm` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `helm`


# 📥 Load Baseline Models

## 🦙 Llama-3-8B-UltraMedical

**Links**  
- 🤗 [Hugging Face Model Card](https://huggingface.co/TsinghuaC3I/Llama-3-8B-UltraMedical)
- 📄 [Paper / Source](https://huggingface.co/TsinghuaC3I/Llama-3-8B-UltraMedical)

**Approximate GPU Memory Requirements:**
- **FP32**: ~32.4 GB  
- **FP16**: ~48 GB  
- **INT8**: ~24 GB  
- **INT4**: ~12 GB  

> These values are estimates and may vary based on sequence length, attention optimizations, and tokenizer overhead.


In [None]:
#Llama-3-8B-UltraMedical

tokenizer_llama8b_med = AutoTokenizer.from_pretrained(
    "TsinghuaC3I/Llama-3-8B-UltraMedical",
    trust_remote_code=True,
    use_auth_token=True
)

model_llama8b_med = AutoModelForCausalLM.from_pretrained(
    "TsinghuaC3I/Llama-3-8B-UltraMedical",
    trust_remote_code=True,
  device_map="auto",
    torch_dtype=torch.float32,
    use_auth_token=True
)

print("✅ Loaded Llama-3-8B-UltraMedical (FP32, device-mapped)")



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



✅ Loaded Llama-3-8B-UltraMedical (FP32, device-mapped)


In [None]:
# Inspect your GPU's memory usage
!nvidia-smi

Wed May  7 02:54:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   35C    P0             50W /  400W |   37641MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

**This model took 33139MiB /  40960MiB, or 32.4GB of GPU Memory.**

In [None]:
# Offload Model for GPU Space
del model_llama8b_med
del tokenizer_llama8b_med

gc.collect()
torch.cuda.empty_cache()
time.sleep(5)
gc.collect()
torch.cuda.empty_cache()

## 🦙 Llama-3.2-3B

**Links**  
- 🤗 [Hugging Face Model Card](https://huggingface.co/meta-llama/Llama-3.2-3B)  
- 📄 [Paper / Source](https://huggingface.co/meta-llama/Llama-3.2-3B)

**Approximate GPU Memory Requirements:**
- **FP32**: ~14.9 GB  

> These are rough estimates. Actual usage depends on sequence length, architecture-specific memory optimizations, and tokenizer overhead.


In [None]:
#Llama-3.2-3B

tokenizer_llama3b = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    trust_remote_code=True,
    use_auth_token=True
)

model_llama3b = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float32,
    use_auth_token=True
)

print("✅ Loaded Llama-3.2-3B (FP32, device-mapped)")




tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

✅ Loaded Llama-3.2-3B (FP32, device-mapped)


In [None]:
# Inspect your GPU's memory usage
print("\n--- NVIDIA-SMI Snapshot ---")
print(subprocess.getoutput("nvidia-smi"))


--- NVIDIA-SMI Snapshot ---
Wed May  7 02:56:29 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   35C    P0             50W /  400W |   14261MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                   

 **This model took 14261MiB /  40960MiB, or 14.9GB of GPU Memory.**

In [None]:
# Offload Model for GPU Space
del model_llama3b
del tokenizer_llama3b
gc.collect()
torch.cuda.empty_cache()
time.sleep(5)
gc.collect()
torch.cuda.empty_cache()

## 🐉 Qwen3-4B

**Links**  
- 🤗 [Hugging Face Model Card](https://huggingface.co/Qwen/Qwen3-4B)  
- 📄 [Paper / Source](https://arxiv.org/abs/2403.08552) *(Qwen2 paper for reference — Qwen3 paper may be pending)*

**Approximate GPU Memory Requirements:**
- **FP32**: ~16.9 GB  


> Qwen models typically require `trust_remote_code=True` due to custom model implementations.

In [None]:
# Qwen3-4B

tokenizer_qwen4b = AutoTokenizer.from_pretrained(
    "Qwen/Qwen3-4B",
    trust_remote_code=True,
    use_auth_token=True
)

model_qwen4b = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float32,
    use_auth_token=True
)

print("✅ Loaded Qwen3-4B (FP32, device-mapped)")


tokenizer_config.json:   0%|          | 0.00/9.68k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/32.8k [00:00<?, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/99.6M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

✅ Loaded Qwen3-4B (FP32, device-mapped)


In [None]:
# Inspect your GPU's memory usage
print("\n--- NVIDIA-SMI Snapshot ---")
print(subprocess.getoutput("nvidia-smi"))


--- NVIDIA-SMI Snapshot ---
Wed May  7 02:57:30 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   35C    P0             50W /  400W |   17331MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                   

 ** This model took 17331MiB /  40960MiB or ~16.9GB of GPU Memory **

In [None]:
# Offload Model for GPU Space
del model_qwen4b
del tokenizer_qwen4b
gc.collect()
torch.cuda.empty_cache()
time.sleep(5)
gc.collect()
torch.cuda.empty_cache()

# 📊 Benchmarking

To establish performance baselines, we will:

* Load eeach model in full float32 (Already implemented above)
* Run each model through standard medical QA tasks (e.gPubMedQA).
* Repeat each benchmark 3 times and average results.


In [None]:
# Import section dependencies
import platform
import psutil
import distro
import numpy as np

In [None]:
!wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdyh2111[0m ([33mmed-moe[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
# ==========================
# 🖥️ System & OS Information
# ==========================
system_info = platform.uname()

print("🖥️ System Information")
print("-" * 40)
print(f"Node Name      : {system_info.node}")
print(f"System         : {platform.system()}")
print(f"OS Flavor      : {distro.name()}")
print(f"OS Version     : {distro.version()}")
print(f"Release        : {system_info.release}")
print(f"Architecture   : {platform.machine()}")
print(f"Python Version : {platform.python_version()}")

# =====================
# 🧠 CPU Information
# =====================
cpu_count = psutil.cpu_count(logical=False)
logical_cpu_count = psutil.cpu_count(logical=True)

print("\n🧠 CPU Information")
print("-" * 40)
print(f"Processor      : {system_info.processor or platform.processor()}")
print(f"Physical Cores : {cpu_count}")
print(f"Logical Cores  : {logical_cpu_count}")

# ======================
# 🧠 Memory Information
# ======================
memory_info = psutil.virtual_memory()

print("\n🧠 Memory Information")
print("-" * 40)
print(f"Total RAM      : {memory_info.total / 1024 ** 3:.2f} GB")
print(f"Available RAM  : {memory_info.available / 1024 ** 3:.2f} GB")
print(f"Used RAM       : {memory_info.used / 1024 ** 3:.2f} GB")

# =======================
# 💾 Disk Information
# =======================
disk_info = psutil.disk_usage('/')

print("\n💾 Disk Information")
print("-" * 40)
print(f"Total Space    : {disk_info.total / 1024 ** 3:.2f} GB")
print(f"Used Space     : {disk_info.used / 1024 ** 3:.2f} GB")
print(f"Free Space     : {disk_info.free / 1024 ** 3:.2f} GB")

# =======================
# 🧠 GPU Information
# =======================

print("\n🧠 GPU Info")
print("GPU:", torch.cuda.get_device_name(0))
print("CUDA Available:", True)

🖥️ System Information
----------------------------------------
Node Name      : 5d6c33d010a6
System         : Linux
OS Flavor      : Ubuntu
OS Version     : 22.04
Release        : 6.1.123+
Architecture   : x86_64
Python Version : 3.11.12

🧠 CPU Information
----------------------------------------
Processor      : x86_64
Physical Cores : 6
Logical Cores  : 12

🧠 Memory Information
----------------------------------------
Total RAM      : 83.48 GB
Available RAM  : 79.70 GB
Used RAM       : 2.89 GB

💾 Disk Information
----------------------------------------
Total Space    : 235.68 GB
Used Space     : 70.45 GB
Free Space     : 165.21 GB

🧠 GPU Info
GPU: NVIDIA A100-SXM4-40GB
CUDA Available: True


## 🦙 Llama-3-8B-UltraMedical

## Measure baseline performance

In [None]:
#Load Llama-3-8B-UltraMedical
model_name = "TsinghuaC3I/Llama-3-8B-UltraMedical"

# Load tokenizer once (doesn’t affect model loading time)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_auth_token=True
)

load_times = []

trials = 5

print(f"⏳ Starting timed model loads ({trials} repetitions)...\n")

for i in range(trials):
    start_time = time.monotonic()

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.float32,
        use_auth_token=True
    )

    elapsed = time.monotonic() - start_time
    load_times.append(elapsed)
    print(f"✅ Run {i + 1}: Loaded in {elapsed:.2f} seconds")

    # Clean up between runs (free GPU memory)
    del model
    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(5)
    gc.collect()
    torch.cuda.empty_cache()

# Summary stats
mean_time = np.mean(load_times)
std_dev_time = np.std(load_times)

print(f"\n📊 {model_name} Load Time Summary (FP32)")
print(f"- Average Load Time: {mean_time:.2f} seconds")
print(f"- Std Dev:           {std_dev_time:.2f} seconds")



⏳ Starting timed model loads (5 repetitions)...





Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

✅ Run 1: Loaded in 5.21 seconds


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

✅ Run 2: Loaded in 5.11 seconds


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

✅ Run 3: Loaded in 5.11 seconds


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

✅ Run 4: Loaded in 5.10 seconds


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

✅ Run 5: Loaded in 5.10 seconds

📊 TsinghuaC3I/Llama-3-8B-UltraMedical Load Time Summary (FP32)
- Average Load Time: 5.13 seconds
- Std Dev:           0.04 seconds


## 🦙 Llama-3.2-3B

In [None]:
#Load Llama-3.2 3B

model_name = "meta-llama/Llama-3.2-3B"

# Load tokenizer once (doesn’t affect model loading time)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_auth_token=True
)

load_times = []

trials = 5

print(f"⏳ Starting timed model loads ({trials} repetitions)...\n")

for i in range(trials):
    start_time = time.monotonic()

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.float32,
        use_auth_token=True
    )

    elapsed = time.monotonic() - start_time
    load_times.append(elapsed)
    print(f"✅ Run {i + 1}: Loaded in {elapsed:.2f} seconds")

    # Clean up between runs (free GPU memory)
    del model
    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(5)
    gc.collect()
    torch.cuda.empty_cache()

# Summary stats
mean_time = np.mean(load_times)
std_dev_time = np.std(load_times)

print(f"\n📊 {model_name} Load Time Summary (FP32)")
print(f"- Average Load Time: {mean_time:.2f} seconds")
print(f"- Std Dev:           {std_dev_time:.2f} seconds")

⏳ Starting timed model loads (5 repetitions)...



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Run 1: Loaded in 2.68 seconds


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Run 2: Loaded in 2.67 seconds


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Run 3: Loaded in 2.45 seconds


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Run 4: Loaded in 2.45 seconds


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Run 5: Loaded in 2.46 seconds

📊 meta-llama/Llama-3.2-3B Load Time Summary (FP32)
- Average Load Time: 2.54 seconds
- Std Dev:           0.11 seconds


In [None]:
import random
import json
import wandb
import subprocess
import time
import os
from datetime import datetime

# -----------------------------
# 🧠 Model and Task Config
# -----------------------------
model_name = "meta-llama/Llama-3.2-3B"
task_name = "pubmedqa"
output_base = "./results"

# -----------------------------
# 🚀 Start W&B run
# -----------------------------
run_name = f"{model_name.replace('/', '_')}_{task_name}_5x"
wandb_run = wandb.init(
    project="med-moe-baseline-evals",
    name=run_name,
    config={
        "model": model_name,
        "task": task_name,
        "batch_size": 8,
        "precision": "fp32",
        "eval_method": "lm_eval",
        "repeats": 5
    }
)

# -----------------------------
# 🔁 Run 5x Evaluation Loop
# -----------------------------
for i in range(5):
    print(f"\n🔁 Run {i + 1}/5")

    # Create timestamped output folder
    timestamp = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    run_output_dir = os.path.join(output_base, f"run_{i+1}_{timestamp}")
    os.makedirs(run_output_dir, exist_ok=True)

    # Define lm_eval command
    command = [
        "lm_eval",
        "--model", "hf",
        "--tasks", task_name,
        "--model_args", f"pretrained={model_name},parallelize=True",
        "--device", "cuda:0",
        "--batch_size", "8",
        "--write_out",
        "--output_path", run_output_dir,
        "--trust_remote_code",
        "--confirm_run_unsafe_code"
    ]

    # Start timing
    start_time = time.monotonic()
    result = subprocess.run(command, capture_output=True, text=True)
    elapsed = time.monotonic() - start_time

    print(f"✅ Run {i + 1} completed in {elapsed:.2f} seconds")
    print("STDOUT:\n", result.stdout)

    # -----------------------------
    # 📊 Find and parse result file
    # -----------------------------
    result_file = None
    for fname in os.listdir(run_output_dir):
        if fname.startswith("eval_results") and fname.endswith(".json"):
            result_file = os.path.join(run_output_dir, fname)
            break

    if result_file is None:
        print(f"❌ No eval_results_*.json found in {run_output_dir}")
        continue

    try:
        with open(result_file) as f:
            data = json.load(f)
        task_data = data["results"][task_name]

        acc = task_data.get("acc,none")
        stderr = task_data.get("acc_stderr,none")

        if acc is not None and stderr is not None:
            wandb_run.log({
                f"{task_name}/accuracy": acc,
                f"{task_name}/stddev": stderr,
                f"{task_name}/eval_time_sec": elapsed,
                "run_index": i + 1
            })
            print(f"📈 Logged to W&B: acc={acc:.3f}, stderr={stderr:.4f}")
        else:
            print(f"⚠️ Missing keys in result: {task_data.keys()}")

    except Exception as e:
        print(f"⚠️ Failed to parse results from {result_file}: {e}")

# -----------------------------
# ✅ Finish W&B run
# -----------------------------
wandb_run.finish()



🔁 Run 1/5
✅ Run 1 completed in 30.88 seconds
STDOUT:
 hf (pretrained=meta-llama/Llama-3.2-3B,parallelize=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-----:|------|---|----:|---|-----:|
|pubmedqa|      1|none  |     0|acc   |↑  |0.732|±  |0.0198|


❌ No eval_results_*.json found in ./results/run_1_2025-05-07T03-52-27

🔁 Run 2/5
✅ Run 2 completed in 30.71 seconds
STDOUT:
 hf (pretrained=meta-llama/Llama-3.2-3B,parallelize=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-----:|------|---|----:|---|-----:|
|pubmedqa|      1|none  |     0|acc   |↑  |0.732|±  |0.0198|


❌ No eval_results_*.json found in ./results/run_2_2025-05-07T03-52-58

🔁 Run 3/5
✅ Run 3 completed in 30.65 seconds
STDOUT:
 hf (pretrained=meta-llama/Llama

## 🐉 Qwen3-4B

In [None]:
#Load Quen3 4B

model_name = "Qwen/Qwen3-4B"

# Load tokenizer once (doesn’t affect model loading time)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_auth_token=True
)

load_times = []

trials = 5

print(f"⏳ Starting timed model loads ({trials} repetitions)...\n")

for i in range(trials):
    start_time = time.monotonic()

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.float32,
        use_auth_token=True
    )

    elapsed = time.monotonic() - start_time
    load_times.append(elapsed)
    print(f"✅ Run {i + 1}: Loaded in {elapsed:.2f} seconds")

    # Clean up between runs (free GPU memory)
    del model
    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(5)
    gc.collect()
    torch.cuda.empty_cache()

# Summary stats
mean_time = np.mean(load_times)
std_dev_time = np.std(load_times)

print(f"\n📊 {model_name} Load Time Summary (FP32)")
print(f"- Average Load Time: {mean_time:.2f} seconds")
print(f"- Std Dev:           {std_dev_time:.2f} seconds")

⏳ Starting timed model loads (5 repetitions)...



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✅ Run 1: Loaded in 3.27 seconds


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✅ Run 2: Loaded in 3.02 seconds


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✅ Run 3: Loaded in 3.01 seconds


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✅ Run 4: Loaded in 3.02 seconds


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✅ Run 5: Loaded in 3.12 seconds

📊 Qwen/Qwen3-4B Load Time Summary (FP32)
- Average Load Time: 3.09 seconds
- Std Dev:           0.10 seconds


In [8]:
import random
import json
import wandb
import subprocess
import time
import os
from datetime import datetime

# -----------------------------
# 🧠 Model and Task Config
# -----------------------------
model_name = "Qwen/Qwen3-4B"
task_name = "pubmedqa"
output_base = "./results"

# -----------------------------
# 🚀 Start W&B run
# -----------------------------
run_name = f"{model_name.replace('/', '_')}_{task_name}_5x"
wandb_run = wandb.init(
    project="med-moe-baseline-evals",
    name=run_name,
    config={
        "model": model_name,
        "task": task_name,
        "batch_size": 8,
        "precision": "fp32",
        "eval_method": "lm_eval",
        "repeats": 5
    }
)

# -----------------------------
# 🔁 Run 5x Evaluation Loop
# -----------------------------
for i in range(5):
    print(f"\n🔁 Run {i + 1}/5")

    # Create timestamped output folder
    timestamp = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    run_output_dir = os.path.join(output_base, f"run_{i+1}_{timestamp}")
    os.makedirs(run_output_dir, exist_ok=True)

    # Define lm_eval command
    command = [
        "lm_eval",
        "--model", "hf",
        "--tasks", task_name,
        "--model_args", f"pretrained={model_name},parallelize=True",
        "--device", "cuda:0",
        "--batch_size", "8",
        "--write_out",
        "--output_path", run_output_dir,
        "--trust_remote_code", "--confirm_run_unsafe_code"
    ]

    # Start timing
    start_time = time.monotonic()
    result = subprocess.run(command, capture_output=True, text=True)
    elapsed = time.monotonic() - start_time

    print(f"✅ Run {i + 1} completed in {elapsed:.2f} seconds")
    print("STDOUT:\n", result.stdout)
    print("STDERR:\n", result.stderr)

    # -----------------------------
    # 📊 Find and parse result file
    # -----------------------------
    result_file = None
    for fname in os.listdir(run_output_dir):
        if fname.startswith("eval_results") and fname.endswith(".json"):
            result_file = os.path.join(run_output_dir, fname)
            break

    if result_file is None:
        print(f"❌ No eval_results_*.json found in {run_output_dir}")
        continue

    try:
        with open(result_file) as f:
            data = json.load(f)
        task_data = data["results"][task_name]

        acc = task_data.get("acc,none")
        stderr = task_data.get("acc_stderr,none")

        if acc is not None and stderr is not None:
            wandb_run.log({
                f"{task_name}/accuracy": acc,
                f"{task_name}/stddev": stderr,
                f"{task_name}/eval_time_sec": elapsed,
                "run_index": i + 1
            })
            print(f"📈 Logged to W&B: acc={acc:.3f}, stderr={stderr:.4f}")
        else:
            print(f"⚠️ Missing keys in result: {task_data.keys()}")

    except Exception as e:
        print(f"⚠️ Failed to parse results from {result_file}: {e}")

# -----------------------------
# ✅ Finish W&B run
# -----------------------------
wandb_run.finish()



🔁 Run 1/5
✅ Run 1 completed in 43.17 seconds
STDOUT:
 hf (pretrained=Qwen/Qwen3-4B,parallelize=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-----:|------|---|----:|---|-----:|
|pubmedqa|      1|none  |     0|acc   |↑  |0.768|±  |0.0189|


STDERR:
 2025-05-09 01:54:35.138132: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746755675.159982   13896 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746755675.166597   13896 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-09:01:54:51,923 INFO     [lm_eval.__main__:

In [9]:
import random
import json
import wandb
import subprocess
import time
import os
from datetime import datetime

# -----------------------------
# 🧠 Model and Task Config
# -----------------------------
model_name = "meta-llama/Llama-3.2-3B"
task_name = "pubmedqa"
output_base = "./results"

# -----------------------------
# 🚀 Start W&B run
# -----------------------------
run_name = f"{model_name.replace('/', '_')}_{task_name}_5x"
wandb_run = wandb.init(
    project="med-moe-baseline-evals",
    name=run_name,
    config={
        "model": model_name,
        "task": task_name,
        "batch_size": 8,
        "precision": "fp32",
        "eval_method": "lm_eval",
        "repeats": 5
    }
)

# -----------------------------
# 🔁 Run 5x Evaluation Loop
# -----------------------------
for i in range(5):
    print(f"\n🔁 Run {i + 1}/5")

    # Create timestamped output folder
    timestamp = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    run_output_dir = os.path.join(output_base, f"run_{i+1}_{timestamp}")
    os.makedirs(run_output_dir, exist_ok=True)

    # Define lm_eval command
    command = [
        "lm_eval",
        "--model", "hf",
        "--tasks", task_name,
        "--model_args", f"pretrained={model_name},parallelize=True",
        "--device", "cuda:0",
        "--batch_size", "8",
        "--write_out",
        "--output_path", run_output_dir,
        "--trust_remote_code", "--confirm_run_unsafe_code"
    ]

    # Start timing
    start_time = time.monotonic()
    result = subprocess.run(command, capture_output=True, text=True)
    elapsed = time.monotonic() - start_time

    print(f"✅ Run {i + 1} completed in {elapsed:.2f} seconds")
    print("STDOUT:\n", result.stdout)
    print("STDERR:\n", result.stderr)

    # -----------------------------
    # 📊 Find and parse result file
    # -----------------------------
    result_file = None
    for fname in os.listdir(run_output_dir):
        if fname.startswith("eval_results") and fname.endswith(".json"):
            result_file = os.path.join(run_output_dir, fname)
            break

    if result_file is None:
        print(f"❌ No eval_results_*.json found in {run_output_dir}")
        continue

    try:
        with open(result_file) as f:
            data = json.load(f)
        task_data = data["results"][task_name]

        acc = task_data.get("acc,none")
        stderr = task_data.get("acc_stderr,none")

        if acc is not None and stderr is not None:
            wandb_run.log({
                f"{task_name}/accuracy": acc,
                f"{task_name}/stddev": stderr,
                f"{task_name}/eval_time_sec": elapsed,
                "run_index": i + 1
            })
            print(f"📈 Logged to W&B: acc={acc:.3f}, stderr={stderr:.4f}")
        else:
            print(f"⚠️ Missing keys in result: {task_data.keys()}")

    except Exception as e:
        print(f"⚠️ Failed to parse results from {result_file}: {e}")

# -----------------------------
# ✅ Finish W&B run
# -----------------------------
wandb_run.finish()



🔁 Run 1/5
✅ Run 1 completed in 60.62 seconds
STDOUT:
 hf (pretrained=meta-llama/Llama-3.2-3B,parallelize=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-----:|------|---|----:|---|-----:|
|pubmedqa|      1|none  |     0|acc   |↑  |0.732|±  |0.0198|


STDERR:
 2025-05-09 02:00:51.229049: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746756051.250535   15878 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746756051.257001   15878 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-09:02:01:08,057 INFO     [lm_eval