# 🏥 Medical Model Optimization + Mixture of Experts  
## ⚙️ Notebook 3: Experts Benchmarks

**Authors:**  
- Dan Harvey  
- Xinzhuo Jiang  

**Affiliation:**  
*High-Performance Machine Learning (HPML)*  
*Columbia University*


---

### 🔍 Project Overview

In this section, we evaluate how quantization affects model load time, memory usage, and inference speed.


### 🎯 Objectives

- Load `Llama-3-8B-UltraMedical` in FP16, 8-bit, and 4-bit quantized formats
- Time each loading operation
- Profile memory usage using `nvidia-smi`
- Compare load time, memory, and performance tradeoffs

We use Hugging Face’s quantization options and BitsAndBytes for INT8/INT4 support.

In [1]:
## 📦 Environment Setup: Dependencies and Imports

import torch
import time
import os
import subprocess
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
import gc
import sys
import importlib
!pip install -U bitsandbytes



In [2]:
# Add project root to path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Required packages
required_packages = [
    'torch', 'transformers', 'datasets', 'accelerate', 'flash_attn',
    'evaluate', 'lm_eval', 'sklearn', 'matplotlib', 'wandb',
    'tqdm', 'sentencepiece', 'scipy', 'einops'
]

# Check and install missing packages
for package in required_packages:
    try:
        module = importlib.import_module(package)
        print(f"✅ {package} installed successfully")
        if package == 'torch':
            print(f"   Version: {torch.__version__}")
            print(f"   CUDA available: {torch.cuda.is_available()}")
            if torch.cuda.is_available():
                print(f"   CUDA version: {torch.version.cuda}")
                print(f"   GPU: {torch.cuda.get_device_name(0)}")
        elif hasattr(module, '__version__'):
            print(f"   Version: {module.__version__}")
    except ImportError:
        print(f"❌ {package} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        module = importlib.import_module(package)
        print(f"✅ {package} installed successfully (post-install)")
        if hasattr(module, '__version__'):
            print(f"   Version: {module.__version__}")

# You may need to restart the Kernel to use these

✅ torch installed successfully
   Version: 2.6.0+cu124
   CUDA available: True
   CUDA version: 12.4
   GPU: NVIDIA A100-SXM4-40GB
✅ transformers installed successfully
   Version: 4.51.3
✅ datasets installed successfully
   Version: 3.6.0
✅ accelerate installed successfully
   Version: 1.6.0
✅ flash_attn installed successfully
   Version: 2.7.4.post1
✅ evaluate installed successfully
   Version: 0.4.3
✅ lm_eval installed successfully
✅ sklearn installed successfully
   Version: 1.6.1
✅ matplotlib installed successfully
   Version: 3.10.0
✅ wandb installed successfully
   Version: 0.19.10
✅ tqdm installed successfully
   Version: 4.67.1
✅ sentencepiece installed successfully
   Version: 0.2.0
✅ scipy installed successfully
   Version: 1.15.2
✅ einops installed successfully
   Version: 0.8.1


In [3]:
# Mount drive to get the experts models
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# Load section dependencies
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import gc

In [5]:
# 🔐 Hugging Face Access - Llama is Gated
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineG

In [6]:
# Import section dependencies
import platform
import psutil
import distro
import numpy as np

# ==========================
# 🖥️ System & OS Information
# ==========================
system_info = platform.uname()

print("🖥️ System Information")
print("-" * 40)
print(f"Node Name      : {system_info.node}")
print(f"System         : {platform.system()}")
print(f"OS Flavor      : {distro.name()}")
print(f"OS Version     : {distro.version()}")
print(f"Release        : {system_info.release}")
print(f"Architecture   : {platform.machine()}")
print(f"Python Version : {platform.python_version()}")

# =====================
# 🧠 CPU Information
# =====================
cpu_count = psutil.cpu_count(logical=False)
logical_cpu_count = psutil.cpu_count(logical=True)

print("\n🧠 CPU Information")
print("-" * 40)
print(f"Processor      : {system_info.processor or platform.processor()}")
print(f"Physical Cores : {cpu_count}")
print(f"Logical Cores  : {logical_cpu_count}")

# ======================
# 🧠 Memory Information
# ======================
memory_info = psutil.virtual_memory()

print("\n🧠 Memory Information")
print("-" * 40)
print(f"Total RAM      : {memory_info.total / 1024 ** 3:.2f} GB")
print(f"Available RAM  : {memory_info.available / 1024 ** 3:.2f} GB")
print(f"Used RAM       : {memory_info.used / 1024 ** 3:.2f} GB")

# =======================
# 💾 Disk Information
# =======================
disk_info = psutil.disk_usage('/')

print("\n💾 Disk Information")
print("-" * 40)
print(f"Total Space    : {disk_info.total / 1024 ** 3:.2f} GB")
print(f"Used Space     : {disk_info.used / 1024 ** 3:.2f} GB")
print(f"Free Space     : {disk_info.free / 1024 ** 3:.2f} GB")

# =======================
# 🧠 GPU Information
# =======================

print("\n🧠 GPU Info")
print("GPU:", torch.cuda.get_device_name(0))
print("CUDA Available:", True)

🖥️ System Information
----------------------------------------
Node Name      : d5718e995c35
System         : Linux
OS Flavor      : Ubuntu
OS Version     : 22.04
Release        : 6.1.123+
Architecture   : x86_64
Python Version : 3.11.12

🧠 CPU Information
----------------------------------------
Processor      : x86_64
Physical Cores : 6
Logical Cores  : 12

🧠 Memory Information
----------------------------------------
Total RAM      : 83.48 GB
Available RAM  : 80.37 GB
Used RAM       : 2.24 GB

💾 Disk Information
----------------------------------------
Total Space    : 235.68 GB
Used Space     : 87.98 GB
Free Space     : 147.68 GB

🧠 GPU Info
GPU: NVIDIA A100-SXM4-40GB
CUDA Available: True


## 🦙 Llama-3-8B-UltraMedical MoE

**Experts**
- Cardiology Expert: 🤗 [Hugging Face Model Card](https://huggingface.co/xj2193/medmoe-cardiology-expert)
- Orthopedic Expert: 🤗 [Hugging Face Model Card](https://huggingface.co/xj2193/medmoe-orthopedic-expert)
- Mental Health Expert: 🤗 [Hugging Face Model Card](https://huggingface.co/xj2193/medmoe-mentalhealth-expert)

These experts are quantized in 4bit precision.


### 🦙 Cardiology Expert Benchmarks

In [10]:
# Benchmark Cardiology Expert
import random
import json
import wandb
import subprocess
import time
import os
from datetime import datetime
import lm_eval

# -----------------------------
# 🧠 Model and Task Config
# -----------------------------

model_name = "TsinghuaC3I/Llama-3-8B-UltraMedical"
peft_model_name = "xj2193/medmoe-cardiology-expert"
task_name = "pubmedqa"
output_base = "./results"

# -----------------------------
# 🚀 Start W&B run
# -----------------------------
run_name = f"{model_name.replace('/', '_')}_{task_name}_cardiology_expert_5x"
wandb_run = wandb.init(
    project="med-moe-baseline-evals",
    name=run_name,
    config={
        "model": model_name,
        "task": task_name,
        "batch_size": 8,
        "precision": "fp16",
        "eval_method": "lm_eval",
        "repeats": 5
    }
)

# -----------------------------
# 🔁 Run 5x Evaluation Loop
# -----------------------------
for i in range(5):
    print(f"\n🔁 Run {i + 1}/5")

    # Create timestamped output folder
    timestamp = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    day = datetime.now().strftime("%Y-%m-%d")
    run_output_dir = os.path.join(output_base, f"run_{i+1}_{timestamp}")
    os.makedirs(run_output_dir, exist_ok=True)

    # Define lm_eval command
    command = [
        "lm_eval",
        "--model", "hf",
        "--tasks", task_name,
        "--model_args", f"pretrained={model_name},peft=xj2193/medmoe-cardiology-expert,load_in_4bit=True,parallelize=True,device_map=auto,llm_int8_enable_fp32_cpu_offload=True",
        "--device", "cuda:0",
        "--batch_size", "8",
        "--write_out",
        "--output_path", run_output_dir,
        "--trust_remote_code",
        "--confirm_run_unsafe_code"
    ]

    # Start timing
    start_time = time.monotonic()
    result = subprocess.run(command, capture_output=True, text=True)
    elapsed = time.monotonic() - start_time

    print(f"✅ Run {i + 1} completed in {elapsed:.2f} seconds")
    print("STDOUT:\n", result.stdout)

    # -----------------------------
    # 📊 Find and parse result file
    # -----------------------------
    result_file = None
    for fname in os.listdir(os.path.join(run_output_dir, peft_model_name.replace('/', '__'))):
        print(fname)
        if fname.startswith(f"results_{day}") and fname.endswith(".json"):
            result_file = os.path.join(run_output_dir, peft_model_name.replace('/', '__'), fname)
            with open(result_file, 'r') as f:
                result = json.load(f)
                acc = result['results'][task_name]['acc,none']
                stderr = result['results'][task_name]['acc_stderr,none']

                wandb_run.log({f"{task_name}/eval_time_sec": elapsed,
                              f"{task_name}/accuracy": acc,
                              f"{task_name}/stderr": stderr,
                              "run_index": i + 1
                              })
                print(f"📈 Logged to W&B: acc={acc:.3f}, stderr={stderr:.4f}")

    if result_file is None:
        print(f"❌ No eval_results_*.json found in {run_output_dir}")
        continue

# -----------------------------
# ✅ Finish W&B run
# -----------------------------
wandb_run.finish()



🔁 Run 1/5
✅ Run 1 completed in 70.41 seconds
STDOUT:
 hf (pretrained=TsinghuaC3I/Llama-3-8B-UltraMedical,peft=xj2193/medmoe-cardiology-expert,load_in_4bit=True,parallelize=True,device_map=auto,llm_int8_enable_fp32_cpu_offload=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-----:|------|---|----:|---|-----:|
|pubmedqa|      1|none  |     0|acc   |↑  |0.672|±  | 0.021|


results_2025-05-08T21-57-33.765988.json
📈 Logged to W&B: acc=0.672, stderr=0.0210

🔁 Run 2/5
✅ Run 2 completed in 70.28 seconds
STDOUT:
 hf (pretrained=TsinghuaC3I/Llama-3-8B-UltraMedical,peft=xj2193/medmoe-cardiology-expert,load_in_4bit=True,parallelize=True,device_map=auto,llm_int8_enable_fp32_cpu_offload=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-

0,1
pubmedqa/accuracy,▁▁▁▁▁
pubmedqa/eval_time_sec,▇▆▆▁█
pubmedqa/stderr,▁▁▁▁▁
run_index,▁▃▅▆█

0,1
pubmedqa/accuracy,0.672
pubmedqa/eval_time_sec,70.49784
pubmedqa/stderr,0.02102
run_index,5.0


In [11]:
# Benchmark Orthopedic Expert
import random
import json
import wandb
import subprocess
import time
import os
from datetime import datetime
import lm_eval

# -----------------------------
# 🧠 Model and Task Config
# -----------------------------

model_name = "TsinghuaC3I/Llama-3-8B-UltraMedical"
peft_model_name = "xj2193/medmoe-orthopedic-expert"
task_name = "pubmedqa"
output_base = "./results"

# -----------------------------
# 🚀 Start W&B run
# -----------------------------
run_name = f"{model_name.replace('/', '_')}_{task_name}_orthopedic_expert_5x"
wandb_run = wandb.init(
    project="med-moe-baseline-evals",
    name=run_name,
    config={
        "model": model_name,
        "task": task_name,
        "batch_size": 8,
        "precision": "fp16",
        "eval_method": "lm_eval",
        "repeats": 5
    }
)

# -----------------------------
# 🔁 Run 5x Evaluation Loop
# -----------------------------
for i in range(5):
    print(f"\n🔁 Run {i + 1}/5")

    # Create timestamped output folder
    timestamp = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    day = datetime.now().strftime("%Y-%m-%d")
    run_output_dir = os.path.join(output_base, f"run_{i+1}_{timestamp}")
    os.makedirs(run_output_dir, exist_ok=True)

    # Define lm_eval command
    command = [
        "lm_eval",
        "--model", "hf",
        "--tasks", task_name,
        "--model_args", f"pretrained={model_name},peft=xj2193/medmoe-orthopedic-expert,load_in_4bit=True,parallelize=True,device_map=auto,llm_int8_enable_fp32_cpu_offload=True",
        "--device", "cuda:0",
        "--batch_size", "8",
        "--write_out",
        "--output_path", run_output_dir,
        "--trust_remote_code",
        "--confirm_run_unsafe_code"
    ]

    # Start timing
    start_time = time.monotonic()
    result = subprocess.run(command, capture_output=True, text=True)
    elapsed = time.monotonic() - start_time

    print(f"✅ Run {i + 1} completed in {elapsed:.2f} seconds")
    print("STDOUT:\n", result.stdout)

    # -----------------------------
    # 📊 Find and parse result file
    # -----------------------------
    result_file = None
    for fname in os.listdir(os.path.join(run_output_dir, peft_model_name.replace('/', '__'))):
        print(fname)
        if fname.startswith(f"results_{day}") and fname.endswith(".json"):
            result_file = os.path.join(run_output_dir, peft_model_name.replace('/', '__'), fname)
            with open(result_file, 'r') as f:
                result = json.load(f)
                acc = result['results'][task_name]['acc,none']
                stderr = result['results'][task_name]['acc_stderr,none']

                wandb_run.log({f"{task_name}/eval_time_sec": elapsed,
                              f"{task_name}/accuracy": acc,
                              f"{task_name}/stderr": stderr,
                              "run_index": i + 1
                              })
                print(f"📈 Logged to W&B: acc={acc:.3f}, stderr={stderr:.4f}")

    if result_file is None:
        print(f"❌ No eval_results_*.json found in {run_output_dir}")
        continue

# -----------------------------
# ✅ Finish W&B run
# -----------------------------
wandb_run.finish()



🔁 Run 1/5
✅ Run 1 completed in 71.10 seconds
STDOUT:
 hf (pretrained=TsinghuaC3I/Llama-3-8B-UltraMedical,peft=xj2193/medmoe-orthopedic-expert,load_in_4bit=True,parallelize=True,device_map=auto,llm_int8_enable_fp32_cpu_offload=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-----:|------|---|----:|---|-----:|
|pubmedqa|      1|none  |     0|acc   |↑  |0.776|±  |0.0187|


results_2025-05-08T22-04-01.675250.json
📈 Logged to W&B: acc=0.776, stderr=0.0187

🔁 Run 2/5
✅ Run 2 completed in 70.78 seconds
STDOUT:
 hf (pretrained=TsinghuaC3I/Llama-3-8B-UltraMedical,peft=xj2193/medmoe-orthopedic-expert,load_in_4bit=True,parallelize=True,device_map=auto,llm_int8_enable_fp32_cpu_offload=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-

0,1
pubmedqa/accuracy,▁▁▁▁▁
pubmedqa/eval_time_sec,█▆▆▁▅
pubmedqa/stderr,▁▁▁▁▁
run_index,▁▃▅▆█

0,1
pubmedqa/accuracy,0.776
pubmedqa/eval_time_sec,70.70093
pubmedqa/stderr,0.01866
run_index,5.0


In [12]:
# Benchmark Mental Health Expert
import random
import json
import wandb
import subprocess
import time
import os
from datetime import datetime
import lm_eval

# -----------------------------
# 🧠 Model and Task Config
# -----------------------------

model_name = "TsinghuaC3I/Llama-3-8B-UltraMedical"
peft_model_name = "xj2193/medmoe-mentalhealth-expert"
task_name = "pubmedqa"
output_base = "./results"

# -----------------------------
# 🚀 Start W&B run
# -----------------------------
run_name = f"{model_name.replace('/', '_')}_{task_name}_mentalhealth_expert_5x"
wandb_run = wandb.init(
    project="med-moe-baseline-evals",
    name=run_name,
    config={
        "model": model_name,
        "task": task_name,
        "batch_size": 8,
        "precision": "fp16",
        "eval_method": "lm_eval",
        "repeats": 5
    }
)

# -----------------------------
# 🔁 Run 5x Evaluation Loop
# -----------------------------
for i in range(5):
    print(f"\n🔁 Run {i + 1}/5")

    # Create timestamped output folder
    timestamp = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    day = datetime.now().strftime("%Y-%m-%d")
    run_output_dir = os.path.join(output_base, f"run_{i+1}_{timestamp}")
    os.makedirs(run_output_dir, exist_ok=True)

    # Define lm_eval command
    command = [
        "lm_eval",
        "--model", "hf",
        "--tasks", task_name,
        "--model_args", f"pretrained={model_name},peft=xj2193/medmoe-mentalhealth-expert,load_in_4bit=True,parallelize=True,device_map=auto,llm_int8_enable_fp32_cpu_offload=True",
        "--device", "cuda:0",
        "--batch_size", "8",
        "--write_out",
        "--output_path", run_output_dir,
        "--trust_remote_code",
        "--confirm_run_unsafe_code"
    ]

    # Start timing
    start_time = time.monotonic()
    result = subprocess.run(command, capture_output=True, text=True)
    elapsed = time.monotonic() - start_time

    print(f"✅ Run {i + 1} completed in {elapsed:.2f} seconds")
    print("STDOUT:\n", result.stdout)

    # -----------------------------
    # 📊 Find and parse result file
    # -----------------------------
    result_file = None
    for fname in os.listdir(os.path.join(run_output_dir, peft_model_name.replace('/', '__'))):
        print(fname)
        if fname.startswith(f"results_{day}") and fname.endswith(".json"):
            result_file = os.path.join(run_output_dir, peft_model_name.replace('/', '__'), fname)
            with open(result_file, 'r') as f:
                result = json.load(f)
                acc = result['results'][task_name]['acc,none']
                stderr = result['results'][task_name]['acc_stderr,none']

                wandb_run.log({f"{task_name}/eval_time_sec": elapsed,
                              f"{task_name}/accuracy": acc,
                              f"{task_name}/stderr": stderr,
                              "run_index": i + 1
                              })
                print(f"📈 Logged to W&B: acc={acc:.3f}, stderr={stderr:.4f}")

    if result_file is None:
        print(f"❌ No eval_results_*.json found in {run_output_dir}")
        continue

# -----------------------------
# ✅ Finish W&B run
# -----------------------------
wandb_run.finish()



🔁 Run 1/5
✅ Run 1 completed in 71.37 seconds
STDOUT:
 hf (pretrained=TsinghuaC3I/Llama-3-8B-UltraMedical,peft=xj2193/medmoe-mentalhealth-expert,load_in_4bit=True,parallelize=True,device_map=auto,llm_int8_enable_fp32_cpu_offload=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-----:|------|---|----:|---|-----:|
|pubmedqa|      1|none  |     0|acc   |↑  | 0.77|±  |0.0188|


results_2025-05-08T22-10-02.585943.json
📈 Logged to W&B: acc=0.770, stderr=0.0188

🔁 Run 2/5
✅ Run 2 completed in 69.94 seconds
STDOUT:
 hf (pretrained=TsinghuaC3I/Llama-3-8B-UltraMedical,peft=xj2193/medmoe-mentalhealth-expert,load_in_4bit=True,parallelize=True,device_map=auto,llm_int8_enable_fp32_cpu_offload=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|----

0,1
pubmedqa/accuracy,▁▁▁▁▁
pubmedqa/eval_time_sec,█▂▄▃▁
pubmedqa/stderr,▁▁▁▁▁
run_index,▁▃▅▆█

0,1
pubmedqa/accuracy,0.77
pubmedqa/eval_time_sec,69.71517
pubmedqa/stderr,0.01884
run_index,5.0


In [16]:
# Benchmark Original Llama-3-8B-UltraMedical
import random
import json
import wandb
import subprocess
import time
import os
from datetime import datetime
import lm_eval

# -----------------------------
# 🧠 Model and Task Config
# -----------------------------

model_name = "TsinghuaC3I/Llama-3-8B-UltraMedical"
task_name = "pubmedqa"
output_base = "./results"

# -----------------------------
# 🚀 Start W&B run
# -----------------------------
run_name = f"{model_name.replace('/', '_')}_{task_name}_5x"
wandb_run = wandb.init(
    project="med-moe-baseline-evals",
    name=run_name,
    config={
        "model": model_name,
        "task": task_name,
        "batch_size": 8,
        "precision": "fp16",
        "eval_method": "lm_eval",
        "repeats": 5
    }
)

# -----------------------------
# 🔁 Run 5x Evaluation Loop
# -----------------------------
for i in range(5):
    print(f"\n🔁 Run {i + 1}/5")

    # Create timestamped output folder
    timestamp = datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
    day = datetime.now().strftime("%Y-%m-%d")
    run_output_dir = os.path.join(output_base, f"run_{i+1}_{timestamp}")
    os.makedirs(run_output_dir, exist_ok=True)

    # Define lm_eval command
    command = [
        "lm_eval",
        "--model", "hf",
        "--tasks", task_name,
        "--model_args", f"pretrained={model_name},parallelize=True,device_map=auto",
        "--device", "cuda:0",
        "--batch_size", "8",
        "--write_out",
        "--output_path", run_output_dir,
        "--trust_remote_code",
        "--confirm_run_unsafe_code"
    ]

    # Start timing
    start_time = time.monotonic()
    result = subprocess.run(command, capture_output=True, text=True)
    elapsed = time.monotonic() - start_time

    print(f"✅ Run {i + 1} completed in {elapsed:.2f} seconds")
    print("STDOUT:\n", result.stdout)

    # -----------------------------
    # 📊 Find and parse result file
    # -----------------------------
    result_file = None
    for fname in os.listdir(os.path.join(run_output_dir, model_name.replace('/', '__'))):
        print(fname)
        if fname.startswith(f"results_{day}") and fname.endswith(".json"):
            result_file = os.path.join(run_output_dir, model_name.replace('/', '__'), fname)
            with open(result_file, 'r') as f:
                result = json.load(f)
                acc = result['results'][task_name]['acc,none']
                stderr = result['results'][task_name]['acc_stderr,none']

                wandb_run.log({f"{task_name}/eval_time_sec": elapsed,
                              f"{task_name}/accuracy": acc,
                              f"{task_name}/stderr": stderr,
                              "run_index": i + 1
                              })
                print(f"📈 Logged to W&B: acc={acc:.3f}, stderr={stderr:.4f}")

    if result_file is None:
        print(f"❌ No eval_results_*.json found in {run_output_dir}")
        continue

# -----------------------------
# ✅ Finish W&B run
# -----------------------------
wandb_run.finish()



🔁 Run 1/5
✅ Run 1 completed in 47.71 seconds
STDOUT:
 hf (pretrained=TsinghuaC3I/Llama-3-8B-UltraMedical,parallelize=True,device_map=auto,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-----:|------|---|----:|---|-----:|
|pubmedqa|      1|none  |     0|acc   |↑  | 0.76|±  |0.0191|


results_2025-05-08T22-50-48.255789.json
📈 Logged to W&B: acc=0.760, stderr=0.0191

🔁 Run 2/5
✅ Run 2 completed in 47.81 seconds
STDOUT:
 hf (pretrained=TsinghuaC3I/Llama-3-8B-UltraMedical,parallelize=True,device_map=auto,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks  |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|--------|------:|------|-----:|------|---|----:|---|-----:|
|pubmedqa|      1|none  |     0|acc   |↑  | 0.76|±  |0.0191|


results_2025-05-08T22-51-36.046583.json
📈 Logged to W&B: acc=0.760, stderr=0.0191

🔁 Run

0,1
pubmedqa/accuracy,▁▁▁▁▁
pubmedqa/eval_time_sec,▆█▃▁▃
pubmedqa/stderr,▁▁▁▁▁
run_index,▁▃▅▆█

0,1
pubmedqa/accuracy,0.76
pubmedqa/eval_time_sec,47.48327
pubmedqa/stderr,0.01912
run_index,5.0
