# 🏥 Medical Model Optimization + Mixture of Experts  
## 📓 Notebook 1: Model Selection & Benchmarking

**Authors:**  
- Dan Harvey  
- Xinzhuo Jiang  

**Affiliation:**  
*High-Performance Machine Learning (HPML)*  
*Columbia University*


---

### 🔍 Project Overview

This project investigates how to optimize large language models (LLMs) for medical applications by leveraging modern efficiency techniques and modular model design.

### 🎯 Objectives

1. **Benchmark** multiple medical and general-purpose LLMs to establish quantitative performance baselines.  
2. **Optimize** selected models through quantization, pruning, and architectural tuning.  
3. **Design and evaluate** a Mixture-of-Experts (MoE) architecture with specialized, task-specific experts.

---

### 📁 This notebook focuses on Step 1: loading and benchmarking key candidate models.


In [8]:
## 📦 Environment Setup: Dependencies and Imports
import os
import sys
import importlib
import subprocess
import torch

In [4]:
# Add project root to path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Required packages
required_packages = [
    'torch', 'transformers', 'datasets', 'accelerate', 'flash_attn',
    'evaluate', 'lm_eval', 'sklearn', 'matplotlib', 'wandb',
    'tqdm', 'sentencepiece', 'scipy', 'einops','lib-platform'
]

# Check and install missing packages
for package in required_packages:
    try:
        module = importlib.import_module(package)
        print(f"✅ {package} installed successfully")
        if package == 'torch':
            print(f"   Version: {torch.__version__}")
            print(f"   CUDA available: {torch.cuda.is_available()}")
            if torch.cuda.is_available():
                print(f"   CUDA version: {torch.version.cuda}")
                print(f"   GPU: {torch.cuda.get_device_name(0)}")
        elif hasattr(module, '__version__'):
            print(f"   Version: {module.__version__}")
    except ImportError:
        print(f"❌ {package} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        module = importlib.import_module(package)
        print(f"✅ {package} installed successfully (post-install)")
        if hasattr(module, '__version__'):
            print(f"   Version: {module.__version__}")

# You may need to restart the Kernel to use these



✅ torch installed successfully
   Version: 2.6.0+cu124
   CUDA available: True
   CUDA version: 12.4
   GPU: NVIDIA A100-SXM4-40GB
✅ transformers installed successfully
   Version: 4.51.3
❌ datasets not found. Installing...
✅ datasets installed successfully (post-install)
   Version: 3.5.1
✅ accelerate installed successfully
   Version: 1.6.0
❌ flash_attn not found. Installing...
✅ flash_attn installed successfully (post-install)
   Version: 2.7.4.post1
❌ evaluate not found. Installing...
✅ evaluate installed successfully (post-install)
   Version: 0.4.3
❌ lm_eval not found. Installing...
✅ lm_eval installed successfully (post-install)
✅ sklearn installed successfully
   Version: 1.6.1
✅ matplotlib installed successfully
   Version: 3.10.0
✅ wandb installed successfully
   Version: 0.19.10
✅ tqdm installed successfully
   Version: 4.67.1
✅ sentencepiece installed successfully
   Version: 0.2.0
✅ scipy installed successfully
   Version: 1.15.2
✅ einops installed successfully
   Version:

In [20]:
# Load section dependencies
from transformers import AutoTokenizer, AutoModelForCausalLM
import gc
!git config --global credential.helper store

# 🧠 Model Selection: Baseline Models

We will work with the following Hugging Face models:

| Model Name                            | Size | Notes                                                          |
| ------------------------------------- | ---- | -------------------------------------------------------------- |
| `TsinghuaC3I/Llama-3-8B-UltraMedical` | 8B   | Medical domain-specific, fine-tuned, ideal teacher & benchmark |
| `meta-llama/Llama-3.2-3B`             | 3B   | Same architecture, smaller, ideal as an expert or student      |
| `Qwen/Qwen3-4B`                       | 4B   | Non-LLaMA expert for diversity in MoE                          |

These models will serve as the baseline in our pipeline and will be evaluated for:

- Performance on medical QA and reasoning tasks
- Suitability for distillation and expert specialization
- Impact of downstream optimizations (quantization, pruning, MoE routing)

📌 **Note**: All models are initially loaded in **full FP32 (float32) precision** to serve as accurate performance baselines before applying any quantization or memory optimization techniques.


In [8]:
# 🔐 Hugging Face Access - Llama is Gated
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: read)

# 📥 Load Baseline Models

## 🦙 Llama-3-8B-UltraMedical

**Links**  
- 🤗 [Hugging Face Model Card](https://huggingface.co/TsinghuaC3I/Llama-3-8B-UltraMedical)
- 📄 [Paper / Source](https://huggingface.co/TsinghuaC3I/Llama-3-8B-UltraMedical)

**Approximate GPU Memory Requirements:**
- **FP32**: ~32.4 GB  
- **FP16**: ~48 GB  
- **INT8**: ~24 GB  
- **INT4**: ~12 GB  

> These values are estimates and may vary based on sequence length, attention optimizations, and tokenizer overhead.


In [9]:
#Llama-3-8B-UltraMedical

tokenizer_llama8b_med = AutoTokenizer.from_pretrained(
    "TsinghuaC3I/Llama-3-8B-UltraMedical",
    trust_remote_code=True,
    use_auth_token=True
)

model_llama8b_med = AutoModelForCausalLM.from_pretrained(
    "TsinghuaC3I/Llama-3-8B-UltraMedical",
    trust_remote_code=True,
  device_map="auto",
    torch_dtype=torch.float32,
    use_auth_token=True
)

print("✅ Loaded Llama-3-8B-UltraMedical (FP32, device-mapped)")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/335 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/728 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

✅ Loaded Llama-3-8B-UltraMedical (FP32, device-mapped)


In [10]:
# Inspect your GPU's memory usage
!nvidia-smi

Tue May  6 21:45:46 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             47W /  400W |   33139MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

**This model took 33139MiB /  40960MiB, or 32.4GB of GPU Memory.**

In [11]:
# Offload Model for GPU Space
del model_llama8b_med
del tokenizer_llama8b_med
gc.collect()
torch.cuda.empty_cache()

## 🦙 Llama-3.2-3B

**Links**  
- 🤗 [Hugging Face Model Card](https://huggingface.co/meta-llama/Llama-3.2-3B)  
- 📄 [Paper / Source](https://huggingface.co/meta-llama/Llama-3.2-3B)

**Approximate GPU Memory Requirements:**
- **FP32**: ~14.9 GB  

> These are rough estimates. Actual usage depends on sequence length, architecture-specific memory optimizations, and tokenizer overhead.


In [31]:
#Llama-3.2-3B

tokenizer_llama3b = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    trust_remote_code=True,
    use_auth_token=True
)

model_llama3b = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float32,
    use_auth_token=True
)

print("✅ Loaded Llama-3.2-3B (FP32, device-mapped)")




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Loaded Llama-3.2-3B (FP32, device-mapped)


In [32]:
# Inspect your GPU's memory usage
!nvidia-smi

Tue May  6 21:52:25 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             47W /  400W |   14261MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

 **This model took 14261MiB /  40960MiB, or 14.9GB of GPU Memory.**

In [33]:
# Offload Model for GPU Space
del model_llama3b
del tokenizer_llama3b
gc.collect()
torch.cuda.empty_cache()

## 🐉 Qwen3-4B

**Links**  
- 🤗 [Hugging Face Model Card](https://huggingface.co/Qwen/Qwen3-4B)  
- 📄 [Paper / Source](https://arxiv.org/abs/2403.08552) *(Qwen2 paper for reference — Qwen3 paper may be pending)*

**Approximate GPU Memory Requirements:**
- **FP32**: ~16.9 GB  


> Qwen models typically require `trust_remote_code=True` due to custom model implementations.

In [44]:
# Qwen3-4B

tokenizer_qwen4b = AutoTokenizer.from_pretrained(
    "Qwen/Qwen3-4B",
    trust_remote_code=True,
    use_auth_token=True
)

model_qwen4b = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float32,
    use_auth_token=True
)

print("✅ Loaded Qwen3-4B (FP32, device-mapped)")


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

✅ Loaded Qwen3-4B (FP32, device-mapped)


In [45]:
# Inspect your GPU's memory usage
print("\n--- NVIDIA-SMI Snapshot ---")
print(subprocess.getoutput("nvidia-smi"))

Tue May  6 21:56:41 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   31C    P0             47W /  400W |   17331MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

 ** This model took 17331MiB /  40960MiB or ~16.9GB of GPU Memory **

In [41]:
# Offload Model for GPU Space
#del model_qwen4b
#del tokenizer_qwen4b
gc.collect()
torch.cuda.empty_cache()

# 📊 Benchmarking

To establish performance baselines, we will:

* Load eeach model in full float32 (Already implemented above)
* Run each model through standard medical QA tasks (e.gPubMedQA).
* Repeat each benchmark 3 times and average results.


In [None]:
📊 Benchmarking

To establish performance baselines, we will:

Run each model through standard medical QA tasks (e.g., PubMedQA).
Repeat each benchmark 3 times and average results.
Benchmarking code to be added here.

In [1]:
# Import section dependencies
import platform
import psutil
import distro

In [10]:
# ==========================
# 🖥️ System & OS Information
# ==========================
system_info = platform.uname()

print("🖥️ System Information")
print("-" * 40)
print(f"Node Name      : {system_info.node}")
print(f"System         : {platform.system()}")
print(f"OS Flavor      : {distro.name()}")
print(f"OS Version     : {distro.version()}")
print(f"Release        : {system_info.release}")
print(f"Architecture   : {platform.machine()}")
print(f"Python Version : {platform.python_version()}")

# =====================
# 🧠 CPU Information
# =====================
cpu_count = psutil.cpu_count(logical=False)
logical_cpu_count = psutil.cpu_count(logical=True)

print("\n🧠 CPU Information")
print("-" * 40)
print(f"Processor      : {system_info.processor or platform.processor()}")
print(f"Physical Cores : {cpu_count}")
print(f"Logical Cores  : {logical_cpu_count}")

# ======================
# 🧠 Memory Information
# ======================
memory_info = psutil.virtual_memory()

print("\n🧠 Memory Information")
print("-" * 40)
print(f"Total RAM      : {memory_info.total / 1024 ** 3:.2f} GB")
print(f"Available RAM  : {memory_info.available / 1024 ** 3:.2f} GB")
print(f"Used RAM       : {memory_info.used / 1024 ** 3:.2f} GB")

# =======================
# 💾 Disk Information
# =======================
disk_info = psutil.disk_usage('/')

print("\n💾 Disk Information")
print("-" * 40)
print(f"Total Space    : {disk_info.total / 1024 ** 3:.2f} GB")
print(f"Used Space     : {disk_info.used / 1024 ** 3:.2f} GB")
print(f"Free Space     : {disk_info.free / 1024 ** 3:.2f} GB")

# =======================
# 🧠 GPU Information
# =======================

print("\n🧠 GPU Info")
print("GPU:", torch.cuda.get_device_name(0))
print("CUDA Available:", True)

🖥️ System Information
----------------------------------------
Node Name      : acc49957f0e2
System         : Linux
OS Flavor      : Ubuntu
OS Version     : 22.04
Release        : 6.1.123+
Architecture   : x86_64
Python Version : 3.11.12

🧠 CPU Information
----------------------------------------
Processor      : x86_64
Physical Cores : 6
Logical Cores  : 12

🧠 Memory Information
----------------------------------------
Total RAM      : 83.48 GB
Available RAM  : 81.23 GB
Used RAM       : 1.47 GB

💾 Disk Information
----------------------------------------
Total Space    : 235.68 GB
Used Space     : 37.09 GB
Free Space     : 198.57 GB

🧠 GPU Info
GPU: NVIDIA A100-SXM4-40GB
CUDA Available: True


## 🦙 Llama-3-8B-UltraMedical

## Measure baseline performance

In [None]:
#Load Llama-3-8B-UltraMedical
model_name = "TsinghuaC3I/Llama-3-8B-UltraMedical"

# Load tokenizer once (doesn’t affect model loading time)
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_auth_token=True
)

load_times = []

trials = 5

print(f"⏳ Starting timed model loads ({trials} repetitions)...\n")

for i in range(trials):
    start_time = time.time()

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        trust_remote_code=True,
        device_map="auto",
        torch_dtype=torch.float32,
        use_auth_token=True
    )

    elapsed = time.time() - start_time
    load_times.append(elapsed)
    print(f"✅ Run {i + 1}: Loaded in {elapsed:.2f} seconds")

    # Clean up between runs (free GPU memory)
    del model
    gc.collect()
    torch.cuda.empty_cache()
    time.sleep(5)
    gc.collect()
    torch.cuda.empty_cache()

# Summary stats
mean_time = np.mean(load_times)
std_dev_time = np.std(load_times)

print(f"\n📊 {model_name} Load Time Summary (FP32)")
print(f"- Average Load Time: {mean_time:.2f} seconds")
print(f"- Std Dev:           {std_dev_time:.2f} seconds")

# Final GPU memory snapshot
print("\n🖥️ Final NVIDIA-SMI Snapshot:")
print(subprocess.getoutput("nvidia-smi"))

# LOAD