# 🏥 Medical Model Optimization + Mixture of Experts
## Notebook 1

**Authors**:


*   Dan Harvey
*   Xinzhuo Jiang



**Columbia University**

---

In this project, we explore the optimization of Transformer-based LLM models for medical applications.  
We aim to:
1. Benchmark for quantitative comparion.
2. Apply optimizations.
3. Construct a Mixture-of-Experts (MoE) using multiple specialized models.

---

In [2]:
## 📦 Environment Setup: Dependencies and Imports
import os
import sys
import importlib
import subprocess
import torch

In [3]:
# Add project root to path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

# Required packages
required_packages = [
    'torch', 'transformers', 'datasets', 'accelerate', 'flash_attn',
    'evaluate', 'lm_eval', 'sklearn', 'matplotlib', 'wandb',
    'tqdm', 'sentencepiece', 'scipy', 'einops'
]

# Check and install missing packages
for package in required_packages:
    try:
        module = importlib.import_module(package)
        print(f"✅ {package} installed successfully")
        if package == 'torch':
            print(f"   Version: {torch.__version__}")
            print(f"   CUDA available: {torch.cuda.is_available()}")
            if torch.cuda.is_available():
                print(f"   CUDA version: {torch.version.cuda}")
                print(f"   GPU: {torch.cuda.get_device_name(0)}")
        elif hasattr(module, '__version__'):
            print(f"   Version: {module.__version__}")
    except ImportError:
        print(f"❌ {package} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        module = importlib.import_module(package)
        print(f"✅ {package} installed successfully (post-install)")
        if hasattr(module, '__version__'):
            print(f"   Version: {module.__version__}")

# You may need to restart the Kernel to use these

✅ torch installed successfully
   Version: 2.6.0+cu124
   CUDA available: False
✅ transformers installed successfully
   Version: 4.51.3
❌ datasets not found. Installing...
✅ datasets installed successfully (post-install)
   Version: 3.5.1
✅ accelerate installed successfully
   Version: 1.6.0
❌ flash_attn not found. Installing...
✅ flash_attn installed successfully (post-install)
   Version: 2.7.4.post1
❌ evaluate not found. Installing...
✅ evaluate installed successfully (post-install)
   Version: 0.4.3
❌ lm_eval not found. Installing...
✅ lm_eval installed successfully (post-install)
✅ sklearn installed successfully
   Version: 1.6.1
✅ matplotlib installed successfully
   Version: 3.10.0
✅ wandb installed successfully
   Version: 0.19.10
✅ tqdm installed successfully
   Version: 4.67.1
✅ sentencepiece installed successfully
   Version: 0.2.0
✅ scipy installed successfully
   Version: 1.15.2
✅ einops installed successfully
   Version: 0.8.1


# 🧠 Model Selection: Baseline Models

We will work with the following Hugging Face models:

| Model Name                            | Size | Notes                                                          |
| ------------------------------------- | ---- | -------------------------------------------------------------- |
| `TsinghuaC3I/Llama-3-8B-UltraMedical` | 8B   | Medical domain-specific, fine-tuned, ideal teacher & benchmark |
| `meta-llama/Llama-3.2-3B`             | 3B   | Same architecture, smaller, ideal as an expert or student      |
| `Qwen/Qwen3-4B`                       | 4B   | Non-LLaMA expert for diversity in MoE                          |

These models will serve as the baseline in our pipeline and will be evaluated for:

- Performance on medical QA and reasoning tasks
- Suitability for distillation and expert specialization
- Impact of downstream optimizations (quantization, pruning, MoE routing)

📌 **Note**: All models are initially loaded in **full FP32 (float32) precision** to serve as accurate performance baselines before applying any quantization or memory optimization techniques.


In [None]:
# 🔐 Hugging Face Access - Llama is Gated
!huggingface-cli login

# 📥 Load Baseline Models

In [None]:
# Load section dependencies
from transformers import AutoTokenizer, AutoModelForCausalLM

## 🦙 Llama-3-8B-UltraMedical

**Links**  
- 🤗 [Hugging Face Model Card](https://huggingface.co/TsinghuaC3I/Llama-3-8B-UltraMedical)
- 📄 [Paper / Source](https://huggingface.co/TsinghuaC3I/Llama-3-8B-UltraMedical)

**Approximate GPU Memory Requirements:**
- **FP32**: ~95 GB  
- **FP16**: ~48 GB  
- **INT8**: ~24 GB  
- **INT4**: ~12 GB  

> These values are estimates and may vary based on sequence length, attention optimizations, and tokenizer overhead.


In [None]:
#Llama-3-8B-UltraMedical

tokenizer_llama8b_med = AutoTokenizer.from_pretrained(
    "TsinghuaC3I/Llama-3-8B-UltraMedical",
    trust_remote_code=True,
    use_auth_token=True
)

model_llama8b_med = AutoModelForCausalLM.from_pretrained(
    "TsinghuaC3I/Llama-3-8B-UltraMedical",
    trust_remote_code=True,
  device_map="auto",
    torch_dtype=torch.float32,
    use_auth_token=True
)

print("✅ Loaded Llama-3-8B-UltraMedical (FP32, device-mapped)")

## 🦙 Llama-3.2-3B

**Links**  
- 🤗 [Hugging Face Model Card](https://huggingface.co/meta-llama/Llama-3.2-3B)  
- 📄 [Paper / Source](https://huggingface.co/meta-llama/Llama-3.2-3B)

**Approximate GPU Memory Requirements:**
- **FP32**: ~36 GB  
- **FP16**: ~18 GB  
- **INT8**: ~9 GB  
- **INT4**: ~4.5 GB  

> These are rough estimates. Actual usage depends on sequence length, architecture-specific memory optimizations, and tokenizer overhead.


In [None]:
#Llama-3.2-3B

tokenizer_llama3b = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    trust_remote_code=True,
    use_auth_token=True
)

model_llama3b = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float32,
    use_auth_token=True
)

print("✅ Loaded Llama-3.2-3B (FP32, device-mapped)")


## 🐉 Qwen3-4B

**Links**  
- 🤗 [Hugging Face Model Card](https://huggingface.co/Qwen/Qwen3-4B)  
- 📄 [Paper / Source](https://arxiv.org/abs/2403.08552) *(Qwen2 paper for reference — Qwen3 paper may be pending)*

**Approximate GPU Memory Requirements:**
- **FP32**: ~48 GB  
- **FP16**: ~24 GB  
- **INT8**: ~12 GB  
- **INT4**: ~6 GB  

> Qwen models typically require `trust_remote_code=True` due to custom model implementations.

In [None]:
# Qwen3-4B

tokenizer_qwen4b = AutoTokenizer.from_pretrained(
    "Qwen/Qwen3-4B",
    trust_remote_code=True,
    use_auth_token=True
)

model_qwen4b = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B",
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float32,
    use_auth_token=True
)

print("✅ Loaded Qwen3-4B (FP32, device-mapped)")


# 📊 Benchmarking

To establish performance baselines, we will:

* Load eeach model in full float32 (Already implemented above)
* Run each model through standard medical QA tasks (e.gPubMedQA).
* Repeat each benchmark 3 times and average results.


In [None]:
📊 Benchmarking

To establish performance baselines, we will:

Run each model through standard medical QA tasks (e.g., PubMedQA).
Repeat each benchmark 3 times and average results.
Benchmarking code to be added here.