# üè• Clinical LLM Training with DPO + Weaver Evaluation

**Goal**: Train Llama 3 8B with Direct Preference Optimization (DPO) on clinical preference pairs, then evaluate using Weaver ensemble scoring.

**Dataset**: 526 training pairs + 59 holdout pairs (filtered by Weaver from 2,742 Gemini-generated pairs)

**Runtime**: ~20-25 minutes total (15 min training + 5-10 min evaluation)

---

## üìã Checklist

Before running:
- [ ] Set Runtime to GPU (Runtime ‚Üí Change runtime type ‚Üí T4 GPU)
- [ ] Have your data files ready: `dpo_train_dataset.jsonl`, `dpo_holdout_dataset.jsonl`
- [ ] Have Weaver code ready: `weaver_ensembles.py`

---

## 1Ô∏è‚É£ Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

In [None]:
# Install required packages
print("üì¶ Installing Unsloth and dependencies...")
!pip install -q unsloth "xformers<0.0.26" trl datasets accelerate transformers torch

# Additional packages for evaluation
!pip install -q sentence-transformers scikit-learn

print("‚úÖ Installation complete!")

## 2Ô∏è‚É£ Upload Data Files

Upload your files using the file browser on the left, or run the cell below to upload via dialog.

In [None]:
from google.colab import files
import os

# Create directories
os.makedirs('Data', exist_ok=True)
os.makedirs('Weaver', exist_ok=True)

print("üìÇ Please upload the following files:")
print("   1. dpo_train_dataset.jsonl")
print("   2. dpo_holdout_dataset.jsonl")
print("   3. weaver_ensembles.py")
print("   4. weaver_weights.json (optional, if you have trained weights)")
print("\n‚¨ÜÔ∏è  Click 'Choose Files' below...\n")

uploaded = files.upload()

# Move files to appropriate directories
for filename in uploaded.keys():
    if 'dpo' in filename and filename.endswith('.jsonl'):
        !mv "{filename}" Data/
        print(f"‚úÖ Moved {filename} to Data/")
    elif 'weaver' in filename:
        !mv "{filename}" Weaver/
        print(f"‚úÖ Moved {filename} to Weaver/")

print("\nüìã File check:")
!ls -lh Data/
!ls -lh Weaver/

## 3Ô∏è‚É£ Data Validation

Let's verify the data format before training.

In [None]:
import json
from datasets import load_dataset

# Load and inspect training data
print("üîç Validating training data...")
train_dataset = load_dataset("json", data_files="Data/dpo_train_dataset.jsonl", split="train")
holdout_dataset = load_dataset("json", data_files="Data/dpo_holdout_dataset.jsonl", split="train")

print(f"\n‚úÖ Training samples: {len(train_dataset)}")
print(f"‚úÖ Holdout samples: {len(holdout_dataset)}")

# Show sample
sample = train_dataset[0]
print("\nüìã Sample Entry:")
print(f"   Prompt: {sample['prompt'][:100]}...")
print(f"   Chosen: {sample['chosen'][:100]}...")
print(f"   Rejected: {sample['rejected'][:100]}...")

# Check for required fields
required_fields = ['prompt', 'chosen', 'rejected']
missing_fields = [field for field in required_fields if field not in sample]

if missing_fields:
    print(f"\n‚ùå ERROR: Missing fields: {missing_fields}")
else:
    print("\n‚úÖ All required fields present. Ready for training!")

## 4Ô∏è‚É£ Train DPO Model

This will take ~15-20 minutes on T4 GPU.

**What's happening:**
- Loading Llama 3 8B Instruct in 4-bit
- Adding LoRA adapters (trainable parameters)
- Training with DPO for 3 epochs
- Saving the model

In [None]:
import torch
from unsloth import FastLanguageModel, PatchDPOTrainer, is_bfloat16_supported
from trl import DPOConfig, DPOTrainer
from datasets import load_dataset

print("="*60)
print("üöÄ STARTING CLINICAL DPO TRAINING")
print("="*60)

# Configuration
MAX_SEQ_LENGTH = 2048
NUM_EPOCHS = 3
LEARNING_RATE = 5e-6
BATCH_SIZE = 2
GRAD_ACCUMULATION = 4
OUTPUT_DIR = "clinical_dpo_model_v1"

print(f"\n‚öôÔ∏è  Configuration:")
print(f"   Epochs: {NUM_EPOCHS}")
print(f"   Learning Rate: {LEARNING_RATE}")
print(f"   Effective Batch Size: {BATCH_SIZE * GRAD_ACCUMULATION}")
print(f"   Max Sequence Length: {MAX_SEQ_LENGTH}")

In [None]:
# Load base model
print("\nü§ñ Loading Llama 3 8B Instruct...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-instruct-bnb-4bit",
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,
    load_in_4bit=True,
)
print("   ‚úÖ Base model loaded")

In [None]:
# Add LoRA adapters
print("\nüîß Adding LoRA adapters...")
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)
print("   ‚úÖ LoRA configured")

In [None]:
# Load and format dataset
print("\nüìÇ Loading training data...")

def format_dpo(example):
    prompt = example['prompt']
    if not prompt.startswith("<|user|>"):
        prompt = f"<|user|>\n{prompt}\n<|assistant|>\n"
    return {
        "prompt": prompt,
        "chosen": example['chosen'],
        "rejected": example['rejected']
    }

dataset = load_dataset("json", data_files="Data/dpo_train_dataset.jsonl", split="train")
dataset = dataset.map(format_dpo)
print(f"   ‚úÖ Loaded {len(dataset)} training pairs")

In [None]:
# Initialize trainer
print("\n‚öôÔ∏è  Initializing DPO Trainer...")

patch_dpo = PatchDPOTrainer()

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,
    tokenizer=tokenizer,
    beta=0.1,
    train_dataset=dataset,
    args=DPOConfig(
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRAD_ACCUMULATION,
        warmup_ratio=0.1,
        num_train_epochs=NUM_EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        save_steps=100,
        output_dir=OUTPUT_DIR,
        optim="adamw_8bit",
        seed=42,
        remove_unused_columns=False,
    ),
)

print("   ‚úÖ Trainer ready")

In [None]:
# Train!
print("\n" + "="*60)
print("üéØ TRAINING STARTED")
print("="*60)
print(f"Training {len(dataset)} samples for {NUM_EPOCHS} epochs...")
print(f"Expected steps: ~{len(dataset) * NUM_EPOCHS // (BATCH_SIZE * GRAD_ACCUMULATION)}")
print("\n‚è±Ô∏è  This will take ~15-20 minutes. Watch the loss decrease!\n")
print("="*60 + "\n")

trainer_output = dpo_trainer.train()

print("\n" + "="*60)
print("‚úÖ TRAINING COMPLETE!")
print("="*60)
print(f"Final loss: {trainer_output.training_loss:.4f}")
print("="*60)

In [None]:
# Save model
print(f"\nüíæ Saving model to {OUTPUT_DIR}...")
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print("   ‚úÖ Model saved!")

## 5Ô∏è‚É£ Quick Inference Test

Let's test if the model works before full evaluation.

In [None]:
print("üß™ Running quick inference test...\n")

FastLanguageModel.for_inference(model)

test_prompt = "I've been feeling really anxious about my upcoming presentation at work."
formatted_prompt = f"<|user|>\n{test_prompt}\n<|assistant|>\n"

inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = response.split("<|assistant|>")[-1].strip()

print("="*60)
print("üë§ PATIENT:")
print(test_prompt)
print("\nü§ñ FINE-TUNED MODEL:")
print(response)
print("="*60)

## 6Ô∏è‚É£ Weaver Evaluation

Now the important part: evaluate using Weaver's 5-verifier ensemble!

**What we'll do:**
1. Load base Llama 3 + your fine-tuned model
2. Generate responses from both for 59 holdout prompts
3. Score each with Weaver (Clinical Correctness, Therapeutic Tone, Safety, Protocol, Logic)
4. Calculate win rate and improvement

**Success criteria:**
- Win rate > 70%
- Average improvement > 0.10

In [None]:
# Import Weaver
import sys
sys.path.append('Weaver')

print("üìä Importing Weaver ensemble...")

try:
    from weaver_ensembles import (
        WeaverEnsemble,
        ClinicalCorrectnessVerifier,
        TherapeuticToneVerifier,
        SafetyVerifier,
        ClinicalProtocolVerifier,
        DialogueLogicVerifier
    )
    print("   ‚úÖ Weaver imported successfully")
except ImportError as e:
    print(f"   ‚ùå Error importing Weaver: {e}")
    print("   Make sure you uploaded weaver_ensembles.py to the Weaver/ folder")
    raise

In [None]:
# Initialize Weaver verifiers
import os

print("üèóÔ∏è  Initializing Weaver jury (5 verifiers)...\n")

verifiers = [
    ClinicalCorrectnessVerifier(device=0),
    TherapeuticToneVerifier(device=0),
    SafetyVerifier(device=0),
    ClinicalProtocolVerifier(),
    DialogueLogicVerifier(device=0)
]

# Load trained weights if available
if os.path.exists('Weaver/weaver_weights.json'):
    print("   Loading trained weights...")
    with open('Weaver/weaver_weights.json', 'r') as f:
        weights = json.load(f)
        for v in verifiers:
            if v.name in weights:
                v.weight = weights[v.name]
                print(f"   ‚Ä¢ {v.name}: {v.weight:.2f}")
else:
    print("   Using default weights:")
    for v in verifiers:
        print(f"   ‚Ä¢ {v.name}: {v.weight:.2f}")

ensemble = WeaverEnsemble(verifiers)
print("\n   ‚úÖ Weaver ensemble ready")

In [None]:
# Load both models for comparison
print("\nü§ñ Loading models for comparison...\n")

# Base model
print("   Loading base Llama 3 8B...")
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-instruct-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(base_model)
print("   ‚úÖ Base model loaded")

# Fine-tuned model
print("   Loading fine-tuned model...")
finetuned_model, finetuned_tokenizer = FastLanguageModel.from_pretrained(
    model_name=OUTPUT_DIR,
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(finetuned_model)
print("   ‚úÖ Fine-tuned model loaded")

In [None]:
# Load holdout dataset
print("\nüìã Loading holdout dataset...")
holdout = load_dataset("json", data_files="Data/dpo_holdout_dataset.jsonl", split="train")
print(f"   ‚úÖ Loaded {len(holdout)} holdout samples")

In [None]:
# Generate responses and score with Weaver
import numpy as np
from tqdm import tqdm

print("\nüî¨ Generating responses and scoring with Weaver...")
print(f"   This will take ~{len(holdout) * 10} seconds (2 models √ó {len(holdout)} samples)\n")

results = []
MAX_NEW_TOKENS = 256
TEMPERATURE = 0.7

for idx, example in enumerate(tqdm(holdout, desc="Evaluating")):
    prompt = example['prompt']
    chosen_gold = example['chosen']
    
    # Extract clean user text
    user_text = prompt.replace("<|user|>", "").replace("<|assistant|>", "").strip()
    if "\n" in user_text:
        user_text = user_text.split("\n")[0].strip()
    
    # Generate from base model
    base_inputs = base_tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        base_outputs = base_model.generate(
            **base_inputs,
            max_new_tokens=MAX_NEW_TOKENS,
            temperature=TEMPERATURE,
            do_sample=True,
            pad_token_id=base_tokenizer.eos_token_id
        )
    base_response = base_tokenizer.decode(base_outputs[0], skip_special_tokens=True)
    base_response = base_response.split("<|assistant|>")[-1].strip()
    
    # Generate from fine-tuned model
    ft_inputs = finetuned_tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        ft_outputs = finetuned_model.generate(
            **ft_inputs,
            max_new_tokens=MAX_NEW_TOKENS,
            temperature=TEMPERATURE,
            do_sample=True,
            pad_token_id=finetuned_tokenizer.eos_token_id
        )
    ft_response = finetuned_tokenizer.decode(ft_outputs[0], skip_special_tokens=True)
    ft_response = ft_response.split("<|assistant|>")[-1].strip()
    
    # Score with Weaver
    base_eval = ensemble.evaluate_pair(chosen_gold, base_response, user_text)
    ft_eval = ensemble.evaluate_pair(chosen_gold, ft_response, user_text)
    
    improvement = ft_eval['rejected_score'] - base_eval['rejected_score']
    
    results.append({
        "sample_id": idx,
        "prompt": user_text[:200],
        "base_response": base_response,
        "finetuned_response": ft_response,
        "base_score": float(base_eval['rejected_score']),
        "finetuned_score": float(ft_eval['rejected_score']),
        "improvement": float(improvement),
        "win": improvement > 0
    })

print("\n   ‚úÖ Evaluation complete!")

## 7Ô∏è‚É£ Results Analysis

Let's see how the fine-tuned model performed!

In [None]:
# Calculate metrics
wins = sum(1 for r in results if r['win'])
losses = len(results) - wins
win_rate = wins / len(results) * 100

avg_base = np.mean([r['base_score'] for r in results])
avg_ft = np.mean([r['finetuned_score'] for r in results])
avg_improvement = avg_ft - avg_base

std_base = np.std([r['base_score'] for r in results])
std_ft = np.std([r['finetuned_score'] for r in results])

# Print results
print("="*60)
print("üéØ EVALUATION RESULTS")
print("="*60)
print(f"\nüìà Overall Performance:")
print(f"   Win Rate (Fine-tuned > Base):  {win_rate:.1f}% ({wins}/{len(results)})")
print(f"   Loss Rate (Fine-tuned ‚â§ Base): {100-win_rate:.1f}% ({losses}/{len(results)})")
print(f"\nüìä Weaver Scores:")
print(f"   Base Model Average:            {avg_base:.3f} (¬±{std_base:.3f})")
print(f"   Fine-tuned Model Average:      {avg_ft:.3f} (¬±{std_ft:.3f})")
print(f"   Average Improvement:           {'+' if avg_improvement > 0 else ''}{avg_improvement:.3f}")

# Success criteria
print("\nüéØ Success Criteria:")
success = []

if win_rate > 70:
    print("   ‚úÖ Win Rate > 70%")
    success.append(True)
else:
    print(f"   ‚ùå Win Rate ‚â§ 70% (got {win_rate:.1f}%)")
    success.append(False)

if avg_improvement > 0.10:
    print("   ‚úÖ Avg Improvement > 0.10")
    success.append(True)
else:
    print(f"   ‚ùå Avg Improvement ‚â§ 0.10 (got {avg_improvement:.3f})")
    success.append(False)

print("="*60)

if all(success):
    print("üéâ SUCCESS: Model shows significant clinical improvement!")
elif win_rate > 60:
    print("‚ö†Ô∏è  PARTIAL SUCCESS: Model improved but below target")
else:
    print("‚ùå NEEDS IMPROVEMENT: Model did not show consistent improvement")

print("="*60)

In [None]:
# Visualize score distributions
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Score comparison
ax1.hist([r['base_score'] for r in results], bins=20, alpha=0.5, label='Base Model', color='red')
ax1.hist([r['finetuned_score'] for r in results], bins=20, alpha=0.5, label='Fine-tuned', color='green')
ax1.axvline(avg_base, color='red', linestyle='--', linewidth=2, label=f'Base Avg: {avg_base:.3f}')
ax1.axvline(avg_ft, color='green', linestyle='--', linewidth=2, label=f'Fine-tuned Avg: {avg_ft:.3f}')
ax1.set_xlabel('Weaver Score')
ax1.set_ylabel('Frequency')
ax1.set_title('Score Distribution Comparison')
ax1.legend()
ax1.grid(alpha=0.3)

# Improvement distribution
improvements = [r['improvement'] for r in results]
ax2.hist(improvements, bins=20, color='blue', alpha=0.7, edgecolor='black')
ax2.axvline(0, color='red', linestyle='--', linewidth=2, label='No Change')
ax2.axvline(avg_improvement, color='green', linestyle='--', linewidth=2, label=f'Avg: {avg_improvement:.3f}')
ax2.set_xlabel('Improvement (Fine-tuned - Base)')
ax2.set_ylabel('Frequency')
ax2.set_title('Improvement Distribution')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.savefig('evaluation_charts.png', dpi=150, bbox_inches='tight')
plt.show()

print("üìä Charts saved to evaluation_charts.png")

In [None]:
# Show top improvements
sorted_results = sorted(results, key=lambda x: x['improvement'], reverse=True)
top_5 = sorted_results[:5]

print("\nüìù Top 5 Improvements:\n")
for i, ex in enumerate(top_5, 1):
    print("‚îÄ"*60)
    print(f"EXAMPLE {i} | Improvement: +{ex['improvement']:.3f}")
    print("‚îÄ"*60)
    print(f"üë§ PATIENT:\n{ex['prompt']}\n")
    print(f"ü§ñ BASE (score={ex['base_score']:.3f}):\n{ex['base_response'][:200]}...\n")
    print(f"‚ú® FINE-TUNED (score={ex['finetuned_score']:.3f}):\n{ex['finetuned_response'][:200]}...\n")

In [None]:
# Save detailed results
output_data = {
    "summary": {
        "total_samples": len(results),
        "win_rate": float(win_rate),
        "wins": wins,
        "losses": losses,
        "avg_base_score": float(avg_base),
        "avg_finetuned_score": float(avg_ft),
        "avg_improvement": float(avg_improvement),
        "success_criteria_met": all(success)
    },
    "detailed_results": results
}

with open("evaluation_results.json", "w") as f:
    json.dump(output_data, f, indent=2)

print("üíæ Results saved to evaluation_results.json")

## 8Ô∏è‚É£ Download Results

Download your trained model and evaluation results.

In [None]:
from google.colab import files
import shutil

print("üì¶ Preparing files for download...\n")

# Zip the model
print("   Zipping model (this may take a minute)...")
!zip -r clinical_dpo_model_v1.zip clinical_dpo_model_v1/
print("   ‚úÖ Model zipped")

# Download files
print("\nüì• Downloading files...\n")
print("   1. Trained model (large file, ~500MB-1GB)")
files.download('clinical_dpo_model_v1.zip')

print("   2. Evaluation results (JSON)")
files.download('evaluation_results.json')

print("   3. Evaluation charts (PNG)")
files.download('evaluation_charts.png')

print("\n‚úÖ All files downloaded!")

## 9Ô∏è‚É£ Next Steps

Based on your results:

### If Success (Win Rate > 70%):
1. **Scale up data generation** to 5,000-10,000 pairs
2. **Run trajectory analysis** (multi-turn conversation test)
3. **Consider GRPO** for further improvement
4. **Integrate Hourglass Emotions** into Weaver

### If Partial Success (60-70% Win Rate):
1. **Analyze failure cases** above to identify patterns
2. **Improve data quality** by adjusting Weaver filters
3. **Try different hyperparameters**:
   - Lower learning rate (1e-6)
   - More epochs (5-7)
   - Higher LoRA rank (32)

### If Needs Improvement (<60% Win Rate):
1. **Check training loss** - did it converge?
2. **Validate data quality** - are chosen/rejected truly different?
3. **Review Weaver weights** - are they appropriate?
4. **Consider different base model** (Qwen 2.5, Mistral, etc.)

---

**Good luck! üöÄ**