# Lab 4.6.8.3: Merge and Export

**Capstone Option E:** Browser-Deployed Fine-Tuned LLM (Matcha Expert)  
**Phase:** 3 of 6  
**Time:** 3-4 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## Phase Objectives

By completing this phase, you will:
- [ ] Understand why merging requires full precision
- [ ] Load base model in BF16 (not 4-bit!)
- [ ] Merge LoRA adapters into base model
- [ ] Verify merged model quality
- [ ] Export to GGUF for Ollama testing
- [ ] Save merged model for ONNX conversion

---

## Phase Checklist

- [ ] Base model loaded in BF16
- [ ] LoRA adapters loaded
- [ ] Merge completed successfully
- [ ] Quality verified (same output as LoRA model)
- [ ] Merged model saved
- [ ] GGUF exported for Ollama (optional)

---

## Why This Matters

**This is the most critical step for browser deployment quality!**

| Merge Method | Quality | Browser Works? |
|--------------|---------|---------------|
| Merge in 4-bit | ‚ùå Degraded | Maybe |
| Merge in BF16 | ‚úÖ Full quality | ‚úÖ Yes |

**The Rule:** Always merge LoRA adapters into a full-precision model, then quantize afterward.

---

## ELI5: Why Merge in Full Precision?

> **Imagine you're mixing paint colors.**
>
> - **Your base color** (the model) is stored in a can
> - **Your new tint** (LoRA adapters) is a small bottle of concentrated color
>
> **Merging in 4-bit** is like mixing with the can mostly dried up - the colors don't blend properly, and you get streaks and inconsistencies.
>
> **Merging in BF16 (full precision)** is like having a full, fresh can - the new tint mixes smoothly and evenly throughout.
>
> After mixing properly, you CAN compress the result (quantize to INT4) and it will still look great. But you can't compress first and then mix - the damage is already done.

---

## Part 1: Environment Setup

In [None]:
# Environment Setup
import os
import sys
from pathlib import Path
from datetime import datetime
import json
import torch

print("üçµ PHASE 3: MERGE AND EXPORT")
print("="*70)
print(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print(f"\nGPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'Not available'}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Project Configuration
PROJECT_DIR = Path("./matcha-expert")
MODEL_DIR = PROJECT_DIR / "models"

# Paths
ADAPTER_PATH = MODEL_DIR / "matcha-lora" / "final"
MERGED_PATH = MODEL_DIR / "matcha-merged"
GGUF_PATH = MODEL_DIR / "matcha-gguf"

# Create directories
MERGED_PATH.mkdir(parents=True, exist_ok=True)
GGUF_PATH.mkdir(parents=True, exist_ok=True)

# Base model (must match training)
BASE_MODEL = "google/gemma-3-270m-it"

print(f"üìÅ Paths:")
print(f"   Adapters: {ADAPTER_PATH}")
print(f"   Merged output: {MERGED_PATH}")
print(f"   GGUF output: {GGUF_PATH}")
print(f"   Base model: {BASE_MODEL}")

In [None]:
# Memory helper
def log_memory(stage: str = ""):
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        print(f"üíæ Memory [{stage}]: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")

log_memory("Initial")

---

## Part 2: Load Base Model in Full Precision

**CRITICAL: Load in BF16, NOT 4-bit!**

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

print("üîÑ Loading base model in BF16 (full precision)...")
print("   This is CRITICAL for quality - do NOT load in 4-bit!")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

# Load model in BF16 - NOT quantized!
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.bfloat16,  # Full precision!
    device_map="auto",
    # NO quantization_config here!
)

print(f"\n‚úÖ Base model loaded in BF16")
print(f"   Model dtype: {base_model.dtype}")
log_memory("After base model")

---

## Part 3: Load and Merge LoRA Adapters

In [None]:
from peft import PeftModel

print(f"üîß Loading LoRA adapters from {ADAPTER_PATH}...")

# Check adapters exist
if not ADAPTER_PATH.exists():
    print(f"‚ùå Adapters not found at {ADAPTER_PATH}")
    print("   Please complete Phase 2 first!")
else:
    # Load LoRA adapters onto base model
    model = PeftModel.from_pretrained(
        base_model,
        str(ADAPTER_PATH),
        torch_dtype=torch.bfloat16,
    )
    
    print(f"‚úÖ LoRA adapters loaded")
    log_memory("After LoRA load")

In [None]:
# Merge adapters into base model

print("üîÄ Merging LoRA adapters into base model...")
print("   This creates a single merged model with the fine-tuned weights")

merged_model = model.merge_and_unload()

print(f"\n‚úÖ Merge complete!")
print(f"   Model type: {type(merged_model).__name__}")
print(f"   Parameters: {sum(p.numel() for p in merged_model.parameters()):,}")
log_memory("After merge")

---

## Part 4: Verify Merged Model Quality

In [None]:
# Test function for merged model

def generate_response(model, tokenizer, question: str, max_tokens: int = 256) -> str:
    """
    Generate a response from the model.
    
    Args:
        model: The model to use for generation
        tokenizer: The tokenizer
        question: User question
        max_tokens: Maximum tokens to generate
        
    Returns:
        Generated response text
    """
    messages = [
        {"role": "system", "content": "You are a matcha tea expert with deep knowledge of Japanese tea culture, preparation methods, health benefits, and culinary applications."},
        {"role": "user", "content": question},
    ]
    
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    return response.strip()

print("‚úÖ Generation function defined")

In [None]:
# Test merged model

TEST_QUESTIONS = [
    "What's the difference between ceremonial and culinary grade matcha?",
    "What's the best water temperature for making matcha?",
    "How should I store matcha?",
]

print("üß™ TESTING MERGED MODEL")
print("="*70)

for i, question in enumerate(TEST_QUESTIONS, 1):
    print(f"\n‚ùì Question {i}: {question}")
    print(f"\nüí¨ Response:")
    response = generate_response(merged_model, tokenizer, question)
    print(response[:500] + "..." if len(response) > 500 else response)
    print("-"*70)

---

## Part 5: Save Merged Model

In [None]:
# Save merged model

print(f"üíæ Saving merged model to {MERGED_PATH}...")

merged_model.save_pretrained(
    str(MERGED_PATH),
    safe_serialization=True,  # Use safetensors format
)
tokenizer.save_pretrained(str(MERGED_PATH))

# Calculate size
model_size = sum(f.stat().st_size for f in MERGED_PATH.glob("*.safetensors")) / 1e9

print(f"\n‚úÖ Merged model saved!")
print(f"   Path: {MERGED_PATH}")
print(f"   Size: {model_size:.2f} GB")

# List files
print(f"\nüìÅ Saved files:")
for f in sorted(MERGED_PATH.iterdir()):
    size = f.stat().st_size / 1e6
    print(f"   {f.name}: {size:.1f} MB")

---

## Part 6: Export to GGUF for Ollama (Optional)

GGUF format allows you to test the model locally with Ollama before browser deployment.

In [None]:
# Export to GGUF using llama.cpp
# This requires llama.cpp to be installed

print("üì¶ GGUF EXPORT (Optional)")
print("="*70)
print("""To export to GGUF format for Ollama testing:

1. Install llama.cpp:
   git clone https://github.com/ggerganov/llama.cpp
   cd llama.cpp
   make

2. Convert to GGUF:
   python convert-hf-to-gguf.py /path/to/matcha-merged --outfile matcha-expert.gguf

3. Quantize (optional, for smaller size):
   ./quantize matcha-expert.gguf matcha-expert-q4.gguf Q4_K_M

4. Create Ollama Modelfile:
   FROM ./matcha-expert-q4.gguf
   SYSTEM "You are a matcha tea expert..."

5. Register with Ollama:
   ollama create matcha-expert -f Modelfile

6. Test:
   ollama run matcha-expert "What is ceremonial grade matcha?"
""")

In [None]:
# Save Ollama Modelfile template

modelfile_content = '''FROM ./matcha-expert-q4.gguf

TEMPLATE """<start_of_turn>system
{{ .System }}<end_of_turn>
<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
"""

SYSTEM """You are a matcha tea expert with deep knowledge of Japanese tea culture, preparation methods, health benefits, and culinary applications. You provide accurate, helpful information about matcha grades, brewing techniques, traditional ceremonies, and modern recipes."""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_predict 256
'''

modelfile_path = GGUF_PATH / "Modelfile"
with open(modelfile_path, 'w') as f:
    f.write(modelfile_content)

print(f"‚úÖ Ollama Modelfile saved to {modelfile_path}")

---

## Common Issues

### Issue 1: Merged Model Quality Degraded
**Symptom:** Output is worse than LoRA model  
**Cause:** Model was loaded in 4-bit for merging  
**Fix:** Reload base model in BF16/FP16 without quantization

### Issue 2: CUDA Out of Memory During Merge
**Symptom:** OOM when loading full model  
**Fix:** Use `device_map="auto"` to spread across GPU/CPU

### Issue 3: Tokenizer Mismatch
**Symptom:** Output is garbled or wrong  
**Fix:** Ensure tokenizer comes from base model, not adapter

---

## Metrics & Outputs

| Metric | Expected | Actual |
|--------|----------|--------|
| Merged Model Size | ~2 GB | [Fill in] |
| Quality Match | Same as LoRA | [Fill in] |
| Merge Time | ~1-2 min | [Fill in] |
| Peak Memory | ~6-8 GB | [Fill in] |

---

## Phase Complete!

You've achieved:
- ‚úÖ Loaded base model in full precision (BF16)
- ‚úÖ Merged LoRA adapters into base model
- ‚úÖ Verified merged model quality
- ‚úÖ Saved merged model for ONNX conversion
- ‚úÖ Created Ollama Modelfile template

**Next:** [Lab 4.6.8.4: ONNX Quantization](./lab-4.6.8.4-onnx-quantization.ipynb)

---

In [None]:
# Cleanup
import gc

del merged_model
del base_model
if 'model' in dir():
    del model

torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Phase 3 Complete!")
print("\nüéØ Next Steps:")
print("   1. Verify merged model produces correct outputs")
print("   2. (Optional) Test with Ollama using GGUF export")
print("   3. Proceed to Lab 4.6.8.4 for ONNX conversion")
print(f"\n   Merged model at: {MERGED_PATH}")

log_memory("After cleanup")