# V-ADASM Quickstart Notebook

**V-ADASM**: Vision-Adaptive Dimensionality-Aligned Subspace Merging

This notebook demonstrates how to use V-ADASM to merge a large multimodal model (donor) with a small text model (base) to create a compact Vision-Language Model.

In [None]:
# Install V-ADASM
!pip install -e ..

# Or from GitHub:
# !pip install git+https://github.com/yourorg/vadasm.git

In [None]:
import torch
from vadasm.merger import VADASMMerger, ModelConfig, MergeConfig

# 1. Configure Models
small_config = ModelConfig(
    name_or_path="microsoft/phi-2",  # Small base (2.7B params)
    is_moe=False,
    has_vision=False
)

large_config = ModelConfig(
    name_or_path="llava-hf/llava-1.5-7b-hf",  # Large donor with vision
    is_moe=False,
    has_vision=True
)

# 2. Configure Merge Process
merge_config = MergeConfig(
    fusion_beta=0.3,        # Vision delta weight
    projector_svd_rank=0.95, # SVD variance threshold
    ties_drop_rate=0.3,     # TIES sparsification
    evo_generations=15,     # Evolutionary optimization
    moe_top_k=2,           # For MoE models
    device="cuda"          # GPU acceleration
)

In [None]:
# 3. Initialize Merger
merger = VADASMMerger(merge_config)

print("üöÄ Starting V-ADASM merge pipeline...")
print(f"Small model: {small_config.name_or_path}")
print(f"Large model: {large_config.name_or_path}")
print(f"Output size: ~{small_config.name_or_path.split('-')[-1]} params (no size bloat!)")

In [None]:
# 4. Perform Training-Free Merge!
merged_model = merger.merge_models(small_config, large_config)

print("‚úÖ Merge complete!")
print(f"Model has vision capability: {merged_model.config.has_vision}")
print(f"Parameter count preserved: {sum(p.numel() for p in merged_model.parameters())}")

In [None]:
# 5. Test Multimodal Inference
from transformers import pipeline
from PIL import Image
import requests

# Load merged model for inference
vlm = pipeline("image-to-text", model=merged_model, trust_remote_code=True)

# Example image (you can use your own)
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/coco_sample.png"
image = Image.open(requests.get(image_url, stream=True).raw)

# Multimodal generation
prompt = "Describe this scene in detail:"
output = vlm(prompt, images=[image], max_new_tokens=100)[0]['generated_text']

print(f"Prompt: {prompt}")
print(f"Response: {output}")

In [None]:
# 6. Optional: Evaluate Performance
!python ../scripts/eval_vlm.py --model ./merged_model --tasks vqav2 mmlu --limit 100

## Next Steps

- **Fork and customize**: Modify hyperparams, try different model pairs
- **Scale up**: Use larger donor models (LLaVA-34B, Gemini) for better vision
- **Deploy**: Export to ONNX/TensorRT for edge inference
- **Contribute**: Add support for new architectures, benchmarks

**Key Benefits:**
- üß† **Same size**: Output = small model parameters  
- üöÄ **Training-free**: Offline merge in ~2-4 hours
- üëÅÔ∏è **Multimodal**: Text-only input ‚Üí Text+Image processing
- üîß **Extensible**: Add new fusion methods, optimizers

Visit [GitHub](https://github.com/yourorg/vadasm) for more details!