# LatentMAS RunPod API - Complete Usage Guide

This notebook demonstrates all features of the **LatentMAS + S-LoRA** multi-agent system via the RunPod serverless API.

## Features Covered
1. **Basic Chat & Inference** - Simple text queries with domain routing
2. **Vision Language Model (VLM)** - Image + text queries
3. **RAG (Retrieval Augmented Generation)** - Document injection
4. **Conversation Continuity** - Session & conversation management
5. **LoRA Adapter Selection** - Domain-specific fine-tuned models
6. **Tool Execution** - Function calling capabilities
7. **Metadata Queries** - List adapters, conversations, sessions

## Architecture
- **Base Model**: Qwen/Qwen2.5-VL-7B-Instruct (8B params, VLM)
- **Pipeline**: Planner ‚Üí Critic (latent) ‚Üí Refiner (latent) ‚Üí Judger
- **Adapters**: Medical, Math, Code, Reasoning LoRAs
- **RAG**: Document retrieval with embeddings
- **Persistence**: Session-based conversation memory

## 1. Setup & Configuration

First, configure your RunPod API credentials:

In [None]:
import requests
import json
import base64
import time
from typing import Dict, Any, Optional, List

# ==============================================
# RunPod Configuration
# ==============================================
API_KEY = "rpa_VHF8RTJVI3H5XX"  # Your RunPod API key
ENDPOINT_ID = "te7m0xova7z4rrXXX"  # Your endpoint ID

# Construct the API endpoint URL
RUNPOD_API_URL = f"https://api.runpod.ai/v2/{ENDPOINT_ID}/run"

# Headers for all requests
HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}

print("‚úÖ RunPod API configured")
print(f"üì° Endpoint: {RUNPOD_API_URL}")
print(f"üîë API Key: {API_KEY[:10]}...")

### Helper Functions

Utility functions for making API requests and handling responses:

In [None]:
def call_latent_mas(
    prompt: str,
    image_url: Optional[str] = None,
    image_base64: Optional[str] = None,
    max_tokens: int = 800,
    temperature: float = 0.7,
    system_prompt: Optional[str] = None,
    conversation_id: Optional[str] = None,
    session_id: Optional[str] = None,
    lora_adapter: Optional[str] = None,
    lora_hf_path: Optional[str] = None,
    rag_data: Optional[str] = None,
    rag_documents: Optional[List[Dict]] = None,
    no_default_data: bool = True,
    enable_tools: bool = False,
    model: str = "Qwen/Qwen2.5-VL-7B-Instruct",
    wait_for_completion: bool = True,
    timeout: int = 300,
) -> Dict[str, Any]:
    """
    Call the LatentMAS RunPod endpoint with comprehensive parameters.
    
    Args:
        prompt: The query text (required)
        image_url: URL of an image for VLM analysis
        image_base64: Base64-encoded image data
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature (0.0-1.0)
        system_prompt: Custom system prompt
        conversation_id: ID to continue existing conversation
        session_id: ID to group conversations
        lora_adapter: LoRA adapter name from registry
        lora_hf_path: Direct HuggingFace LoRA path
        rag_data: URL or base64 encoded data for RAG
        rag_documents: List of documents for RAG
        no_default_data: Skip built-in RAG data
        enable_tools: Enable tool execution
        model: Model name to use
        wait_for_completion: Wait for async job to complete
        timeout: Timeout in seconds for completion
    
    Returns:
        Response dictionary with results
    """
    # Build request payload
    payload = {
        "input": {
            "prompt": prompt,
            "max_tokens": max_tokens,
            "temperature": temperature,
            "model": model,
            "no_default_data": no_default_data,
            "enable_tools": enable_tools,
        }
    }
    
    # Add optional parameters
    if image_url:
        payload["input"]["image_url"] = image_url
    if image_base64:
        payload["input"]["image_base64"] = image_base64
    if system_prompt:
        payload["input"]["system_prompt"] = system_prompt
    if conversation_id:
        payload["input"]["conversation_id"] = conversation_id
    if session_id:
        payload["input"]["session_id"] = session_id
    if lora_adapter:
        payload["input"]["lora_adapter"] = lora_adapter
    if lora_hf_path:
        payload["input"]["lora_hf_path"] = lora_hf_path
    if rag_data:
        payload["input"]["rag_data"] = rag_data
    if rag_documents:
        payload["input"]["rag_documents"] = rag_documents
    
    # Submit job
    response = requests.post(RUNPOD_API_URL, headers=HEADERS, json=payload)
    response.raise_for_status()
    result = response.json()
    
    if not wait_for_completion:
        return result
    
    # Poll for completion
    job_id = result.get("id")
    status_url = f"https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{job_id}"
    
    start_time = time.time()
    while time.time() - start_time < timeout:
        status_response = requests.get(status_url, headers=HEADERS)
        status_response.raise_for_status()
        status_data = status_response.json()
        
        status = status_data.get("status")
        
        if status == "COMPLETED":
            return status_data.get("output", {})
        elif status == "FAILED":
            raise Exception(f"Job failed: {status_data.get('error', 'Unknown error')}")
        
        time.sleep(2)  # Poll every 2 seconds
    
    raise TimeoutError(f"Job did not complete within {timeout} seconds")


def list_available_loras() -> Dict[str, Any]:
    """List all available LoRA adapters in the registry."""
    payload = {
        "input": {
            "list_loras": True
        }
    }
    response = requests.post(RUNPOD_API_URL, headers=HEADERS, json=payload)
    response.raise_for_status()
    result = response.json()
    
    # Handle async response
    if "id" in result:
        job_id = result["id"]
        status_url = f"https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{job_id}"
        
        for _ in range(30):  # 60 seconds timeout
            status_response = requests.get(status_url, headers=HEADERS)
            status_response.raise_for_status()
            status_data = status_response.json()
            
            if status_data.get("status") == "COMPLETED":
                return status_data.get("output", {})
            elif status_data.get("status") == "FAILED":
                raise Exception(f"Failed to list LoRAs: {status_data.get('error')}")
            
            time.sleep(2)
    
    return result


def list_conversations() -> Dict[str, Any]:
    """List all saved conversations and sessions."""
    payload = {
        "input": {
            "list_conversations": True
        }
    }
    response = requests.post(RUNPOD_API_URL, headers=HEADERS, json=payload)
    response.raise_for_status()
    result = response.json()
    
    # Handle async response
    if "id" in result:
        job_id = result["id"]
        status_url = f"https://api.runpod.ai/v2/{ENDPOINT_ID}/status/{job_id}"
        
        for _ in range(30):
            status_response = requests.get(status_url, headers=HEADERS)
            status_response.raise_for_status()
            status_data = status_response.json()
            
            if status_data.get("status") == "COMPLETED":
                return status_data.get("output", {})
            elif status_data.get("status") == "FAILED":
                raise Exception(f"Failed to list conversations: {status_data.get('error')}")
            
            time.sleep(2)
    
    return result


def print_response(response: Dict[str, Any], show_full: bool = False):
    """Pretty print the API response."""
    print("\n" + "="*60)
    print("üì• RESPONSE")
    print("="*60)
    
    if "response" in response:
        print(f"\nüí¨ Answer:\n{response['response']}\n")
    
    if "error" in response:
        print(f"\n‚ùå Error: {response['error']}\n")
        if "traceback" in response:
            print(f"Traceback:\n{response['traceback']}")
        return
    
    # Metadata
    print("üìä Metadata:")
    print(f"  ‚Ä¢ Domain: {response.get('domain', 'N/A')} (confidence: {response.get('domain_confidence', 0):.2f})")
    print(f"  ‚Ä¢ Model: {response.get('model', 'N/A')}")
    print(f"  ‚Ä¢ VLM Mode: {'Yes' if response.get('vlm') else 'No'}")
    print(f"  ‚Ä¢ Image Provided: {'Yes' if response.get('image_provided') else 'No'}")
    print(f"  ‚Ä¢ RAG Enabled: {'Yes' if response.get('rag_enabled') else 'No'}")
    print(f"  ‚Ä¢ Tools Enabled: {'Yes' if response.get('tools_enabled') else 'No'}")
    
    if "conversation_id" in response:
        print(f"  ‚Ä¢ Conversation ID: {response['conversation_id'][:16]}...")
    if "session_id" in response:
        print(f"  ‚Ä¢ Session ID: {response['session_id'][:16]}...")
    
    # LoRA info
    if "lora" in response and response["lora"].get("loaded"):
        lora_info = response["lora"]
        print(f"  ‚Ä¢ LoRA Adapter: {lora_info.get('adapter', 'N/A')}")
        if "hf_path" in lora_info:
            print(f"    Path: {lora_info['hf_path']}")
    
    if show_full:
        print("\nüîç Full Response:")
        print(json.dumps(response, indent=2))
    
    print("="*60 + "\n")

print("‚úÖ Helper functions defined")

---

## 2. Basic Chat & Inference

Simple text queries with multi-agent reasoning and domain routing:

In [None]:
# Example 1: Simple question
print("üîπ Example 1: Simple Question")
response = call_latent_mas(
    prompt="What is the capital of France?",
    max_tokens=200,
    temperature=0.7
)
print_response(response)

In [None]:
# Example 2: Complex reasoning (triggers multi-agent pipeline)
print("üîπ Example 2: Complex Reasoning")
response = call_latent_mas(
    prompt="Explain the difference between supervised and unsupervised learning, with examples.",
    max_tokens=500,
    temperature=0.7
)
print_response(response)

In [None]:
# Example 3: Domain-specific query (medical)
print("üîπ Example 3: Medical Domain")
response = call_latent_mas(
    prompt="What are the symptoms and treatment for type 2 diabetes?",
    max_tokens=600,
    temperature=0.6
)
print_response(response)

In [None]:
# Example 4: Math/Reasoning domain
print("üîπ Example 4: Math Problem")
response = call_latent_mas(
    prompt="Solve: If a train travels at 80 km/h for 2.5 hours, then slows to 60 km/h for 1 hour, what is the total distance traveled?",
    max_tokens=400,
    temperature=0.5
)
print_response(response)

In [None]:
# Example 5: Code generation
print("üîπ Example 5: Code Domain")
response = call_latent_mas(
    prompt="Write a Python function to calculate the Fibonacci sequence up to n terms using dynamic programming.",
    max_tokens=500,
    temperature=0.5
)
print_response(response)

---

## 3. Vision Language Model (VLM)

The base model supports image + text queries. You can provide images via URL or base64 encoding:

In [None]:
# Example 6: Image analysis via URL
print("üîπ Example 6: Image Analysis (URL)")
response = call_latent_mas(
    prompt="Describe what you see in this image in detail.",
    image_url="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
    max_tokens=400,
    temperature=0.7
)
print_response(response)

In [None]:
# Example 7: Medical image analysis (with medical LoRA)
print("üîπ Example 7: Medical Image Analysis")
response = call_latent_mas(
    prompt="Analyze this chest X-ray image. Identify any abnormalities or key findings.",
    image_url="https://upload.wikimedia.org/wikipedia/commons/8/8c/Chest_Xray_PA_3-8-2010.png",
    lora_adapter="medical_vl",  # Use medical vision LoRA
    max_tokens=600,
    temperature=0.6
)
print_response(response)

In [None]:
# Example 8: Base64 image encoding (for local files)
def encode_image_to_base64(image_path: str) -> str:
    """Encode a local image to base64 string."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

# Example usage (uncomment if you have a local image)
# image_b64 = encode_image_to_base64("/path/to/your/image.jpg")
# response = call_latent_mas(
#     prompt="What objects can you identify in this image?",
#     image_base64=image_b64,
#     max_tokens=300
# )
# print_response(response)

print("‚úÖ Image encoding function defined (uncomment to use with local images)")

---

## 4. RAG (Retrieval Augmented Generation)

Inject custom documents for context-aware responses. The system supports multiple RAG input methods:

In [None]:
# Example 9: RAG with inline documents
print("üîπ Example 9: RAG with Inline Documents")

# Define custom documents
custom_docs = [
    {
        "content": """
        Company Policy: Remote Work Guidelines
        
        All employees are eligible for remote work up to 3 days per week.
        Employees must maintain core hours of 10 AM - 3 PM in their local timezone.
        Weekly team meetings are mandatory on Tuesdays at 2 PM EST.
        Home office stipend: $500 annually for equipment.
        """,
        "metadata": {"type": "policy", "category": "remote_work"}
    },
    {
        "content": """
        IT Security Requirements for Remote Workers
        
        1. Use company-approved VPN at all times
        2. Enable 2FA on all corporate accounts
        3. Encrypt all devices with BitLocker or FileVault
        4. Never share credentials via email or chat
        5. Report suspicious emails to security@company.com
        """,
        "metadata": {"type": "policy", "category": "security"}
    }
]

response = call_latent_mas(
    prompt="What are the remote work policies and security requirements for our company?",
    rag_documents=custom_docs,
    no_default_data=True,  # Only use provided documents
    max_tokens=600,
    temperature=0.6
)
print_response(response)

In [None]:
# Example 10: RAG with medical knowledge (using built-in data)
print("üîπ Example 10: RAG with Built-in Medical Data")

response = call_latent_mas(
    prompt="What medications are commonly prescribed for hypertension and what are their mechanisms of action?",
    no_default_data=False,  # Use built-in medical knowledge from data/
    max_tokens=700,
    temperature=0.6
)
print_response(response)

In [None]:
# Example 11: RAG with cryptocurrency data
print("üîπ Example 11: RAG with Cryptocurrency Data")

response = call_latent_mas(
    prompt="Analyze the Bitcoin price trends and provide insights on market volatility.",
    no_default_data=False,  # Uses built-in CryptocurrencyData.csv
    max_tokens=600,
    temperature=0.7
)
print_response(response)

In [None]:
# Example 12: RAG with JSON data (base64 encoded)
print("üîπ Example 12: RAG with Base64 Encoded JSON")

# Create a JSON dataset
import json
import base64

product_data = {
    "products": [
        {
            "id": "P001",
            "name": "Wireless Headphones",
            "price": 79.99,
            "features": ["Bluetooth 5.0", "40h battery", "Noise cancellation"],
            "rating": 4.5
        },
        {
            "id": "P002",
            "name": "Smart Watch",
            "price": 199.99,
            "features": ["Heart rate monitor", "GPS", "Water resistant"],
            "rating": 4.7
        },
        {
            "id": "P003",
            "name": "USB-C Hub",
            "price": 49.99,
            "features": ["7 ports", "4K HDMI", "Fast charging"],
            "rating": 4.3
        }
    ]
}

# Encode to base64
json_str = json.dumps(product_data)
rag_data_b64 = base64.b64encode(json_str.encode()).decode()

response = call_latent_mas(
    prompt="Compare the smart watch and wireless headphones. Which one offers better value for money?",
    rag_data=rag_data_b64,
    no_default_data=True,
    max_tokens=500,
    temperature=0.7
)
print_response(response)

---

## 5. Conversation Continuity

Maintain context across multiple requests using session and conversation IDs:

In [None]:
# Example 13: Multi-turn conversation
print("üîπ Example 13: Multi-turn Conversation")

# First message - create new session
response1 = call_latent_mas(
    prompt="I'm planning a trip to Japan. Can you suggest some must-visit cities?",
    session_id="user-123-travel-planning",
    max_tokens=400,
    temperature=0.7
)
print_response(response1)

# Extract conversation ID for continuity
conversation_id = response1.get("conversation_id")
session_id = response1.get("session_id")

print(f"\nüìù Conversation ID saved: {conversation_id[:16]}...")
print(f"üìù Session ID saved: {session_id[:16]}...")

In [None]:
# Continue the conversation from previous cell
print("üîπ Follow-up: Continue Conversation")

response2 = call_latent_mas(
    prompt="How long should I plan to stay in each city? I have 2 weeks total.",
    conversation_id=conversation_id,
    session_id=session_id,
    max_tokens=400,
    temperature=0.7
)
print_response(response2)

In [None]:
# Another follow-up
print("üîπ Follow-up: Ask about specific details")

response3 = call_latent_mas(
    prompt="What's the best time of year to visit? I want to see cherry blossoms.",
    conversation_id=conversation_id,
    session_id=session_id,
    max_tokens=400,
    temperature=0.7
)
print_response(response3)

In [None]:
# Example 14: Multiple conversations in one session
print("üîπ Example 14: Multiple Conversations in Same Session")

# Start a new conversation in the same session
response4 = call_latent_mas(
    prompt="Now let's talk about Japanese cuisine. What dishes should I try?",
    session_id=session_id,  # Same session, but no conversation_id = new conversation
    max_tokens=400,
    temperature=0.7
)
print_response(response4)

new_conversation_id = response4.get("conversation_id")
print(f"\nüìù New conversation started: {new_conversation_id[:16]}...")
print(f"   (still in session: {session_id[:16]}...)")

---

## 6. LoRA Adapter Selection

The system supports multiple domain-specific LoRA adapters. You can list available adapters and select specific ones:

In [None]:
# Example 15: List available LoRA adapters
print("üîπ Example 15: List Available LoRA Adapters")

loras = list_available_loras()

print("\nüìö Available LoRA Adapters:")
print("="*60)

if "loras" in loras:
    for name, info in loras["loras"].items():
        print(f"\nüî∏ {name}")
        print(f"   Domain: {info.get('domain', 'N/A')}")
        print(f"   Description: {info.get('description', 'N/A')}")
        print(f"   HF Path: {info.get('hf_path', 'N/A')}")
        print(f"   Source: {info.get('source', 'N/A')}")
        if "tags" in info:
            print(f"   Tags: {', '.join(info['tags'])}")
else:
    print(json.dumps(loras, indent=2))

print("="*60)

In [None]:
# Example 16: Use medical LoRA adapter
print("üîπ Example 16: Medical LoRA Adapter")

response = call_latent_mas(
    prompt="Explain the pathophysiology of congestive heart failure and the role of ACE inhibitors in treatment.",
    lora_adapter="medical_vl",
    max_tokens=700,
    temperature=0.6
)
print_response(response)

In [None]:
# Example 17: Use reward model LoRA for higher quality
print("üîπ Example 17: Reward Model LoRA (Quality Enhancement)")

response = call_latent_mas(
    prompt="Explain quantum entanglement in simple terms, but make it accurate and engaging.",
    lora_adapter="reward_vl",
    max_tokens=500,
    temperature=0.7
)
print_response(response)

In [None]:
# Example 18: Use custom LoRA via HuggingFace path
print("üîπ Example 18: Custom LoRA via Direct HF Path")

# You can load any compatible LoRA from HuggingFace
response = call_latent_mas(
    prompt="Analyze this comic panel and describe the narrative elements.",
    lora_hf_path="VLR-CVC/Qwen2.5-VL-7B-Instruct-lora-ComicsPAP",
    image_url="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Comic_image_missing.svg/400px-Comic_image_missing.svg.png",
    max_tokens=500,
    temperature=0.7
)
print_response(response)

---

## 7. Custom System Prompts

Override the default system prompt to customize agent behavior:

In [None]:
# Example 19: Custom system prompt - Pirate mode
print("üîπ Example 19: Custom System Prompt (Pirate)")

response = call_latent_mas(
    prompt="Tell me about machine learning.",
    system_prompt="You are a friendly pirate captain who explains technical concepts using nautical metaphors. Always respond in pirate speak.",
    max_tokens=400,
    temperature=0.9
)
print_response(response)

In [None]:
# Example 20: Custom system prompt - Expert consultant
print("üîπ Example 20: Custom System Prompt (Expert Consultant)")

response = call_latent_mas(
    prompt="Should we migrate our infrastructure to Kubernetes?",
    system_prompt="""You are a senior cloud architecture consultant with 15 years of experience. 
    Provide detailed, nuanced advice considering cost, scalability, team expertise, and long-term maintenance. 
    Always mention potential risks and trade-offs.""",
    max_tokens=600,
    temperature=0.6
)
print_response(response)

---

## 8. Advanced Use Cases

Combining multiple features for complex scenarios:

In [None]:
# Example 21: Medical consultation with RAG + LoRA + VLM
print("üîπ Example 21: Full Medical Analysis (RAG + LoRA + VLM)")

# Patient context document
patient_docs = [
    {
        "content": """
        Patient History:
        Name: John Doe (fictional)
        Age: 65
        History: Type 2 Diabetes (15 years), Hypertension (10 years)
        Current Medications: Metformin 1000mg BID, Lisinopril 10mg QD
        Recent Labs: HbA1c 7.8%, Blood Pressure 145/92
        Symptoms: Increased fatigue, occasional chest discomfort
        """,
        "metadata": {"type": "patient_record", "patient_id": "P12345"}
    }
]

response = call_latent_mas(
    prompt="Based on the patient history, analyze this chest X-ray and provide a comprehensive assessment with treatment recommendations.",
    image_url="https://upload.wikimedia.org/wikipedia/commons/8/8c/Chest_Xray_PA_3-8-2010.png",
    rag_documents=patient_docs,
    lora_adapter="medical_vl",
    no_default_data=False,  # Include medical knowledge base
    max_tokens=800,
    temperature=0.6,
    system_prompt="You are an experienced physician providing detailed medical analysis. Always consider patient history and current medications."
)
print_response(response)

In [None]:
# Example 22: Research assistant with conversation history
print("üîπ Example 22: Research Assistant with Context")

# Start research session
research_session = "research-session-ai-ethics"

# First query
response1 = call_latent_mas(
    prompt="What are the main ethical concerns regarding AI in healthcare?",
    session_id=research_session,
    max_tokens=600,
    temperature=0.7
)
print_response(response1)

research_conv_id = response1.get("conversation_id")

# Follow-up with context
print("\n" + "="*60)
print("üí¨ Follow-up Query")
print("="*60)

response2 = call_latent_mas(
    prompt="Can you elaborate on the privacy concerns you mentioned?",
    conversation_id=research_conv_id,
    session_id=research_session,
    max_tokens=500,
    temperature=0.7
)
print_response(response2)

# Add research documents
print("\n" + "="*60)
print("üí¨ Adding Research Context")
print("="*60)

research_docs = [
    {
        "content": """
        Recent Study: AI in Medical Diagnosis
        Published: 2026
        
        Key Findings:
        - AI models achieve 94% accuracy in radiology diagnoses
        - 23% reduction in diagnostic errors when AI assists physicians
        - Patient privacy concerns: 67% of patients worried about data sharing
        - Algorithmic bias detected in 31% of skin condition diagnoses
        - Healthcare costs reduced by 18% with AI triage systems
        """,
        "metadata": {"type": "research", "year": 2026}
    }
]

response3 = call_latent_mas(
    prompt="Based on this recent research, what solutions would you propose for the privacy and bias issues?",
    conversation_id=research_conv_id,
    session_id=research_session,
    rag_documents=research_docs,
    max_tokens=700,
    temperature=0.7
)
print_response(response3)

In [None]:
# Example 23: Code review with technical context
print("üîπ Example 23: Code Review Session")

code_context = [
    {
        "content": """
        # Current Implementation
        def process_user_data(data):
            results = []
            for item in data:
                if item['status'] == 'active':
                    results.append({
                        'id': item['id'],
                        'name': item['name'],
                        'email': item['email'],
                        'score': item['score'] * 1.5
                    })
            return results
        
        # Issues:
        # - No input validation
        # - Inefficient list concatenation
        # - Magic number (1.5)
        # - No error handling
        # - Direct dict access without checks
        """,
        "metadata": {"type": "code", "language": "python"}
    }
]

response = call_latent_mas(
    prompt="Review this code and suggest improvements focusing on performance, security, and maintainability. Provide a refactored version.",
    rag_documents=code_context,
    no_default_data=True,
    max_tokens=800,
    temperature=0.5,
    system_prompt="You are a senior Python developer conducting a thorough code review. Focus on best practices, security, and performance."
)
print_response(response)

---

## 9. Session & Conversation Management

List and manage your conversations:

In [None]:
# Example 24: List all conversations
print("üîπ Example 24: List All Conversations")

conversations = list_conversations()

print("\nüìö Active Sessions & Conversations:")
print("="*60)

if "sessions" in conversations:
    for session_id, session_info in conversations["sessions"].items():
        print(f"\nüìÅ Session: {session_id[:16]}...")
        print(f"   Created: {session_info.get('created', 'N/A')}")
        print(f"   Conversations: {len(session_info.get('conversations', []))}")
        
        for conv_id in session_info.get("conversations", []):
            print(f"      üí¨ {conv_id[:16]}...")
else:
    print(json.dumps(conversations, indent=2))

print("="*60)

---

## 10. Temperature & Creativity Control

Experiment with different temperature settings for various use cases:

In [None]:
# Example 25: Low temperature (factual, deterministic)
print("üîπ Example 25: Low Temperature (0.3) - Factual Response")

response_low = call_latent_mas(
    prompt="What is the boiling point of water at sea level?",
    max_tokens=200,
    temperature=0.3
)
print_response(response_low)

In [None]:
# Example 26: High temperature (creative, diverse)
print("üîπ Example 26: High Temperature (1.0) - Creative Response")

response_high = call_latent_mas(
    prompt="Write a short story about a robot learning to paint.",
    max_tokens=500,
    temperature=1.0
)
print_response(response_high)

---

## 11. Batch Processing Example

Process multiple queries efficiently:

In [None]:
# Example 27: Batch processing
print("üîπ Example 27: Batch Processing Multiple Queries")

queries = [
    {"prompt": "What is machine learning?", "domain": "general"},
    {"prompt": "Explain gradient descent algorithm", "domain": "math"},
    {"prompt": "Write a Python decorator for timing functions", "domain": "code"},
    {"prompt": "What are the symptoms of pneumonia?", "domain": "medical"},
]

results = []

print(f"\nüîÑ Processing {len(queries)} queries...\n")

for i, query in enumerate(queries, 1):
    print(f"[{i}/{len(queries)}] Processing: {query['prompt'][:50]}...")
    
    try:
        response = call_latent_mas(
            prompt=query["prompt"],
            max_tokens=300,
            temperature=0.7
        )
        results.append({
            "query": query["prompt"],
            "response": response.get("response", ""),
            "domain": response.get("domain", ""),
            "confidence": response.get("domain_confidence", 0)
        })
        print(f"    ‚úÖ Completed (Domain: {response.get('domain', 'N/A')})")
    except Exception as e:
        print(f"    ‚ùå Failed: {e}")
        results.append({
            "query": query["prompt"],
            "error": str(e)
        })
    
    time.sleep(1)  # Rate limiting

print(f"\n‚úÖ Batch processing complete: {len(results)} results\n")

# Display summary
print("="*60)
print("üìä BATCH RESULTS SUMMARY")
print("="*60)
for i, result in enumerate(results, 1):
    print(f"\n{i}. Query: {result['query'][:60]}...")
    if "error" in result:
        print(f"   ‚ùå Error: {result['error']}")
    else:
        print(f"   Domain: {result['domain']} (confidence: {result['confidence']:.2f})")
        print(f"   Response: {result['response'][:100]}...")
print("="*60)

---

## 12. Error Handling & Best Practices

In [None]:
# Example 28: Robust error handling
print("üîπ Example 28: Error Handling Best Practices")

def safe_query(prompt: str, **kwargs):
    """
    Safely execute a query with comprehensive error handling.
    """
    try:
        response = call_latent_mas(
            prompt=prompt,
            **kwargs
        )
        
        if "error" in response:
            print(f"‚ùå API Error: {response['error']}")
            return None
        
        return response
        
    except requests.exceptions.Timeout:
        print("‚ùå Request timed out. The model may be cold-starting or overloaded.")
        return None
        
    except requests.exceptions.HTTPError as e:
        print(f"‚ùå HTTP Error: {e}")
        if e.response.status_code == 401:
            print("   Check your API key")
        elif e.response.status_code == 404:
            print("   Check your endpoint ID")
        return None
        
    except Exception as e:
        print(f"‚ùå Unexpected error: {e}")
        return None

# Test with error handling
response = safe_query(
    "What are the benefits of yoga?",
    max_tokens=400,
    temperature=0.7
)

if response:
    print_response(response)
else:
    print("‚ö†Ô∏è  Query failed, see error above")

---

## 13. Performance Tips & Best Practices

### Optimization Strategies:

1. **Reuse Sessions**: Keep `session_id` for multi-turn conversations to maintain context
2. **Temperature Control**:
   - Low (0.3-0.5): Factual queries, code generation, medical info
   - Medium (0.6-0.8): General chat, explanations
   - High (0.9-1.0): Creative writing, brainstorming
3. **RAG Optimization**:
   - Use `no_default_data=True` if you don't need built-in knowledge
   - Provide focused, relevant documents (not entire websites)
   - Limit to 5-10 documents per query
4. **LoRA Selection**:
   - Use domain-specific adapters when available
   - Medical: `medical_vl`
   - Quality: `reward_vl`
   - Comics/Visual: `comics_vl`
5. **Token Management**:
   - Start with `max_tokens=200-400` for quick responses
   - Increase to 600-800 for detailed explanations
   - Monitor costs (longer = more expensive)
6. **Rate Limiting**:
   - Add delays between batch requests
   - RunPod may have per-endpoint rate limits
7. **Cold Start**:
   - First request may take 30-60s (model loading)
   - Subsequent requests are fast (< 5s)

### Common Pitfalls:

‚ùå **Don't** send huge documents (>10MB) via RAG
‚úÖ **Do** chunk and summarize first

‚ùå **Don't** use high temperature for factual queries
‚úÖ **Do** use 0.3-0.5 for accuracy

‚ùå **Don't** create new sessions for every message
‚úÖ **Do** reuse `session_id` for related conversations

‚ùå **Don't** ignore domain routing results
‚úÖ **Do** use domain info to select appropriate LoRAs

---

## 14. API Reference Summary

### Core Parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `prompt` | string | ‚úÖ Yes | - | The query text |
| `max_tokens` | int | No | 800 | Maximum tokens to generate |
| `temperature` | float | No | 0.7 | Sampling temperature (0.0-1.0) |
| `model` | string | No | Qwen/Qwen2.5-VL-7B-Instruct | Model name |

### Vision (VLM) Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `image_url` | string | No | URL of an image for analysis |
| `image_base64` | string | No | Base64-encoded image data |

### RAG Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `rag_documents` | list[dict] | No | Inline documents for context |
| `rag_data` | string | No | URL or base64 JSON/CSV data |
| `no_default_data` | bool | No | Skip built-in knowledge base |

### Conversation Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `session_id` | string | No | Session ID for grouping conversations |
| `conversation_id` | string | No | Continue existing conversation |
| `system_prompt` | string | No | Custom system prompt |

### LoRA Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `lora_adapter` | string | No | Adapter name from registry |
| `lora_hf_path` | string | No | Direct HuggingFace LoRA path |

### Special Requests

| Parameter | Type | Description |
|-----------|------|-------------|
| `list_loras` | bool | Return available LoRA adapters |
| `list_conversations` | bool | Return saved sessions |

### Response Fields

| Field | Type | Description |
|-------|------|-------------|
| `response` | string | The generated answer |
| `conversation_id` | string | Conversation ID (for continuity) |
| `session_id` | string | Session ID |
| `domain` | string | Detected domain (medical, code, math, etc.) |
| `domain_confidence` | float | Domain classification confidence |
| `model` | string | Model used |
| `vlm` | bool | VLM mode active |
| `image_provided` | bool | Image was provided |
| `rag_enabled` | bool | RAG is active |
| `lora` | dict | LoRA adapter info |

---

## 15. Conclusion & Next Steps

This notebook demonstrates all major features of the **LatentMAS + S-LoRA** system via the RunPod API:

### ‚úÖ What We Covered:
1. ‚ú® **Basic chat** with multi-agent reasoning
2. üñºÔ∏è **Vision Language Model** for image analysis
3. üìö **RAG** for document-aware responses
4. üí¨ **Conversation continuity** with sessions
5. üîß **LoRA adapters** for domain specialization
6. üéõÔ∏è **Custom system prompts** for behavior control
7. üîÑ **Batch processing** for efficiency
8. üõ°Ô∏è **Error handling** best practices

### üöÄ Next Steps:
- Share this notebook with your GitHub team
- Experiment with different LoRA adapters for your use case
- Build custom RAG pipelines with your domain data
- Integrate into your production applications
- Monitor costs and optimize token usage

### üìö Resources:
- **GitHub**: [latent_mas_slora](https://github.com/Arifuzzamanjoy/latent_mas_slora)
- **Docs**: See `docs/` folder in repo
- **Docker Image**: `s1710374103/latent-mas-slora:latest`
- **RunPod**: https://runpod.io

---

**Questions or Issues?**
- Check the [README.md](../README.md)
- Review [ARCHITECTURE.md](../docs/ARCHITECTURE.md)
- Open an issue on GitHub

**Happy experimenting! üéâ**