# Lab 3.3.2: SGLang Deployment with RadixAttention

**Module:** 3.3 - Model Deployment & Inference Engines  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how RadixAttention enables 29-45% faster inference through prefix caching
- [ ] Deploy SGLang on DGX Spark with optimal configuration
- [ ] Measure and verify prefix cache hit rates
- [ ] Build applications that leverage shared system prompts efficiently

---

## üìö Prerequisites

- Completed: Lab 3.3.1 (Engine Benchmark)
- Knowledge of: REST APIs, Python async programming
- Having: Hugging Face account with accepted Llama license

---

## üåç Real-World Context

**The Problem:** In production chatbots, every user shares the same system prompt:

```
"You are a helpful customer service agent for Acme Corp. You help customers with orders, returns, and product questions. Always be polite and professional..."
```

**The Waste:** Traditional inference engines recompute this system prompt for EVERY request. If you have 1000 users, you compute the same 500 tokens 1000 times!

**SGLang's Solution:** RadixAttention caches the KV (key-value) computations for shared prefixes. Once the system prompt is processed, subsequent users get it "for free".

**Real Impact:**
- OpenAI's API uses similar prefix caching (that's why structured prompts are encouraged)
- Production deployments see 29-45% latency reduction
- Memory efficiency improves since cached prefixes aren't duplicated

---

## üßí ELI5: What is RadixAttention?

> **Imagine you're a teacher grading homework...**
>
> Every student's paper starts with the same header:
> - "Name: _____"
> - "Date: _____"
> - "Class: Math 101"
> - "Assignment: Chapter 5 Problems"
>
> **The OLD way:** You read this entire header for EVERY paper, even though it's the same.
>
> **The RadixAttention way:** You read the header ONCE, remember it, and when you see the same header on the next paper, you skip right to the unique part (the actual answers).
>
> The "Radix" in RadixAttention comes from **radix trees** - a special data structure that efficiently stores strings with common prefixes (like how "apple", "application", and "apply" share "appl").
>
> **In AI terms:** SGLang stores the computed attention values (KV cache) for prompts it has seen before. When a new prompt shares a prefix with a cached one, it reuses those computations instead of redoing them.

---

## üìä When RadixAttention Helps Most

| Scenario | Cache Benefit | Example |
|----------|---------------|--------|
| **Same system prompt** | üî•üî•üî• Huge | Chatbots, assistants |
| **Few-shot examples** | üî•üî• Large | When you include examples in every prompt |
| **Document QA** | üî•üî• Large | Questions about the same document |
| **Continuation** | üî• Moderate | Generating more text from same context |
| **Unique prompts** | ‚ùÑÔ∏è None | Every prompt is different |

---

## Part 1: Setting Up SGLang on DGX Spark

First, let's understand how to deploy SGLang properly on DGX Spark's ARM64 architecture.

In [None]:
# Standard imports
import json
import os
import sys
import time
import subprocess
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Any, Optional
import warnings
warnings.filterwarnings('ignore')

# Third-party imports
import requests
import numpy as np

# Add scripts directory to path
scripts_path = Path("../scripts").resolve()
sys.path.insert(0, str(scripts_path))

print("‚úÖ Imports successful!")
print(f"üìÅ Scripts path: {scripts_path}")

In [None]:
# Check GPU status - DGX Spark should show ~128GB unified memory
def check_gpu_status():
    """Check GPU availability and memory on DGX Spark."""
    try:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=name,memory.total,memory.free,memory.used",
             "--format=csv,noheader,nounits"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            values = result.stdout.strip().split(",")
            name = values[0].strip()
            total_gb = int(values[1]) / 1024
            free_gb = int(values[2]) / 1024
            used_gb = int(values[3]) / 1024
            
            print("üñ•Ô∏è GPU Status:")
            print(f"   Name: {name}")
            print(f"   Memory: {used_gb:.1f}GB used / {total_gb:.1f}GB total")
            print(f"   Free: {free_gb:.1f}GB available")
            
            if "GB10" in name or total_gb > 100:
                print("   ‚úÖ DGX Spark detected!")
            return True
    except Exception as e:
        print(f"‚ö†Ô∏è GPU check failed: {e}")
    return False

check_gpu_status()

### üîß Starting SGLang Server

SGLang can be started in several ways on DGX Spark:

**Option 1: Direct installation (recommended for DGX Spark)**
```bash
# SGLang has native ARM64/Blackwell support
pip install sglang[all]

# Start the server
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --port 30000 \
    --dtype bfloat16 \
    --mem-fraction-static 0.85
```

**Option 2: Using NGC container**
```bash
docker run --gpus all -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -e HF_TOKEN=$HF_TOKEN \
    --ipc=host \
    nvcr.io/nvidia/pytorch:25.11-py3 \
    bash -c "pip install 'sglang[all]' && \
            python -m sglang.launch_server \
            --model-path meta-llama/Llama-3.1-8B-Instruct \
            --port 30000 \
            --dtype bfloat16"
```

**Key flags for DGX Spark:**
- `--dtype bfloat16`: Native Blackwell support
- `--mem-fraction-static 0.85`: Reserve 85% of GPU memory for KV cache
- `--enable-prefix-caching`: Enabled by default, but can be explicitly set

In [None]:
# Check if SGLang server is running
SGLANG_URL = "http://localhost:30000"

def check_sglang_server(url: str = SGLANG_URL) -> bool:
    """Check if SGLang server is running and accessible."""
    try:
        response = requests.get(f"{url}/v1/models", timeout=5)
        if response.status_code == 200:
            models = response.json().get("data", [])
            print(f"‚úÖ SGLang server is running at {url}")
            if models:
                for model in models:
                    print(f"   Model: {model.get('id', 'unknown')}")
            return True
    except requests.exceptions.ConnectionError:
        print(f"‚ùå SGLang server not running at {url}")
        print("\nüìù To start SGLang, run in a separate terminal:")
        print("   python -m sglang.launch_server \\")
        print("       --model-path meta-llama/Llama-3.1-8B-Instruct \\")
        print("       --port 30000 \\")
        print("       --dtype bfloat16")
    except Exception as e:
        print(f"‚ùå Error checking SGLang: {e}")
    return False

sglang_available = check_sglang_server()

---

## Part 2: Understanding Prefix Caching with RadixAttention

Let's visualize how RadixAttention works by comparing requests with and without shared prefixes.

In [None]:
# Define our test scenario: customer service chatbot
SYSTEM_PROMPT = """
You are a helpful customer service assistant for TechCorp Inc.
You help customers with:
- Product information and specifications
- Order status and tracking
- Returns and refunds
- Technical support

Always be polite, professional, and concise.
If you don't know something, say so and offer to connect them with a specialist.
"""

# Different user questions that all share the same system prompt
USER_QUESTIONS = [
    "Where is my order #12345?",
    "How do I return a product?",
    "What's the warranty on the X500 laptop?",
    "My device won't turn on, what should I do?",
    "Can I change my shipping address?",
    "What payment methods do you accept?",
    "Is the Y200 compatible with Mac?",
    "How long does shipping take?",
]

print(f"üìù System prompt length: {len(SYSTEM_PROMPT)} characters")
print(f"‚ùì Number of test questions: {len(USER_QUESTIONS)}")
print("\nüîë Key insight: All questions share the SAME system prompt!")
print("   With RadixAttention, we only compute the system prompt ONCE.")

In [None]:
def send_chat_request(
    url: str,
    system_prompt: str,
    user_message: str,
    max_tokens: int = 100,
    temperature: float = 0.7
) -> Dict[str, Any]:
    """
    Send a chat request and measure timing.
    
    Returns dict with:
        - response: The generated text
        - ttft: Time to first token (seconds)
        - total_time: Total request time (seconds)
        - tokens: Number of tokens generated
    """
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ]
    
    start_time = time.perf_counter()
    first_token_time = None
    chunks = []
    
    try:
        # Use streaming to measure TTFT
        response = requests.post(
            f"{url}/v1/chat/completions",
            json={
                "model": "default",
                "messages": messages,
                "max_tokens": max_tokens,
                "temperature": temperature,
                "stream": True
            },
            stream=True,
            timeout=60
        )
        response.raise_for_status()
        
        for line in response.iter_lines():
            if line:
                line_str = line.decode("utf-8")
                if line_str.startswith("data: "):
                    data_str = line_str[6:]
                    if data_str.strip() == "[DONE]":
                        break
                    try:
                        chunk = json.loads(data_str)
                        delta = chunk.get("choices", [{}])[0].get("delta", {})
                        content = delta.get("content", "")
                        if content:
                            if first_token_time is None:
                                first_token_time = time.perf_counter()
                            chunks.append(content)
                    except json.JSONDecodeError:
                        continue
        
        end_time = time.perf_counter()
        total_time = end_time - start_time
        ttft = (first_token_time - start_time) if first_token_time else total_time
        
        return {
            "response": "".join(chunks),
            "ttft": ttft,
            "total_time": total_time,
            "tokens": len(chunks),  # Approximate
            "success": True
        }
        
    except Exception as e:
        return {
            "response": "",
            "ttft": 0,
            "total_time": time.perf_counter() - start_time,
            "tokens": 0,
            "success": False,
            "error": str(e)
        }

### üß™ Experiment: Measuring Prefix Cache Effect

We'll send multiple requests with the same system prompt and observe how TTFT improves after the first request (when the prefix gets cached).

In [None]:
def benchmark_prefix_caching(url: str, system_prompt: str, questions: List[str]) -> Dict[str, Any]:
    """
    Benchmark prefix caching by sending multiple requests with the same system prompt.
    
    The first request should be slower (cache miss).
    Subsequent requests should be faster (cache hit).
    """
    results = []
    
    print("\nüöÄ Benchmarking prefix caching...")
    print("="*60)
    
    for i, question in enumerate(questions):
        print(f"\n[{i+1}/{len(questions)}] {question[:50]}...")
        
        result = send_chat_request(
            url=url,
            system_prompt=system_prompt,
            user_message=question,
            max_tokens=100
        )
        
        results.append({
            "question": question,
            "is_first": i == 0,
            **result
        })
        
        if result["success"]:
            cache_status = "‚ùÑÔ∏è Cold (cache miss)" if i == 0 else "üî• Warm (cache hit)"
            print(f"   {cache_status}")
            print(f"   TTFT: {result['ttft']*1000:.1f}ms | Total: {result['total_time']*1000:.1f}ms")
        else:
            print(f"   ‚ùå Error: {result.get('error', 'Unknown')}")
    
    return analyze_prefix_cache_results(results)


def analyze_prefix_cache_results(results: List[Dict]) -> Dict[str, Any]:
    """Analyze prefix caching benchmark results."""
    successful = [r for r in results if r["success"]]
    
    if len(successful) < 2:
        return {"error": "Not enough successful requests to analyze"}
    
    # First request (cold cache)
    cold = successful[0]
    
    # Subsequent requests (warm cache)
    warm = successful[1:]
    
    cold_ttft = cold["ttft"] * 1000  # ms
    warm_ttfts = [r["ttft"] * 1000 for r in warm]
    avg_warm_ttft = np.mean(warm_ttfts)
    
    speedup = cold_ttft / avg_warm_ttft if avg_warm_ttft > 0 else 0
    reduction = (cold_ttft - avg_warm_ttft) / cold_ttft * 100 if cold_ttft > 0 else 0
    
    print("\n" + "="*60)
    print("üìä PREFIX CACHING ANALYSIS")
    print("="*60)
    print(f"\n   Cold cache TTFT (first request): {cold_ttft:.1f}ms")
    print(f"   Warm cache TTFT (avg of {len(warm)}): {avg_warm_ttft:.1f}ms")
    print(f"\n   üöÄ Speedup: {speedup:.2f}x")
    print(f"   üìâ Latency reduction: {reduction:.1f}%")
    
    if reduction > 20:
        print("\n   ‚úÖ RadixAttention is working! Prefix caching effective.")
    else:
        print("\n   ‚ö†Ô∏è Low reduction. Possible reasons:")
        print("      - Prefix caching not enabled")
        print("      - System prompt too short to benefit")
        print("      - Server under heavy load")
    
    return {
        "cold_ttft_ms": cold_ttft,
        "warm_ttft_ms": avg_warm_ttft,
        "speedup": speedup,
        "reduction_percent": reduction,
        "num_requests": len(successful)
    }

In [None]:
# Run the benchmark if SGLang is available
if sglang_available:
    prefix_results = benchmark_prefix_caching(
        url=SGLANG_URL,
        system_prompt=SYSTEM_PROMPT,
        questions=USER_QUESTIONS
    )
else:
    print("‚ö†Ô∏è SGLang not available. Start the server to run this benchmark.")
    print("\nüìä Expected results with RadixAttention:")
    print("   - First request (cold): ~200-400ms TTFT")
    print("   - Subsequent (warm): ~100-200ms TTFT")
    print("   - Speedup: 1.5-2.5x on shared prefixes")

### üîç What Just Happened?

When we sent multiple requests with the same system prompt:

1. **First request (cold cache):**
   - SGLang computes the full KV cache for the system prompt
   - Stores the KV cache in the radix tree indexed by the prompt content
   - Higher TTFT because we're doing full prefill

2. **Subsequent requests (warm cache):**
   - SGLang looks up the system prompt in the radix tree
   - Finds a matching prefix ‚Üí Cache hit!
   - Reuses the stored KV cache, only computes the new user message
   - Lower TTFT because we skip re-computing the system prompt

---

## Part 3: Advanced Prefix Caching Patterns

Let's explore more sophisticated uses of prefix caching.

In [None]:
# Pattern 1: Few-shot learning with cached examples
FEW_SHOT_PREFIX = """
You are a sentiment analyzer. Classify text as positive, negative, or neutral.

Examples:
Text: "I love this product! Best purchase ever!"
Sentiment: positive

Text: "This is the worst service I've experienced."
Sentiment: negative

Text: "The package arrived on time."
Sentiment: neutral

Text: "Absolutely fantastic! Exceeded all expectations!"
Sentiment: positive

Text: "Terrible quality, broke after one use."
Sentiment: negative

Now classify the following:
"""

SENTIMENT_QUERIES = [
    "The food was okay, nothing special.",
    "I'm so happy with this purchase!",
    "Complete waste of money.",
    "It works as described.",
    "This changed my life for the better!",
]

print(f"üìù Few-shot prefix length: {len(FEW_SHOT_PREFIX)} characters")
print(f"   This prefix includes 5 training examples")
print(f"   All {len(SENTIMENT_QUERIES)} queries will share this prefix")

In [None]:
# Pattern 2: Document QA with cached document
DOCUMENT_CONTEXT = """
# DGX Spark Technical Specifications

## Overview
DGX Spark is NVIDIA's first personal AI computer, bringing AI supercomputing
capabilities to your desktop.

## Hardware Specifications
- **GPU**: NVIDIA Blackwell GB10 Superchip
- **CPU**: 20 ARM v9.2 cores (10 Cortex-X925 + 10 Cortex-A725)
- **Memory**: 128GB LPDDR5X Unified Memory (shared CPU+GPU)
- **Memory Bandwidth**: 273 GB/s
- **CUDA Cores**: 6,144
- **Tensor Cores**: 192 (5th generation)

## Performance
- 1 PFLOP FP4 (NVFP4 quantization)
- ~209 TFLOPS FP8
- ~100 TFLOPS BF16

## Model Capacity
- FP16 Inference: Up to 50-55B parameters
- FP8 Inference: Up to 90-100B parameters
- NVFP4 Inference: Up to ~200B parameters
- QLoRA Fine-tuning: Up to 100-120B parameters

## Key Features
- Unified memory eliminates CPU-GPU transfers
- Native ARM64 architecture
- Desktop form factor
- NVLink-C2C for dual-system configurations

Based on this document, answer the following question:
"""

DOCUMENT_QUESTIONS = [
    "How much memory does DGX Spark have?",
    "What is the FP4 performance?",
    "What's the maximum model size for FP16 inference?",
    "How many CUDA cores does it have?",
    "What CPU architecture is used?",
]

print(f"üìÑ Document context length: {len(DOCUMENT_CONTEXT)} characters")
print(f"   Questions that reuse this context: {len(DOCUMENT_QUESTIONS)}")

In [None]:
# Run benchmarks on both patterns if SGLang is available
if sglang_available:
    print("\n" + "="*60)
    print("Pattern 1: Few-Shot Learning")
    print("="*60)
    
    few_shot_results = benchmark_prefix_caching(
        url=SGLANG_URL,
        system_prompt=FEW_SHOT_PREFIX,
        questions=SENTIMENT_QUERIES
    )
    
    print("\n" + "="*60)
    print("Pattern 2: Document QA")
    print("="*60)
    
    doc_qa_results = benchmark_prefix_caching(
        url=SGLANG_URL,
        system_prompt=DOCUMENT_CONTEXT,
        questions=DOCUMENT_QUESTIONS
    )
else:
    print("‚ö†Ô∏è SGLang not available. See expected results in solution notebook.")

---

## Part 4: Comparing SGLang with Other Engines

Let's compare SGLang's prefix caching with vLLM (which also supports prefix caching) and Ollama (which doesn't).

In [None]:
def compare_engines_prefix_caching(
    engines: Dict[str, str],  # {"engine_name": "url"}
    system_prompt: str,
    questions: List[str]
) -> Dict[str, Dict]:
    """
    Compare prefix caching performance across engines.
    
    Returns dictionary with results for each engine.
    """
    all_results = {}
    
    for engine_name, url in engines.items():
        print(f"\n{'='*60}")
        print(f"Testing: {engine_name}")
        print(f"{'='*60}")
        
        # Check if engine is available
        try:
            response = requests.get(f"{url}/v1/models", timeout=3)
            if response.status_code != 200:
                print(f"‚ö†Ô∏è {engine_name} not responding")
                continue
        except:
            print(f"‚ö†Ô∏è {engine_name} not available at {url}")
            continue
        
        # Run benchmark
        results = []
        for i, question in enumerate(questions):
            result = send_chat_request(
                url=url,
                system_prompt=system_prompt,
                user_message=question,
                max_tokens=100
            )
            results.append(result)
            
            if result["success"]:
                status = "‚ùÑÔ∏è" if i == 0 else "üî•"
                print(f"  {status} [{i+1}] TTFT: {result['ttft']*1000:.1f}ms")
            else:
                print(f"  ‚ùå [{i+1}] Error")
        
        # Compute stats
        successful = [r for r in results if r["success"]]
        if len(successful) >= 2:
            cold_ttft = successful[0]["ttft"] * 1000
            warm_ttfts = [r["ttft"] * 1000 for r in successful[1:]]
            avg_warm = np.mean(warm_ttfts)
            
            all_results[engine_name] = {
                "cold_ttft_ms": cold_ttft,
                "warm_ttft_ms": avg_warm,
                "speedup": cold_ttft / avg_warm if avg_warm > 0 else 1,
                "reduction_percent": (cold_ttft - avg_warm) / cold_ttft * 100 if cold_ttft > 0 else 0
            }
    
    return all_results

In [None]:
# Define engines to compare
ENGINES_TO_COMPARE = {
    "SGLang": "http://localhost:30000",
    "vLLM": "http://localhost:8000",
}

# Run comparison
comparison_results = compare_engines_prefix_caching(
    engines=ENGINES_TO_COMPARE,
    system_prompt=SYSTEM_PROMPT,
    questions=USER_QUESTIONS[:5]  # Use first 5 for quick comparison
)

# Print comparison table
if comparison_results:
    print("\n" + "="*60)
    print("üìä PREFIX CACHING COMPARISON")
    print("="*60)
    print(f"\n{'Engine':<15} {'Cold TTFT':>12} {'Warm TTFT':>12} {'Speedup':>10} {'Reduction':>12}")
    print("-" * 65)
    
    for engine, stats in comparison_results.items():
        print(f"{engine:<15} {stats['cold_ttft_ms']:>10.1f}ms {stats['warm_ttft_ms']:>10.1f}ms "
              f"{stats['speedup']:>9.2f}x {stats['reduction_percent']:>10.1f}%")
else:
    print("\n‚ö†Ô∏è No engines available for comparison.")
    print("   Start SGLang and/or vLLM servers to run this comparison.")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Expecting Cache Hits with Different Prefixes

```python
# ‚ùå Wrong - Different system prompts = no cache reuse
prompt1 = "You are a helpful assistant. Answer questions."
prompt2 = "You are a helpful assistant. Answer questions!"  # Different punctuation!

# ‚úÖ Right - Exact same prefix for cache hits
SHARED_PROMPT = "You are a helpful assistant. Answer questions."
# Use SHARED_PROMPT for ALL requests
```

**Why:** RadixAttention matches prefixes exactly. Even a single character difference creates a new cache entry.

### Mistake 2: Not Warming Up Before Benchmarking

```python
# ‚ùå Wrong - First request includes model loading overhead
results = benchmark_all_questions()  # First result is artificially slow

# ‚úÖ Right - Warm up with unrelated requests first
warm_up_request("Hello")  # Load model, initialize KV cache
results = benchmark_all_questions()  # Now measuring actual inference
```

### Mistake 3: Using Too Short Prefixes

```python
# ‚ùå Wrong - Short prefixes have minimal cache benefit
system_prompt = "You are helpful."  # Only ~5 tokens

# ‚úÖ Right - Longer prefixes show more benefit
system_prompt = """
You are a helpful customer service assistant for TechCorp Inc.
You help customers with product information, order status,
returns, refunds, and technical support. Always be polite,
professional, and concise...
"""  # ~100+ tokens - significant cache benefit
```

**Why:** The overhead of cache lookup vs. computation means very short prefixes don't benefit much.

---

## ‚úã Try It Yourself

### Exercise 1: Design a Production Chatbot Prefix

Create an optimized system prompt for a specific use case that maximizes prefix cache benefits.

In [None]:
# Exercise 1: Your code here
# Design a system prompt for a code review assistant

CODE_REVIEW_PROMPT = """
# TODO: Create a comprehensive system prompt for a code review assistant
# Include:
# - What languages you specialize in
# - What aspects of code you review (style, bugs, performance, security)
# - How you format your feedback
# - Example reviews (few-shot)

"""

CODE_REVIEW_QUERIES = [
    "Review this Python function: def add(a, b): return a + b",
    "Review this: for i in range(len(lst)): print(lst[i])",
    "Review this: password = input('password: ')",
]

# Benchmark your prompt
# TODO: Run benchmark_prefix_caching with your prompt

<details>
<summary>üí° Hint</summary>

A good code review system prompt should:
1. Be specific about the reviewer's expertise
2. Include 2-3 example reviews to establish the format
3. End with a clear instruction like "Review the following code:"

This creates a long, reusable prefix that benefits from caching.

</details>

### Exercise 2: Measure Cache Eviction

What happens when you have many different prefixes? Test cache eviction behavior.

In [None]:
# Exercise 2: Your code here
# Create multiple different system prompts and see how the cache handles them

DIFFERENT_PROMPTS = [
    "You are a Python expert.",
    "You are a JavaScript expert.",
    "You are a Rust expert.",
    "You are a Go expert.",
    # Add more...
]

# TODO:
# 1. Send requests with each different prompt
# 2. Then go back to the first prompt - is it still cached?
# 3. At what point does cache eviction happen?

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How RadixAttention works to cache and reuse prefix computations
- ‚úÖ How to deploy SGLang on DGX Spark with optimal settings
- ‚úÖ How to measure and verify prefix cache hit rates
- ‚úÖ Best practices for designing cacheable prompts

---

## üöÄ Challenge (Optional)

**Build a Prefix Cache Monitor**

Create a real-time monitoring dashboard that shows:
1. Cache hit rate over time
2. Memory usage by cached prefixes
3. Most frequently cached prefixes
4. Alert when cache hit rate drops

This would be valuable for production monitoring!

---

## üìñ Further Reading

- [SGLang Paper: RadixAttention](https://arxiv.org/abs/2312.07104)
- [SGLang GitHub Repository](https://github.com/sgl-project/sglang)
- [Efficient Prompt Caching (Anthropic Blog)](https://www.anthropic.com/news/prompt-caching)
- [vLLM Automatic Prefix Caching](https://docs.vllm.ai/en/latest/automatic_prefix_caching/)

---

## üßπ Cleanup

In [None]:
# Cleanup
import gc

# Clear Python garbage
gc.collect()

# Clear GPU memory cache if torch is available
try:
    import torch
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
        print("‚úÖ GPU memory cache cleared!")
except ImportError:
    pass

print("‚úÖ Cleanup complete!")
print("\nüìù Remember: Stop SGLang server when done:")
print("   pkill -f 'sglang.launch_server'")