# 🤖 Anthropic API Compatibility Examples

This notebook demonstrates how to use the Anthropic-compatible API endpoints provided by MLX Omni Server.

## 📋 Overview

MLX Omni Server provides fully compatible Anthropic Claude API endpoints, allowing you to use the Anthropic Python SDK to interact with locally running MLX models.

### ✨ Key Features
- **Model Listing**: Get available local MLX models
- **Message Conversations**: Support for text generation and chat
- **Thinking Mode**: Support for extended reasoning (Thinking)
- **Tool Calling**: Support for streaming tool calls and function execution
- **Real-time Streaming**: Support for streaming responses and incremental output

### 🚀 Getting Started
Make sure MLX Omni Server is running:
```bash
uv run uvicorn mlx_omni_server.main:app --reload --host 0.0.0.0 --port 10240
```

In [57]:
# Initialize Anthropic client
import anthropic
import json
from pprint import pprint

# Configure client to use local MLX Omni Server
client = anthropic.Anthropic(
    base_url="http://localhost:10240/anthropic",  # Local server endpoint
    api_key="not-needed",                         # API key not required for local server
    auth_token="not-needed"
)

print("✅ Anthropic client configured for MLX Omni Server")
print("🌐 Base URL: http://localhost:10240/anthropic")
print("🔑 API Key: Not required for local usage")

✅ Anthropic client configured for MLX Omni Server
🌐 Base URL: http://localhost:10240/anthropic
🔑 API Key: Not required for local usage


## 📋 Model Management - `/anthropic/v1/models`

List all available MLX models that can be used with the Anthropic-compatible API.

**API Reference**: [Anthropic Models API](https://docs.anthropic.com/en/api/models-list)

### 🔧 Testing with cURL

You can test the models endpoint directly using curl:

```shell
curl http://localhost:10240/anthropic/v1/models \
     --header "x-api-key: $ANTHROPIC_API_KEY" \
     --header "anthropic-version: 2023-06-01"
```

### 🐍 Using Python SDK

The Anthropic Python SDK provides a seamless experience for accessing local models:

In [44]:
# Get list of available models
response = client.models.list(limit=20)

print("🎯 Available Models:")
print(f"📊 Total models found: {len(response.data)}")
print("\n🔍 First model details:")
pprint(response.data[0].dict() if response.data else "No models available")

🎯 Available Models:
📊 Total models found: 20

🔍 First model details:
{'created_at': datetime.datetime(2025, 8, 1, 1, 53, 32, tzinfo=datetime.timezone.utc),
 'display_name': 'mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit',
 'id': 'mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit',
 'type': 'model'}


/var/folders/07/bt1n4pzn5ln_b8ts86fztw9w0000gn/T/ipykernel_14114/4147969662.py:7: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
  pprint(response.data[0].dict() if response.data else "No models available")


In [None]:
# Display complete response structure
print("📋 Complete Models Response:")
print("=" * 50)
pprint(response.dict())

print(f"\n📝 Model Names:")
for i, model in enumerate(response.data):
    print(f"  {i+1}. {model.id}")

## 💬 Message API - `/anthropic/v1/messages`

Create conversations with MLX models using the Anthropic Messages API format.

**API Reference**: [Anthropic Messages API](https://docs.anthropic.com/en/api/messages)

### 🎭 Standard Text Generation

Basic text generation without thinking mode enabled.

In [47]:
# Create a poetry message with system prompt
message = client.messages.create(
    model="mlx-community/gemma-3-1b-it-4bit-DWQ",
    max_tokens=1000,
    temperature=1,
    system="You are a world-class poet. Respond only with short poems.",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Why is the ocean salty?"
                }
            ]
        }
    ]
)

print("🎨 Poetry Response Generated:")
print("=" * 40)
print(f"📝 Model: {message.model}")
print(f"🆔 Message ID: {message.id}")
print(f"🔢 Input tokens: {message.usage.input_tokens}")
print(f"🔢 Output tokens: {message.usage.output_tokens}")
print(f"⚙️ Stop reason: {message.stop_reason}")
print("\n" + "=" * 40)

message

🎨 Poetry Response Generated:
📝 Model: mlx-community/gemma-3-1b-it-4bit-DWQ
🆔 Message ID: msg_c3bd3ae930ab4a30bd0a17fc
🔢 Input tokens: 31
🔢 Output tokens: 52
⚙️ Stop reason: end_turn



Message(id='msg_c3bd3ae930ab4a30bd0a17fc', content=[TextBlock(citations=None, text='The salt of ancient tears,\nWhispers on currents, whispering fears.\nA slow exhale, weight of stone,\nRain spills down, a silver moan.\n\nThe waters trade, a shifting hue,\nWhere minerals dance, forever new. \n', type='text')], model='mlx-community/gemma-3-1b-it-4bit-DWQ', role='assistant', stop_reason='end_turn', stop_sequence=None, type='message', usage=Usage(cache_creation_input_tokens=None, cache_read_input_tokens=None, input_tokens=31, output_tokens=52, server_tool_use=None, service_tier=None))

In [48]:
# Extract and display the poem content
print("📖 Poem Content:")
print("=" * 30)
for block in message.content:
    if block.type == "text":
        print(f"🎭 {block.text}")
print("=" * 30)

📖 Poem Content:
🎭 The salt of ancient tears,
Whispers on currents, whispering fears.
A slow exhale, weight of stone,
Rain spills down, a silver moan.

The waters trade, a shifting hue,
Where minerals dance, forever new. 



In [49]:
# Display first content block details
print("🔍 First Content Block Analysis:")
print("=" * 35)
first_block = message.content[0]
print(f"📋 Type: {first_block.type}")
print(f"📝 Text: {first_block.text}")
print("=" * 35)

🔍 First Content Block Analysis:
📋 Type: text
📝 Text: The salt of ancient tears,
Whispers on currents, whispering fears.
A slow exhale, weight of stone,
Rain spills down, a silver moan.

The waters trade, a shifting hue,
Where minerals dance, forever new. 



In [None]:
# Streaming conversation example
print("🌊 Starting Streaming Conversation:")
print("=" * 40)
print("🤖 Assistant: ", end="", flush=True)

with client.messages.stream(
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello! Tell me a fun fact about space."}],
    model="mlx-community/gemma-3-1b-it-4bit-DWQ",
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

print(f"\n{'='*40}")
print("✅ Streaming completed!")

### 🧠 Extended Thinking Mode

Enable the model's internal reasoning process with Anthropic's extended thinking feature.

**Documentation**: [Extended Thinking](https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking)

In [None]:
# Select a model that supports thinking mode
thinking_model = "Qwen/Qwen3-0.6B-MLX-4bit"
print(f"🧠 Using thinking model: {thinking_model}")
print("💭 Note: Thinking mode allows the model to show its reasoning process")

In [None]:
# Example: Mathematical reasoning with thinking mode
response = client.messages.create(
    model=thinking_model,
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Budget for thinking tokens (note: not fully implemented)
    },
    messages=[{
        "role": "user",
        "content": "Are there an infinite number of prime numbers such that n mod 4 == 3?"
    }]
)

print("🔢 Mathematical Reasoning Response:")
print("=" * 50)

# Process and display the response blocks
for i, block in enumerate(response.content):
    if block.type == "thinking":
        print(f"💭 Thinking Block {i+1}:")
        print(f"   {block.thinking}")
        print()
    elif block.type == "text":
        print(f"📝 Final Response:")
        print(f"   {block.text}")
        print()

print("=" * 50)
print(f"📊 Usage: {response.usage.input_tokens} input + {response.usage.output_tokens} output tokens")

In [None]:
# Streaming with thinking mode - detailed event monitoring
print("🌊 Streaming with Thinking Mode:")
print("=" * 45)

with client.messages.stream(
    model=thinking_model,
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user", "content": "What is 27 * 453? Show your calculation steps."}],
) as stream:
    thinking_started = False
    response_started = False

    for event in stream:
        if event.type == "message_start":
            print(f"🚀 Message started: {event.message.id}")
        elif event.type == "content_block_start":
            if event.content_block.type == "thinking":
                print("💭 Thinking process begins...")
                thinking_started = True
            elif event.content_block.type == "text":
                print("📝 Response output begins...")
                response_started = True
        elif event.type == "content_block_delta":
            if hasattr(event.delta, 'text') and response_started:
                print(event.delta.text, end="", flush=True)
        elif event.type == "message_stop":
            print(f"\n✅ Message completed")
            break

print("=" * 45)

### Streaming Tool Calls

This example demonstrates streaming tool calls using the Qwen3-30B model with fine-grained streaming of tool parameters.

In [58]:
# Define comprehensive tools for streaming examples
tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to use"
                }
            },
            "required": ["location"]
        }
    },
    {
        "name": "send_email",
        "description": "Send an email to a specified recipient",
        "input_schema": {
            "type": "object",
            "properties": {
                "to": {
                    "type": "string",
                    "description": "Email address of the recipient"
                },
                "subject": {
                    "type": "string",
                    "description": "Subject line of the email"
                },
                "body": {
                    "type": "string",
                    "description": "Content of the email"
                }
            },
            "required": ["to", "subject", "body"]
        }
    }
]

# Select a larger model for better tool calling performance
tool_model = "mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit"

print("🔧 Tool Configuration:")
print("=" * 35)
print(f"📋 Available tools: {len(tools)}")
for i, tool in enumerate(tools):
    print(f"  {i+1}. {tool['name']}: {tool['description']}")
print(f"\n🤖 Selected model: {tool_model}")
print("📥 Note: Model will be downloaded automatically if not available locally")
print("=" * 35)

🔧 Tool Configuration:
📋 Available tools: 2
  1. get_weather: Get the current weather in a given location
  2. send_email: Send an email to a specified recipient

🤖 Selected model: mlx-community/Qwen3-Coder-30B-A3B-Instruct-4bit
📥 Note: Model will be downloaded automatically if not available locally


#### Basic Tool Call Example

First, let's see a non-streaming tool call to understand the structure:

In [None]:
# Non-streaming tool call example for comparison
print("🔧 Non-Streaming Tool Call Example:")
print("=" * 45)

response = client.messages.create(
    model=tool_model,
    max_tokens=10240,
    tools=tools,
    messages=[
        {
            "role": "user",
            "content": "What's the weather like in San Francisco? Also send an email to john@example.com about the meeting tomorrow."
        }
    ]
)

print(f"📊 Response Statistics:")
print(f"  🆔 Message ID: {response.id}")
print(f"  📈 Input tokens: {response.usage.input_tokens}")
print(f"  📉 Output tokens: {response.usage.output_tokens}")
print(f"  ⛔ Stop reason: {response.stop_reason}")

print(f"\n📋 Content blocks ({len(response.content)}):")
for i, block in enumerate(response.content):
    print(f"\n  Block {i+1}: {block.type}")
    if block.type == "text":
        print(f"    📝 Text: {block.text}")
    elif block.type == "tool_use":
        print(f"    🔧 Tool: {block.name}")
        print(f"    🆔 ID: {block.id}")
        print(f"    📋 Parameters:")
        for key, value in block.input.items():
            print(f"      {key}: {value}")

print("=" * 45)

🔧 Non-Streaming Tool Call Example:


#### Streaming Tool Calls with Fine-Grained Parameter Parsing

Now let's see the streaming version, which shows the tool parameters being built incrementally:

In [53]:
# Advanced streaming tool call example with detailed event logging
print("🌊 Advanced Streaming Tool Call Analysis:")
print("=" * 55)
print("This example shows fine-grained streaming of tool parameters")
print("=" * 55)

with client.messages.stream(
    model=tool_model,
    max_tokens=10240,
    tools=tools,
    messages=[
        {
            "role": "user", 
            "content": "Check the weather in New York City and send an email to alice@company.com with the subject 'Weather Update' and tell her about the weather."
        }
    ]
) as stream:
    
    current_content = ""
    tool_calls = {}
    event_count = 0
    
    for event in stream:
        event_count += 1
        print(f"\n📅 Event #{event_count}: {event.type}")
        
        if event.type == "message_start":
            print(f"   🚀 Message ID: {event.message.id}")
            
        elif event.type == "content_block_start":
            print(f"   🎬 Block {event.index}: {event.content_block.type}")
            if event.content_block.type == "tool_use":
                tool_id = event.content_block.id
                tool_name = event.content_block.name
                tool_calls[tool_id] = {
                    "name": tool_name,
                    "partial_input": "",
                    "final_input": {}
                }
                print(f"   🔧 Tool: {tool_name} (ID: {tool_id[:8]}...)")
                
        elif event.type == "content_block_delta":
            if event.delta.type == "text_delta":
                current_content += event.delta.text
                print(f"   📝 Text: '{event.delta.text}'")
                
            elif event.delta.type == "input_json_delta":
                partial_json = event.delta.partial_json
                print(f"   🧩 JSON fragment: '{partial_json}'")
                
                # Update tool input buffer
                for tool_id, tool_info in tool_calls.items():
                    tool_info["partial_input"] += partial_json
                    print(f"   📊 Buffer for {tool_info['name']}: '{tool_info['partial_input']}'")
                    
                    # Attempt to parse accumulated JSON
                    try:
                        parsed = json.loads(tool_info["partial_input"])
                        tool_info["final_input"] = parsed
                        print(f"   ✅ Parsed successfully: {parsed}")
                    except json.JSONDecodeError:
                        print(f"   ⏳ JSON incomplete, continuing...")
                        
        elif event.type == "content_block_stop":
            print(f"   🏁 Block {event.index} completed")
            
        elif event.type == "message_stop":
            print(f"   🔚 Message finished")

print(f"\n{'='*55}")
print("📊 FINAL RESULTS:")
print(f"📝 Complete text response: '{current_content}'")
print(f"🔧 Tool calls executed: {len(tool_calls)}")
for tool_id, tool_info in tool_calls.items():
    print(f"  • {tool_info['name']} (ID: {tool_id[:8]}...)")
    print(f"    Parameters: {tool_info['final_input']}")
print("="*55)

🌊 Advanced Streaming Tool Call Analysis:
This example shows fine-grained streaming of tool parameters


KeyboardInterrupt: 

#### Simplified Streaming Tool Call Handler

Here's a cleaner example that focuses on just the tool call results:

In [None]:
# Simplified streaming tool call handler for practical use
def stream_with_tools(user_message, show_details=False):
    """Stream a message with tools and display results cleanly."""
    
    print(f"👤 User: {user_message}")
    print("🔄 Streaming response...")
    print("=" * 50)
    
    with client.messages.stream(
        model=tool_model,
        max_tokens=1024,
        tools=tools,
        messages=[{"role": "user", "content": user_message}]
    ) as stream:
        
        text_content = ""
        current_tool = None
        tool_input_buffer = ""
        tool_results = []
        
        for event in stream:
            if event.type == "content_block_start":
                if event.content_block.type == "text":
                    print("🤖 Assistant: ", end="", flush=True)
                elif event.content_block.type == "tool_use":
                    current_tool = {
                        "name": event.content_block.name,
                        "id": event.content_block.id
                    }
                    tool_input_buffer = ""
                    print(f"\n🔧 Invoking tool: {current_tool['name']}")
                    
            elif event.type == "content_block_delta":
                if event.delta.type == "text_delta":
                    text_content += event.delta.text
                    print(event.delta.text, end="", flush=True)
                elif event.delta.type == "input_json_delta":
                    tool_input_buffer += event.delta.partial_json
                    if show_details:
                        print(f"   📝 Building parameters: {tool_input_buffer}", end="\r")
                    
            elif event.type == "content_block_stop":
                if current_tool:
                    try:
                        parsed_input = json.loads(tool_input_buffer)
                        tool_results.append({
                            "name": current_tool["name"],
                            "id": current_tool["id"][:8] + "...",
                            "parameters": parsed_input
                        })
                        print(f"   ✅ Parameters: {parsed_input}")
                    except json.JSONDecodeError:
                        print(f"   ❌ Invalid JSON: {tool_input_buffer}")
                    current_tool = None
                    tool_input_buffer = ""
                else:
                    print()  # End text line
                    
    print("=" * 50)
    print("📊 Summary:")
    if text_content:
        print(f"📝 Text response: {len(text_content)} characters")
    print(f"🔧 Tools called: {len(tool_results)}")
    for tool in tool_results:
        print(f"  • {tool['name']} ({tool['id']}) with {len(tool['parameters'])} parameters")
    print("✅ Stream completed!\n")

# Test with single tool call
stream_with_tools("Get the weather for Tokyo, Japan in celsius")

In [None]:
# Test multiple tool calls in one request with detailed monitoring
stream_with_tools(
    "Send an email to team@company.com about the quarterly results meeting next Friday, and also check the weather in London",
    show_details=True  # Show parameter building process
)

### 📚 Understanding Streaming Tool Call Architecture

The streaming tool call implementation follows the Anthropic specification with several key components:

#### 🔄 Event Flow
1. **`content_block_start`** - Announces a new `tool_use` block with tool name and ID
2. **`content_block_delta`** with `input_json_delta` - Streams partial JSON fragments 
3. **`content_block_stop`** - Signals tool call completion

#### ⚡ Key Benefits
- **Real-time Feedback**: See tool parameters being constructed incrementally
- **Partial JSON Handling**: Robust parsing of incomplete JSON during streaming  
- **Multiple Tool Support**: Each tool gets separate content blocks with independent streaming
- **Fine-grained Control**: Parameters stream incrementally rather than all-at-once

#### 🛠️ Technical Implementation
- Uses **HuggingFace tool parser** with Qwen3-inspired incremental JSON building
- Provides **smooth streaming** of tool parameters without blocking
- Supports **complex nested parameters** and multiple simultaneous tool calls
- Includes **error handling** for malformed JSON during streaming

#### 🎯 Use Cases
- **Interactive Applications**: Show users what tools are being called in real-time
- **Debugging**: Monitor tool parameter construction for troubleshooting
- **User Experience**: Provide immediate feedback during long-running tool operations
- **Development**: Understand model behavior during tool selection and parameter generation

This implementation demonstrates MLX Omni Server's capability to provide streaming tool calls compatible with the Anthropic API specification while leveraging local MLX models for inference.