# Day 3: LLM Integration

## Understanding How Qwen-Agent Talks to Language Models

### Today's Learning Objectives:
1. Understand the `BaseChatModel` interface
2. Learn about different model backends (DashScope, OpenAI-compatible, local)
3. Configure LLMs with various parameters
4. Call LLMs directly (without agents)
5. Master streaming responses
6. Understand token counting and context windows

### Prerequisites:
- Completed Day 1 & 2
- Understanding of Messages
- API key configured

### Time Required: 1.5-2 hours

---

## Part 1: The LLM Abstraction Layer

### Why Abstract LLMs?

Qwen-Agent supports many different LLM providers:
- **DashScope** (Alibaba Cloud's Qwen models)
- **OpenAI** (GPT-4, GPT-3.5)
- **vLLM** (Self-hosted high-performance)
- **Ollama** (Local CPU/GPU)
- **Together.AI, Azure, and more**

Without abstraction, you'd need different code for each provider. With `BaseChatModel`, you write once and switch providers by changing configuration!

### Architecture:

```
┌─────────────────────────────────────┐
│        Your Application             │
└──────────────┬──────────────────────┘
               │
               v
┌─────────────────────────────────────┐
│       BaseChatModel Interface       │  ← Uniform API
└──────────────┬──────────────────────┘
               │
     ┌─────────┼─────────┬─────────┐
     v         v         v         v
┌─────────┐ ┌──────┐ ┌──────┐ ┌────────┐
│DashScope│ │OpenAI│ │ vLLM │ │ Ollama │  ← Different backends
└─────────┘ └──────┘ └──────┘ └────────┘
```

### Key Insight:
The agent doesn't care which LLM it talks to - it just sends messages and receives responses through the `BaseChatModel` interface!

## Part 2: BaseChatModel Interface

### Source Code Location:
`/qwen_agent/llm/base.py` (lines 61-100)

### Key Properties:

```python
class BaseChatModel(ABC):
    @property
    def support_multimodal_input(self) -> bool:
        # Can this model accept images/audio/video?
        return False
    
    @property
    def support_multimodal_output(self) -> bool:
        # Can this model generate images/audio/video?
        return False
    
    @property
    def support_audio_input(self) -> bool:
        return False
```

### Key Methods:
- `chat(messages, stream=True/False)` - Main inference method
- `_chat_stream()` - Streaming implementation (abstract)
- `_chat_no_stream()` - Non-streaming implementation (abstract)

Each LLM backend implements these methods!

## Part 3: Getting an LLM Instance

### The `get_chat_model()` Factory Function:

```python
from qwen_agent.llm import get_chat_model

llm = get_chat_model(config_dict)
```

This function:
1. Reads your configuration
2. Determines which backend to use
3. Returns the appropriate `BaseChatModel` subclass
4. All ready to use!

Let's see it in action:

In [None]:
# ================================================
# FIREWORKS API CONFIGURATION
# ================================================
import os

# Set API credentials
os.environ['FIREWORKS_API_KEY'] = 'fw_3ZSpUnVR78vs38jJtyewjcWk'

# Standard configuration for Fireworks Qwen3-235B-A22B-Thinking
llm_cfg_fireworks = {
    'model': 'accounts/fireworks/models/qwen3-235b-a22b-thinking-2507',
    'model_server': 'https://api.fireworks.ai/inference/v1',
    'api_key': os.environ['FIREWORKS_API_KEY'],
    'generate_cfg': {
        'max_tokens': 32768,
        'temperature': 0.6,
    }
}

# Use this as default llm_cfg
llm_cfg = llm_cfg_fireworks

print('✅ Configured for Fireworks API')
print(f'   Model: Qwen3-235B-A22B-Thinking-2507')
print(f'   Max tokens: 32,768')

✅ Configured for Fireworks API
   Model: Qwen3-235B-A22B-Thinking-2507
   Max tokens: 32,768

In [None]:
from qwen_agent.llm import get_chat_model
import os

# Use Fireworks configuration (from cell 4)
llm_config = llm_cfg

# Get the LLM instance
llm = get_chat_model(llm_config)

print(f"✅ LLM Type: {type(llm)}")
print(f"✅ Model: {llm.model}")
print(f"✅ Supports multimodal input: {llm.support_multimodal_input}")
print(f"✅ Supports multimodal output: {llm.support_multimodal_output}")

✅ LLM Type: <class 'qwen_agent.llm.oai.TextChatAtOAI'>
✅ Model: accounts/fireworks/models/qwen3-235b-a22b-thinking-2507
✅ Supports multimodal input: False
✅ Supports multimodal output: False

## Part 4: Direct LLM Calling (Without Agent)

### The `.chat()` Method:

You can call LLMs directly without creating an agent:

```python
response = llm.chat(
    messages=[...],
    stream=False,          # True for streaming
    functions=[...],       # Optional: for function calling
    extra_generate_cfg={}  # Optional: additional parameters
)
```

In [None]:
# Example 1: Simple non-streaming call
messages = [
    {'role': 'user', 'content': 'What is 15 + 27?'}
]

print("Calling LLM...\n")
response = llm.chat(messages=messages, stream=False)

print("Response:")
for msg in response:
    print(f"Role: {msg['role']}")
    print(f"Content: {msg['content']}")
    print()

Calling LLM...

Response:
Role: assistant
Content: Okay, let's see. I need to add 15 and 27. Hmm, how do I do this? Maybe break it down into tens and ones. 15 is 10 + 5, and 27 is 20 + 7. So if I add the tens first: 10 + 20 is 30. Then the ones: 5 + 7 is 12. Now add those two results together: 30 + 12. That should be 42. Let me check another way. Maybe start with 15 and add 20 first, which is 35, then add 7 more. 35 + 7 is 42. Yeah, that works. Or use a number line: 15 + 27. Start at 15, add 20 to get to 35, then add 7 to reach 42. All methods give the same answer, so I think it's 42.
</think>

To solve $15 + 27$, we can break it down step by step:

1. **Break into tens and ones**:
   - $15 = 10 + 5$
   - $27 = 20 + 7$

2. **Add the tens**:
   - $10 + 20 = 30$

3. **Add the ones**:
   - $5 + 7 = 12$

4. **Combine the results**:
   - $30 + 12 = 42$

**Alternative method**:
- Start with $15$, add $20$ to get $35$, then add $7$:  
  $15 + 20 = 35$, $35 + 7 = 42$.

**Final Answer**:  
$$
\

### Understanding the Response:

The `.chat()` method returns a **list of messages**:
- Non-streaming: Complete list immediately
- Streaming: Progressive updates (iterator)

Usually contains:
- One or more assistant messages
- Possibly function call messages
- Same Message format we learned on Day 2!

In [None]:
# Example 2: Multi-turn conversation
messages = [
    {'role': 'user', 'content': 'My favorite color is blue.'},
]

# First turn
response1 = llm.chat(messages=messages, stream=False)
messages.extend(response1)

# Second turn - LLM should remember
messages.append({'role': 'user', 'content': 'What is my favorite color?'})
response2 = llm.chat(messages=messages, stream=False)

print("Full conversation:")
for msg in messages:
    print(f"{msg['role']:10} | {msg['content'][:60]}...")
for msg in response2:
    print(f"{msg['role']:10} | {msg['content'][:60]}...")

Full conversation:
user       | My favorite color is blue....
assistant  | Okay, the user just stated "My favorite color is blue." Hmm,...
user       | What is my favorite color?...
assistant  | Okay, the user just asked "What is my favorite color?" after...

## Part 5: Streaming vs. Non-Streaming Deep Dive

### Non-Streaming (`stream=False`):

```python
response = llm.chat(messages, stream=False)
# response is List[Message] - complete immediately
```

**Pros:**
- Simple to use
- Get complete response at once
- Easier error handling

**Cons:**
- User waits for entire response
- Feels slower
- No early feedback

### Streaming (`stream=True`):

```python
for chunk in llm.chat(messages, stream=True):
    # chunk is List[Message] - progressively complete
    # Process incrementally
```

**Pros:**
- Lower perceived latency
- Better UX (like ChatGPT)
- Can start processing early

**Cons:**
- More complex to handle
- Need to manage state
- Harder error handling

In [None]:
import time

messages = [{'role': 'user', 'content': 'Count from 1 to 5 with explanations for each number.'}]

print("Streaming Response:")
print("="*60)

full_content = ""
start_time = time.time()
first_token_time = None

for chunk in llm.chat(messages=messages, stream=True):
    if chunk and chunk[-1]['role'] == 'assistant':
        content = chunk[-1].get('content', '')
        
        # Track time to first token
        if not first_token_time and content:
            first_token_time = time.time() - start_time
        
        # Print incrementally
        if content != full_content:
            new_content = content[len(full_content):]
            print(new_content, end='', flush=True)
            full_content = content

total_time = time.time() - start_time

print("\n" + "="*60)
print(f"\nTime to first token: {first_token_time:.2f}s")
print(f"Total time: {total_time:.2f}s")

Streaming Response:
We are counting from 1 to 5, and for each number we provide a brief explanation.
 Let's do it step by step.
</think>

### Counting from 1 to 5 with Explanations:

1. **One**  
   *The starting point of counting.*  
   Represents a single, indivisible unit. In mathematics, it is the multiplicative identity (any number multiplied by 1 remains unchanged). Philosophically, it symbolizes unity, singularity, or the origin of all numbers.  

2. **Two**  
   *The first even number and the only even prime.*  
   Introduces the concept of duality (e.g., pairs, opposites like light/dark). In binary systems (the foundation of computing), it’s the base of the numeral system (0s and 1s). Also signifies balance, choice, or division.  

3. **Three**  
   *The first odd prime and a symbol of harmony.*  
   Often represents completeness (e.g., past/present/future, birth/life/death). In geometry, it defines the simplest polygon (triangle), which is structurally stable. Many cultures c

### Streaming Pattern:

```python
# Common pattern to get final result from streaming
*_, final_response = llm.chat(messages, stream=True)

# Equivalent to:
for chunk in llm.chat(messages, stream=True):
    final_response = chunk  # Keep updating
# final_response now has complete result
```

## Part 6: LLM Configuration Options

### Basic Configuration:

```python
{
    'model': str,           # Model name (required)
    'model_type': str,      # Backend type (optional, auto-detected)
    'api_key': str,         # API key (optional, uses env var)
    'model_server': str,    # API base URL (for custom endpoints)
    'generate_cfg': dict,   # Generation parameters
}
```

### Generation Parameters (`generate_cfg`):

```python
'generate_cfg': {
    'top_p': float,              # Nucleus sampling (0.0-1.0)
    'temperature': float,        # Randomness (model dependent)
    'max_tokens': int,           # Max output length
    'max_input_tokens': int,     # Max input length (for truncation)
    'fncall_prompt_type': str,   # 'qwen' or 'nous' for function calling
    'enable_thinking': bool,     # For reasoning models (QwQ)
    'use_raw_api': bool,         # Use native API tool calling
}
```

In [None]:
# Example: Creative vs. Deterministic
# Using Fireworks API with different top_p values

# Configuration 1: Creative (high top_p)
creative_llm = get_chat_model({
    'model': 'accounts/fireworks/models/qwen3-235b-a22b-instruct-2507',
    'model_server': 'https://api.fireworks.ai/inference/v1',
    'api_key': os.environ['FIREWORKS_API_KEY'],
    'generate_cfg': {
        'top_p': 0.9,  # More diverse outputs
        'max_tokens': 50
    }
})

# Configuration 2: Focused (low top_p)
focused_llm = get_chat_model({
    'model': 'accounts/fireworks/models/qwen3-235b-a22b-instruct-2507',
    'model_server': 'https://api.fireworks.ai/inference/v1',
    'api_key': os.environ['FIREWORKS_API_KEY'],
    'generate_cfg': {
        'top_p': 0.3,  # More deterministic
        'max_tokens': 50
    }
})

prompt = [{'role': 'user', 'content': 'Give me a creative name for a pet robot.'}]

print("Creative LLM (top_p=0.9):")
response = creative_llm.chat(messages=prompt, stream=False)
print(response[-1]['content'])

print("\nFocused LLM (top_p=0.3):")
response = focused_llm.chat(messages=prompt, stream=False)
print(response[-1]['content'])

Creative LLM (top_p=0.9):
Sure! How about **"Nimbleton Quirk"**?

It sounds playful and intelligent—like a curious little robot with a dash of personality. "Nimbleton" suggests agility and clever engineering, while "Quirk" hints

Focused LLM (top_p=0.3):
Sure! How about **"Zippy"**?

It’s playful, energetic, and hints at quick movements and smart responsiveness—perfect for a nimble, affectionate pet robot. If you'd like something more futuristic or whimsical, here are

### Top-P (Nucleus Sampling) Explained:

```
Top-P = 0.1 (Deterministic)
Only consider most likely tokens
Output: Predictable, focused
Use for: Math, code, factual Q&A

Top-P = 0.5 (Balanced)
Moderate diversity
Output: Natural, reasonable variety
Use for: General conversation

Top-P = 0.95 (Creative)
Consider many possible tokens
Output: Diverse, creative
Use for: Storytelling, brainstorming
```

## Part 7: Different Model Backends

### Option 1: DashScope (Recommended for Qwen Models)

In [None]:
# Fireworks configuration (already configured in cell 4)
# This example shows how you would configure DashScope if you had access

dashscope_config = {
    'model': 'qwen-max-latest',
    'model_type': 'qwen_dashscope',  # Explicitly specify
    # 'api_key': 'sk-xxx',  # Or use DASHSCOPE_API_KEY env var
}

print("DashScope configuration example (requires DashScope API key):")
print(f"  Model: {dashscope_config['model']}")
print(f"  Type: {dashscope_config['model_type']}")
print("\n✅ We're using Fireworks API instead (configured in cell 4)")

DashScope configuration example (requires DashScope API key):
  Model: qwen-max-latest
  Type: qwen_dashscope

✅ We're using Fireworks API instead (configured in cell 4)

### Option 2: OpenAI-Compatible APIs

Works with:
- **OpenAI** (official API)
- **vLLM** (self-hosted)
- **Ollama** (local)
- **Together.AI, Groq, etc.**

In [None]:
# Example: vLLM self-hosted
# (Requires vLLM server running on localhost:8000)
vllm_config = {
    'model': 'Qwen2.5-7B-Instruct',
    'model_server': 'http://localhost:8000/v1',  # OpenAI-compatible endpoint
    'api_key': 'EMPTY',  # vLLM doesn't need real key
}

# Uncomment to test if you have vLLM running:
# vllm_llm = get_chat_model(vllm_config)
# print(f"Using vLLM: {vllm_llm.model}")

print("vLLM config ready (commented out - needs server running)")

vLLM config ready (commented out - needs server running)

In [None]:
# Example: Ollama (local)
# (Requires Ollama installed and running)
ollama_config = {
    'model': 'qwen2.5:7b',
    'model_server': 'http://localhost:11434/v1',
    'api_key': 'EMPTY',
}

# Uncomment to test if you have Ollama:
# ollama_llm = get_chat_model(ollama_config)
# print(f"Using Ollama: {ollama_llm.model}")

print("Ollama config ready (commented out - needs Ollama running)")

Ollama config ready (commented out - needs Ollama running)

### Comparing Backends:

| Backend | Pros | Cons | Best For |
|---------|------|------|----------|
| **DashScope** | Official Qwen, no setup, latest models | Requires internet, costs money | Production, quick start |
| **vLLM** | Fast inference, self-hosted, free | Requires GPU, setup complexity | High throughput production |
| **Ollama** | Easy local setup, CPU support | Slower, limited models | Development, offline work |
| **OpenAI** | GPT-4 access, reliable | Expensive, not Qwen models | Comparison, benchmarking |

## Part 8: Advanced Configuration

### Max Input Tokens (Context Window Management):

In [None]:
# Configuration with input limit (using Fireworks)
limited_llm = get_chat_model({
    'model': 'accounts/fireworks/models/qwen3-235b-a22b-instruct-2507',
    'model_server': 'https://api.fireworks.ai/inference/v1',
    'api_key': os.environ['FIREWORKS_API_KEY'],
    'generate_cfg': {
        'max_input_tokens': 2000,  # Truncate if input too long
        'max_tokens': 100
    }
})

# Long conversation that might exceed limit
long_messages = [
    {'role': 'user', 'content': 'Tell me about Python programming. ' * 50},  # Repeated text
]

print("Testing with potentially long input...")
# LLM will automatically truncate to fit 2000 tokens
response = limited_llm.chat(messages=long_messages, stream=False)
print("✅ Response received (input was truncated if needed)")
print(f"Response: {response[-1]['content'][:200]}...")

Testing with potentially long input...
✅ Response received (input was truncated if needed)
Response: Sure! Here's a comprehensive overview of **Python programming**:

---

### 🐍 What is Python?

**Python** is a high-level, interpreted, and general-purpose programming language known for its simplicity...

### Max Output Tokens (Response Length Limit):

In [None]:
# Short responses (using Fireworks)
concise_llm = get_chat_model({
    'model': 'accounts/fireworks/models/qwen3-235b-a22b-instruct-2507',
    'model_server': 'https://api.fireworks.ai/inference/v1',
    'api_key': os.environ['FIREWORKS_API_KEY'],
    'generate_cfg': {
        'max_tokens': 50,  # Short responses only
    }
})

messages = [{'role': 'user', 'content': 'Explain quantum physics in detail.'}]
response = concise_llm.chat(messages=messages, stream=False)

print("Concise response (max 50 tokens):")
print(response[-1]['content'])
print(f"\n✅ Response limited to ~50 tokens despite asking for 'detail'")

Concise response (max 50 tokens):
Quantum physics (also known as quantum mechanics or quantum theory) is a fundamental branch of physics that describes the behavior of matter and energy at the smallest scales—atomic and subatomic levels. It fundamentally differs from classical physics (Newtonian mechanics and electrom

✅ Response limited to ~50 tokens despite asking for 'detail'

## Part 9: Error Handling

### Common Errors:

In [None]:
from qwen_agent.llm.base import ModelServiceError

def safe_chat(llm, messages):
    """Chat with error handling"""
    try:
        response = llm.chat(messages=messages, stream=False)
        return response
    except ModelServiceError as e:
        print(f"Model service error: {e.message}")
        print(f"Error code: {e.code}")
        return None
    except Exception as e:
        print(f"Unexpected error: {type(e).__name__}: {e}")
        return None

# Test with valid request
result = safe_chat(llm, [{'role': 'user', 'content': 'Hi'}])
if result:
    print("Success!")
    print(result[-1]['content'])

Success!
Okay, the user said "Hi". That's a simple greeting. I should respond in a friendly and welcoming way. Let me make sure to keep it open-ended so they feel comfortable sharing what they need. Maybe add an emoji to keep it warm. Let me check if there's anything else they might need right now. Since it's just a greeting, probably not. Just a polite reply. Alright, "Hello! 😊 How can I assist you today?" sounds good.
</think>

Hello! 😊 How can I assist you today?

### Handling API Rate Limits:

In [None]:
import time

def chat_with_retry(llm, messages, max_retries=3):
    """Chat with exponential backoff retry"""
    for attempt in range(max_retries):
        try:
            return llm.chat(messages=messages, stream=False)
        except ModelServiceError as e:
            if 'rate limit' in str(e).lower() and attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Rate limited. Waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise
    return None

# Usage
# result = chat_with_retry(llm, messages)

(No output)


## Part 10: Token Counting

### Why Count Tokens?

- **Cost management** - APIs charge per token
- **Context limits** - Models have max token limits
- **Performance** - Longer = slower

### Using the Tokenizer:

In [None]:
from qwen_agent.utils.tokenization_qwen import tokenizer

# Example text
text = "Hello, world! This is a test of token counting in Qwen-Agent."

# Encode to tokens
tokens = tokenizer.encode(text)
print(f"Text: {text}")
print(f"Token count: {len(tokens)}")
print(f"Tokens (first 10): {tokens[:10]}")

# Note: The tokenizer doesn't have a decode() method in this version
# Tokens are integer IDs that represent pieces of text
print(f"\n✅ Successfully encoded text to {len(tokens)} tokens")
print(f"✅ Each token is an integer ID in the vocabulary")

Text: Hello, world! This is a test of token counting in Qwen-Agent.
Token count: 16
Tokens (first 10): [9707, 11, 1879, 0, 1096, 374, 264, 1273, 315, 3950]

✅ Successfully encoded text to 16 tokens
✅ Each token is an integer ID in the vocabulary

In [None]:
# Count tokens in a conversation
def count_message_tokens(messages):
    """Estimate token count for messages"""
    total = 0
    for msg in messages:
        # Role
        total += len(tokenizer.encode(msg['role']))
        # Content
        content = msg.get('content', '')
        if isinstance(content, str):
            total += len(tokenizer.encode(content))
        # Overhead (separators, etc.)
        total += 4  # Approximate
    return total

# Test
messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': 'What is machine learning?'},
    {'role': 'assistant', 'content': 'Machine learning is a subset of AI...'},
]

token_count = count_message_tokens(messages)
print(f"Total tokens in conversation: {token_count}")
print(f"Estimated cost (if $0.001 per 1K tokens): ${token_count * 0.001 / 1000:.6f}")

Total tokens in conversation: 34
Estimated cost (if $0.001 per 1K tokens): $0.000034

## Part 11: Model Comparison

Let's compare different Qwen models on the same task:

In [None]:
import time

# Since we're using Fireworks, let's compare different configurations
# of the same model instead of different models

# Configurations to compare
configs = [
    ('Instruct (Standard)', {
        'model': 'accounts/fireworks/models/qwen3-235b-a22b-instruct-2507',
        'generate_cfg': {'max_tokens': 100, 'temperature': 0.7}
    }),
    ('Instruct (Precise)', {
        'model': 'accounts/fireworks/models/qwen3-235b-a22b-instruct-2507',
        'generate_cfg': {'max_tokens': 100, 'temperature': 0.3}
    }),
    ('Thinking Model', {
        'model': 'accounts/fireworks/models/qwen3-235b-a22b-thinking-2507',
        'generate_cfg': {'max_tokens': 150, 'temperature': 0.6}
    }),
]

# Test prompt
prompt = [{'role': 'user', 'content': 'In exactly 20 words, explain what a transformer model is.'}]

print("Model Configuration Comparison:")
print("="*70)

for config_name, config in configs:
    print(f"\n{config_name}")
    print("-"*70)
    
    full_config = {
        **config,
        'model_server': 'https://api.fireworks.ai/inference/v1',
        'api_key': os.environ['FIREWORKS_API_KEY'],
    }
    
    llm = get_chat_model(full_config)
    
    start = time.time()
    response = llm.chat(messages=prompt, stream=False)
    elapsed = time.time() - start
    
    content = response[-1]['content']
    word_count = len(content.split())
    
    # Show excerpt
    if len(content) > 150:
        print(f"Response (excerpt): {content[:150]}...")
    else:
        print(f"Response: {content}")
    print(f"Words: {word_count} | Time: {elapsed:.2f}s")

print("\n" + "="*70)
print("Note: Thinking model may show reasoning process, affecting length")

Model Configuration Comparison:

Instruct (Standard)
----------------------------------------------------------------------
Response (excerpt): A transformer model is a neural network using self-attention to process sequential data, enabling efficient, parallelized training for tasks like lang...
Words: 25 | Time: 2.18s

Instruct (Precise)
----------------------------------------------------------------------
Response (excerpt): A transformer model is a neural network using self-attention to process sequential data, enabling efficient parallelization and state-of-the-art perfo...
Words: 23 | Time: 2.22s

Thinking Model
----------------------------------------------------------------------
Response (excerpt): Hmm, the user wants me to explain what a transformer model is in exactly 20 words. That's quite precise! They're probably looking for a concise, techn...
Words: 118 | Time: 2.75s

Note: Thinking model may show reasoning process, affecting length

## Part 12: Practice Exercises

### Exercise 1: Configuration Experimentation
Test different top_p values and observe the creativity.

In [None]:
# TODO: Create LLMs with top_p values of 0.1, 0.5, and 0.9
# Test each with the same creative prompt
# Compare the outputs

# Your code here:
prompt = "Write a creative story opening in one sentence."

# Test with different top_p:
# ...

### Exercise 2: Token Budget Manager
Create a function that truncates messages to fit a token budget.

In [None]:
# TODO: Implement truncate_to_budget(messages, max_tokens)
# Should:
# 1. Count tokens in messages
# 2. Remove oldest messages if over budget
# 3. Always keep system message (if present)
# 4. Return truncated message list

def truncate_to_budget(messages, max_tokens):
    # Your implementation
    pass

# Test:
# long_convo = [...]  # Create long conversation
# truncated = truncate_to_budget(long_convo, 500)
# print(f"Truncated from {len(long_convo)} to {len(truncated)} messages")

### Exercise 3: Streaming Progress Bar
Display a progress bar while streaming.

In [None]:
# TODO: Create a streaming chat with progress indicator
# Show:
# - Time elapsed
# - Tokens received
# - Tokens per second

# Your code here:
# ...

## Part 13: Key Takeaways

### What You Learned Today:

1. **LLM Abstraction**
   - `BaseChatModel` provides uniform interface
   - Switch backends by changing config
   - Same code works with any provider

2. **Getting LLM Instances**
   - Use `get_chat_model(config)`
   - Factory pattern
   - Auto-detects backend

3. **Direct LLM Calling**
   - `.chat(messages, stream=True/False)`
   - Returns list of messages
   - No agent needed

4. **Configuration Options**
   - `top_p` for creativity
   - `max_tokens` for length
   - `max_input_tokens` for context
   - Model-specific options

5. **Backends**
   - DashScope (official Qwen)
   - vLLM (self-hosted)
   - Ollama (local)
   - OpenAI-compatible

6. **Token Management**
   - Count with tokenizer
   - Monitor costs
   - Handle limits

### Common Patterns:

```python
# Pattern 1: Simple call
llm = get_chat_model({'model': 'qwen-max-latest'})
response = llm.chat(messages=[...], stream=False)

# Pattern 2: Streaming
for chunk in llm.chat(messages=[...], stream=True):
    process(chunk)

# Pattern 3: Get final from stream
*_, final = llm.chat(messages=[...], stream=True)

# Pattern 4: With generation config
llm = get_chat_model({
    'model': 'qwen-max-latest',
    'generate_cfg': {'top_p': 0.8}
})
```

## Part 14: Next Steps

### Tomorrow (Day 4): Built-in Tools
We'll explore:
- BaseTool interface
- code_interpreter (Python execution)
- doc_parser (PDF/DOCX parsing)
- web_search (Internet search)
- And more!

### Homework:
1. Test different models (turbo vs. max)
2. Experiment with top_p values
3. Set up vLLM or Ollama (optional)
4. Count tokens in a long conversation
5. Read: `/qwen_agent/llm/base.py`

### Resources:
- [DashScope Models](https://help.aliyun.com/zh/dashscope/developer-reference/model-introduction)
- [vLLM Deployment](https://docs.vllm.ai/)
- [Ollama](https://ollama.ai/)

---

## 🎉 Day 3 Complete!

You now understand:
- ✅ LLM abstraction and `BaseChatModel`
- ✅ Direct LLM calling
- ✅ Configuration options
- ✅ Different backends
- ✅ Streaming responses
- ✅ Token management

Tomorrow we'll start giving our LLMs superpowers with **Tools**! 🛠️