# Streaming Responses - Basics

## Why Streaming?

**Problem:** Traditional API calls wait for the entire response before showing anything.
- GPT-4 can take 30-60 seconds to generate a response
- User sees nothing until complete
- Poor user experience

**Solution:** Streaming displays tokens as they're generated!
- Response appears immediately
- Shows progress in real-time  
- Better perceived performance
- Like ChatGPT's interface

In [None]:
from openai import OpenAI
from dotenv import load_dotenv
import time

load_dotenv()
client = OpenAI()

print("‚úÖ Setup complete")

## Example 1: Basic Streaming

Set `stream=True` to enable streaming.

In [None]:
# Non-streaming (traditional)
print("Non-streaming (wait for complete response):")
start = time.time()

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Count from 1 to 10 slowly."}],
    stream=False
)

print(response.choices[0].message.content)
print(f"Time: {time.time() - start:.2f}s\n")

# Streaming
print("Streaming (see tokens as generated):")
start = time.time()

stream = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Count from 1 to 10 slowly."}],
    stream=True  # Enable streaming!
)

for chunk in stream:
    # Extract content from chunk
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end='', flush=True)

print(f"\nTime: {time.time() - start:.2f}s")
print("\n(Notice: Same total time, but you see progress immediately!)")

## Stream Chunk Structure

Each chunk contains incremental data.

In [None]:
print("Inspecting stream chunks:\n")

stream = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Say 'Hello World'"}],
    stream=True
)

for i, chunk in enumerate(stream):
    print(f"Chunk {i}:")
    print(f"  ID: {chunk.id}")
    print(f"  Delta: {chunk.choices[0].delta}")
    print(f"  Finish Reason: {chunk.choices[0].finish_reason}")
    print()
    
    if i >= 5:  # Limit output for demo
        print("... (remaining chunks omitted)")
        # Consume rest of stream
        for _ in stream:
            pass
        break

## Building a Complete Response

Collect chunks to build the full message.

In [None]:
def stream_completion(messages, model="gpt-3.5-turbo"):
    """
    Stream a completion and return the full response.
    
    Returns:
        Tuple of (full_content, chunk_count)
    """
    stream = client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True
    )
    
    full_content = ""
    chunk_count = 0
    
    for chunk in stream:
        chunk_count += 1
        
        # Extract delta content
        delta = chunk.choices[0].delta.content
        
        if delta is not None:
            full_content += delta
            print(delta, end='', flush=True)
    
    print()  # Newline at end
    return full_content, chunk_count

# Test it
messages = [{"role": "user", "content": "Write a haiku about coding."}]
content, chunks = stream_completion(messages)

print(f"\nüìä Received {chunks} chunks")
print(f"üìù Total content length: {len(content)} characters")

## Token Counting with Streaming

**Challenge:** Streaming doesn't include token counts in chunks!

**Solutions:**
1. Count tokens manually with tiktoken
2. Make a final non-streaming call with max_tokens=1 to get usage (hack)
3. Estimate based on content length

In [None]:
import tiktoken

def stream_with_token_count(messages, model="gpt-3.5-turbo"):
    """
    Stream response and count tokens manually.
    """
    encoding = tiktoken.encoding_for_model(model)
    
    # Count input tokens
    input_tokens = sum(
        len(encoding.encode(msg['content'])) 
        for msg in messages
    )
    
    # Stream response
    stream = client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True
    )
    
    full_content = ""
    
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            full_content += delta
            print(delta, end='', flush=True)
    
    print()
    
    # Count output tokens
    output_tokens = len(encoding.encode(full_content))
    total_tokens = input_tokens + output_tokens
    
    return {
        'content': full_content,
        'prompt_tokens': input_tokens,
        'completion_tokens': output_tokens,
        'total_tokens': total_tokens
    }

# Test it
messages = [{"role": "user", "content": "Explain recursion briefly."}]
result = stream_with_token_count(messages)

print(f"\nüìä Token Usage:")
print(f"  Input: {result['prompt_tokens']}")
print(f"  Output: {result['completion_tokens']}")
print(f"  Total: {result['total_tokens']}")

## Error Handling in Streams

Streams can fail mid-generation!

In [None]:
def safe_stream(messages, model="gpt-3.5-turbo"):
    """
    Stream with proper error handling.
    """
    try:
        stream = client.chat.completions.create(
            model=model,
            messages=messages,
            stream=True
        )
        
        full_content = ""
        
        for chunk in stream:
            delta = chunk.choices[0].delta.content
            
            if delta:
                full_content += delta
                print(delta, end='', flush=True)
            
            # Check for finish
            finish_reason = chunk.choices[0].finish_reason
            if finish_reason:
                print(f"\n\n[Finished: {finish_reason}]")
        
        return full_content
        
    except Exception as e:
        print(f"\n\n‚ùå Stream error: {e}")
        return None

# Test it
messages = [{"role": "user", "content": "Tell me a joke."}]
result = safe_stream(messages)

if result:
    print(f"\n‚úÖ Successfully received {len(result)} characters")

## Streaming with Temperature

Higher temperature = more varied token selection = interesting streaming effect!

In [None]:
messages = [{"role": "user", "content": "Tell a very short story."}]

print("Temperature 0 (deterministic):")
stream = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=messages,
    temperature=0,
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)

print("\n\nTemperature 1.8 (creative):")
stream = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=messages,
    temperature=1.8,
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='', flush=True)

print("\n\n(Notice how temperature affects what you see being generated!)")

## Best Practices

### ‚úÖ DO:
- Use streaming for user-facing applications
- Display tokens immediately for better UX
- Handle errors gracefully (stream can fail mid-response)
- Count tokens manually with tiktoken
- Show finish_reason to user if needed

### ‚ùå DON'T:
- Use streaming if you need the full response before processing
- Rely on API for token counts (not in stream chunks)
- Forget to flush output buffers (use flush=True)
- Ignore finish_reason (could be 'length' or 'content_filter')

## Practice Exercises

1. Build a streaming chat that shows each word as it appears
2. Add a "Stop" button to cancel streaming mid-response
3. Display real-time token count while streaming
4. Create a "typing indicator" animation before first token
5. Build a comparison tool that shows streaming vs non-streaming side-by-side