📓 [Open in Colab][1]

[1]: https://colab.research.google.com/drive/1kreAw3ZRqfmi8P9_O6-bv5NFeAxPyT5u?usp=sharing

# **Streaming in LLMs & Agents SDK**

---

## 1. **What is Streaming?**

* Normally, when you ask an LLM a question, the **server generates the entire response first** and then sends it back in one chunk.

* With **streaming**, the response is sent **piece by piece (token by token)** as it’s generated.

* You don’t wait for the full answer → you see it **in real time**.

👉 Think of it like:

* **Non-streaming** = getting a complete email after the sender finishes writing.
* **Streaming** = watching the sender type the email letter by letter in real time.

<br>

---

<br>

## 2. **How Streaming Works**

1. **User sends a request** → “Write me a summary of AI.”
2. LLM **starts generating tokens** (words, subwords).
3. Instead of waiting, the **server streams tokens immediately** to the client.
4. The client (your app) **renders the output progressively** (like a typing effect).

<br>

---

<br>

## 3. **Streaming in OpenAI Agents SDK**

In the **Agents SDK**, streaming can happen at **two levels**:

1. **Model response streaming**

   * You get the assistant’s text **as it’s being generated**.
   * Useful for chatbots → feels fast and natural.

2. **Event streaming (Agent workflow)**

   * You don’t just see text — you also see **events**:

     * When the agent decides to call a tool.
     * When tool results come back.
     * When the agent continues reasoning.
   * Lets developers **visualize and debug** how the agent is thinking.

👉 This is more advanced than just text streaming.

<br>

---

<br>

## 4. **Why Streaming is Important**

* ✅ **Faster user experience** → Users see answers instantly, not after 10s.
* ✅ **Transparency** → Developers can see intermediate steps (tools used, reasoning).
* ✅ **Interactivity** → You can **interrupt**, **cancel**, or **update UI live**.
* ✅ **Scalability** → Useful in chatbots, copilots, dashboards.

<br>

---

<br>

## 5. **Use Cases of Streaming**

* **Chatbots / Virtual Assistants** → Feels like human typing.
* **Search engines** → Show results progressively.
* **Data-heavy tools** → Stream analysis while still computing.
* **Agents with tools** → Show intermediate reasoning (“Now fetching weather… Done!”).

<br>

---

<br>

## 6. **Exam Cheat-Sheet**

| Aspect            | Non-Streaming 🐢              | Streaming 🚀                          |
| ----------------- | ----------------------------- | ------------------------------------- |
| Response Delivery | After full output ready       | Token by token (real-time)            |
| User Experience   | Slow, delayed                 | Fast, natural                         |
| Visibility        | Only final answer             | Intermediate reasoning + events       |
| Best Use Cases    | Small queries, static answers | Chatbots, copilots, real-time systems |

---

✅ **Summary**:
**Streaming** = delivering model output and agent events in **real time**, instead of waiting for the full response.
In OpenAI Agents SDK, streaming lets you not only see **partial tokens** but also **agent reasoning and tool calls** — making it powerful for building transparent, responsive AI systems.


# **Tokens (Short Explanation)**

* **Tokens** are the **smallest units of text** an LLM (like GPT) reads or writes.
* They are usually **words, parts of words, or even punctuation**.

### Examples:

* `"Hello"` → 1 token
* `"ChatGPT is amazing!"` → might be 4 tokens (`Chat`, `GPT`, `is`, `amazing!`)
* `"unbelievable"` → could be split into smaller tokens (`un`, `believable`)

---

## **Why Tokens Matter?**

* LLMs **process text as tokens**, not as full words.
* Cost and limits in OpenAI API are measured in **tokens**.
* 1 token ≈ 4 characters in English, or about **¾ of a word**.

---

**In short**:
Tokens = **text chunks** that LLMs understand.
They decide **cost, context length, and response size**.


# **Delta** (Short Explanation)

* **Delta** = the **newly generated piece of text** the LLM sends during **streaming**.
* Instead of sending the full response each time, the model streams **deltas (tokens/chunks)** one by one.
* You combine all deltas → final response.

👉 Example:

* Deltas: `"Hel"` → `"lo "` → `"world!"`
* Final response = `"Hello world!"`

---

✅ **In short**:
**Delta = the latest chunk of text from the stream.**


In [27]:
!pip install -Uq openai-agents

In [28]:
import nest_asyncio
nest_asyncio.apply()

In [29]:
import os
from agents import Agent, Runner, AsyncOpenAI, OpenAIChatCompletionsModel, set_default_openai_api, set_default_openai_client, set_tracing_disabled
from agents.run import RunConfig
from google.colab import userdata

GEMINI_API_KEY = userdata.get("GEMINI_API_KEY")

external_client = AsyncOpenAI(
    api_key = GEMINI_API_KEY,
    base_url = "https://generativelanguage.googleapis.com/v1beta/openai/",
)

model = OpenAIChatCompletionsModel(
    model = "gemini-2.0-flash",
    openai_client = external_client,
)


In [19]:
import asyncio

from openai.types.responses import ResponseTextDeltaEvent

from agents import Agent, Runner

async def main():
  agent= Agent(
      name= "Joker",
      instructions= "You are a helpful assistant",
      model= model,
  )

  result = Runner.run_streamed(agent, input= "Tell me 5 jokes.")
  print(type (result), result, "\n\n")
  async for event in  result.stream_events():
    print(event)

asyncio.run(main())

<class 'agents.result.RunResultStreaming'> RunResultStreaming:
- Current agent: Agent(name="Joker", ...)
- Current turn: 0
- Max turns: 10
- Is complete: False
- Final output (NoneType):
    None
- 0 new item(s)
- 0 raw response(s)
- 0 input guardrail result(s)
- 0 output guardrail result(s)
(See `RunResultStreaming` for more details) 


AgentUpdatedStreamEvent(new_agent=Agent(name='Joker', handoff_description=None, tools=[], mcp_servers=[], mcp_config={}, instructions='You are a helpful assistant', prompt=None, handoffs=[], model=<agents.models.openai_chatcompletions.OpenAIChatCompletionsModel object at 0x7916b5d5cfe0>, model_settings=ModelSettings(temperature=None, top_p=None, frequency_penalty=None, presence_penalty=None, tool_choice=None, parallel_tool_calls=None, truncation=None, max_tokens=None, reasoning=None, verbosity=None, metadata=None, store=None, include_usage=None, response_include=None, top_logprobs=None, extra_query=None, extra_body=None, extra_headers=None, extra_args=



RawResponsesStreamEvent(data=ResponseTextDeltaEvent(content_index=0, delta="  Parallel lines have so much in common.\n    It's a shame they'll never meet.\n\n3.  Why did the scarecrow win an award?", item_id='__fake_id__', logprobs=[], output_index=0, sequence_number=6, type='response.output_text.delta'), type='raw_response_event')
RawResponsesStreamEvent(data=ResponseTextDeltaEvent(content_index=0, delta='\n    Because he was outstanding in his field!\n\n4.  Why did the bicycle fall over?\n    Because it was two tired!\n\n5.  What', item_id='__fake_id__', logprobs=[], output_index=0, sequence_number=7, type='response.output_text.delta'), type='raw_response_event')
RawResponsesStreamEvent(data=ResponseTextDeltaEvent(content_index=0, delta=' do you call a fish with no eyes?\n    Fsh!\n', item_id='__fake_id__', logprobs=[], output_index=0, sequence_number=8, type='response.output_text.delta'), type='raw_response_event')
RawResponsesStreamEvent(data=ResponseContentPartDoneEvent(content_in

## Streaming Text code

In [20]:
import asyncio

from openai.types.responses import ResponseTextDeltaEvent

from agents import Agent, Runner

async def main():
  agent= Agent(
      name= "Joker",
      instructions= "You are a helpful assistant",
      model= model,
  )

  result = Runner.run_streamed(agent, input= "Tell me 5 jokes.") #: Runs the agent in streaming mode.
  async for event in  result.stream_events():
    if event.type == "raw_response_event" and isinstance(event.data, ResponseTextDeltaEvent):
      print(event.data.delta, end="", flush=True) # flush= forces immediate display (no buffering delay).

asyncio.run(main())



Alright, here are 5 jokes for you:

1.  Why don't scientists trust atoms?
    Because they make up everything!

2.  Parallel lines have so much in common.
    It's a shame they'll never meet.

3.  Why did the scarecrow win an award?
    Because he was outstanding in his field!

4.  What do you call a lazy kangaroo?
    Pouch potato!

5.  Why did the bicycle fall over?
    Because it was two tired!


## Stream item code

In [31]:
import asyncio
import random

from agents import Agent, ItemHelpers, Runner, function_tool

@function_tool
def how_many_jokes() -> int:
  return random.randint(1, 10)

async def main():
    agent = Agent(
        name= "Joker",
        instructions= "First cal the `how_many_jokes` tool, then tell that many jokes",
        tools= [how_many_jokes],
        model = model,
    )

    result = Runner.run_streamed(agent, input= "Hi",)

    print("=== Run Starting ===")
    async for event in result.stream_events():
      if event.type == "raw_response_event":
        continue

      elif event.type == "agent_updated_stream_event":
        print(f"Agent updated: {event.new_agent.name}")
        continue

      elif event.type == "run_item_stream_event":
        if event.item.type == "tool_call_item":
          print("-- Tool was called")

        elif event.item.type == "tool_call_output_item":
          print(f"-- Tool output: {event.item.output}")
        elif event.item.type == "message_output_item":
          print(f"-- Message output: \n {ItemHelpers.text_message_output(event.item)}")
        else:
          pass

asyncio.run(main())

print("=== Run complete ===")



=== Run Starting ===
Agent updated: Joker
-- Tool was called
-- Tool output: 3
-- Message output: 
 OK, I will tell you 3 jokes.

Why don't scientists trust atoms?

Because they make up everything!

 параллелепипед

Why did the scarecrow win an award?

Because he was outstanding in his field!

Why did the bicycle fall over?

Because it was two tired!

=== Run complete ===
