# LangChain Streaming

For LLMs, streaming has become an increasingly popular feature. The idea is to rapidly return tokens as an LLM is generating them, rather than waiting for a full response to be created before returning anything.

Streaming is actually very easy to implement for simple use-cases, but it can get complicated when we start including things like Agents which have their own logic running which can block our attempts at streaming. Fortunately, we can make it work — it just requires a little extra effort.

We'll start easy by implementing streaming to the terminal for LLMs, but by the end of the notebook we'll be handling the more complex task of streaming via FastAPI for Agents.

First, let's install all of the libraries we'll be using.

In [None]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
HUGGINGFACEHUB_API_TOKEN = os.getenv('HUGGINGFACEHUB_API_TOKEN')

# initialize connection to pinecone (get API key at app.pinecone.io)
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"

In [None]:
# !pip install -qU \
#     openai==0.28.0 \
#     langchain==0.0.301 \
#     fastapi==0.103.1 \
#     "uvicorn[standard]"==0.23.2

## LLM Streaming to Stdout

The simplest form of streaming is to simply "print" the tokens as they're generated. To set this up we need to initialize an LLM (one that supports streaming, not all do) with two specific parameters:

* `streaming=True`, to enable streaming
* `callbacks=[SomeCallBackHere()]`, where we pass a LangChain callback class (or list containing multiple).

The `streaming` parameter is self-explanatory. The `callbacks` parameter and callback classes less so — essentially they act as little bits of code that do something as each token from our LLM is generated. As mentioned, the simplest form of streaming is to print the tokens as they're being generated, like with the `StreamingStdOutCallbackHandler`.

In [11]:
import os
from langchain_openai import ChatOpenAI
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler


llm = ChatOpenAI(
    openai_api_key=os.getenv("OPENAI_API_KEY"),
    temperature=0.0,
    model_name="gpt-3.5-turbo",
    streaming=True,  # ! important
    callbacks=[StreamingStdOutCallbackHandler()]  # ! important
)

Now if we run the LLM we'll see the response being _streamed_.

In [14]:
from langchain.schema import HumanMessage

# create messages to be passed to chat LLM
messages = [HumanMessage(content="tell me a long story")]

llm(messages)

  warn_deprecated(


Once upon a time, in a small village nestled in the mountains, there lived a young girl named Elara. Elara was known throughout the village for her kindness and generosity, always willing to help those in need. She lived with her parents in a cozy cottage at the edge of the forest, where she spent her days exploring the woods and playing with the animals that called it home.

One day, while out on a walk, Elara stumbled upon a wounded deer. The poor creature had been caught in a trap set by hunters and was in desperate need of help. Without hesitation, Elara carefully freed the deer from the trap and tended to its wounds, using herbs and plants she had learned about from her mother, who was a skilled healer.

As the days passed, Elara nursed the deer back to health, and the two formed a deep bond. The deer, whom Elara named Luna, became her constant companion, following her wherever she went and protecting her from harm. Together, they roamed the forest, exploring its hidden wonders an

AIMessage(content="Once upon a time, in a small village nestled in the mountains, there lived a young girl named Elara. Elara was known throughout the village for her kindness and generosity, always willing to help those in need. She lived with her parents in a cozy cottage at the edge of the forest, where she spent her days exploring the woods and playing with the animals that called it home.\n\nOne day, while out on a walk, Elara stumbled upon a wounded deer. The poor creature had been caught in a trap set by hunters and was in desperate need of help. Without hesitation, Elara carefully freed the deer from the trap and tended to its wounds, using herbs and plants she had learned about from her mother, who was a skilled healer.\n\nAs the days passed, Elara nursed the deer back to health, and the two formed a deep bond. The deer, whom Elara named Luna, became her constant companion, following her wherever she went and protecting her from harm. Together, they roamed the forest, explorin

That was surprisingly easy, but things begin to get much more complicated as soon as we begin using agents. Let's first initialize an agent.

In [16]:
from langchain.memory import ConversationBufferWindowMemory
from langchain.agents import load_tools, AgentType, initialize_agent

# initialize conversational memory
memory = ConversationBufferWindowMemory(
    memory_key="chat_history",
    k=5,
    return_messages=True,
    output_key="output"
)

# create a single tool to see how it impacts streaming
tools = load_tools(["llm-math"], llm=llm)

# initialize the agent
agent = initialize_agent(
    agent=AgentType.CHAT_CONVERSATIONAL_REACT_DESCRIPTION,
    tools=tools,
    llm=llm,
    memory=memory,
    verbose=True,
    max_iterations=3,
    early_stopping_method="generate",
    return_intermediate_steps=False
)

  warn_deprecated(


We already added our `StreamingStdOutCallbackHandler` to the agent as we initialized the agent with the same `llm` as we created with that callback. So let's see what we get when running the agent.

In [18]:
prompt = "Hello, how are you?"

agent(prompt)

  warn_deprecated(




[1m> Entering new AgentExecutor chain...[0m
```json
{
    "action": "Final Answer",
    "action_input": "I'm just a computer program, so I don't have feelings, but I'm here and ready to assist you. How can I help you today?"
}
```[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "I'm just a computer program, so I don't have feelings, but I'm here and ready to assist you. How can I help you today?"
}
```[0m

[1m> Finished chain.[0m


{'input': 'Hello, how are you?',
 'chat_history': [],
 'output': "I'm just a computer program, so I don't have feelings, but I'm here and ready to assist you. How can I help you today?"}

Not bad, but we do now have the issue of streaming the _entire_ output from the LLM. Because we're using an agent, the LLM is instructed to output the JSON format we can see here so that the agent logic can handle tool usage, multiple "thinking" steps, and so on. For example, if we ask a math question we'll see this:

In [21]:
agent("what is the square root of 71?")



[1m> Entering new AgentExecutor chain...[0m
```json
{
    "action": "Calculator",
    "action_input": "square root of 71"
}
```[32;1m[1;3m```json
{
    "action": "Calculator",
    "action_input": "square root of 71"
}
```[0m```text
71**0.5
```
...numexpr.evaluate("71**0.5")...

Observation: [36;1m[1;3mAnswer: 8.426149773176359[0m
Thought:```json
{
    "action": "Final Answer",
    "action_input": "The square root of 71 is approximately 8.426149773176359."
}
```[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The square root of 71 is approximately 8.426149773176359."
}
```[0m

[1m> Finished chain.[0m


{'input': 'what is the square root of 71?',
 'chat_history': [HumanMessage(content='Hello, how are you?'),
  AIMessage(content="I'm just a computer program, so I don't have feelings, but I'm here and ready to assist you. How can I help you today?")],
 'output': 'The square root of 71 is approximately 8.426149773176359.'}

It's interesting to see during development but we'll want to clean this streaming up a little in any actual use-case. For that we can go with two approaches — either we build a custom callback handler, or use a purpose built callback handler from LangChain (as usual, LangChain has something for everything). Let's first try LangChain's purpose-built `FinalStreamingStdOutCallbackHandler`.

We will overwrite the existing `callbacks` attribute found here:

In [24]:
agent.agent.llm_chain.llm

ChatOpenAI(callbacks=[<langchain_core.callbacks.streaming_stdout.StreamingStdOutCallbackHandler object at 0x317bfedd0>], client=<openai.resources.chat.completions.Completions object at 0x31903fa10>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x31a1e8410>, temperature=0.0, openai_api_key=SecretStr('**********'), openai_proxy='', streaming=True)

With the new callback handler:

In [27]:
from langchain.callbacks.streaming_stdout_final_only import (
    FinalStreamingStdOutCallbackHandler,
)

agent.agent.llm_chain.llm.callbacks = [
    FinalStreamingStdOutCallbackHandler(
        answer_prefix_tokens=["Final", "Answer"]
    )
]

Let's try it:

In [30]:
agent("what is the square root of 71?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Calculator",
    "action_input": "square root of 71"
}
```[0m
Observation: [36;1m[1;3mAnswer: 8.426149773176359[0m
Thought:",
    "action_input": "The square root of 71 is approximately 8.426149773176359."
}
```[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The square root of 71 is approximately 8.426149773176359."
}
```[0m

[1m> Finished chain.[0m


{'input': 'what is the square root of 71?',
 'chat_history': [HumanMessage(content='Hello, how are you?'),
  AIMessage(content="I'm just a computer program, so I don't have feelings, but I'm here and ready to assist you. How can I help you today?"),
  HumanMessage(content='what is the square root of 71?'),
  AIMessage(content='The square root of 71 is approximately 8.426149773176359.')],
 'output': 'The square root of 71 is approximately 8.426149773176359.'}

Not quite there, we should really clean up the `answer_prefix_tokens` argument but it is hard to get right. It's generally easier to use a custom callback handler like so:

In [33]:
import sys

class CallbackHandler(StreamingStdOutCallbackHandler):
    def __init__(self):
        self.content: str = ""
        self.final_answer: bool = False

    def on_llm_new_token(self, token: str, **kwargs: any) -> None:
        self.content += token
        if "Final Answer" in self.content:
            # now we're in the final answer section, but don't print yet
            self.final_answer = True
            self.content = ""
        if self.final_answer:
            if '"action_input": "' in self.content:
                if token not in ["}"]:
                    sys.stdout.write(token)  # equal to `print(token, end="")`
                    sys.stdout.flush()

agent.agent.llm_chain.llm.callbacks = [CallbackHandler()]

Let's try again:

In [36]:
agent("what is the square root of 71?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m```json
{
    "action": "Calculator",
    "action_input": "square root of 71"
}
```[0m
Observation: [36;1m[1;3mAnswer: 8.426149773176359[0m
Thought: "The square root of 71 is approximately 8.426149773176359."
}
```[32;1m[1;3m```json
{
    "action": "Final Answer",
    "action_input": "The square root of 71 is approximately 8.426149773176359."
}
```[0m

[1m> Finished chain.[0m


{'input': 'what is the square root of 71?',
 'chat_history': [HumanMessage(content='Hello, how are you?'),
  AIMessage(content="I'm just a computer program, so I don't have feelings, but I'm here and ready to assist you. How can I help you today?"),
  HumanMessage(content='what is the square root of 71?'),
  AIMessage(content='The square root of 71 is approximately 8.426149773176359.'),
  HumanMessage(content='what is the square root of 71?'),
  AIMessage(content='The square root of 71 is approximately 8.426149773176359.')],
 'output': 'The square root of 71 is approximately 8.426149773176359.'}

In [38]:
agent.agent.llm_chain.llm

ChatOpenAI(callbacks=[<__main__.CallbackHandler object at 0x31bb90350>], client=<openai.resources.chat.completions.Completions object at 0x31903fa10>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x31a1e8410>, temperature=0.0, openai_api_key=SecretStr('**********'), openai_proxy='', streaming=True)

It isn't perfect, but this is getting better. Now, in most scenarios we're unlikely to simply be printing output to a terminal or notebook. When we want to do something more complex like stream this data through another API, we need to do things differently.

## Using FastAPI with Agents

In most cases we'll be placing our LLMs, Agents, etc behind something like an API. Let's add that into the mix and see how we can implement streaming for agents with FastAPI.

First, we'll create a simple `main.py` script to contain our FastAPI logic. 
To run the API, navigate to the directory and run `uvicorn main:app --reload`. Once complete, you can confirm it is running by looking for the 🤙 status in the next cell output:

In [46]:
import requests

res = requests.get("http://localhost:8000/health")
res.json()

In [15]:
res = requests.get("http://localhost:8000/chat",
    json={"text": "hello there!"}
)
res

<Response [200]>

In [16]:
res.json()

{'input': 'hello there!',
 'chat_history': [],
 'output': 'Hello! How can I assist you today?'}

Unlike with our StdOut streaming, we now need to send our tokens to a generator function that feeds those tokens to FastAPI via a `StreamingResponse` object. To handle this we need to use async code, otherwise our generator will not begin emitting anything until _after_ generation is already complete.

The `Queue` is accessed by our callback handler, as as each token is generated, it puts the token into the queue. Our generator function asyncronously checks for new tokens being added to the queue. As soon as the generator sees a token has been added, it gets the token and yields it to our `StreamingResponse`.

To see it in action, we'll define a stream requests function called `get_stream`:

In [17]:
def get_stream(query: str):
    s = requests.Session()
    with s.get(
        "http://localhost:8000/chat",
        stream=True,
        json={"text": query}
    ) as r:
        for line in r.iter_content():
            print(line.decode("utf-8"), end="")

In [26]:
get_stream("hi there!")

 "Hello! How can I assist you today?"
