# Streaming and Batching

In this notebook you'll learn how to stream model responses and handle multiple chat completion requests in batches.

---

## Objectives

By the time you complete this notebook, you will:

- Learn to stream model responses.
- Learn to batch model responses.
- Compare the performance of batch processing to single prompt chat completion.

---

## Imports

Here we import the `ChatNVIDIA` class from `langchain_nvidia_ai_endpoints`, which will enable us to interact with our local Llama 3.1 NIM.

In [1]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

---

## Create a Model Instance

In [16]:
base_url = 'http://llama:8000/v1'
model = 'meta/llama-3.1-8b-instruct'
llm = ChatNVIDIA(base_url=base_url, model=model, temperature=.9)

---

## Sanity Check

Before proceeding with new use cases, let's sanity check that we can interact with our local model via LangChain.

In [17]:
prompt = 'Where and when was NVIDIA founded?'
result = llm.invoke(prompt)

In [18]:
print(result.content)

NVIDIA was founded on April 5, 1993, in Santa Clara, California, USA.


---

## Streaming Responses

As an alternative to the `invoke` method, you can use the `stream` method to receive the model response in chunks. This way, you don't have to wait for the entire response to be generated, and you can see the output as it is being produced. Especially for long responses, or in user-facing applications, streaming output can result in a much better user experience.

Let's create a prompt that generates a longer response.

In [19]:
prompt = 'Explain who you are in roughly 500 words.'

Given this prompt, let's see how the `stream` function works.

In [20]:
for chunk in llm.stream(prompt):
    print(chunk.content, end='')

I am an artificial intelligence designed to understand and generate human-like text based on the input I receive. My primary function is to provide information, answer questions, and engage in conversation to the best of my knowledge and abilities. I am a type of conversational AI, which means I'm trained on vast amounts of text data to learn the patterns, relationships, and context of human language.

 consists of a massive corpus of text from various sources, including books, articles, research papers, and websites. This training data allows me to understand the nuances of language, including grammar, syntax, and idioms, which enables me to generate human-like responses to a wide range of questions and topics.

-based AI, which means I run on remote servers and can be accessed through various interfaces, including chat platforms, messaging apps, and websites. My users can interact with me in a conversational manner, asking me questions, providing context, and obtaining responses. I'm

The `stream` method in LangChain serves as a foundational tool and shows the response as it is being generated. This can make the interaction with the LLMs feel more responsive and improve the user experience.

---

In subsequent notebooks we will import this helper function to assist our work.

---

## Batching Responses

You can also use `batch` to call the prompts on a list of inputs. Calling `batch` will return a list of responses in the same order as they were passed in.

Not only is `batch` convenient when working with collections of data that all need to be responded to in some way by an LLM, but the `batch` method is designed to process multiple prompts concurrently, effectively running the responses in parallel as much as possible. This allows for more efficient handling of multiple requests, reducing the overall time needed to generate responses for a list of prompts. By batching requests, you can leverage the computational power of the language model to handle multiple inputs simultaneously, improving performance and throughput.

We'll demonstrate the functionality and performance benefits of batching by using this list of prompts about state capitals.

In [21]:
state_capital_questions = [
    'What is the capital of California?',
    'What is the capital of Texas?',
    'What is the capital of New York?',
    'What is the capital of Florida?',
    'What is the capital of Illinois?',
    'What is the capital of Ohio?'
]

Using `batch` we can pass in the entire list...

In [22]:
capitals = llm.batch(state_capital_questions)

... and get back a list of responses.

In [23]:
len(capitals)

6

In [24]:
for capital in capitals:
    print(capital.content)

The capital of California is Sacramento.
The capital of Texas is Austin.
The capital of New York is Albany.
The capital of Florida is Tallahassee.
The capital of Illinois is Springfield.
The capital of Ohio is Columbus.


One thing to note is that `batch` is not engaging with the LLM in a multi-turn conversation (a topic we will cover at length later in the workshop). Rather, it is asking multiple questions to a new LLM instance each time.

---

## Comparing batch and invoke Performance

Just to make a quick observation about the potential performance gains from batching, here we time a call to `batch`. Note the `Wall time`.

In [25]:
%%time
llm.batch(state_capital_questions)

CPU times: user 14.4 ms, sys: 2.99 ms, total: 17.4 ms
Wall time: 173 ms


[AIMessage(content='The capital of California is Sacramento.', response_metadata={'role': 'assistant', 'content': 'The capital of California is Sacramento.', 'token_usage': {'prompt_tokens': 19, 'total_tokens': 27, 'completion_tokens': 8}, 'finish_reason': 'stop', 'model_name': 'meta/llama-3.1-8b-instruct'}, id='run-3738ab9e-c5ee-4768-a34f-855984d0bbb8-0', role='assistant'),
 AIMessage(content='The capital of Texas is Austin.', response_metadata={'role': 'assistant', 'content': 'The capital of Texas is Austin.', 'token_usage': {'prompt_tokens': 19, 'total_tokens': 27, 'completion_tokens': 8}, 'finish_reason': 'stop', 'model_name': 'meta/llama-3.1-8b-instruct'}, id='run-a1efba92-ddf8-4a09-a4a4-c3525eaf7996-0', role='assistant'),
 AIMessage(content='The capital of New York is Albany.', response_metadata={'role': 'assistant', 'content': 'The capital of New York is Albany.', 'token_usage': {'prompt_tokens': 20, 'total_tokens': 29, 'completion_tokens': 9}, 'finish_reason': 'stop', 'model_na

And now to compare, we iterate over the `state_capital_questions` list and call `invoke` on each item. Again, note the `Wall time` and compare it to the results from batching above.

In [26]:
%%time
for cq in state_capital_questions:
    llm.invoke(cq)

CPU times: user 12.9 ms, sys: 0 ns, total: 12.9 ms
Wall time: 686 ms


---

## Exercise: Batch Process to Create an FAQ Document

For this exercise you'll use batch processing to respond to a variety of LLM-related questions in service of creating an FAQ document (in this notebook setting the document will just be something we print to screen).

Here is a list of LLM-related questions.

In [27]:
faq_questions = [
    'What is a Large Language Model (LLM)?',
    'How do LLMs work?',
    'What are some common applications of LLMs?',
    'What is fine-tuning in the context of LLMs?',
    'How do LLMs handle context?',
    'What are some limitations of LLMs?',
    'How do LLMs generate text?',
    'What is the importance of prompt engineering in LLMs?',
    'How can LLMs be used in chatbots?',
    'What are some ethical considerations when using LLMs?'
]

You job is to populate `faq_answers` below with a list of responses to each of the questions. Use the `batch` method to make this very easy.

Upon successful completion, you should be able to print the return value of calling the following `create_faq_document` with `faq_questions` and `faq_answers` and get an FAQ document for all of the LLM-related questions above.

In [28]:
def create_faq_document(faq_questions, faq_answers):
    faq_document = ''
    for question, response in zip(faq_questions, faq_answers):
        faq_document += f'{question.upper()}\n\n'
        faq_document += f'{response.content}\n\n'
        faq_document += '-'*30 + '\n\n'

    return faq_document

If you get stuck, check out the *Solution* below.

### Your Work Here

In [34]:
faq_answers = []

In [35]:
# This should work after you successfully populate `faq_answers` with LLM responses.
print(create_faq_document(faq_questions, faq_answers))




### Solution

In [36]:
faq_answers = llm.batch(faq_questions)

In [37]:
def create_faq_document(faq_questions, faq_answers):
    faq_document = ''
    for question, response in zip(faq_questions, faq_answers):
        faq_document += f'{question.upper()}\n\n'
        faq_document += f'{response.content}\n\n'
        faq_document += '-'*30 + '\n\n'

    return faq_document

In [33]:
print(create_faq_document(faq_questions, faq_answers))

WHAT IS A LARGE LANGUAGE MODEL (LLM)?

A Large Language Model (LLM) is a type of artificial intelligence (AI) model that is trained on a massive corpus of text data to generate human-like language. LLMs are designed to understand and generate human language in a way that is similar to how humans communicate.

LLMs are typically trained on large amounts of text data, often from the internet, books, and other sources, to learn the patterns and structures of language. These models are trained using supervised learning techniques, where the model is tasked with predicting the next word or character in a sequence of text, given the context of the surrounding words or characters.

There are different types of LLMs, including:

1. **Transformers**: This type of LLM is based on the Transformer architecture, which was introduced in 2017. Transformers use self-attention mechanisms to weigh the importance of different words in a sentence, allowing the model to capture long-range dependencies and 

---

## Summary

In this notebook you learned how to stream and batch model responses, and used batched LLM calls to generate a helpful FAQ document.

In the next notebook you'll begin focusing more heavily on the creation of prompts themselves with an emphasis on iterative prompt development and engineering prompts that are very specific.