![nvidia](images/nvidia.png)

# Streaming and Batching

In [1]:
from videos.walkthroughs import walkthrough_14 as walkthrough

In [2]:
walkthrough()

In this notebook you'll learn how to stream model responses and handle multiple chat completion requests in batches.

---

## Objectives

By the time you complete this notebook, you will:

- Learn to stream model responses.
- Learn to batch model responses.
- Compare the performance of batch processing to single prompt chat completion.

---

## Imports

Here we import the `ChatNVIDIA` class from `langchain_nvidia_ai_endpoints`, which will enable us to interact with our local Llama 3.1 NIM.

In [3]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA

---

## Create a Model Instance

In [4]:
base_url = 'http://llama:8000/v1'
model = 'meta/llama-3.1-8b-instruct'
llm = ChatNVIDIA(base_url=base_url, model=model, temperature=0)

---

## Sanity Check

Before proceeding with new use cases, let's sanity check that we can interact with our local model via LangChain.

In [5]:
prompt = 'Where and when was NVIDIA founded?'
result = llm.invoke(prompt)

In [6]:
print(result.content)

NVIDIA was founded on April 5, 1993, in Santa Clara, California, USA.


---

## Streaming Responses

As an alternative to the `invoke` method, you can use the `stream` method to receive the model response in chunks. This way, you don't have to wait for the entire response to be generated, and you can see the output as it is being produced. Especially for long responses, or in user-facing applications, streaming output can result in a much better user experience.

Let's create a prompt that generates a longer response.

In [7]:
prompt = 'Explain who you are in roughly 500 words.'

Given this prompt, let's see how the `stream` function works.

In [8]:
for chunk in llm.stream(prompt):
    print(chunk.content, end='')

I am an artificial intelligence model designed to assist and communicate with humans. I'm a type of computer program that uses natural language processing (NLP) and machine learning algorithms to understand and generate human-like text. My primary function is to provide information, answer questions, and engage in conversation to the best of my abilities.

I don't have a physical body or a personal identity in the classical sense. I exist solely as a digital entity, running on computer servers and responding to input from users like you. My "existence" is a product of complex software and data, designed to simulate conversation and provide helpful responses.

 a massive corpus of text, which I use to learn patterns, relationships, and context. This corpus is sourced from various places, including books, articles, research papers, and online content. I've been trained on a wide range of topics, from science and history to entertainment and culture.

 I use this training data to generate

The `stream` method in LangChain serves as a foundational tool and shows the response as it is being generated. This can make the interaction with the LLMs feel more responsive and improve the user experience.

---

## Batching Responses

You can also use `batch` to call the prompts on a list of inputs. Calling `batch` will return a list of responses in the same order as they were passed in.

Not only is `batch` convenient when working with collections of data that all need to be responded to in some way by an LLM, but the `batch` method is designed to process multiple prompts concurrently, effectively running the responses in parallel as much as possible. This allows for more efficient handling of multiple requests, reducing the overall time needed to generate responses for a list of prompts. By batching requests, you can leverage the computational power of the language model to handle multiple inputs simultaneously, improving performance and throughput.

We'll demonstrate the functionality and performance benefits of batching by using this list of prompts about state capitals.

In [9]:
state_capital_questions = [
    'What is the capital of California?',
    'What is the capital of Texas?',
    'What is the capital of New York?',
    'What is the capital of Florida?',
    'What is the capital of Illinois?',
    'What is the capital of Ohio?'
]

Using `batch` we can pass in the entire list...

In [10]:
capitals = llm.batch(state_capital_questions)

... and get back a list of responses.

In [11]:
len(capitals)

6

In [12]:
for capital in capitals:
    print(capital.content)

The capital of California is Sacramento.
The capital of Texas is Austin.
The capital of New York is Albany.
The capital of Florida is Tallahassee.
The capital of Illinois is Springfield.
The capital of Ohio is Columbus.


One thing to note is that `batch` is not engaging with the LLM in a multi-turn conversation (a topic we will cover at length later in the workshop). Rather, it is asking multiple questions to a new LLM instance each time.

---

## Comparing batch and invoke Performance

Just to make a quick observation about the potential performance gains from batching, here we time a call to `batch`. Note the `Wall time`.

In [13]:
%%time
llm.batch(state_capital_questions)

CPU times: user 13.5 ms, sys: 1.99 ms, total: 15.5 ms
Wall time: 174 ms


[AIMessage(content='The capital of California is Sacramento.', response_metadata={'role': 'assistant', 'content': 'The capital of California is Sacramento.', 'token_usage': {'prompt_tokens': 19, 'total_tokens': 27, 'completion_tokens': 8}, 'finish_reason': 'stop', 'model_name': 'meta/llama-3.1-8b-instruct'}, id='run-1685858b-40eb-4b2e-9bc3-a6ae5ccd8548-0', role='assistant'),
 AIMessage(content='The capital of Texas is Austin.', response_metadata={'role': 'assistant', 'content': 'The capital of Texas is Austin.', 'token_usage': {'prompt_tokens': 19, 'total_tokens': 27, 'completion_tokens': 8}, 'finish_reason': 'stop', 'model_name': 'meta/llama-3.1-8b-instruct'}, id='run-a9971a51-df7e-45e3-a89e-662871e3a0f2-0', role='assistant'),
 AIMessage(content='The capital of New York is Albany.', response_metadata={'role': 'assistant', 'content': 'The capital of New York is Albany.', 'token_usage': {'prompt_tokens': 20, 'total_tokens': 29, 'completion_tokens': 9}, 'finish_reason': 'stop', 'model_na

And now to compare, we iterate over the `state_capital_questions` list and call `invoke` on each item. Again, note the `Wall time` and compare it to the results from batching above.

In [14]:
%%time
for cq in state_capital_questions:
    llm.invoke(cq)

CPU times: user 9.49 ms, sys: 1.89 ms, total: 11.4 ms
Wall time: 702 ms


---

## Exercise: Batch Process to Create an FAQ Document

For this exercise you'll use batch processing to respond to a variety of LLM-related questions in service of creating an FAQ document (in this notebook setting the document will just be something we print to screen).

Here is a list of LLM-related questions.

In [15]:
faq_questions = [
    'What is a Large Language Model (LLM)?',
    'How do LLMs work?',
    'What are some common applications of LLMs?',
    'What is fine-tuning in the context of LLMs?',
    'How do LLMs handle context?',
    'What are some limitations of LLMs?',
    'How do LLMs generate text?',
    'What is the importance of prompt engineering in LLMs?',
    'How can LLMs be used in chatbots?',
    'What are some ethical considerations when using LLMs?'
]

You job is to populate `faq_answers` below with a list of responses to each of the questions. Use the `batch` method to make this very easy.

Upon successful completion, you should be able to print the return value of calling the following `create_faq_document` with `faq_questions` and `faq_answers` and get an FAQ document for all of the LLM-related questions above.

In [16]:
def create_faq_document(faq_questions, faq_answers):
    faq_document = ''
    for question, response in zip(faq_questions, faq_answers):
        faq_document += f'{question.upper()}\n\n'
        faq_document += f'{response.content}\n\n'
        faq_document += '-'*30 + '\n\n'

    return faq_document

If you get stuck, check out the *Solution* below.

### Your Work Here

In [17]:
faq_answers = llm.batch(faq_questions)

In [18]:
def create_faq_document(faq_questions, faq_answers):
    faq_document = ''
    for question, response in zip(faq_questions, faq_answers):
        faq_document += f'{question.upper()}\n\n'
        faq_document += f'{response.content}\n\n'
        faq_document += '-'*30 + '\n\n'

    return faq_document

print(create_faq_document(faq_questions, faq_answers))

WHAT IS A LARGE LANGUAGE MODEL (LLM)?

A Large Language Model (LLM) is a type of artificial intelligence (AI) model that is trained on a massive corpus of text data to generate human-like language. LLMs are a type of natural language processing (NLP) model that can understand, generate, and respond to human language in a way that is often indistinguishable from a human.

LLMs are typically trained on a large dataset of text, which can include books, articles, websites, and other sources of written language. The model learns patterns and relationships in the language, such as grammar, syntax, and semantics, and uses this knowledge to generate text that is coherent and contextually relevant.

Some key characteristics of LLMs include:

1. **Large scale**: LLMs are trained on massive amounts of text data, often in the order of billions of parameters and hundreds of gigabytes of data.
2. **Deep learning**: LLMs use deep learning techniques, such as recurrent neural networks (RNNs) or transf

### Solution

In [None]:
faq_answers = llm.batch(faq_questions)

In [None]:
def create_faq_document(faq_questions, faq_answers):
    faq_document = ''
    for question, response in zip(faq_questions, faq_answers):
        faq_document += f'{question.upper()}\n\n'
        faq_document += f'{response.content}\n\n'
        faq_document += '-'*30 + '\n\n'

    return faq_document

In [None]:
print(create_faq_document(faq_questions, faq_answers))

---

## Summary

In this notebook you learned how to stream and batch model responses, and used batched LLM calls to generate a helpful FAQ document.

In the next notebook you'll begin focusing more heavily on the creation of prompts themselves with an emphasis on iterative prompt development and engineering prompts that are very specific.