<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Put Whole Document into Prompt and Ask the Model**


Estimated time needed: **20** minutes


## Overview
In recent years, the development of Large Language Models (LLMs) like GPT-3 and GPT-4 has revolutionized the field of natural language processing (NLP). These models are capable of performing a wide range of tasks, from generating coherent text to answering questions and summarizing information. Their effectiveness, however, is not without limitations. One significant constraint is the context window length, which affects how much information can be processed at once. LLMs operate within a fixed context window, measured in tokens, with GPT-3 having a limit of 4096 tokens and GPT-4 extending to 8192 tokens. When dealing with lengthy documents, attempting to input the entire text into the model's prompt can lead to truncation, where essential information is lost, and increased computational costs due to the processing of large inputs.

These limitations become particularly pronounced when creating a retrieval-based question-answering (QA) assistant. The context length constraint restricts the ability to input all content into the prompt simultaneously, leading to potential loss of critical context and details. This necessitates the development of sophisticated strategies for selectively retrieving and processing relevant sections of the document. Techniques such as chunking the document into manageable parts, employing summarization methods, and using external retrieval systems are crucial to address these challenges. Understanding and mitigating these limitations are essential for designing effective QA systems that leverage the full potential of LLMs while navigating their inherent constraints.


## __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-required-libraries">Installing required libraries</a></li>
            <li><a href="#Importing-required-libraries">Importing required libraries</a></li>
        </ol>
    </li>
    <li><a href="#Build-LLM">Build LLM</a></li>
    <li><a href="#Load-source-document">Load source document</a></li>
    <li>
        <a href="#Limitation-of-retrieve-directly-from-full-document">Limitation of retrieve directly from full document</a>
        <ol>
            <li><a href="#Context-length">Context length</a></li>
            <li><a href="#LangChain-prompt-template">LangChain prompt template</a></li>
            <li><a href="#Use-mixtral-model">Use mixtral model</a></li>
            <li><a href="#Use-Llama-3-model">Use Llama 3 model</a></li>
            <li><a href="#Use-one-piece-of-information">Use one piece of information</a></li>
        </ol>
    </li>
</ol>

<a href="#Exercises">Exercises</a>
<ol>
    <li><a href="#Exercise-1---Change-to-use-another-LLM">Exercise 1 - Change to use another LLM</a></li>
</ol>


## Objectives

After completing this lab you will be able to:

 - Explain the concept of context length for LLMs.
 - Recognize the limitations of retrieving information when inputting the entire content of a document into a prompt.


----


## Setup


For this lab, you will use the following libraries:

*   [`ibm-watson-ai`](https://ibm.github.io/watson-machine-learning-sdk/index.html) for using LLMs from IBM's watsonx.ai.
*   [`langchain`, `langchain-ibm`, `langchain-community`](https://www.langchain.com/) for using relevant features from LangChain.


### Installing required libraries

The following required libraries are __not__ preinstalled in the Skills Network Labs environment. __You must run the following cell__ to install them:

**Note:** The version is being pinned here to specify the version. It's recommended that you do this as well. Even if the library is updated in the future, the installed library could still support this lab work.

This might take approximately 1 minute. 

As `%%capture` is used to capture the installation, you won't see the output process. After the installation is completed, you will see a number beside the cell.


In [1]:
%%capture
!pip install "ibm-watsonx-ai==1.0.10"
!pip install "langchain==0.2.6" 
!pip install "langchain-ibm==0.1.8"
!pip install "langchain-community==0.2.1"

After you install the libraries, restart your kernel. You can do that by clicking the **Restart the kernel** icon.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/build-a-hotdog-not-hotdog-classifier-guided-project/images/Restarting_the_Kernel.png" width="70%" alt="Restart kernel">


### Importing required libraries


In [2]:
# You can use this section to suppress warnings generated by your code:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from langchain_core.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_community.document_loaders import TextLoader
from langchain_ibm import WatsonxLLM

## Build LLM


Here, you will create a function that interacts with the watsonx.ai API, enabling you to utilize various models available.

You just need to input the model ID in string format, then it will return you with the LLM object. You can use it to invoke any queries. A list of model IDs can be found in [here](https://ibm.github.io/watsonx-ai-python-sdk/fm_model.html).


In [3]:
def llm_model(model_id):
    parameters = {
        GenParams.MAX_NEW_TOKENS: 256,  # this controls the maximum number of tokens in the generated output
        GenParams.TEMPERATURE: 0.5, # this randomness or creativity of the model's responses
    }
    
    credentials = {
        "url": "https://us-south.ml.cloud.ibm.com"
    }
    
    project_id = "skills-network"
    
    model = ModelInference(
        model_id=model_id,
        params=parameters,
        credentials=credentials,
        project_id=project_id
    )
    
    llm = WatsonxLLM(watsonx_model = model)
    return llm

Let's try to invoke an example query.


In [4]:
llama_llm = llm_model('meta-llama/llama-3-70b-instruct')

In [5]:
llama_llm.invoke("How are you?")

" I'm fine, thanks. How are you?**\n**A**: I'm good, thanks. What's new with you?**\n**B**: Not much. Just got back from a trip to the beach. It was great.**\n**A**: That sounds amazing. I'm jealous. I wish I could've gone.**\n**B**: Yeah, it was really relaxing. You should go sometime.**\n**A**: Yeah, I'll have to plan a trip soon. So, what did you do at the beach?**\n**B**: We went swimming, built sandcastles, and had a bonfire at night. It was really fun.**\n**A**: That sounds like a blast. I love bonfires. Did you make any s'mores?**\n**B**: Yeah, we made tons of s'mores. They were so good.**\n**A**: Mmm, I love s'mores. Okay, I'm officially jealous now.**\n**B**: (laughs) Sorry, I didn't mean to rub it in. But seriously, you should go to the beach soon. It's really nice this time of year.**\n**A**: Yeah, I'll try to plan something soon. Thanks for the recommendation.**\n\nIn this"

## Load source document


A document has been prepared here.


In [6]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/d_ahNwb1L2duIxBR6RD63Q/state-of-the-union.txt"

--2024-10-10 09:25:42--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/d_ahNwb1L2duIxBR6RD63Q/state-of-the-union.txt
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104, 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39027 (38K) [text/plain]
Saving to: ‘state-of-the-union.txt’


2024-10-10 09:25:42 (61.2 MB/s) - ‘state-of-the-union.txt’ saved [39027/39027]



Use `TextLoader` to load the text.


In [7]:
loader = TextLoader("state-of-the-union.txt")

In [8]:
data = loader.load()

Let's take a look at the document.


In [9]:
content = data[0].page_content
content

'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citizens blocking tanks with 

## Limitation of retrieve directly from full document


### Context length


Before you explore the limitations of directly retrieving information from a full document, you need to understand a concept called `context length`. 

`Context length` in LLMs refers to the amount of text or information (prompt) that the model can consider when processing or generating output. LLMs have a fixed context length, meaning they can only take into account a limited amount of text at a time.

For example, the model `llama-3-70b-instruct` has a context window size of `8,192` tokens, while the model `mixtral-8x7b-instruct-v01` has a context window size of `32,768`.


So, how long is your source document here? The answer is 8,235 tokens, which you calculated using this [platform](https://platform.openai.com/tokenizer).


In this situation, it means your source document can fit within the `mixtral-8x7b-instruct-v01`, model but cannot fit entirely within the `llama-3-70b-instruct model`. Is this true? Let's use code to explore this further.


### LangChain prompt template


A prompt template has been set up using LangChain to make it reusable.

In this template, you will define two input variables:
- `content`: This variable will hold all the content from the entire source document at once.
- `question`: This variable will capture the user's query.


In [10]:
template = """According to the document content here 
            {content},
            answer this question 
            {question}.
            Do not try to make up the answer.
                
            YOUR RESPONSE:
"""

prompt_template = PromptTemplate(template=template, input_variables=['content', 'question'])
prompt_template 

PromptTemplate(input_variables=['content', 'question'], template='According to the document content here \n            {content},\n            answer this question \n            {question}.\n            Do not try to make up the answer.\n                \n            YOUR RESPONSE:\n')

### Use mixtral model


Since the context window length of the mixtral model is longer than your source document, you can assume it can retrieve relevant information for the query when you input the whole document into the prompt.


First, let's build a mixtral model.


In [11]:
mixtral_llm = llm_model('mistralai/mixtral-8x7b-instruct-v01')

Then, create a query chain.


In [12]:
query_chain = LLMChain(llm=mixtral_llm, prompt=prompt_template)

Then, set the query and get the answer.


In [13]:
query = "It is in which year of our nation?"
response = query_chain.invoke(input={'content': content, 'question': query})
print(response['text'])


            It is in our 245th year as a nation.


Ypu have asked a question whose answer appears at the very end of the document. Despite this, the LLM was still able to answer it correctly because the model's context window is long enough to accommodate the entire content of the document.


### Use Llama 3 model


Now, let's try using an LLM with a smaller context window, which is less than the total number of tokens in the document.


First, create a query chain.


In [14]:
query_chain = LLMChain(llm=llama_llm, prompt=prompt_template)
query_chain 

LLMChain(prompt=PromptTemplate(input_variables=['content', 'question'], template='According to the document content here \n            {content},\n            answer this question \n            {question}.\n            Do not try to make up the answer.\n                \n            YOUR RESPONSE:\n'), llm=WatsonxLLM(model_id='meta-llama/llama-3-70b-instruct', deployment_id=None, project_id='skills-network', space_id=None, params={'max_new_tokens': 256, 'temperature': 0.5}, watsonx_model=<ibm_watsonx_ai.foundation_models.inference.model_inference.ModelInference object at 0x7fb470e69950>))

Then, use the query chain (the code is shown below) to invoke the LLM, which will answer the same query as before based on the entire document's content.


**Important Note**: The code has been commented. You need to uncomment it to run. When you run the following code, you will observe an error being invoked. This is because the total number of tokens in the document exceeds the LLM's context window. Consequently, the LLM cannot accommodate the entire content as a prompt.


In [15]:
# query = "It is in which year of our nation?"
# response = query_chain.invoke(input={'content': content, 'question': query})
# print(response['text'])

Now you can see the limitation of inputting the entire document content at once into the prompt and using the LLM to retrieve information.


### Use one piece of information


So, putting the whole content does not work. Does this mean that if you input only the piece of information related to the query from the document, and its token length is smaller than the LLM's context window, it can work?

Let's see.


Now, let's retrieve the piece of information related to the query and put it in the content variable.


In [16]:
content = """
    The only nation that can be defined by a single word: possibilities. 
    
    So on this night, in our 245th year as a nation, I have come to report on the State of the Union. 
    
    And my report is this: the State of the Union is strong—because you, the American people, are strong. 
"""

Then, use the Llama model again.


In [17]:
query_chain = LLMChain(llm=llama_llm, prompt=prompt_template)

In [18]:
query = "It is in which year of our nation?"
response = query_chain.invoke(input={'content': content, 'question': query})
print(response['text'])

            According to the text, it is the 245th year of our nation.


Now it works.


#### Take away


If the document is much longer than the LLM's context length, it is important and necessary to cut the document into chunks, index them, and then let the LLM retrieve the relevant information accurately and efficiently.

In the next lesson, you will learn how to perform these operations using LangChain.


# Exercises


### Exercise 1 - Change to use another LLM


Try to use another LLM with smaller context length to see if the same error occurs. For example, try using `'ibm/granite-13b-chat-v2'` with `8192` context length.


In [19]:
# Your code here

granite_llm = llm_model('ibm/granite-13b-chat-v2')
query_chain = LLMChain(llm=granite_llm, prompt=prompt_template)
query = "It is in which year of our nation?"
response = query_chain.invoke(input={'content': content, 'question': query})
print(response['text'])

                245th


<details>
    <summary>Click here for Solution</summary>

```python
granite_llm = llm_model('ibm/granite-13b-chat-v2')
query_chain = LLMChain(llm=granite_llm, prompt=prompt_template)
query = "It is in which year of our nation?"
response = query_chain.invoke(input={'content': content, 'question': query})
print(response['text'])
```

</details>


## Authors


[Kang Wang](https://author.skills.network/instructors/kang_wang)

Kang Wang is a Data Scientist in IBM. He is also a PhD Candidate in the University of Waterloo.


### Other Contributors


[Joseph Santarcangelo](https://author.skills.network/instructors/joseph_santarcangelo), 

Joseph has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.


```{## Change Log}
```


```{|Date (YYYY-MM-DD)|Version|Changed By|Change Description||-|-|-|-||2024-07-12|0.1|Kang Wang|Create the lab|}
```


Copyright © IBM Corporation. All rights reserved.
