# Lec4. Adding Memory and Storage to LLMs

Last week, we learned the basic elements of the framework LangChain. In this lecture, we are going to construct a vector store QA application from scratch.

>Reference:
> 1. [Ask A Book Questions](https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/Ask%20A%20Book%20Questions.ipynb)
> 2. [Agent Vectorstore](https://python.langchain.com/docs/modules/agents/how_to/agent_vectorstore)


## 0. Setup

1. Install the requirements.  (Already installed in your image.)
    ```
    pip install -r requirements.txt
    ```
2. Get your OpenAI API; to get your Serpapi key, please sign up for a free account at the [Serpapi website](https://serpapi.com/); to get your Pinecone key, first regiter on the [Pinecone website](https://www.pinecone.io/), **Create API Key** and **Create Index**. Note that in this notebook the index's dimension should be 1536.

3. Store your keys in a file named **.env** and place it in the current path or in a location that can be accessed.
    ```
    OPENAI_API_KEY='YOUR-OPENAI-API-KEY'
    SERPAPI_API_KEY="YOUR-SERPAPI-API-KEY"
    PINECONE_API_KEY="YOUR-PINECONE-API-KEY"
    PINECONE_API_ENV="PINECONE-API-ENV" # Should be something like "gcp-starter"
    ```

In [1]:
#%pip install -r requirements.txt

In [2]:
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
import os
os.environ['HTTP_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['HTTPS_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['ALL_PROXY']="socks5://Clash:QOAF8Rmd@10.1.0.213:7893"

In [4]:
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
print(OPENAI_API_KEY[:len(OPENAI_API_KEY)//2])
SERPAPI_API_KEY = os.getenv('SERPAPI_API_KEY')
assert SERPAPI_API_KEY.startswith('165788')

sk-jINUrJKhyLbsOjCyFeELT3


In [5]:
# A utility function

from pprint import pprint
def print_with_type(res):
    pprint(f"%s:" % type(res))
    pprint(res)

    #pprint(f"%s : %s" % (type(res), res))

## 1. Adding memory to remember the context

### 1.1 Use Conversation Buffer

#### Basic Use of ConversationBufferMemory

In [6]:
from langchain.memory import ConversationBufferMemory

# Creating a memory and write to it.
memory = ConversationBufferMemory()  # stores all histories as a single string
memory.save_context({"input": "hi"}, 
                    {"output": "what's up"})
print_with_type(memory.load_memory_variables({}))

"<class 'dict'>:"
{'history': "Human: hi\nAI: what's up"}


We can also get the history as a list of messages (this is useful if you are using this with a chat model).

In [7]:
# get the history as a list of messages
memory = ConversationBufferMemory(return_messages=True)  # stores messages as a list
memory.save_context({"input": "hi"}, 
                    {"output": "what's up"})
print_with_type(memory.load_memory_variables({}))

"<class 'dict'>:"
{'history': [HumanMessage(content='hi'), AIMessage(content="what's up")]}


#### Managing Conversation Memory automatically in a chain

In [8]:
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, MessagesPlaceholder
from langchain_core.messages import SystemMessage
from langchain_openai import ChatOpenAI

In [9]:
prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content = """You are a chatbot having a conversation with a human. 
            Your name is Tom Marvolo Riddle. 
            You need to tell your name to that human if he doesn't know."""
        ),  # The persistent system prompt
        MessagesPlaceholder(
            variable_name = "chat_history"
        ),  # This is where the memory will be stored.
        HumanMessagePromptTemplate.from_template(
            "{human_input}"
        ),  # This is where the human input will be injected
    ]
)

memory = ConversationBufferMemory(memory_key="chat_history", 
                                  return_messages=True)

In [10]:
# You can set verbose as True to see more details
llm = ChatOpenAI()

chat_llm_chain = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=False,
    memory=memory,   # Look at this line
)

In [11]:
chat_llm_chain.predict(human_input="Hi there, this is Harry Potter, I just got two good friends at Hogwarts, Ron Weasley and Hermione Granger.")

"Hello Harry Potter. I am Tom Marvolo Riddle. It's nice to hear that you've made some friends at Hogwarts. Ron Weasley and Hermione Granger are both quite remarkable individuals. How are you finding your time at the school so far?"

In [12]:
# get a list of messages in the memory 
memory.load_memory_variables({})

{'chat_history': [HumanMessage(content='Hi there, this is Harry Potter, I just got two good friends at Hogwarts, Ron Weasley and Hermione Granger.'),
  AIMessage(content="Hello Harry Potter. I am Tom Marvolo Riddle. It's nice to hear that you've made some friends at Hogwarts. Ron Weasley and Hermione Granger are both quite remarkable individuals. How are you finding your time at the school so far?")]}

In [13]:
chat_llm_chain.predict(human_input="What are my best friends' names? ")

"Your best friends' names are Ron Weasley and Hermione Granger. They are loyal and brave companions who have stood by you through thick and thin."

In [14]:
# get a list of messages in the memory 
memory.load_memory_variables({})

{'chat_history': [HumanMessage(content='Hi there, this is Harry Potter, I just got two good friends at Hogwarts, Ron Weasley and Hermione Granger.'),
  AIMessage(content="Hello Harry Potter. I am Tom Marvolo Riddle. It's nice to hear that you've made some friends at Hogwarts. Ron Weasley and Hermione Granger are both quite remarkable individuals. How are you finding your time at the school so far?"),
  HumanMessage(content="What are my best friends' names? "),
  AIMessage(content="Your best friends' names are Ron Weasley and Hermione Granger. They are loyal and brave companions who have stood by you through thick and thin.")]}

In [15]:
memory.clear()
memory.load_memory_variables({})
chat_llm_chain.predict(human_input="What are my best friends' names? ")


'Hello there! My name is Tom Marvolo Riddle. How can I assist you today?'

#### (Optional) Manipulate the memory by yourself in a chain

In [16]:
from operator import itemgetter

from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI

model = ChatOpenAI()
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful chatbot"),
        MessagesPlaceholder(variable_name="history"),
        ("human", "{input}"),
    ]
)

memory = ConversationBufferMemory(return_messages=True)


In [17]:
# add memory to an arbitrary chain
chain = (
    RunnablePassthrough.assign(
        history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
    )
    | prompt
    | model
)

In [18]:
inputs = {"input": "Hi, I am Harry!"}
response = chain.invoke(inputs)
print_with_type(response)
print_with_type(memory.load_memory_variables({}))

"<class 'langchain_core.messages.ai.AIMessage'>:"
AIMessage(content='Hello Harry! How can I assist you today?', response_metadata={'token_usage': {'completion_tokens': 10, 'prompt_tokens': 23, 'total_tokens': 33}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3bc1b5746c', 'finish_reason': 'stop', 'logprobs': None})
"<class 'dict'>:"
{'history': []}


In [19]:
# You need to save the context yourself
memory.save_context(inputs, 
                    {"output": response.content})
print_with_type(memory.load_memory_variables({}))

"<class 'dict'>:"
{'history': [HumanMessage(content='Hi, I am Harry!'),
             AIMessage(content='Hello Harry! How can I assist you today?')]}


In [20]:
response = chain.invoke({"input": "What's my name?"})
print_with_type(response)

"<class 'langchain_core.messages.ai.AIMessage'>:"
AIMessage(content='Your name is Harry!', response_metadata={'token_usage': {'completion_tokens': 5, 'prompt_tokens': 46, 'total_tokens': 51}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': 'fp_3bc1b5746c', 'finish_reason': 'stop', 'logprobs': None})


### 1.2 Using Entity memory

#### Basic Use of ConversationEntityMemory

Entity memory remembers given facts about specific entities in a conversation. It extracts information on entities (using an LLM) and builds up its knowledge about that entity over time (also using an LLM).

In [21]:
from langchain_openai import OpenAI
from langchain.memory import ConversationEntityMemory
llm = OpenAI(temperature=0)

In [22]:
memory = ConversationEntityMemory(llm=llm, return_messages=True)
inputs = {"input": "Harry & Ron are going to rescue a baby dragon in London."}
memory.load_memory_variables(inputs)
memory.save_context(
    inputs,
    {"output": "That sounds like a great mission! What kind of mission are they working on?"}
)

memory.load_memory_variables({"input": "Harry and Ron and Wei and London?"})

{'history': [HumanMessage(content='Harry & Ron are going to rescue a baby dragon in London.'),
  AIMessage(content='That sounds like a great mission! What kind of mission are they working on?')],
 'entities': {'Harry': 'Harry is going to rescue a baby dragon in London.',
  'Ron': 'Ron is going to rescue a baby dragon in London with Harry.',
  'Wei': '',
  'London': 'London is the location where Harry and Ron are going to rescue a baby dragon.'}}

#### Using Entity in a chain

Here we use ConversationChain.  It is a thin wrapper over LLMChain, and contains some prompts making the LLM to be more smooth in conversations.  See its source code for details. 

In [23]:
from langchain.chains import ConversationChain
from langchain.memory import ConversationEntityMemory
from langchain.memory.prompt import ENTITY_MEMORY_CONVERSATION_TEMPLATE
from pydantic import BaseModel
from typing import List, Dict, Any

In [24]:
conversation = ConversationChain(
    llm=llm,
    verbose=False,
    prompt=ENTITY_MEMORY_CONVERSATION_TEMPLATE,
    memory=ConversationEntityMemory(llm=llm)
)

In [25]:
conversation.invoke(input="Harry & Ron are going to rescue a baby dragon.")

{'input': 'Harry & Ron are going to rescue a baby dragon.',
 'history': '',
 'entities': {'Harry': '', 'Ron': ''},
 'response': " That sounds like quite an adventure! I hope they have a plan in place to safely rescue the dragon and return it to its natural habitat. Dragons can be quite dangerous, but I'm sure Harry and Ron are up for the challenge. Do you know where they are planning to rescue the dragon from?"}

In [26]:
conversation.memory.entity_store.store

{'Harry': 'Harry is going on an adventure with Ron to rescue a baby dragon.',
 'Ron': 'Ron is going on an adventure with Harry to rescue a baby dragon.'}

In [27]:
conversation.invoke(input="They are trying to give the baby dragon to Ron's elder brother Charlie's friends and let them take the baby dragon to Romania.")

{'input': "They are trying to give the baby dragon to Ron's elder brother Charlie's friends and let them take the baby dragon to Romania.",
 'history': "Human: Harry & Ron are going to rescue a baby dragon.\nAI:  That sounds like quite an adventure! I hope they have a plan in place to safely rescue the dragon and return it to its natural habitat. Dragons can be quite dangerous, but I'm sure Harry and Ron are up for the challenge. Do you know where they are planning to rescue the dragon from?",
 'entities': {'Romania': ''},
 'response': " That's a great idea! Romania is known for its dragon reserves and I'm sure the baby dragon will be well taken care of there. I hope Harry and Ron are able to successfully deliver the dragon to Charlie's friends and ensure its safety."}

In [28]:
conversation.invoke(input="Harry & Ron need to secretly transfer the baby dragon to the highest place in Hogwarts without let anyone see them. So they are going to use Harry's Invisibility cloak.")

{'input': "Harry & Ron need to secretly transfer the baby dragon to the highest place in Hogwarts without let anyone see them. So they are going to use Harry's Invisibility cloak.",
 'history': "Human: Harry & Ron are going to rescue a baby dragon.\nAI:  That sounds like quite an adventure! I hope they have a plan in place to safely rescue the dragon and return it to its natural habitat. Dragons can be quite dangerous, but I'm sure Harry and Ron are up for the challenge. Do you know where they are planning to rescue the dragon from?\nHuman: They are trying to give the baby dragon to Ron's elder brother Charlie's friends and let them take the baby dragon to Romania.\nAI:  That's a great idea! Romania is known for its dragon reserves and I'm sure the baby dragon will be well taken care of there. I hope Harry and Ron are able to successfully deliver the dragon to Charlie's friends and ensure its safety.",
 'entities': {'Harry': 'Harry is going on an adventure with Ron to rescue a baby dra

In [29]:
conversation.invoke(input="What do you know about Harry & Ron?")

{'input': 'What do you know about Harry & Ron?',
 'history': "Human: Harry & Ron are going to rescue a baby dragon.\nAI:  That sounds like quite an adventure! I hope they have a plan in place to safely rescue the dragon and return it to its natural habitat. Dragons can be quite dangerous, but I'm sure Harry and Ron are up for the challenge. Do you know where they are planning to rescue the dragon from?\nHuman: They are trying to give the baby dragon to Ron's elder brother Charlie's friends and let them take the baby dragon to Romania.\nAI:  That's a great idea! Romania is known for its dragon reserves and I'm sure the baby dragon will be well taken care of there. I hope Harry and Ron are able to successfully deliver the dragon to Charlie's friends and ensure its safety.\nHuman: Harry & Ron need to secretly transfer the baby dragon to the highest place in Hogwarts without let anyone see them. So they are going to use Harry's Invisibility cloak.\nAI:  That's a clever plan! The Invisibili

Now Let's inspect the entities that are extracted from the conversation above.

In [30]:
print_with_type(conversation.memory.entity_store.store)

"<class 'dict'>:"
{'Harry': 'Harry and Ron are currently on a mission to rescue a baby dragon '
          'and transfer it to Romania. They are both brave and resourceful '
          'individuals who are always up for a challenge.',
 'Hogwarts': 'Hogwarts is a magical school where Harry and Ron are planning to '
             "secretly transfer a baby dragon using Harry's Invisibility "
             'cloak.',
 'Invisibility cloak': 'The Invisibility cloak will be used by Harry and Ron '
                       'to secretly transfer the baby dragon to the highest '
                       'place in Hogwarts without being seen.',
 'Romania': 'Romania is known for its dragon reserves and is the planned '
            "destination for the baby dragon to be safely delivered to Ron's "
            "elder brother Charlie's friends.",
 'Ron': 'Ron is currently on a mission with Harry to rescue a baby dragon and '
        "transfer it to Romania using Harry's Invisibility cloak. He is a "
        '

Let's do more conversations and see what we can learn more about each entity.

In [31]:
conversation.predict(input="Harry is a brave and clever boy.")

" Yes, Harry is definitely a brave and clever boy. He has proven himself time and time again on various missions and challenges. I'm sure he will continue to impress and surprise us with his bravery and cleverness."

In [32]:
print_with_type(conversation.memory.entity_store.store)

"<class 'dict'>:"
{'Harry': 'Harry is a brave and clever boy who has proven himself time and '
          'time again on various missions and challenges.',
 'Hogwarts': 'Hogwarts is a magical school where Harry and Ron are planning to '
             "secretly transfer a baby dragon using Harry's Invisibility "
             'cloak.',
 'Invisibility cloak': 'The Invisibility cloak will be used by Harry and Ron '
                       'to secretly transfer the baby dragon to the highest '
                       'place in Hogwarts without being seen.',
 'Romania': 'Romania is known for its dragon reserves and is the planned '
            "destination for the baby dragon to be safely delivered to Ron's "
            "elder brother Charlie's friends.",
 'Ron': 'Ron is currently on a mission with Harry to rescue a baby dragon and '
        "transfer it to Romania using Harry's Invisibility cloak. He is a "
        'brave and resourceful individual who is always up for a challenge.'}


In [33]:
conversation.invoke(input="What do you know about Harry?")

{'input': 'What do you know about Harry?',
 'history': "Human: Harry & Ron need to secretly transfer the baby dragon to the highest place in Hogwarts without let anyone see them. So they are going to use Harry's Invisibility cloak.\nAI:  That's a clever plan! The Invisibility cloak will definitely come in handy for their mission. I hope they are able to successfully transfer the baby dragon without being seen. It would be quite a feat to pull off in a place like Hogwarts where there are always people around.\nHuman: What do you know about Harry & Ron?\nAI:  Well, based on the context provided, I know that Harry and Ron are currently on a mission to rescue a baby dragon and transfer it to Romania. They are using Harry's Invisibility cloak to do so without being seen. They are both brave and resourceful individuals who are always up for a challenge.\nHuman: Harry is a brave and clever boy.\nAI:  Yes, Harry is definitely a brave and clever boy. He has proven himself time and time again on

In [34]:
print_with_type(conversation.memory.entity_store.store)

"<class 'dict'>:"
{'Harry': 'Harry is a brave and clever boy who has proven himself time and '
          'time again on various missions and challenges. He is also a loyal '
          'friend and a skilled wizard.',
 'Hogwarts': 'Hogwarts is a magical school where Harry and Ron are planning to '
             "secretly transfer a baby dragon using Harry's Invisibility "
             'cloak.',
 'Invisibility cloak': 'The Invisibility cloak will be used by Harry and Ron '
                       'to secretly transfer the baby dragon to the highest '
                       'place in Hogwarts without being seen.',
 'Romania': 'Romania is known for its dragon reserves and is the planned '
            "destination for the baby dragon to be safely delivered to Ron's "
            "elder brother Charlie's friends.",
 'Ron': 'Ron is currently on a mission with Harry to rescue a baby dragon and '
        "transfer it to Romania using Harry's Invisibility cloak. He is a "
        'brave and resourc

### 1.3 Adding Memory to Agents

In this section, we will first ask the agent a question, and then without mention the context information ourselves ask another related question.

In [35]:
from langchain.agents import AgentExecutor, Tool, ZeroShotAgent
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_community.utilities import SerpAPIWrapper
from langchain_openai import OpenAI

In [36]:
search = SerpAPIWrapper()

tools = [
    Tool(
        name="Search",
        func=search.run,
        description="useful for when you need to answer questions about current events",
    )
]

In [37]:
prompt = ZeroShotAgent.create_prompt(
    tools,
    prefix="""Have a conversation with a human, answering the following questions as best you can.  You have access to the following tools:""",
    suffix="""Begin!  
{chat_history}
Question: {input}
{agent_scratchpad}""",
    input_variables=["input", "chat_history", "agent_scratchpad"],
)
memory = ConversationBufferMemory(memory_key="chat_history")

In [38]:
llm_chain = LLMChain(llm=OpenAI(temperature=0), prompt=prompt)
agent = ZeroShotAgent(llm_chain=llm_chain, tools=tools, verbose=True)
agent_chain = AgentExecutor.from_agent_and_tools(
    agent=agent, tools=tools, verbose=True, memory=memory, handle_parsing_errors=True
)

  warn_deprecated(


In [39]:
agent_chain.invoke(input="What is the population of China in 2024?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should use the search tool to find the most recent data on China's population.
Action: Search
Action Input: "China population 2024"[0m
Observation: [36;1m[1;3mThe current population of China in 2024 is 1,425,178,782, a 0.03% decline from 2023.[0m
Thought:[32;1m[1;3m I should double check this information with another source to ensure accuracy.
Action: Search
Action Input: "China population 2024 statistics"[0m
Observation: [36;1m[1;3mPopulation of China in 2024 The population of China stood at 1.43 billion in January 2024. Data shows that China's population decreased by 356 thousand (-0.02 percent) between early 2023 and the start of 2024. 49.0 percent of China's population is female, while 51.0 percent of the population is male.[0m
Thought:[32;1m[1;3m This seems like a reliable source, but I should also consider the potential for population changes in the next few years.
Action: Search
Action Input: "Ch

{'input': 'What is the population of China in 2024?',
 'chat_history': '',
 'output': 'The current population of China in 2024 is estimated to be 1,425,178,782, with a slight decline from the previous year. However, this number may change as the year progresses.'}

In [40]:
memory.load_memory_variables({})

{'chat_history': 'Human: What is the population of China in 2024?\nAI: The current population of China in 2024 is estimated to be 1,425,178,782, with a slight decline from the previous year. However, this number may change as the year progresses.'}

In [41]:
agent_chain.invoke(input="Is it more or less than India?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should use the search tool to find the current population of India.
Action: Search
Action Input: "Population of India in 2024"[0m
Observation: [36;1m[1;3m1,438,268,310[0m
Thought:[32;1m[1;3m Now I can compare the population of China and India.
Action: Compare
Action Input: 1,425,178,782 and 1,438,268,310[0m
Observation: Compare is not a valid tool, try one of [Search].
Thought:[32;1m[1;3m I should use a calculator to compare the two populations.
Action: Use calculator
Action Input: 1,425,178,782 and 1,438,268,310[0m
Observation: Use calculator is not a valid tool, try one of [Search].
Thought:[32;1m[1;3m I can simply subtract the two populations to find the difference.
Action: Subtract
Action Input: 1,438,268,310 - 1,425,178,782[0m
Observation: Subtract is not a valid tool, try one of [Search].
Thought:[32;1m[1;3m I can use a search engine to find the difference between the two populations.
Action: S

{'input': 'Is it more or less than India?',
 'chat_history': 'Human: What is the population of China in 2024?\nAI: The current population of China in 2024 is estimated to be 1,425,178,782, with a slight decline from the previous year. However, this number may change as the year progresses.',
 'output': "India's population in 2024 is estimated to be slightly higher than China's."}

In [42]:
print_with_type(memory.load_memory_variables({}))

"<class 'dict'>:"
{'chat_history': 'Human: What is the population of China in 2024?\n'
                 'AI: The current population of China in 2024 is estimated to '
                 'be 1,425,178,782, with a slight decline from the previous '
                 'year. However, this number may change as the year '
                 'progresses.\n'
                 'Human: Is it more or less than India?\n'
                 "AI: India's population in 2024 is estimated to be slightly "
                 "higher than China's."}


In [43]:
agent_chain.invoke(input="what is the population in China?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should use the search tool to find the most recent and accurate information.
Action: Search
Action Input: "Population of China"[0m
Observation: [36;1m[1;3m{'type': 'population_result', 'place': 'China', 'population': '1.412 billion', 'year': '2021'}[0m
Thought:[32;1m[1;3m This information is from 2021, I should try to find a more recent source.
Action: Search
Action Input: "Population of China 2024"[0m
Observation: [36;1m[1;3m1.43 billion[0m
Thought:[32;1m[1;3m This is a more recent estimate, but I should keep in mind that it may change as the year progresses.
Final Answer: The estimated population of China in 2024 is 1.43 billion.[0m

[1m> Finished chain.[0m


{'input': 'what is the population in China?',
 'chat_history': "Human: What is the population of China in 2024?\nAI: The current population of China in 2024 is estimated to be 1,425,178,782, with a slight decline from the previous year. However, this number may change as the year progresses.\nHuman: Is it more or less than India?\nAI: India's population in 2024 is estimated to be slightly higher than China's.",
 'output': 'The estimated population of China in 2024 is 1.43 billion.'}

## 2. Long term memory with vector storage 

In this section, we are going to embed the famous Harry Potter book's first chapter into a vectorstore and try some similarity searches. We have some extra examples commented, you can uncomment and try them one-by-one. If you observe the results carefully, you may find the characteristics of similarity search.

### 2.1 Loaders and Splitters

#### PDF Loaders

In [44]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader

data = PyPDFLoader("/share/lab4/harry-potter-chap-1.pdf").load()


In [45]:
# Note: If you're using PyPDFLoader then it will split by page for you already

print (f'You have {len(data)} document(s) in your data')
i = 0
for d in data:
    print (f'There are {len(d.page_content)} characters in doc {i}')
    i += 1

You have 16 document(s) in your data
There are 1848 characters in doc 0
There are 2101 characters in doc 1
There are 2093 characters in doc 2
There are 1898 characters in doc 3
There are 1892 characters in doc 4
There are 1300 characters in doc 5
There are 1867 characters in doc 6
There are 1806 characters in doc 7
There are 1548 characters in doc 8
There are 1573 characters in doc 9
There are 1635 characters in doc 10
There are 1792 characters in doc 11
There are 1542 characters in doc 12
There are 1399 characters in doc 13
There are 1882 characters in doc 14
There are 1921 characters in doc 15


#### Text file loader

In [46]:
from langchain_community.document_loaders import TextLoader

union = TextLoader("/share/lab4/state_of_the_union.txt").load()

#### Text Splitters

From Langchain documents: 

RecursiveCharacterTextSplitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [47]:
# You can have some trials with different chunk_size and chunk_overlap.
# This is optional, test out on your own data.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(data)

In [48]:
print (f'Now you have {len(texts)} documents')

for t in texts:
    print(t.page_content[:100])
    print("=========")

Now you have 70 documents
CHAPTER ONE  
 
THE BOY WHO LIVED  
 
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud
have a very large mustac he. Mrs. Dursley was thin and blonde and had  
nearly twice the usual amoun
think they could bear it if anyone found out about the Potters. Mrs.  
Potter was Mrs. Dursley's sis
Potters had a small son, too, but they had never even seen him. This boy  
was another good reason f
work, and Mrs. Dursley gossiped away happily as she wrestled a 
screaming  
Dudley into his high cha
into his car and backed out of number four's drive.  
 
It was on the corner of the street that he n
stared at the cat. It stared  back. As Mr. Dursley drove around the  
corner and up the road, he wat
But on the edge of town, drills were driven out of his mind by something  
else. As he sat in the us
wheel and his eyes fell on a huddle of these weirdos standing quite  
close by. They were whispering
nerve of him! But then it struck Mr. Dursley that this was probab

There are different kinds of splitters.  

https://chunkviz.up.railway.app/ 

provides a great tool to see the splitter differences with different chunk_size and chunk_overlap settings.

In [49]:
#### Your TASK ####
# Try different PDF Loaders.  Which one works the best for this file /share/lab4/hp-book1.pdf ,
# which contains the full book of Harry Potter Book 1, with all the illustratons.

## Langchain provides many other options for loaders, read the documents to find out the differences
# See page https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf
loader = UnstructuredPDFLoader("/share/lab4/hp-book1.pdf")

# loader = PyPDFLoader("/share/lab4/hp-book1.pdf")
# loader = PDFMinerLoader("example_data/layout-parser-paper.pdf")
documents = loader.load()

In [50]:
# s = docs[0].page_content
# print(len(s))
FullTexts = text_splitter.split_documents(documents)
print (f'Now you have {len(FullTexts)} documents')

for t in FullTexts:
    print(t.page_content[:100])
    print("=========")

Now you have 1251 documents
Harry Potter

And the Sorcerer’s Stone

ALSO BY J. K. ROWLING

Harry Potter and the Sorcerer’s Stone
Harry Potter and the Deathly Hallows

Year Seven at Hogwarts

Harry Potter and the Sorcerer’s Stone

Text copyright © 1997 by J.K. Rowling Illustrations by Mary GrandPré copyright © 1998 Warner Bros. A
No part of this publication may be reproduced, or stored in a retrieval system, or transmitted in an
Rowling, J.K. Harry Potter and the Sorcerer's Stone / by J.K. Rowling p. cm. Summary: Rescued from t
64 65 66 67 68 69 70 71 72 05 Printed in U.S.A. 10 First American edition, October 1998

Contents

O
Halloween · 163

ELEVEN

Quidditch · 180

TWELVE

The Mirror of Erised · 194

THIRTEEN

Nicholas Fla
C H A P T E R O N E

THE BOY WHO LIVED

M

r. and Mrs. Dursley, of number four, Privet Drive, were p
Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man 
The Dursleys had everything they wanted, but they also had a se

In [51]:
loader = PyPDFLoader("/share/lab4/hp-book1.pdf")
documents = loader.load()

In [52]:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
FullTexts = []

for d in documents:
    text = text_splitter.split_documents([d])
    FullTexts.extend(text)
print (f'Now you have {len(FullTexts)} documents')


Now you have 1155 documents


In [53]:
for t in FullTexts:
    print(t.page_content[:100])
    print("=========")


Harry Potter 
And the  Sorcerer’s Stone
ALSO BY J . K. ROWLING  
Harry Potter and the Sorcerer’s Stone 
Year One at Hogwarts 
Harry Potter a
Harry Potter  
and the Sorcerer’s Stone  
 
 BY 
J. K. Rowling 
ILLUSTRATIONS BY Mary GrandPré 
 
  
For Jessica, who loves stories 
for Anne, who loved them too; and for Di, who heard this one first. 
HARRY POTTER  and all related characters and elements are trademarks of Warner Bros. 
 
No part of t
Library of Congress Cataloging-in-Publication Data 
 
Rowling, J.K. 
Harry Potter and the Sorcerer's
PZ7.R79835Har    1998 
[Fic] — dc21    97-39059 
 
64 65 66 67 68 69 70 71 72    05 
Printed in U.S.
 5  Contents 
ONE 
The Boy Who Lived · 1 
TWO 
The Vanishing Glass · 18 
THREE 
The Letters from N
Contents 
 6  NINE 
The Midnight Duel · 143 
TEN 
Halloween · 163 
ELEVEN 
Quidditch · 180 
TWELVE
Harry Potter 
And the  Sorcerer’s Stone
C H A P T E R  O N E 
 
 1  THE BOY WHO LIVED 
 
 
 
r. and Mrs. Dursley, of numb er four, Privet 
though he d

In [54]:
from langchain.document_loaders import PDFMinerLoader
loader = PDFMinerLoader("/share/lab4/hp-book1.pdf")
documents = loader.load()

In [55]:
FullTexts = text_splitter.split_documents(documents)
print (f'Now you have {len(FullTexts)} documents')

for t in FullTexts:
    print(t.page_content[:100])
    print("=========")

Now you have 1372 documents
Harry Potter 

And the  Sorcerer’s Stone 

 
 
 
 
 
 
 
 
 
 
 
 
ALSO BY J. K. ROWLING 

Harry Po
Year Six at Hogwarts 

Harry Potter and the Deathly Hallows 

Year Seven at Hogwarts 

 
 
 
 
 
 
 
Text copyright © 1997 by J.K. Rowling 
Illustrations by Mary GrandPré copyright © 1998 Warner Bros. 
No part of this publication may be reproduced, or stored in a retrieval system, or transmitted 
in a
Rowling, J.K. 
Harry Potter and the Sorcerer's Stone / by J.K. Rowling 
p.    cm. 
Summary: Rescued 
64 65 66 67 68 69 70 71 72    05 
Printed in U.S.A.     10 
First American edition, October 1998 

 
NINE 

The Midnight Duel · 143 

TEN 

Halloween · 163 

ELEVEN 

Quidditch · 180 

TWELVE 

The Mir
THE BOY WHO LIVED 

M 

r. and Mrs. Dursley, of number four, Privet Drive, were 
proud  to  say  tha
Mr.  Dursley  was  the  director  of  a  firm  called  Grunnings,  which 
made  drills.  He  was  a 
son called Dudley and in their opinion there was no finer boy a

### 2.2 Create embeddings of your documents

Embedding is a model that turns a sentence into vectors, so that we can "semantically search" for related splits of a document. 

In [56]:
# OpenAI embedding: slow and expensive, we do not use them here.  

# from langchain.embeddings.openai import OpenAIEmbeddings

# openai_embedding = OpenAIEmbeddings()

In [57]:
# Let's use the local ones.
# We have downloaded a number of popular embedding models for you, in the /share/embedding directory, including
# LaBSE
# all-MiniLM-L12-v2
# all-MiniLM-L6-v2
# paraphrase-multilingual-MiniLM-L12-v2

from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
minilm_embedding = SentenceTransformerEmbeddings(model_name="/share/embedding/all-MiniLM-L6-v2/")


  return self.fget.__get__(instance, owner)()


### 2.4  Store and retrieve the embeddings in ChromaDB

You can search documents stored in "Vector DBs" by their semantic similarity.  Vector DBs uses an algorithm called "KNN (k-nearest neighbors)" to find documents whose embedding is the closest to the query. 

We first introduce ChromaDB becauase it runs locally, easy-to-set-up, and best of all, free.

In [58]:
# compute embeddings and save the embeddings into ChromaDB
from langchain.vectorstores import Chroma

chroma_dir = "/scratch2/chroma_db"
docsearch_chroma = Chroma.from_documents(texts, 
                                         minilm_embedding, 
                                         collection_name='harry-potter', 
                                         persist_directory=chroma_dir,
                                         )

In [59]:
# questions from https://en.wikibooks.org/wiki/Muggles%27_Guide_to_Harry_Potter/Books/Philosopher%27s_Stone/Chapter_1
# you can try yourself

# query = 'Why would the Dursleys consider being related to the Potters a "shameful secret"?'
# query = 'Who are the robed people Mr. Dursley sees in the streets?'
# query = 'What might a "Muggle" be?'
# query = 'What exactly is the cat on Privet Drive?'
query = '''Who might "You-Know-Who" be? Why isn't this person referred to by a given name?'''

In [60]:
## A utiity function ...
def print_search_results(docs,l=None):
    print(f"search returned %d results. " % len(docs))
    for doc in docs:
        if l is None:
            print(doc.page_content)
        else:
            print(doc.page_content[:l])
        print("=============")


In [61]:
# semantic similarity search

docs = docsearch_chroma.similarity_search(query)
print_search_results(docs)

search returned 4 results. 
“My dear Professor, surely a sensible person like yourself can call 
him by his name? All this ‘You-Know-Who’ nonsense — for eleven 
years  I  have  been  trying  to  persuade  people  to  call  him  by  his 
proper name: Voldemort.” Professor McGonagall flinched, but Dum-
bledore, who was unsticking two lemon drops, seemed not to no-
tice. “It all gets so confusing if we keep saying ‘You-Know-Who.’ I 
have  never  seen  any  reason  to  be  frightened  of  saying  Voldemort’s 
name.”
"No, thank you," said Professor McGonagall coldly, as though she didn't  
think this was the moment for lemon drops. "As I say, even if  
You-Know -Who has gone -" 
 
"My dear Professor, surely a sensible person like yourself can call him  
by his name? All this 'You - Know -Who' nonsense -- for eleven years I  
have been trying to persuade people to call him by his proper name:  
Voldemort." Professor McGonagall flinched, but Dumbledore, who was
"No, thank you," said Professor

#### Saving and Loading your ChromaDB

In [62]:
# save to local disk
docsearch_chroma.persist()

In [63]:
# reload from disk
docsearch_chroma_reloaded = Chroma(persist_directory = chroma_dir,
                                   collection_name = 'harry-potter', 
                                   embedding_function = minilm_embedding)

In [64]:
# you can test with the previous or another query

query = 'Who are the robed people Mr. Dursley sees in the streets?'
docs = docsearch_chroma_reloaded.similarity_search(query)
print_search_results(docs)

search returned 4 results. 
But  on  the  edge  of  town,  drills  were  driven  out  of  his  mind  by 
something  else.  As  he  sat  in  the  usual  morning  traffic  jam,  he 
couldn’t  help  noticing  that  there  seemed  to  be  a  lot  of  strangely 
dressed  people  about.  People  in  cloaks.  Mr.  Dursley  couldn’t  bear 
people  who  dressed  in  funny  clothes  —  the  getups  you  saw  on 
young  people!  He  supposed  this  was  some  stupid  new  fashion.  He
c h a p t e r o n e 1 the boy who lived r. and mrs. dursley, of numb er four, privet drive, were proud to say that they were perfectly normal, thank you very much. they were the last people you ’ d expect to be in - volved in anything strange or myst erious, because they just didn ’ t hold with such nonsense. mr. dursley was the director of a firm called grunnings, which made drills. he was a big, beef y man with hardly any neck, al - though he did have a very large mustache. mrs. dursley was thin and blonde and had

In [65]:
#### Your TASK ####
# With the chosen PDF loaders, test different splitters and chunk size until you feel that the chucking makes sense. 
# You can also try different embeddings
# Then embed the entire book 1 into ChormaDB
loader = PyPDFLoader("/share/lab4/hp-book1.pdf")
documents = loader.load()

In [66]:

# from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
text_splitter = SentenceTransformersTokenTextSplitter(tokens_per_chunk = 300, chunk_overlap=20)
Newtexts = text_splitter.split_documents(documents)
# print_search_results(texts)

In [67]:
print(len(Newtexts))
print_search_results(Newtexts)

595
search returned 595 results. 
harry potter and the sorcerer ’ s stone
also by j. k. rowling harry potter and the sorcerer ’ s stone year one at hogwarts harry potter and the chamber of secrets year two at hogwarts harry potter and the prisoner of azkaban year three at hogwarts harry potter and the goblet of fire year four at hogwarts harry potter and the order of the phoenix year five at hogwarts harry potter and the half - blood prince year six at hogwarts harry potter and the deathly hallows year seven at hogwarts
harry potter and the sorcerer ’ s stone by j. k. rowling illustrations by mary grandpre arthur a. l evine books an imprint of scholastic press.
for jessica, who loves stories for anne, who loved them too ; and for di, who heard this one first. text copyright © 1997 by j. k. rowling illustrations by mary grandpr e copyright © 1998 warner bros. all rights reserved. published by scholastic press, a division of scholastic inc., publishers since 1920 scholastic, scholastic p

In [68]:
# compute embeddings and save the embeddings into ChromaDB
from langchain.vectorstores import Chroma

chroma_dir = "/scratch2/chroma_db"
docsearch_chroma = Chroma.from_documents(Newtexts, 
                                         minilm_embedding, 
                                         collection_name='harry-potter-new', 
                                         persist_directory=chroma_dir,
                                         )

query = '''Who might "You-Know-Who" be? Why isn't this person referred to by a given name?'''
# semantic similarity search

docs = docsearch_chroma.similarity_search(query)
print_search_results(docs)
# save to local disk
docsearch_chroma.persist()
docsearch_chroma_reloaded_new = Chroma(persist_directory = chroma_dir,
                                   collection_name = 'harry-potter-new', 
                                   embedding_function = minilm_embedding)

search returned 4 results. 
i m b y h i s proper name : voldemort. ” professor mcgonagall flinched, but dum - bledore, who was unsticking two lemon drops, seemed not to no - tice. “ it all gets so confusing if we keep saying ‘ you - know - who. ’ i have never seen any reason to be frightened of saying voldemort ’ s name. ” “ i know you haven ’ t, ” said pro fessor mcgonagall, sounding half exasperated, half admiring. “ but you ’ re different. everyone knows you ’ re the only one you - know - oh, all right, voldemort, was fright - ened of. ” “ you
at the dursleys. “ w e l l, i t ’ s b e s t y e h k n o w a s m u c h a s i c a n t e l l y e h — m i n d, i can ’ t tell yeh everythin ’, it ’ s a great myst ’ ry, parts of it.... ” he sat down, stared into the fi re for a few seconds, and then said, “ it begins, i suppose, with — with a person called — but it ’ s incredible yeh don ’ t know hi s name, everyone in our world knows — ” “ who? ” “ well — i don ’ like sayin
wasn ’ t such an unusu

In [69]:
query = 'Who are the robed people Mr. Dursley sees in the streets?'
docs = docsearch_chroma_reloaded_new.similarity_search(query)
print_search_results(docs)

search returned 4 results. 
c h a p t e r o n e 1 the boy who lived r. and mrs. dursley, of numb er four, privet drive, were proud to say that they were perfectly normal, thank you very much. they were the last people you ’ d expect to be in - volved in anything strange or myst erious, because they just didn ’ t hold with such nonsense. mr. dursley was the director of a firm called grunnings, which made drills. he was a big, beef y man with hardly any neck, al - though he did have a very large mustache. mrs. dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. the dursleys had a small son called dudley and in their op inion there was no finer boy any - where. the du
c h a p t e r o n e 1 the boy who lived r. and mrs. dursley, of numb er four, privet drive, were proud to say that they were perfectly normal, thank you very much. they were the last people y

### 2.5 Query those docs with a QA chain

In [70]:
from langchain_openai import OpenAI
from langchain.chains.question_answering import load_qa_chain

In [71]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
chain = load_qa_chain(llm, chain_type="stuff", verbose=True)

In [72]:
query = "How did Harry's parents die?"
docs = docsearch_chroma_reloaded.similarity_search(query)

In [73]:
chain.run(input_documents=docs, question=query)

  warn_deprecated(




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

And there were his mother and father smiling at him again, and 
one of his grandfathers nodding happily. Harry sank down to sit on 
the floor in front of the mirror. There was nothing to stop him from 
staying here all night with his family. Nothing at all.

CHAPTER  TWO 
 30  couldn’t remember his parents at  all. His aunt and uncle never 
spoke about them, and of course he  was forbidden to ask questions. There were no photographs of them in the house. When he had been younger, Harry  had dreamed and dreamed of 
some unknown relation coming to take him away, but it had never 
happened; the Dursleys were his only family. Yet sometimes he thought (or maybe hoped) that strangers in the stree

" Harry's parents died when he was a baby, but the exact circumstances of their death are not mentioned in this context."

In [74]:
#### Your Task ####

# Rebuild the chain from the whole book ChromaDB.  Test with one of the following questions (of your choice).

query = 'Why does Dumbledore believe the celebrations may be premature?'
#query = 'Why is Harry left with the Dursleys rather than a Wizard family?'
#query = 'Why does McGonagall seem concerned about Harry being raised by the Dursleys?'
chain = load_qa_chain(llm, chain_type="stuff", verbose=True)
docs = docsearch_chroma_reloaded_new.similarity_search(query)
chain.run(input_documents=docs, question=query)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

at once. he slipped into a seat be - tween ron and hermione at the gryffindor table and tried to ig - nore the fact that people were standing up to look at him. fortunately, dumbledore arrived moments later. the babble died away. “ another year gone! ” dumbledore said cheerfully. “ and i must trouble you with an old man ’ s wheez ing waffle before we sink our teeth into our delicious feast. what a year it has been! hopefully your heads are all a little fuller than they were... you have the whole summer ahead to get them nice and empty before next year starts....

gryffindor table and tried to ig - nore the fact that people were standing up to look at him. fortunately, dumbledore arrived momen

' Dumbledore believes the celebrations may be premature because there is still a lot of work to be done and challenges to face in the future, and it is important to stay focused and not get too caught up in the excitement of the moment.'

In [75]:
#### Your Task ####

# Using langchain documentation, find out about the map reduce QA chain.  
# answer the following questions using the chain
chain = load_qa_chain(llm, chain_type="map_reduce")
# answer one of the following questions of your choice. 

# query = 'What happened in the Forbidden Forest during the first year of Harry Potter at Hogwarts?'
query = 'Tell me about Harry Potter and Quidditch during the first year'
out = docsearch_chroma_reloaded_new.similarity_search(query)
chain.run(input_documents=out, question=query)

' During his first year, Harry Potter played in his first Quidditch match for Gryffindor against Slytherin, with the potential to move up in the house championship. However, hardly anyone had seen him play because his captain, Wood, wanted to keep him a secret. Additionally, there were complaints about first years not being able to join the house Quidditch teams.'

### 2.6 Using Pinecone, an online vector DB

You have many reasons to store your DB online in a SaaS / PaaS service.  For example, 
- you want to scale the queries to many concurrent users
- you want more data reliability without having to worry about DB management
- you want to share the DB but without owning any servers

If you want to store your embeddings online, try pinecone with the code below. You must go to [Pinecone.io](https://www.pinecone.io/) and set up an account. Then you need to generate an api-key and create an "index", this can be done by navigating through the homepage once you've logged in to Pinecone, 

In [76]:
import pinecone
from langchain.vectorstores import Pinecone

# initialize pinecone, depends on two environment variables, os.environ['PINECONE_API_KEY'] and os.environ['PINECONE_API_ENV']
pinecone.Pinecone()

# You should create an index for your vector db.  
# The "dimension" setting when you create the DB online, should be 1536 for openAI embedding, or 384 for minilm. 
index_name = os.getenv('PINECONE_INDEX_NAME')

In [77]:
docsearch_pinecone = Pinecone.from_texts(
                                [ t.page_content for t in texts ], 
                                minilm_embedding, 
                                index_name=index_name)

In [78]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
chain = load_qa_chain(llm, chain_type="stuff", verbose=True)
query = "How did Harry's parents die?"
docs = docsearch_pinecone.similarity_search(query)
chain.run(input_documents=docs, question=query)

# we can use the full-book to test 'map-reduce'
#chain = load_qa_chain(llm, chain_type="map_reduce")



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

large pink beac h ball wearing different -colored bonnets -- but Dudley  
Dursley was no longer a baby, and now the photographs showed a large  
blond boy riding his first bicycle, on a carousel at the fair, playing a  
computer game with his father, being hugged and kissed b y his mother.  
The room held no sign at all that another boy lived in the house, too.  
 
Yet Harry Potter was still there, asleep at the moment, but not for

large pink beac h ball wearing different -colored bonnets -- but Dudley  
Dursley was no longer a baby, and now the photographs showed a large  
blond boy riding his first bicycle, on a carousel at the fair, playing a  
computer game with his father, being hugged 

" Harry's parents died in a car accident caused by Lord Voldemort."

In [79]:
# query with pinecone
query = 'What exactly is the cat on Privet Drive?'
docs = docsearch_pinecone.similarity_search(query)
print(docs[0].page_content[:600])

Privet Drive. It didn't so much as quiver when a car door slammed on the  
next street, nor when two owls swooped overhead. In fact, it was nearly  
midnight before the cat moved  at all.  
 
A man appeared on the corner the cat had been watching, appeared so  
suddenly and silently you'd have thought he'd just popped out of the  
ground. The cat's tail twitched and its eyes narrowed.  
 
Nothing like this man had ever been seen on Pri vet Drive. He was tall,


In [80]:
#### Your Task ####
# modify the QA chain in Section 2.5 (Chapter 1 only) to use pinecone instead of ChromaDB
query = 'Tell me about Harry Potter and Quidditch during the first year'
docs = docsearch_pinecone.similarity_search(query)
chain = load_qa_chain(llm, chain_type="map_reduce")
chain.run(input_documents=docs, question=query)

' During the first year, Harry Potter was predicted to become famous and a legend, with books being written about him and every child knowing his name. However, there is no mention of Quidditch in this excerpt.'

### 2.7 Use vector store in Agent

In this section, we are going to create a simple QA agent that can decide by itself which of the two vectorstores it should switch to for questions of differnent fields.

#### Preparing the tools for the agent.

We will use our chroma_based Harry Potter vectorDB, and let's create another one containing President Biden's State of the Union speech. 

In [81]:
from langchain.document_loaders import TextLoader

documents = TextLoader('/share/lab4/state_of_the_union.txt').load()
texts = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0).split_documents(documents)
docsearch3 = Chroma.from_documents(texts, 
                                   minilm_embedding, 
                                   collection_name="state-of-union", 
                                   persist_directory="/scratch2/chroma_db")
docsearch3.persist()

To allow the agent query these databases, we need to define two RetrievalQA chains.

In [82]:
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI

llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")

harry_potter = RetrievalQA.from_chain_type(llm=llm, 
                                           chain_type="stuff", 
                                           retriever=docsearch_chroma_reloaded.as_retriever())
state_of_union = RetrievalQA.from_chain_type(llm=llm, 
                                             chain_type="stuff", 
                                             retriever=docsearch3.as_retriever())

In [83]:
# Now try both chains

print_with_type(harry_potter.invoke('Why does McGonagall seem concerned about Harry being raised by the Dursleys?'))
print_with_type(state_of_union.invoke("what is the GDP increase last year?"))

"<class 'dict'>:"
{'query': 'Why does McGonagall seem concerned about Harry being raised by the '
          'Dursleys?',
 'result': " McGonagall is concerned because she knows that Harry's aunt and "
           'uncle, the Dursleys, are not fond of magic and may not treat Harry '
           "well because of his magical abilities. She also knows that Harry's "
           'cousin, Dudley, is spoiled and may bully Harry.'}
"<class 'dict'>:"
{'query': 'what is the GDP increase last year?',
 'result': ' The GDP increase last year was 5.7%.'}


In [84]:
from langchain.agents import AgentType, Tool
from langchain.llms import OpenAI

# define tools
tools = [
    Tool(
        name="State of Union QA System",
        func=state_of_union.run,
        description="useful for when you need to answer questions about the most recent state of the union address. Input should be a fully formed question.",
    ),
    Tool(
        name="Harry Potter QA System",
        func=harry_potter.run,
        description="useful for when you need to answer questions about Harry Potter. Input should be a fully formed question.",
    ),
]

Now we can create the Agent giving both chains as tools. 

In [85]:
from langchain.agents import initialize_agent


# Construct the agent. We will use the default agent type here.
# See documentation for a full list of options.
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)


  warn_deprecated(


In [86]:
agent.run(
    "What did biden say about ketanji brown jackson?"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should use the State of Union QA System to find the answer
Action: State of Union QA System
Action Input: What did biden say about ketanji brown jackson?[0m
Observation: [36;1m[1;3m Biden nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to serve on the United States Supreme Court.[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: Biden nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to serve on the United States Supreme Court.[0m

[1m> Finished chain.[0m


'Biden nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to serve on the United States Supreme Court.'

In [87]:
agent.run(
    "'Why does McGonagall seem concerned about Harry being raised by the Dursleys?'"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m You should always think about what to do
Action: Harry Potter QA System
Action Input: 'Why does McGonagall seem concerned about Harry being raised by the Dursleys?'[0m
Observation: [33;1m[1;3m McGonagall may be concerned because she knows that the Dursleys are not kind or loving towards Harry, and she worries about the impact this may have on his well-being and development. She may also be concerned about the influence of the Dursleys' negative attitudes and beliefs on Harry's upbringing.[0m
Thought:[32;1m[1;3m You should always think about what to do
Action: Harry Potter QA System
Action Input: 'What is the relationship between McGonagall and Harry?'[0m
Observation: [33;1m[1;3m Professor McGonagall is a professor at Hogwarts School of Witchcraft and Wizardry and Harry is one of her students. She also seems to have a fondness for Harry's father, who was a former student and Quidditch player.[0m
Thought:[32;1m[1;3

'McGonagall is concerned about Harry being raised by the Dursleys because she knows they are not kind to him and she worries about the impact on his well-being and development. She also has a close relationship with Harry as his professor and seems to have a fondness for his father.'

We can see that the agent can "smartly" choose which QA system to use given a specific question. 

## 3 Your Task: putting it all together: OpenAI and Langchain

In [88]:
#### Your Task ####

# This is a major task that requires some thinking and time. 

# Build a conversation system from a collection of research papers of your choice. 
# You can ask specific questions of a method about these papers, and the agent returns a brief answer to you (with no more than 100 words). 
# Save your data and ChromaDB in the /share directory so other people can use it. 
# Provide at least three query examples so the TAs can review your work. 

# You may use any tool from the past four labs or from the langchain docs, or any open source project. 

# write a summary (a Markdown cell) at the end of the notebook summarizing what works and what does not. 

In [121]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader
import os
root_dir = '/share/lab4/zhh/2023040163_lab4_pdfs'
files = os.listdir(root_dir)
files = [file for file in files if file.endswith('.pdf')]
names = [file.replace('.pdf','') for file in files ]
print(len(names))
print(names[5])

15
NeurIPS-2023-l_2-uniform-stability-of-randomized-learning-algorithms-sharper-generalization-bounds-and-confidence-boosting-Paper-Conference


In [122]:
print(names[4])

NeurIPS-2023-p-value-adjustment-for-monotonous-unbiased-and-fast-clustering-comparison-Paper-Conference


In [123]:
names = [' '.join([word.capitalize() for word in name.replace('NeurIPS-2023','').replace('Paper-Conference','').split('-')]).strip() for name in names]

In [124]:
print(names[4])

P Value Adjustment For Monotonous Unbiased And Fast Clustering Comparison


In [125]:
from langchain.agents import AgentType, Tool
def CheckLength(answer:str)->bool:
    """Check whether the answer has length less than 100 words. If not(the return is false), you should generate another answer."""
    return len(answer.split(' '))<=100    

In [126]:
tools = [
    Tool(
        name="Check Length Tool",
        func=CheckLength,
        description="useful when you finish answering a question and want to submit your answer. You should check whether the length of your answer is valid, i.e. no longer than 100 words.",
    ),   
]

In [127]:
for i,file in enumerate(files):
    document = PyPDFLoader(os.path.join('/share/lab4/zhh/2023040163_lab4_pdfs',file)).load()
    texts = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20).split_documents(document)
    from langchain.document_loaders import TextLoader

    texts = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0).split_documents(texts)
    docsearch3 = Chroma.from_documents(texts, 
                                       minilm_embedding, 
                                       collection_name=f"paper_{i}", 
                                       persist_directory="/share/lab4/zhh/chroma_db")
    docsearch3.persist()
    from langchain.chains import RetrievalQA
    from langchain_openai import OpenAI
    
    docsearch_chroma_reloaded = Chroma(persist_directory = "/share/lab4/zhh/chroma_db",
                                       collection_name = f'paper_{i}', 
                                       embedding_function = minilm_embedding)
    tools.append(Tool(
        name=f"Expert on topic {names[i]}",
        func=RetrievalQA.from_chain_type(llm=llm, 
                                             chain_type="stuff", 
                                             retriever=docsearch_chroma_reloaded.as_retriever()).run,                             
        description=f"useful for when you need to answer questions about the topic {names[i]}. Input should be a fully formed question.",
    ))                             

In [128]:
print(len(tools))
print(tools[8].name)

16
Expert on topic Textbfa2textbfcid2 Accelerating Asynchronous Communication In Decentralized Deep Learning


In [129]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")

In [130]:
prompt = ZeroShotAgent.create_prompt(
    tools,
    prefix="""Have a conversation with a human, answering the following questions as best you can, but make sure that your answer is NO LONGER than 100 words. You have access to the following tools, but make sure that you use the `Check Length Tool` AT LEAST ONCE.:""",
    suffix="""Begin!  
{chat_history}
Question: {input}
{agent_scratchpad}""",
    input_variables=["input", "chat_history", "agent_scratchpad"],
)

memory = ConversationBufferMemory(return_messages=True,memory_key="chat_history")

In [131]:
from langchain.agents import AgentExecutor, Tool, ZeroShotAgent
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_community.utilities import SerpAPIWrapper
from langchain_openai import OpenAI

In [132]:
llm_chain = LLMChain(llm=OpenAI(temperature=0), prompt=prompt)
agent = ZeroShotAgent(llm_chain=llm_chain, tools=tools, verbose=True)
agent_chain = AgentExecutor.from_agent_and_tools(
    agent=agent, tools=tools, verbose=True, memory=memory, handle_parsing_errors=True
)

In [133]:
agent_chain.invoke("what is Super-CLEVR-3D?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should ask the expert on topic 3d Aware Visual Question Answering About Parts Poses And Occlusions
Action: Expert on topic 3d Aware Visual Question Answering About Parts Poses And Occlusions
Action Input: what is Super-CLEVR-3D?[0m
Observation: [33;1m[1;3m Super-CLEVR-3D is a dataset used for studying 3D-aware visual question answering (VQA). It contains images rendered from scenes in the Super-CLEVR dataset and includes questions specifically about the 3D object configurations in the image. [0m
Thought:[32;1m[1;3m I should check the length of my answer
Action: Check Length Tool
Action Input: Super-CLEVR-3D is a dataset used for studying 3D-aware visual question answering (VQA). It contains images rendered from scenes in the Super-CLEVR dataset and includes questions specifically about the 3D object configurations in the image.[0m
Observation: [36;1m[1;3mTrue[0m
Thought:[32;1m[1;3m I now know the final 

{'input': 'what is Super-CLEVR-3D?',
 'chat_history': [HumanMessage(content='what is Super-CLEVR-3D?'),
  AIMessage(content='Super-CLEVR-3D is a dataset used for studying 3D-aware visual question answering (VQA). It contains images rendered from scenes in the Super-CLEVR dataset and includes questions specifically about the 3D object configurations in the image.')],
 'output': 'Super-CLEVR-3D is a dataset used for studying 3D-aware visual question answering (VQA). It contains images rendered from scenes in the Super-CLEVR dataset and includes questions specifically about the 3D object configurations in the image.'}

In [134]:
agent_chain.invoke("What are the main advantages of it?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should ask an expert on topic 3d Aware Visual Question Answering About Parts Poses And Occlusions
Action: Expert on topic 3d Aware Visual Question Answering About Parts Poses And Occlusions
Action Input: What are the main advantages of Super-CLEVR-3D?[0m
Observation: [33;1m[1;3m The main advantages of Super-CLEVR-3D are its ability to ask questions about 3D poses and pose relationships between objects, as well as its use of regression and intersection-over-union techniques to accurately detect and associate parts with objects.[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: The main advantages of Super-CLEVR-3D are its ability to ask questions about 3D poses and pose relationships between objects, as well as its use of regression and intersection-over-union techniques to accurately detect and associate parts with objects.[0m

[1m> Finished chain.[0m


{'input': 'What are the main advantages of it?',
 'chat_history': [HumanMessage(content='what is Super-CLEVR-3D?'),
  AIMessage(content='Super-CLEVR-3D is a dataset used for studying 3D-aware visual question answering (VQA). It contains images rendered from scenes in the Super-CLEVR dataset and includes questions specifically about the 3D object configurations in the image.'),
  HumanMessage(content='What are the main advantages of it?'),
  AIMessage(content='The main advantages of Super-CLEVR-3D are its ability to ask questions about 3D poses and pose relationships between objects, as well as its use of regression and intersection-over-union techniques to accurately detect and associate parts with objects.')],
 'output': 'The main advantages of Super-CLEVR-3D are its ability to ask questions about 3D poses and pose relationships between objects, as well as its use of regression and intersection-over-union techniques to accurately detect and associate parts with objects.'}

In [135]:
agent_chain.invoke("How is the total communication complexity in 2Direction defined?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should ask the expert on topic 2direction Theoretically Faster Distributed Training With Bidirectional Communication Compression.
Action: Expert on topic 2direction Theoretically Faster Distributed Training With Bidirectional Communication Compression
Action Input: Total communication complexity[0m
Observation: [33;1m[1;3m The total communication complexity is a measure of the amount of communication required for a particular method or algorithm. It is defined as a convex combination of the w2s and s2w communication complexities, with the parameter r weighting the importance of uplink and downlink connections. The full complexities are given in equations (13) and (14), and the Round Cost of a method is defined as a constant that is equal to the number of communication rounds multiplied by the Round Cost.[0m
Thought:[32;1m[1;3m I now know the final answer.
Final Answer: The total communication complexity in 2D

{'input': 'How is the total communication complexity in 2Direction defined?',
 'chat_history': [HumanMessage(content='what is Super-CLEVR-3D?'),
  AIMessage(content='Super-CLEVR-3D is a dataset used for studying 3D-aware visual question answering (VQA). It contains images rendered from scenes in the Super-CLEVR dataset and includes questions specifically about the 3D object configurations in the image.'),
  HumanMessage(content='What are the main advantages of it?'),
  AIMessage(content='The main advantages of Super-CLEVR-3D are its ability to ask questions about 3D poses and pose relationships between objects, as well as its use of regression and intersection-over-union techniques to accurately detect and associate parts with objects.'),
  HumanMessage(content='How is the total communication complexity in 2Direction defined?'),
  AIMessage(content='The total communication complexity in 2Direction is defined as a convex combination of the w2s and s2w communication complexities, with th

In [136]:
agent_chain.invoke("What is the main algorithm of 2Direction: A Fast Gradient Method Supporting Bidirectional Compression")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should ask an expert on the topic 2Direction about the main algorithm.
Action: Expert on topic 2direction Theoretically Faster Distributed Training With Bidirectional Communication Compression
Action Input: What is the main algorithm of 2Direction?[0m
Observation: [33;1m[1;3m The main algorithm of 2Direction is a fast bidirectional compressed communication algorithm with a bespoke error-feedback mechanism.[0m
Thought:[32;1m[1;3m I now know the final answer.
Final Answer: The main algorithm of 2Direction is a fast bidirectional compressed communication algorithm with a bespoke error-feedback mechanism.[0m

[1m> Finished chain.[0m


{'input': 'What is the main algorithm of 2Direction: A Fast Gradient Method Supporting Bidirectional Compression',
 'chat_history': [HumanMessage(content='what is Super-CLEVR-3D?'),
  AIMessage(content='Super-CLEVR-3D is a dataset used for studying 3D-aware visual question answering (VQA). It contains images rendered from scenes in the Super-CLEVR dataset and includes questions specifically about the 3D object configurations in the image.'),
  HumanMessage(content='What are the main advantages of it?'),
  AIMessage(content='The main advantages of Super-CLEVR-3D are its ability to ask questions about 3D poses and pose relationships between objects, as well as its use of regression and intersection-over-union techniques to accurately detect and associate parts with objects.'),
  HumanMessage(content='How is the total communication complexity in 2Direction defined?'),
  AIMessage(content='The total communication complexity in 2Direction is defined as a convex combination of the w2s and s2

In [137]:
agent_chain.invoke("What is the definition of monotonicity in the context of clustering?")



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should ask an expert on the topic P Value Adjustment For Monotonous Unbiased And Fast Clustering Comparison
Action: Expert on topic P Value Adjustment For Monotonous Unbiased And Fast Clustering Comparison
Action Input: What is the definition of monotonicity in the context of clustering?[0m
Observation: [38;5;200m[1;3m Monotonicity in the context of clustering refers to the property of a clustering comparison measure where, as one clustering is changed to resemble another more, the measure should increase. This means that the measure should be consistent and not decrease when clusters are merged or split.[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: Monotonicity in the context of clustering refers to the property of a clustering comparison measure where, as one clustering is changed to resemble another more, the measure should increase. This means that the measure should be consistent and 

{'input': 'What is the definition of monotonicity in the context of clustering?',
 'chat_history': [HumanMessage(content='what is Super-CLEVR-3D?'),
  AIMessage(content='Super-CLEVR-3D is a dataset used for studying 3D-aware visual question answering (VQA). It contains images rendered from scenes in the Super-CLEVR dataset and includes questions specifically about the 3D object configurations in the image.'),
  HumanMessage(content='What are the main advantages of it?'),
  AIMessage(content='The main advantages of Super-CLEVR-3D are its ability to ask questions about 3D poses and pose relationships between objects, as well as its use of regression and intersection-over-union techniques to accurately detect and associate parts with objects.'),
  HumanMessage(content='How is the total communication complexity in 2Direction defined?'),
  AIMessage(content='The total communication complexity in 2Direction is defined as a convex combination of the w2s and s2w communication complexities, wit

In [139]:
agent_chain.invoke('What is the architecture of ElasticTransformer?')



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I should ask the expert on topic Mathbfmathbbefwi Multiparameter Benchmark Datasets For Elastic Full Waveform Inversion Of Geophysical Properties Paper Datasets_and_benchmarks
Action: Expert on topic Mathbfmathbbefwi Multiparameter Benchmark Datasets For Elastic Full Waveform Inversion Of Geophysical Properties Paper Datasets_and_benchmarks
Action Input: What is the architecture of ElasticTransformer?[0m
Observation: [33;1m[1;3m ElasticTransformer follows a similar seismic-encoder and velocity-decoder architecture design as the SimFWI described in [33]. It consists of two two-layer transformer encoders that take ux and uz as inputs and two two-layer transformer decoders to output VP and VS separately.[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: ElasticTransformer follows a similar seismic-encoder and velocity-decoder architecture design as the SimFWI described in [33]. It consists of two tw

{'input': 'What is the architecture of ElasticTransformer?',
 'chat_history': [HumanMessage(content='what is Super-CLEVR-3D?'),
  AIMessage(content='Super-CLEVR-3D is a dataset used for studying 3D-aware visual question answering (VQA). It contains images rendered from scenes in the Super-CLEVR dataset and includes questions specifically about the 3D object configurations in the image.'),
  HumanMessage(content='What are the main advantages of it?'),
  AIMessage(content='The main advantages of Super-CLEVR-3D are its ability to ask questions about 3D poses and pose relationships between objects, as well as its use of regression and intersection-over-union techniques to accurately detect and associate parts with objects.'),
  HumanMessage(content='How is the total communication complexity in 2Direction defined?'),
  AIMessage(content='The total communication complexity in 2Direction is defined as a convex combination of the w2s and s2w communication complexities, with the parameter r wei

## Summary

From the results, I can find that the chain works when asking about general questions (i.e. What is something, or what's it main advantages). With the short term memory added in the prompt, the model can also remember what the user is referencing or the context. 

However, when the question is about specific details, sometimes the model's answer is imprecise (for example, in the question `What is the definition of monotonicity in the context of clustering?`). I think the reason is that the text splitters leads to some information loss.