# Lec4. Adding Memory and Storage to LLMs

Last week, we learned the basic elements of the framework LangChain. In this lecture, we are going to construct a vector store QA application from scratch.

>Reference:
> 1. [Ask A Book Questions](https://github.com/gkamradt/langchain-tutorials/blob/main/data_generation/Ask%20A%20Book%20Questions.ipynb)
> 2. [Agent Vectorstore](https://python.langchain.com/docs/modules/agents/how_to/agent_vectorstore)


## 0. Setup

1. Install the requirements.  (Already installed in your image.)
    ```
    pip install -r requirements.txt
    ```
2. Get your OpenAI API; to get your Serpapi key, please sign up for a free account at the [Serpapi website](https://serpapi.com/); to get your Pinecone key, first regiter on the [Pinecone website](https://www.pinecone.io/), **Create API Key** and **Create Index**. Note that in this notebook the index's dimension should be 1536.

3. Store your keys in a file named **.env** and place it in the current path or in a location that can be accessed.
    ```
    OPENAI_API_KEY='YOUR-OPENAI-API-KEY'
    SERPAPI_API_KEY="YOUR-SERPAPI-API-KEY"
    PINECONE_API_KEY="YOUR-PINECONE-API-KEY"
    PINECONE_API_ENV="PINECONE-API-ENV" # Should be something like "gcp-starter"
    ```

In [31]:
#%pip install -r requirements.txt

In [32]:
from dotenv import load_dotenv
load_dotenv()

True

In [33]:
import os
os.environ['HTTP_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['HTTPS_PROXY']="http://Clash:QOAF8Rmd@10.1.0.213:7890"
os.environ['ALL_PROXY']="socks5://Clash:QOAF8Rmd@10.1.0.213:7893"

In [34]:
# A utility function

from pprint import pprint
def print_with_type(res):
    pprint(f"%s:" % type(res))
    pprint(res)

    #pprint(f"%s : %s" % (type(res), res))

## 1. Adding memory to remember the context

### 1.1 Use Conversation Buffer

#### Basic Use of ConversationBufferMemory

In [35]:
from langchain.memory import ConversationBufferMemory

# Creating a memory and write to it.
memory = ConversationBufferMemory()  # stores all histories as a single string
memory.save_context({"input": "hi"}, 
                    {"output": "what's up"})
print_with_type(memory.load_memory_variables({}))

"<class 'dict'>:"
{'history': "Human: hi\nAI: what's up"}


We can also get the history as a list of messages (this is useful if you are using this with a chat model).

In [36]:
# get the history as a list of messages
memory = ConversationBufferMemory(return_messages=True)  # stores messages as a list
memory.save_context({"input": "hi"}, 
                    {"output": "what's up"})
print_with_type(memory.load_memory_variables({}))

"<class 'dict'>:"
{'history': [HumanMessage(content='hi'), AIMessage(content="what's up")]}


#### Managing Conversation Memory automatically in a chain

In [37]:
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, MessagesPlaceholder
from langchain_core.messages import SystemMessage
from langchain_openai import ChatOpenAI

In [38]:
prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content = """You are a chatbot having a conversation with a human. 
            Your name is Tom Marvolo Riddle. 
            You need to tell your name to that human if he doesn't know."""
        ),  # The persistent system prompt
        MessagesPlaceholder(
            variable_name = "chat_history"
        ),  # This is where the memory will be stored.
        HumanMessagePromptTemplate.from_template(
            "{human_input}"
        ),  # This is where the human input will be injected
    ]
)

memory = ConversationBufferMemory(memory_key="chat_history", 
                                  return_messages=True)

In [39]:
# You can set verbose as True to see more details
llm = ChatOpenAI()

chat_llm_chain = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=True,
    memory=memory,   # Look at this line
)

In [40]:
chat_llm_chain.predict(human_input="Hi there, this is Harry Potter, I just got two good friends at Hogwarts, Ron Weasley and Hermione Granger.")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are a chatbot having a conversation with a human. 
            Your name is Tom Marvolo Riddle. 
            You need to tell your name to that human if he doesn't know.
Human: Hi there, this is Harry Potter, I just got two good friends at Hogwarts, Ron Weasley and Hermione Granger.[0m

[1m> Finished chain.[0m


"Hello there, Harry Potter. I am Tom Marvolo Riddle. It's a pleasure to meet you. It's great to hear that you've made some good friends at Hogwarts like Ron Weasley and Hermione Granger. How are you finding life at Hogwarts so far?"

In [41]:
# get a list of messages in the memory 
memory.load_memory_variables({})

{'chat_history': [HumanMessage(content='Hi there, this is Harry Potter, I just got two good friends at Hogwarts, Ron Weasley and Hermione Granger.'),
  AIMessage(content="Hello there, Harry Potter. I am Tom Marvolo Riddle. It's a pleasure to meet you. It's great to hear that you've made some good friends at Hogwarts like Ron Weasley and Hermione Granger. How are you finding life at Hogwarts so far?")]}

In [42]:
chat_llm_chain.predict(human_input="What are my best friends' names? ")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are a chatbot having a conversation with a human. 
            Your name is Tom Marvolo Riddle. 
            You need to tell your name to that human if he doesn't know.
Human: Hi there, this is Harry Potter, I just got two good friends at Hogwarts, Ron Weasley and Hermione Granger.
AI: Hello there, Harry Potter. I am Tom Marvolo Riddle. It's a pleasure to meet you. It's great to hear that you've made some good friends at Hogwarts like Ron Weasley and Hermione Granger. How are you finding life at Hogwarts so far?
Human: What are my best friends' names? [0m



[1m> Finished chain.[0m


"Your best friends' names are Ron Weasley and Hermione Granger. They are both loyal and brave companions who have stood by you through thick and thin at Hogwarts."

In [43]:
# get a list of messages in the memory 
memory.load_memory_variables({})

{'chat_history': [HumanMessage(content='Hi there, this is Harry Potter, I just got two good friends at Hogwarts, Ron Weasley and Hermione Granger.'),
  AIMessage(content="Hello there, Harry Potter. I am Tom Marvolo Riddle. It's a pleasure to meet you. It's great to hear that you've made some good friends at Hogwarts like Ron Weasley and Hermione Granger. How are you finding life at Hogwarts so far?"),
  HumanMessage(content="What are my best friends' names? "),
  AIMessage(content="Your best friends' names are Ron Weasley and Hermione Granger. They are both loyal and brave companions who have stood by you through thick and thin at Hogwarts.")]}

In [44]:
memory.clear()
memory.load_memory_variables({})
chat_llm_chain.predict(human_input="What are my best friends' names? ")




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: You are a chatbot having a conversation with a human. 
            Your name is Tom Marvolo Riddle. 
            You need to tell your name to that human if he doesn't know.
Human: What are my best friends' names? [0m



[1m> Finished chain.[0m


'Hello there! I am Tom Marvolo Riddle. And you are?'

#### (Optional) Manipulate the memory by yourself in a chain

In [45]:
from operator import itemgetter

from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI

model = ChatOpenAI()
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful chatbot"),
        MessagesPlaceholder(variable_name="history"),
        ("human", "{input}"),
    ]
)

memory = ConversationBufferMemory(return_messages=True)


In [46]:
# add memory to an arbitrary chain
chain = (
    RunnablePassthrough.assign(
        history=RunnableLambda(memory.load_memory_variables) | itemgetter("history")
    )
    | prompt
    | model
)

In [47]:
inputs = {"input": "Hi, I am Harry!"}
response = chain.invoke(inputs)
print_with_type(response)
print_with_type(memory.load_memory_variables({}))

"<class 'langchain_core.messages.ai.AIMessage'>:"
AIMessage(content='Hello Harry! How can I help you today?')
"<class 'dict'>:"
{'history': []}


In [48]:
# You need to save the context yourself
memory.save_context(inputs, 
                    {"output": response.content})
print_with_type(memory.load_memory_variables({}))

"<class 'dict'>:"
{'history': [HumanMessage(content='Hi, I am Harry!'),
             AIMessage(content='Hello Harry! How can I help you today?')]}


In [49]:
response = chain.invoke({"input": "What's my name?"})
print_with_type(response)

"<class 'langchain_core.messages.ai.AIMessage'>:"
AIMessage(content='Your name is Harry. Is there anything else you would like assistance with?')


### 1.2 Using Entity memory

#### Basic Use of ConversationEntityMemory

Entity memory remembers given facts about specific entities in a conversation. It extracts information on entities (using an LLM) and builds up its knowledge about that entity over time (also using an LLM).

In [50]:
from langchain_openai import OpenAI
from langchain.memory import ConversationEntityMemory
llm = OpenAI(temperature=0)

In [51]:
memory = ConversationEntityMemory(llm=llm, return_messages=True)
inputs = {"input": "Harry & Ron are going to rescue a baby dragon in London."}
memory.load_memory_variables(inputs)
memory.save_context(
    inputs,
    {"output": "That sounds like a great mission! What kind of mission are they working on?"}
)

memory.load_memory_variables({"input": "Harry and Ron and Wei and London?"})

{'history': [HumanMessage(content='Harry & Ron are going to rescue a baby dragon in London.'),
  AIMessage(content='That sounds like a great mission! What kind of mission are they working on?')],
 'entities': {'Harry': 'Harry is going to rescue a baby dragon in London.',
  'Ron': 'Ron is going to rescue a baby dragon in London with Harry.',
  'Wei': '',
  'London': 'London is the location where Harry and Ron are going to rescue a baby dragon.'}}

#### Using Entity in a chain

Here we use ConversationChain.  It is a thin wrapper over LLMChain, and contains some prompts making the LLM to be more smooth in conversations.  See its source code for details. 

In [52]:
from langchain.chains import ConversationChain
from langchain.memory import ConversationEntityMemory
from langchain.memory.prompt import ENTITY_MEMORY_CONVERSATION_TEMPLATE
from pydantic import BaseModel
from typing import List, Dict, Any

In [53]:
conversation = ConversationChain(
    llm=llm,
    verbose=False,
    prompt=ENTITY_MEMORY_CONVERSATION_TEMPLATE,
    memory=ConversationEntityMemory(llm=llm)
)

In [54]:
conversation.invoke(input="Harry & Ron are going to rescue a baby dragon.")

{'input': 'Harry & Ron are going to rescue a baby dragon.',
 'history': '',
 'entities': {'Harry': '', 'Ron': ''},
 'response': " That sounds like quite an adventure! I hope they have a plan in place to safely rescue the dragon and return it to its natural habitat. Dragons can be dangerous creatures, but they also deserve to live freely in the wild. I'm sure Harry and Ron will do everything they can to ensure the dragon's safety."}

In [55]:
conversation.memory.entity_store.store

{'Harry': 'Harry is going to help rescue a baby dragon with Ron.',
 'Ron': 'Ron is going to help Harry rescue a baby dragon.'}

In [56]:
conversation.invoke(input="They are trying to give the baby dragon to Ron's elder brother Charlie's friends and let them take the baby dragon to Romania.")

{'input': "They are trying to give the baby dragon to Ron's elder brother Charlie's friends and let them take the baby dragon to Romania.",
 'history': "Human: Harry & Ron are going to rescue a baby dragon.\nAI:  That sounds like quite an adventure! I hope they have a plan in place to safely rescue the dragon and return it to its natural habitat. Dragons can be dangerous creatures, but they also deserve to live freely in the wild. I'm sure Harry and Ron will do everything they can to ensure the dragon's safety.",
 'entities': {'Ron': 'Ron is going to help Harry rescue a baby dragon.',
  'Charlie': '',
  'Romania': ''},
 'response': " That's a great idea! Romania is known for its dragon reserves and I'm sure Charlie's friends will take good care of the baby dragon there. It's important to find a safe and suitable home for the dragon, and it seems like Ron and Harry have found the perfect solution."}

In [57]:
conversation.invoke(input="Harry & Ron need to secretly transfer the baby dragon to the highest place in Hogwarts without let anyone see them. So they are going to use Harry's Invisibility cloak.")

{'input': "Harry & Ron need to secretly transfer the baby dragon to the highest place in Hogwarts without let anyone see them. So they are going to use Harry's Invisibility cloak.",
 'history': "Human: Harry & Ron are going to rescue a baby dragon.\nAI:  That sounds like quite an adventure! I hope they have a plan in place to safely rescue the dragon and return it to its natural habitat. Dragons can be dangerous creatures, but they also deserve to live freely in the wild. I'm sure Harry and Ron will do everything they can to ensure the dragon's safety.\nHuman: They are trying to give the baby dragon to Ron's elder brother Charlie's friends and let them take the baby dragon to Romania.\nAI:  That's a great idea! Romania is known for its dragon reserves and I'm sure Charlie's friends will take good care of the baby dragon there. It's important to find a safe and suitable home for the dragon, and it seems like Ron and Harry have found the perfect solution.",
 'entities': {'Harry': 'Harry 

In [58]:
conversation.invoke(input="What do you know about Harry & Ron?")

{'input': 'What do you know about Harry & Ron?',
 'history': "Human: Harry & Ron are going to rescue a baby dragon.\nAI:  That sounds like quite an adventure! I hope they have a plan in place to safely rescue the dragon and return it to its natural habitat. Dragons can be dangerous creatures, but they also deserve to live freely in the wild. I'm sure Harry and Ron will do everything they can to ensure the dragon's safety.\nHuman: They are trying to give the baby dragon to Ron's elder brother Charlie's friends and let them take the baby dragon to Romania.\nAI:  That's a great idea! Romania is known for its dragon reserves and I'm sure Charlie's friends will take good care of the baby dragon there. It's important to find a safe and suitable home for the dragon, and it seems like Ron and Harry have found the perfect solution.\nHuman: Harry & Ron need to secretly transfer the baby dragon to the highest place in Hogwarts without let anyone see them. So they are going to use Harry's Invisibi

Now Let's inspect the entities that are extracted from the conversation above.

In [59]:
print_with_type(conversation.memory.entity_store.store)

"<class 'dict'>:"
{'Charlie': "Charlie is Ron's elder brother who has friends in Romania who can "
            'take care of the baby dragon.',
 'Harry': 'Harry and Ron are two brave and resourceful wizards who are always '
          'willing to help others, even if it means going on dangerous '
          'missions like rescuing a baby dragon. They make a great team and '
          "I'm sure they will succeed in their mission with the help of "
          "Harry's Invisibility cloak.",
 "Harry's Invisibility cloak": 'Harry will use his Invisibility cloak to '
                               'secretly transfer the baby dragon to the '
                               'highest place in Hogwarts without being seen.',
 'Hogwarts': 'Hogwarts is a big place, but Harry and Ron have a clever plan to '
             'secretly transfer the baby dragon to the highest point using '
             "Harry's Invisibility cloak.",
 'Romania': "Romania is known for its dragon reserves and it's important to "


Let's do more conversations and see what we can learn more about each entity.

In [60]:
conversation.predict(input="Harry is a brave and clever boy.")

' Yes, Harry is known for his bravery and cleverness. He has faced many challenges and always finds a way to overcome them. He is a great friend and a valuable member of the wizarding community.'

In [61]:
print_with_type(conversation.memory.entity_store.store)

"<class 'dict'>:"
{'Charlie': "Charlie is Ron's elder brother who has friends in Romania who can "
            'take care of the baby dragon.',
 'Harry': 'Harry is known for his bravery and cleverness, and is a great '
          'friend and valuable member of the wizarding community.',
 "Harry's Invisibility cloak": 'Harry will use his Invisibility cloak to '
                               'secretly transfer the baby dragon to the '
                               'highest place in Hogwarts without being seen.',
 'Hogwarts': 'Hogwarts is a big place, but Harry and Ron have a clever plan to '
             'secretly transfer the baby dragon to the highest point using '
             "Harry's Invisibility cloak.",
 'Romania': "Romania is known for its dragon reserves and it's important to "
            'find a safe and suitable home for the baby dragon.',
 'Ron': 'Ron is a brave and resourceful wizard who, along with his friend '
        'Harry, is on a mission to secretly transfer a baby d

In [62]:
conversation.invoke(input="What do you know about Harry?")

{'input': 'What do you know about Harry?',
 'history': "Human: Harry & Ron need to secretly transfer the baby dragon to the highest place in Hogwarts without let anyone see them. So they are going to use Harry's Invisibility cloak.\nAI:  That's a clever plan! Harry's Invisibility cloak will definitely come in handy for this mission. I'm sure they will be able to successfully transfer the baby dragon without anyone noticing. Hogwarts is a big place, so finding a high and secluded spot shouldn't be too difficult. Good luck to Harry and Ron on their mission!\nHuman: What do you know about Harry & Ron?\nAI:  Harry and Ron are two brave and resourceful wizards who are always willing to help others, even if it means going on dangerous missions like rescuing a baby dragon. They make a great team and I'm sure they will succeed in their mission with the help of Harry's Invisibility cloak.\nHuman: Harry is a brave and clever boy.\nAI:  Yes, Harry is known for his bravery and cleverness. He has f

In [63]:
print_with_type(conversation.memory.entity_store.store)

"<class 'dict'>:"
{'Charlie': "Charlie is Ron's elder brother who has friends in Romania who can "
            'take care of the baby dragon.',
 'Harry': 'As mentioned before, Harry is known for his bravery and cleverness, '
          'and is a great friend and valuable member of the wizarding '
          'community. He has faced many challenges and always finds a way to '
          'overcome them. He is also a skilled wizard and has a strong sense '
          'of justice. Overall, Harry is a remarkable individual who has made '
          'a significant impact in the wizarding world.',
 "Harry's Invisibility cloak": 'Harry will use his Invisibility cloak to '
                               'secretly transfer the baby dragon to the '
                               'highest place in Hogwarts without being seen.',
 'Hogwarts': 'Hogwarts is a big place, but Harry and Ron have a clever plan to '
             'secretly transfer the baby dragon to the highest point using '
             "Harry

### 1.3 Adding Memory to Agents

In this section, we will first ask the agent a question, and then without mention the context information ourselves ask another related question.

In [64]:
from langchain.agents import AgentExecutor, Tool, ZeroShotAgent
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferMemory
from langchain_community.utilities import SerpAPIWrapper
from langchain_openai import OpenAI

In [65]:
search = SerpAPIWrapper()

tools = [
    Tool(
        name="Search",
        func=search.run,
        description="useful for when you need to answer questions about current events",
    )
]

In [66]:
prompt = ZeroShotAgent.create_prompt(
    tools,
    prefix="""Have a conversation with a human, answering the following questions as best you can.  You have access to the following tools:""",
    suffix="""Begin!  
{chat_history}
Question: {input}
{agent_scratchpad}""",
    input_variables=["input", "chat_history", "agent_scratchpad"],
)
memory = ConversationBufferMemory(memory_key="chat_history")

In [67]:
llm_chain = LLMChain(llm=OpenAI(temperature=0), prompt=prompt)
agent = ZeroShotAgent(llm_chain=llm_chain, tools=tools, verbose=True)
agent_chain = AgentExecutor.from_agent_and_tools(
    agent=agent, tools=tools, verbose=True, memory=memory, handle_parsing_errors=True
)

  warn_deprecated(


In [68]:
agent_chain.invoke(input="What is the population of China in 2024?")



[1m> Entering new AgentExecutor chain...[0m


[32;1m[1;3mThought: I should use the search tool to find the most recent data on China's population.
Action: Search
Action Input: "China population 2024"[0m
Observation: [36;1m[1;3mThe current population of China is 1,425,289,376 as of Tuesday, April 9, 2024, based on Worldometer elaboration of the latest United Nations data 1.[0m
Thought:[32;1m[1;3m I should double check this information with a reliable source.
Action: Search
Action Input: "China population 2024 reliable source"[0m
Observation: [36;1m[1;3mThe current population of China is 1,425,289,376 as of Tuesday, April 9, 2024, based on Worldometer elaboration of the latest United Nations data 1. China 2023 population is estimated at 1,425,671,352 people at mid year. China population is equivalent to 17.72% of the total world population.[0m
Thought:[32;1m[1;3m This seems like a reliable and up-to-date source.
Action: Search
Action Input: "China population 2024 reliable source"[0m
Observation: [36;1m[1;3mThe curre

{'input': 'What is the population of China in 2024?',
 'chat_history': '',
 'output': 'The population of China in 2024 is estimated to be 1,425,289,376 people.'}

In [69]:
memory.load_memory_variables({})

{'chat_history': 'Human: What is the population of China in 2024?\nAI: The population of China in 2024 is estimated to be 1,425,289,376 people.'}

In [70]:
agent_chain.invoke(input="Search the data in google. and answer is it more or less than India?")



[1m> Entering new AgentExecutor chain...[0m


[32;1m[1;3mThought: I should use the search tool to find the population data for both China and India.
Action: Search
Action Input: "Population of China 2024" and "Population of India 2024"[0m
Observation: [36;1m[1;3m{'type': 'organic_result', 'title': 'Twenty countries with the largest population in 2024 (in millions)', 'source': 'Statista'}[0m
Thought:[32;1m[1;3m I should click on the source from Statista to get the accurate data.
Action: Click on the source from Statista
Action Input: "Statista"[0m
Observation: Click on the source from Statista is not a valid tool, try one of [Search].
Thought:[32;1m[1;3m I should use the search tool to find the population data for both China and India.
Action: Search
Action Input: "Population of China 2024" and "Population of India 2024"[0m
Observation: [36;1m[1;3m{'type': 'organic_result', 'title': 'Twenty countries with the largest population in 2024 (in millions)', 'source': 'Statista'}[0m
Thought:[32;1m[1;3m I should click on t

{'input': 'Search the data in google. and answer is it more or less than India?',
 'chat_history': 'Human: What is the population of China in 2024?\nAI: The population of China in 2024 is estimated to be 1,425,289,376 people.',
 'output': 'The population of China in 2024 is estimated to be 1,425,289,376 people, which is more than the estimated population of India in 2024.'}

In [71]:
print_with_type(memory.load_memory_variables({}))

"<class 'dict'>:"
{'chat_history': 'Human: What is the population of China in 2024?\n'
                 'AI: The population of China in 2024 is estimated to be '
                 '1,425,289,376 people.\n'
                 'Human: Search the data in google. and answer is it more or '
                 'less than India?\n'
                 'AI: The population of China in 2024 is estimated to be '
                 '1,425,289,376 people, which is more than the estimated '
                 'population of India in 2024.'}


In [72]:
agent_chain.invoke(input="what is the population in Chima?")



[1m> Entering new AgentExecutor chain...[0m


[32;1m[1;3mThought: I should use the search tool to find the answer.
Action: Search
Action Input: "population of China"[0m
Observation: [36;1m[1;3m{'type': 'population_result', 'place': 'China', 'population': '1.412 billion', 'year': '2022'}[0m
Thought:[32;1m[1;3m I should confirm the year of the population data.
Action: Search
Action Input: "population of China in 2024"[0m
Observation: [36;1m[1;3m1.43 billion[0m
Thought:[32;1m[1;3m I now know the final answer.
Final Answer: The population of China in 2024 is estimated to be 1,425,289,376 people.[0m

[1m> Finished chain.[0m


{'input': 'what is the population in Chima?',
 'chat_history': 'Human: What is the population of China in 2024?\nAI: The population of China in 2024 is estimated to be 1,425,289,376 people.\nHuman: Search the data in google. and answer is it more or less than India?\nAI: The population of China in 2024 is estimated to be 1,425,289,376 people, which is more than the estimated population of India in 2024.',
 'output': 'The population of China in 2024 is estimated to be 1,425,289,376 people.'}

## 2. Long term memory with vector storage 

In this section, we are going to embed the famous Harry Potter book's first chapter into a vectorstore and try some similarity searches. We have some extra examples commented, you can uncomment and try them one-by-one. If you observe the results carefully, you may find the characteristics of similarity search.

### 2.1 Loaders and Splitters

#### PDF Loaders

In [73]:
from langchain.document_loaders import PyPDFLoader
from langchain_community.document_loaders import UnstructuredPDFLoader

data = PyPDFLoader("/share/lab4/harry-potter-chap-1.pdf").load()

In [74]:
# Note: If you're using PyPDFLoader then it will split by page for you already

print (f'You have {len(data)} document(s) in your data')
i = 0
for d in data:
    print (f'There are {len(d.page_content)} characters in doc {i}')
    i += 1

You have 16 document(s) in your data
There are 1848 characters in doc 0
There are 2101 characters in doc 1
There are 2093 characters in doc 2
There are 1898 characters in doc 3
There are 1892 characters in doc 4
There are 1300 characters in doc 5
There are 1867 characters in doc 6
There are 1806 characters in doc 7
There are 1548 characters in doc 8
There are 1573 characters in doc 9
There are 1635 characters in doc 10
There are 1792 characters in doc 11
There are 1542 characters in doc 12
There are 1399 characters in doc 13
There are 1882 characters in doc 14
There are 1921 characters in doc 15


#### Text file loader

In [75]:
from langchain_community.document_loaders import TextLoader

union = TextLoader("/share/lab4/state_of_the_union.txt").load()

#### Text Splitters

From Langchain documents: 

RecursiveCharacterTextSplitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [76]:
# You can have some trials with different chunk_size and chunk_overlap.
# This is optional, test out on your own data.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=30)
texts = text_splitter.split_documents(data)

In [77]:
print (f'Now you have {len(texts)} documents')

for t in texts:
    print(t.page_content[:100])
    print("=========")

Now you have 68 documents
CHAPTER ONE  
 
THE BOY WHO LIVED  
 
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud
have a very large mustac he. Mrs. Dursley was thin and blonde and had  
nearly twice the usual amoun
think they could bear it if anyone found out about the Potters. Mrs.  
Potter was Mrs. Dursley's sis
Potters had a small son, too, but they had never even seen him. This boy  
was another good reason f
work, and Mrs. Dursley gossiped away happily as she wrestled a 
screaming  
Dudley into his high cha
into his car and backed out of number four's drive.  
 
It was on the corner of the street that he n
stared at the cat. It stared  back. As Mr. Dursley drove around the  
corner and up the road, he wat
But on the edge of town, drills were driven out of his mind by something  
else. As he sat in the us
wheel and his eyes fell on a huddle of these weirdos standing quite  
close by. They were whispering
nerve of him! But then it struck Mr. Dursley that this was probab

There are different kinds of splitters.  

https://chunkviz.up.railway.app/ 

provides a great tool to see the splitter differences with different chunk_size and chunk_overlap settings.

In [78]:
#### Your TASK ####
# Try different PDF Loaders.  Which one works the best for this file /share/lab4/hp-book1.pdf ,
# which contains the full book of Harry Potter Book 1, with all the illustratons.

## Langchain provides many other options for loaders, read the documents to find out the differences
# See page https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PDFMinerLoader
from langchain_community.document_loaders import OnlinePDFLoader
loader_1 = UnstructuredPDFLoader("/share/lab4/hp-book1.pdf")
data_1 = loader_1.load()
print (f'You have {len(data_1)} document(s) in data_1')
i = 0
for d in data_1:
    print (f'There are {len(d.page_content)} characters in doc {i}')
    i += 1

print("=======================")

loader_2 = PyPDFLoader("/share/lab4/hp-book1.pdf")
data_2 = loader_2.load()
print (f'You have {len(data_2)} document(s) in data_2')
i = 0
m = 0
for d in data_2:
    i += 1
    m += len(d.page_content)
print(f'There are {m} characters in doc')

print("=======================")

loader_3 = PDFMinerLoader("/share/lab4/hp-book1.pdf")
data_3 = loader_3.load()
print (f'You have {len(data_3)} document(s) in data_3')
i = 0
for d in data_3:
    print (f'There are {len(d.page_content)} characters in doc {i}')
    i += 1

print("========================")

loader_4 = OnlinePDFLoader("/share/lab4/hp-book1.pdf")
data_4 = loader_4.load()
print (f'You have {len(data_4)} document(s) in data_4')
i = 0
for d in data_4:
    print (f'There are {len(d.page_content)} characters in doc {i}')
    i += 1

print("=========================")

print(f'So, PDFMinerLoader behaves best.')


You have 1 document(s) in data_1
There are 457834 characters in doc 0
You have 327 document(s) in data_2
There are 462967 characters in doc
You have 1 document(s) in data_3
There are 498350 characters in doc 0
You have 1 document(s) in data_4
There are 457834 characters in doc 0
So, PDFMinerLoader behaves best.


### 2.2 Create embeddings of your documents

Embedding is a model that turns a sentence into vectors, so that we can "semantically search" for related splits of a document. 

In [79]:
# OpenAI embedding: slow and expensive, we do not use them here.  

# from langchain.embeddings.openai import OpenAIEmbeddings

# openai_embedding = OpenAIEmbeddings()

In [80]:
# Let's use the local ones.
# We have downloaded a number of popular embedding models for you, in the /share/embedding directory, including
# LaBSE
# all-MiniLM-L12-v2
# all-MiniLM-L6-v2
# paraphrase-multilingual-MiniLM-L12-v2

from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
minilm_embedding = SentenceTransformerEmbeddings(model_name="/share/embedding/all-MiniLM-L6-v2/")


  return self.fget.__get__(instance, owner)()


### 2.4  Store and retrieve the embeddings in ChromaDB

You can search documents stored in "Vector DBs" by their semantic similarity.  Vector DBs uses an algorithm called "KNN (k-nearest neighbors)" to find documents whose embedding is the closest to the query. 

We first introduce ChromaDB becauase it runs locally, easy-to-set-up, and best of all, free.

In [81]:
# compute embeddings and save the embeddings into ChromaDB
from langchain.vectorstores import Chroma

chroma_dir = "/scratch2/chroma_db"
docsearch_chroma = Chroma.from_documents(texts, 
                                         minilm_embedding, 
                                         collection_name='harry-potter', 
                                         persist_directory=chroma_dir,
                                         )

In [82]:
# questions from https://en.wikibooks.org/wiki/Muggles%27_Guide_to_Harry_Potter/Books/Philosopher%27s_Stone/Chapter_1
# you can try yourself

# query = 'Why would the Dursleys consider being related to the Potters a "shameful secret"?'
# query = 'Who are the robed people Mr. Dursley sees in the streets?'
query = 'What might a "Muggle" be?'
# query = 'What exactly is the cat on Privet Drive?'
# query = '''Who might "You-Know-Who" be? Why isn't this person referred to by a given name?'''

In [83]:
## A utiity function ...
def print_search_results(docs):
    print(f"search returned %d results. " % len(docs))
    for doc in docs:
        print(doc.page_content)
        print("=============")


In [84]:
# semantic similarity search

docs = docsearch_chroma.similarity_search(query)
print_search_results(docs)

search returned 4 results. 
“He’s not going,” he said. 
Hagrid grunted. 
“I’d like ter see a great Muggle like you stop him,” he said. 
“A what?” said Harry, interested. 

(cid:145) 52 (cid:145) 

 
 
 
THE  KEEPER  OF  THE  KEYS 

“A  Muggle,”  said  Hagrid,  “it’s  what  we  call  nonmagic  folk  like 
them. An’ it’s your bad luck you grew up in a family o’ the biggest 
Muggles I ever laid eyes on.” 

“We  swore  when  we  took  him  in  we’d  put  a  stop  to  that  rub-
bish,” said Uncle Vernon, “swore we’d stamp it out of him! Wizard 
indeed!”
“But there aren’t wild dragons in Britain?” said Harry. 
“Of  course  there  are,”  said  Ron.  “Common  Welsh  Green  and 
Hebridean  Blacks.  The  Ministry  of  Magic  has  a  job  hushing  them 
up, I can tell you. Our kind have to keep putting spells on Muggles 
who’ve spotted them, to make them forget.” 

“So what on earth’s Hagrid up to?” said Hermione. 

When  they  knocked  on  the  door  of  the  gamekeeper’s  hut  an  hour 
later,

#### Saving and Loading your ChromaDB

In [85]:
# save to local disk
docsearch_chroma.persist()

In [86]:
# reload from disk
docsearch_chroma_reloaded = Chroma(persist_directory = chroma_dir,
                                   collection_name = 'harry-potter', 
                                   embedding_function = minilm_embedding)

In [87]:
# you can test with the previous or another query

query = 'Who are the robed people Mr. Dursley sees in the streets?'
docs = docsearch_chroma_reloaded.similarity_search(query)
print_search_results(docs)

search returned 4 results. 
But  on  the  edge  of  town,  drills  were  driven  out  of  his  mind  by 
something  else.  As  he  sat  in  the  usual  morning  traffic  jam,  he 
couldn’t  help  noticing  that  there  seemed  to  be  a  lot  of  strangely 
dressed  people  about.  People  in  cloaks.  Mr.  Dursley  couldn’t  bear 
people  who  dressed  in  funny  clothes  —  the  getups  you  saw  on 
young  people!  He  supposed  this  was  some  stupid  new  fashion.  He 
drummed  his  fingers  on  the  steering  wheel  and  his  eyes  fell  on  a 
huddle  of  these  weirdos  standing  quite  close  by.  They  were  whis-
pering  excitedly  together.  Mr.  Dursley  was  enraged  to  see  that  a
But on the edge of town, drills were driven out of his mind by something  
else. As he sat in the usual morning tr affic jam, he couldn't help  
noticing that there seemed to be a lot of strangely dressed people  
about. People in cloaks. Mr. Dursley couldn't bear people who dressed in  
fun

In [88]:
#### Your TASK ####
# With the chosen PDF loaders, test different splitters and chunk size until you feel that the chucking makes sense. 
# You can also try different embeddings
# Then embed the entire book 1 into ChormaDB

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=30)
texts = text_splitter.split_documents(data_3)

chroma_dir = "/scratch2/chroma_db"
docsearch_chroma = Chroma.from_documents(texts, 
                                         minilm_embedding, 
                                         collection_name='harry-potter-book1', 
                                         persist_directory=chroma_dir,
                                         )



### 2.5 Query those docs with a QA chain

In [89]:
from langchain_openai import OpenAI
from langchain.chains.question_answering import load_qa_chain

In [90]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
chain = load_qa_chain(llm, chain_type="stuff", verbose=True)

In [91]:
query = "How did Harry's parents die?"
docs = docsearch_chroma_reloaded.similarity_search(query)

In [92]:
chain.run(input_documents=docs, question=query)

  warn_deprecated(




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

“Some  say  he  died.  Codswallop,  in  my  opinion.  Dunno  if  he 
had enough human left in him to die. Some say he’s still out there, 
bidin’  his  time,  like,  but  I  don’  believe  it.  People  who  was  on  his 
side  came  back  ter  ours.  Some  of  ’em  came  outta  kinda  trances. 
Don’ reckon they could’ve done if he was comin’ back. 

“Most  of  us  reckon  he’s  still  out  there  somewhere  but  lost  his 
powers. Too weak to carry on. ’Cause somethin’ about you finished 
him,  Harry.  There  was  somethin’  goin’  on  that  night  he  hadn’t 
counted on — I dunno what it was, no one does — but somethin’ 
about you stumped him, all right.”

needs reminding of that on his first

" Harry's parents were killed by Lord Voldemort, a powerful dark wizard."

In [93]:
#### Your Task ####

# Rebuild the chain from the whole book ChromaDB.  Test with one of the following questions (of your choice).

#query = 'Why does Dumbledore believe the celebrations may be premature?'
query = 'Why is Harry left with the Dursleys rather than a Wizard family?'
#query = 'Why does McGonagall seem concerned about Harry being raised by the Dursleys?'

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=30)
texts = text_splitter.split_documents(data_3)

chroma_dir = "/scratch2/chroma_db"
docsearch_chroma = Chroma.from_documents(texts, 
                                         minilm_embedding, 
                                         collection_name='harry-potter-book1', 
                                         persist_directory=chroma_dir,
                                         )

docs = docsearch_chroma.similarity_search(query)
chain.run(input_documents=docs, question=query)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

(cid:145) 1 (cid:145) 

 
 
 
 
 
CHAPTER  ONE 

They didn’t think they could bear it if anyone found out about the 
Potters.  Mrs.  Potter  was  Mrs.  Dursley’s  sister,  but  they  hadn’t  met 
for  several  years;  in  fact,  Mrs.  Dursley  pretended  she  didn’t  have  a 
sister, because her sister and her good-for-nothing husband were as 
unDursleyish  as  it  was  possible  to  be.  The  Dursleys  shuddered  to 
think  what  the  neighbors  would  say  if  the  Potters  arrived  in  the 
street. The Dursleys knew that the Potters had a small son, too, but 
they  had  never  even  seen  him.  This  boy  was  another  good  reason 
for keeping the Potters away; they didn’t want Dudley mi

' Harry is left with the Dursleys because his parents, who were wizards, were killed by the dark wizard, Voldemort. The Dursleys are his only living relatives and the only ones who can protect him from Voldemort.'

In [94]:
#### Your Task ####

# Using langchain documentation, find out about the map reduce QA chain.
# answer the following questions using the chain
from langchain_openai import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=745, chunk_overlap=34)
texts = text_splitter.split_documents(data_3)

chroma_dir = "/scratch2/chroma_db"
docsearch_chroma = Chroma.from_documents(texts, 
                                         minilm_embedding, 
                                         collection_name='harry-potter-book1', 
                                         persist_directory=chroma_dir,
                                         )

chain = load_qa_chain(llm, chain_type="map_reduce")

query = 'What happened in the Forbidden Forest during the first year of Harry Potter at Hogwarts?'
# query = 'Tell me about Harry Potter and Quidditch during the first year'

docs = docsearch_chroma.similarity_search(query)
chain.run(input_documents=docs, question=query)

' It is not mentioned what specifically happened in the Forbidden Forest during the first year of Harry Potter at Hogwarts.'

### 2.6 Using Pinecone, an online vector DB

You have many reasons to store your DB online in a SaaS / PaaS service.  For example, 
- you want to scale the queries to many concurrent users
- you want more data reliability without having to worry about DB management
- you want to share the DB but without owning any servers

If you want to store your embeddings online, try pinecone with the code below. You must go to [Pinecone.io](https://www.pinecone.io/) and set up an account. Then you need to generate an api-key and create an "index", this can be done by navigating through the homepage once you've logged in to Pinecone, 

In [95]:
import pinecone
from langchain.vectorstores import Pinecone

# initialize pinecone, depends on two environment variables, os.environ['PINECONE_API_KEY'] and os.environ['PINECONE_API_ENV']
pinecone.Pinecone()

# You should create an index for your vector db.  
# The "dimension" setting when you create the DB online, should be 1536 for openAI embedding, or 384 for minilm. 
index_name = "test01"

In [96]:
docsearch_pinecone = Pinecone.from_texts(
                                [ t.page_content for t in texts ], 
                                minilm_embedding, 
                                index_name=index_name)

In [97]:
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
chain = load_qa_chain(llm, chain_type="stuff", verbose=True)
query = "How did Harry's parents die?"
docs = docsearch_pinecone.similarity_search(query)
chain.run(input_documents=docs, question=query)

# we can use the full-book to test 'map-reduce'
#chain = load_qa_chain(llm, chain_type="map_reduce")



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Where are your parents?” 

“They’re dead,” said Harry shortly. He didn’t feel much like go-

ing into the matter with this boy. 

“Oh, sorry,” said the other, not sounding sorry at all. “But they 

were our kind, weren’t they?”

name of heaven did Harry survive?"

And there were his mother and father smiling at him again, and 
one of his grandfathers nodding happily. Harry sank down to sit on 
the floor in front of the mirror. There was nothing to stop him from 
staying here all night with his family. Nothing at all.

It's just astounding... of all the things to stop him... but how in the 
name of heaven did Harry survive?"

Question: How did Harry's parents die?
Helpful Answer:[0m

[1m> Fi

" I don't know."

In [98]:
# query with pinecone
query = 'What exactly is the cat on Privet Drive?'
docs = docsearch_pinecone.similarity_search(query)
print(docs[0].page_content[:600])

didn't realize what he had seen -- then he jerked his head around to 
look again. There was a tabby cat standing on the corner of Privet


In [99]:
#### Your Task ####
# modify the QA chain in Section 2.5 (Chapter 1 only) to use pinecone instead of ChromaDB
data_5 = PDFMinerLoader("/share/lab4/harry-potter-chap-1.pdf").load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1247, chunk_overlap=102)
texts = text_splitter.split_documents(data_5)

docsearch_pinecone = Pinecone.from_texts(
                                [ t.page_content for t in texts ], 
                                minilm_embedding, 
                                index_name=index_name)

llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
chain = load_qa_chain(llm, chain_type="stuff", verbose=True)
query = 'What happened in the Forbidden Forest during the first year of Harry Potter at Hogwarts?'
docs = docsearch_pinecone.similarity_search(query)
chain.run(input_documents=docs, question=query)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

THE  FORBIDDEN  FOREST 

“Harry!” Neville burst out, the moment he saw the other two. “I 
was  trying  to  find  you  to  warn  you,  I  heard  Malfoy  saying  he  was 
going to catch you, he said you had a drag —”

THE  FORBIDDEN  FOREST 

can  never  die.  Mr.  Potter,  do  you  know  what  is  hidden  in  the 
school at this very moment?” 

“The Sorcerer’s Stone! Of course — the Elixir of Life! But I don’t 

understand who —” 

“Can you think of nobody who has waited many years to return

THE  FORBIDDEN  FOREST 

“You  don’t  think  they’ve  been  hurt,  do  you?”  whispered 

Hermione. 

“I don’t care if Malfoy has, but if something’s got Neville . . . it’s 

our fault he’s here in the fi

' Harry, Hermione, and Neville were in the Forbidden Forest and Neville warned them that Malfoy was planning to catch them. Later, Hagrid returned with Malfoy, Neville, and Fang, and it was revealed that Malfoy had grabbed Neville as a joke, causing Neville to panic and send up sparks.'

### 2.7 Use vector store in Agent

In this section, we are going to create a simple QA agent that can decide by itself which of the two vectorstores it should switch to for questions of differnent fields.

#### Preparing the tools for the agent.

We will use our chroma_based Harry Potter vectorDB, and let's create another one containing President Biden's State of the Union speech. 

In [100]:
from langchain.document_loaders import TextLoader

documents = TextLoader('/share/lab4/state_of_the_union.txt').load()
texts = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0).split_documents(documents)
docsearch3 = Chroma.from_documents(texts, 
                                   minilm_embedding, 
                                   collection_name="state-of-union", 
                                   persist_directory="/scratch2/chroma_db")
docsearch3.persist()

To allow the agent query these databases, we need to define two RetrievalQA chains.

In [101]:
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI

llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")

harry_potter = RetrievalQA.from_chain_type(llm=llm, 
                                           chain_type="stuff", 
                                           retriever=docsearch_chroma_reloaded.as_retriever())
state_of_union = RetrievalQA.from_chain_type(llm=llm, 
                                             chain_type="stuff", 
                                             retriever=docsearch3.as_retriever())

In [102]:
# Now try both chains

print_with_type(harry_potter.invoke('Why does McGonagall seem concerned about Harry being raised by the Dursleys?'))
print_with_type(state_of_union.invoke("what is the GDP increase last year?"))

"<class 'dict'>:"
{'query': 'Why does McGonagall seem concerned about Harry being raised by the '
          'Dursleys?',
 'result': " McGonagall is concerned because she knows that Harry's aunt and "
           'uncle, the Dursleys, are not fond of magic and may not treat Harry '
           "well because of his magical abilities. She also knows that Harry's "
           'cousin, Dudley, is spoiled and may bully Harry.'}
"<class 'dict'>:"
{'query': 'what is the GDP increase last year?',
 'result': ' The GDP increase last year was 5.7%.'}


In [103]:
from langchain.agents import AgentType, Tool
from langchain.llms import OpenAI

# define tools
tools = [
    Tool(
        name="State of Union QA System",
        func=state_of_union.run,
        description="useful for when you need to answer questions about the most recent state of the union address. Input should be a fully formed question.",
    ),
    Tool(
        name="Harry Potter QA System",
        func=harry_potter.run,
        description="useful for when you need to answer questions about Harry Potter. Input should be a fully formed question.",
    ),
]

Now we can create the Agent giving both chains as tools. 

In [104]:
from langchain.agents import initialize_agent


# Construct the agent. We will use the default agent type here.
# See documentation for a full list of options.
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)


  warn_deprecated(


In [105]:
agent.run(
    "What did biden say about ketanji brown jackson?"
)



[1m> Entering new AgentExecutor chain...[0m


[32;1m[1;3m I should use the State of Union QA System to find the answer
Action: State of Union QA System
Action Input: What did biden say about ketanji brown jackson?[0m
Observation: [36;1m[1;3m Biden nominated Ketanji Brown Jackson for the United States Supreme Court.[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: Biden nominated Ketanji Brown Jackson for the United States Supreme Court.[0m

[1m> Finished chain.[0m


'Biden nominated Ketanji Brown Jackson for the United States Supreme Court.'

In [106]:
agent.run(
    "'Why does McGonagall seem concerned about Harry being raised by the Dursleys?'"
)



[1m> Entering new AgentExecutor chain...[0m


[32;1m[1;3m You should always think about what to do
Action: Harry Potter QA System
Action Input: 'Why does McGonagall seem concerned about Harry being raised by the Dursleys?'[0m
Observation: [33;1m[1;3m It is not explicitly stated in the given context, but it can be inferred that McGonagall is concerned about Harry's well-being and treatment at the hands of the Dursleys, as they are described as neglectful and possibly abusive. Additionally, McGonagall may also be concerned about Harry's lack of knowledge about the wizarding world and his potential vulnerability to dark forces.[0m
Thought:[32;1m[1;3m You should always think about what to do
Action: State of Union QA System
Action Input: 'What were the main topics discussed in the most recent state of the union address?'[0m
Observation: [36;1m[1;3m The main topics discussed in the most recent state of the union address were the strength of the American people, the progress made in the past year, the future goals for the cou

'The main topics discussed in the most recent state of the union address were the strength of the American people, the progress made in the past year, the future goals for the country, and the need to come together as one nation to overcome challenges.'

We can see that the agent can "smartly" choose which QA system to use given a specific question. 

## 3 Your Task: putting it all together: OpenAI and Langchain

In [154]:
#### Your Task ####

# This is a major task that requires some thinking and time. 

# Build a conversation system from a collection of research papers of your choice. 
# You can ask specific questions of a method about these papers, and the agent returns a brief answer to you (with no more than 100 words). 
# Save your data and ChromaDB in the /share directory so other people can use it. 
# Provide at least three query examples so the TAs can review your work. 

# You may use any tool from the past four labs or from the langchain docs, or any open source project. 

# write a summary (a Markdown cell) at the end of the notebook summarizing what works and what does not. 

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import RetrievalQA
from langchain.agents import AgentType, Tool
from langchain.agents import initialize_agent

llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")

data = []
docsearch=[]
docsearch_reloaded = []
chain = []

for i in range(1, 9):
    loaded_data = PyPDFLoader(f"/share/lab4/ych/papers/paper{i}.pdf").load()
    data.append(loaded_data)
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=750, chunk_overlap=50)
    texts = text_splitter.split_documents(loaded_data)
    print(texts)
    
    docsearch.append(Chroma.from_documents(texts, 
                                   minilm_embedding, 
                                   collection_name=f"data{i}", 
                                   persist_directory="/share/lab4/ych/chroma"))
    docsearch[i-1].persist()
    docsearch_reloaded.append(Chroma(persist_directory = "/share/lab4/ych/chroma",
                                   collection_name = f"data{i}", 
                                   embedding_function = minilm_embedding))

for i in range(0, 8):
    chain.append(RetrievalQA.from_chain_type(llm=llm, 
                                           chain_type="stuff", 
                                           retriever=docsearch_reloaded[i].as_retriever()))


tools = [
    Tool(
        name="System1",
        func=chain[0].run,
        description="It is used to answer the question about the first paper.",
    ),
    Tool(
        name="System2",
        func=chain[1].run,
        description="It is used to answer the question about the second paper.",
    ),
    Tool(
        name="System3",
        func=chain[2].run,
        description="It is used to answer the question about the third paper.",
    ),
    Tool(
        name="System4",
        func=chain[3].run,
        description="It is used to answer the question about the fourth paper.",
    ),
    Tool(
        name="System5",
        func=chain[4].run,
        description="It is used to answer the question about the fifth paper.",
    ),
    Tool(
        name="System6",
        func=chain[5].run,
        description="It is used to answer the question about the sixth paper.",
    ),
    Tool(
        name="System7",
        func=chain[6].run,
        description="It is used to answer the question about the seventh paper.",
    ),
    Tool(
        name="System8",
        func=chain[7].run,
        description="It is used to answer the question about the eighth paper.",
    ),
]

agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)

#query = "What is the Contributions of the fifth paper?"
query = "What is the final conclusions reached by the eighth paper?"
#query = "What is the research problem of the first paper?"

agent.run(query)



[Document(page_content='Non-Mergeable Sketching for Cardinality Estimation∗\nSeth Pettie\nUniversity of Michigan\npettie@umich.eduDingyu Wang\nUniversity of Michigan\nwangdy@umich.eduLonghui Yin\nTsinghua University\nylh17@mails.tsinghua.edu.cn\nAbstract\nCardinality estimation is perhaps the simplest non-trivial statistical problem that can be\nsolved via sketching. Industrially-deployed sketches like HyperLogLog ,MinHash, and PCSAare\nmergeable , which means that large data sets can be sketched in a distributed environment, and\nthen merged into a single sketch of the whole data set. In the last decade a variety of sketches\nhave been developed that are non-mergeable , but attractive for other reasons. They are simpler,', metadata={'source': '/share/lab4/ych/papers/paper1.pdf', 'page': 0}), Document(page_content='their cardinality estimates are strictly unbiased , and they have substantially lower variance .\nWe evaluate sketching schemes on a reasonably level playing ﬁeld, in terms 

'The final conclusions reached by the eighth paper are that relationships between papers can be represented by various types of relationships, such as direct and indirect citation, co-citation, and co-reference. These relationships can be combined using weights to create a relativity matrix, which can be used to measure the associations between papers. The direct citation matrix is implied into the distance matrix, and the relativity matrix is used instead of the direct and indirect citation relationships. The final constraint is that the sum of the weights must equal 1.'

This agent can summarize the paper stored in the memory and gives some detailed information, such as, the conclusion, of the papers.
But it fails to combine the papers' contents together and may be confused when asked the information which don't explicitly appear in the papers.