## PyPDFLoader

providers functionality for loading PDF documents within the LangChain framework

In [2]:
# !pip install pypdf 

from langchain_community.document_loaders import PyPDFLoader

First lets look at the pdf document

In [3]:
!open documents/apple_cider.pdf

'open' is not recognized as an internal or external command,
operable program or batch file.


This line of code initializes the loader

In [4]:
loader = PyPDFLoader("documents/apple_cider.pdf")

Load the PDF using the pypdf into the "pages" variable

Each page is stored as a separate chunk. It also stores the page numbers as metadata

In [5]:
pages = loader.load_and_split()

In [6]:
pages[:3]

[Document(page_content="REVIEW Open Access\nThe effect of apple cider vinegar on lipid\nprofiles and glycemic parameters: a\nsystematic review and meta-analysis of\nrandomized clinical trials\nAmir Hadi1, Makan Pourmasoumi2, Ameneh Najafgholizadeh3, Cain C. T. Clark4and Ahmad Esmaillzadeh5,6,7*\nAbstract\nBackground: Elevated lipid profiles and impaired glucose homeostasis are risk factors for several cardiovascular\ndiseases (CVDs), which, subsequently, represent a leading cause of early mortality, worldwide. The aim of the\ncurrent study was to conduct a systematic review and meta-analysis of the effect of apple cider vinegar (ACV) on\nlipid profiles and glycemic parameters in adults.\nMethods: A systematic search was conducted in electronic databases, including Medline, Scopus, Cochrane Library,\nand Web of Knowledge, from database inception to January 2020. All clinical trials which investigated the effect of\nACV on lipid profiles and glycemic indicators were included. Studies wer

In [7]:
for i in range(3):
    print(pages[i].metadata)

{'source': 'documents/apple_cider.pdf', 'page': 0}
{'source': 'documents/apple_cider.pdf', 'page': 1}
{'source': 'documents/apple_cider.pdf', 'page': 1}


Since the page of the pdf is still quite long, we would break the pages into smaller pieces.

We give a bit of overlap so that no meaningful sentense is lost. 

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

documents = text_splitter.split_documents(pages)

In [9]:
print(f"{len(pages)} vs {len(documents)}")

17 vs 57


Let's now load teh api_key

In [10]:
import os 
from dotenv import load_dotenv

load_dotenv(".env")

openai_api_key = os.getenv("open_ai_key")

## Embeddings: 

We are going to use openAI embeddings to convert each chunk of text to numeric vectors. 

Remember, the reason is that searching through a large number of text chunks is very time consuming. However, numeric vector comparison is extremly fast.

In [11]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

## Chroma vector database

We are going to use openAI embeddings to convert each chunk of text to numeric vectors. 

In [12]:
!pip install Chroma chromadb



In [13]:
from langchain_community.vectorstores import Chroma
vector = Chroma.from_documents(documents, embeddings)

In [14]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(openai_api_key=openai_api_key, model="gpt-3.5-turbo")

Output parser

We would like to conver thte output of the chatmodel into a pure text

In [15]:
from langchain_core.output_parsers import StrOutputParser
output_parser = StrOutputParser()


Retrievers 

Will be used to take the question, and compare it with all the numeric vectors in the databse and return the most similar chunks of text

In [16]:
retriever = vector.as_retriever()

## Adding memory

## Question Maker 

One user asks a new question, there is a history of questions and answers in his/her mind. 

Here the idea is to reforulate user's question into a format that has its own conext. 

We are going to use LLM to perform this reformulation of the question. 

Here is the idea: 

User's followup question => LLM => reforulated question(with history)

In [17]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

instructions_to_system="""

Given a chat history and the latest user question
which might reference context in the chat history, formulate a standalone question
which can be understood without the chat history. Do NOT answer the question, 
just reforumlate it if needed otherwise return it as is.

"""

question_maker_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", instructions_to_system),
        MessagesPlaceholder(variable_name = "chat_history"),
        ("human", "{question}"),
    ]
)


question_chain = question_maker_prompt | llm | StrOutputParser()

## Exmaple 

Let's assume that we previously had asked about the shape of the moon and the ai had responsed that the mood in sphereical. 

Now user follows up asking for further explanation. But, does not give the context. 

The question_chain has to add the context to the followup question and make a new question. 

In [20]:
from langchain_core.messages import AIMessage, HumanMessage
question_chain.invoke({"question": "can you explain more?", 
                       "chat_history": [HumanMessage(content="you explained that the moon is round")]})

'Can you provide further details on the shape of the moon?'

## Prompt 

We now build the prompt for th  question and answer. 

This time, the prompt consists of: 

 - a python-list of system instruction
 - a place holder to take the chat history later on
 - a user's question

In [21]:
# Use three sentences maximum and keep the answer concise

qa_system_prompt = """

You are an assistant for question-answering tasks. \
Use the following piece of retrieved context to answer the question. \
If you don't know the answer, provide a summary of the context. Do not generate your answer.\
    
    {context}
"""

qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt), 
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{question}"),
    ]
)

## Which question to pass to LLM 

We define a function that looks at the chat history, 

- If there is a history: it will pass the question chain( that reformulates user's question)
- if chat history is empty, it will pass user's question directly

In [22]:
def contextualized_question(input: dict):
    if input.get("chat_history"):
        return question_chain
    else: 
        return input["question"]

## Retriever chain

We need a chain to pass the following to the llm: 

- context: use the vector retriever and get the most relevant chunks of the PDF 
- question: reforumlated or the original user's question depending on the history 
- chat_history: python list of the chats
We use the assign function which adds the context to whatever it gets as input and pass it to the next link of the chain

In [23]:
from langchain_core.runnables import  RunnablePassthrough

retriever_chain = RunnablePassthrough.assign(
    context = contextualized_question | retriever
)

## Example

Let's see the output of the retriever_chain 

Look at the extra'context variable that is added to the "question and "chat_history" variables. 

This is what the "assign" function does


In [25]:
retriever_chain.invoke({
    "chat_history":[HumanMessage(content="you explained that the moon is round")],
    "question": "can you explain more?"
})

{'chat_history': [HumanMessage(content='you explained that the moon is round')],
 'question': 'can you explain more?',
 'context': [Document(page_content='Mahmoodi et al. [ 27] UH U U L U\nPanetta et al. [ 33]L L LL L U', metadata={'page': 5, 'source': 'documents/apple_cider.pdf'}),
  Document(page_content='The online version contains supplementary material available at https://doi.', metadata={'page': 9, 'source': 'documents/apple_cider.pdf'}),
  Document(page_content='dataSelective\nreportingOther\nbias\nBashiri et al. [ 4]L U H L L L\nHalima et al. [ 16] LU U U L U\nKondo et al. [ 23]L U L L LU\nMahmoodi et al. [ 27] UH U U L U\nPanetta et al. [ 33]L L LL L U\nKhezri et al. (2018) [ 22]L L H L L L\nEbrahimi-Mamaghani et al. [ 12]LU H L L U\nMohammadpourhodki et al.\n(2019) [ 28]LU H L L U\nKausar et al. (2019) [ 20]L L L L L L\nHhigh risk of bias, Llow risk of bias, Uunclear or unrevealed risk of bias. Criteria defined for risk of bias assessment are according to the Cochrane guidel

## Retrieval-Augmented Generation (RAG) Chain: the main chain

This is the main chain that produces the final answer

In [27]:
rag_chain =(
    retriever_chain | qa_prompt | llm # | output_parser
)

In [31]:
question = "What top are the top 3 benefits"

In [32]:
chat_history=[]

ai_msg = rag_chain.invoke({"question": question, "chat_history": chat_history})
chat_history.extend([HumanMessage(content=question), ai_msg])
ai_msg

'The top three benefits of apple cider vinegar (ACV) consumption mentioned in the provided context are its potential to reduce fasting plasma glucose (FPG) levels, lower serum total cholesterol (TC) concentrations, and decrease triglyceride (TG) levels.'

In [36]:
print(ai_msg)

The top three benefits of apple cider vinegar (ACV) consumption mentioned in the provided context are its potential to reduce fasting plasma glucose (FPG) levels, lower serum total cholesterol (TC) concentrations, and decrease triglyceride (TG) levels.


In [37]:
question = "Can you tell me more about each one?"

ai_msg = rag_chain.invoke({"question": question, "chat_history": chat_history})
chat_history.extend([HumanMessage(content=question), ai_msg])
ai_msg

'1. **Reducing Fasting Plasma Glucose (FPG) levels:** ACV consumption was found to significantly decrease FPG levels in studies where ACV was administered for more than 8 weeks. This indicates a potential benefit for individuals in managing their blood sugar levels.\n\n2. **Lowering Serum Total Cholesterol (TC) Concentrations:** The meta-analysis revealed that ACV consumption led to a significant decrease in serum total cholesterol levels. This effect can be beneficial for individuals looking to improve their lipid profiles and reduce the risk of cardiovascular diseases.\n\n3. **Decreasing Triglyceride (TG) Levels:** ACV consumption was associated with a significant reduction in triglyceride levels. Lowering triglycerides is important for maintaining heart health and reducing the risk of cardiovascular issues.'

In [39]:
print(ai_msg)

1. **Reducing Fasting Plasma Glucose (FPG) levels:** ACV consumption was found to significantly decrease FPG levels in studies where ACV was administered for more than 8 weeks. This indicates a potential benefit for individuals in managing their blood sugar levels.

2. **Lowering Serum Total Cholesterol (TC) Concentrations:** The meta-analysis revealed that ACV consumption led to a significant decrease in serum total cholesterol levels. This effect can be beneficial for individuals looking to improve their lipid profiles and reduce the risk of cardiovascular diseases.

3. **Decreasing Triglyceride (TG) Levels:** ACV consumption was associated with a significant reduction in triglyceride levels. Lowering triglycerides is important for maintaining heart health and reducing the risk of cardiovascular issues.
