## PyPDFLoader

providers functionality for loading PDF documents within the LangChain framework

In [4]:
# !pip install pypdf 

from langchain_community.document_loaders import PyPDFLoader

First lets look at the pdf document

In [6]:
!open documents/apple_cider.pdf

'open' is not recognized as an internal or external command,
operable program or batch file.


This line of code initializes the loader

In [7]:
loader = PyPDFLoader("documents/apple_cider.pdf")

Load the PDF using the pypdf into the "pages" variable

Each page is stored as a separate chunk. It also stores the page numbers as metadata

In [8]:
pages = loader.load_and_split()

In [9]:
pages[:3]

[Document(page_content="REVIEW Open Access\nThe effect of apple cider vinegar on lipid\nprofiles and glycemic parameters: a\nsystematic review and meta-analysis of\nrandomized clinical trials\nAmir Hadi1, Makan Pourmasoumi2, Ameneh Najafgholizadeh3, Cain C. T. Clark4and Ahmad Esmaillzadeh5,6,7*\nAbstract\nBackground: Elevated lipid profiles and impaired glucose homeostasis are risk factors for several cardiovascular\ndiseases (CVDs), which, subsequently, represent a leading cause of early mortality, worldwide. The aim of the\ncurrent study was to conduct a systematic review and meta-analysis of the effect of apple cider vinegar (ACV) on\nlipid profiles and glycemic parameters in adults.\nMethods: A systematic search was conducted in electronic databases, including Medline, Scopus, Cochrane Library,\nand Web of Knowledge, from database inception to January 2020. All clinical trials which investigated the effect of\nACV on lipid profiles and glycemic indicators were included. Studies wer

In [10]:
for i in range(3):
    print(pages[i].metadata)

{'source': 'documents/apple_cider.pdf', 'page': 0}
{'source': 'documents/apple_cider.pdf', 'page': 1}
{'source': 'documents/apple_cider.pdf', 'page': 1}


Since the page of the pdf is still quite long, we would break the pages into smaller pieces.

We give a bit of overlap so that no meaningful sentense is lost. 

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

documents = text_splitter.split_documents(pages)

In [14]:
print(f"{len(pages)} vs {len(documents)}")

17 vs 57


Let's now load teh api_key

In [15]:
import os 
from dotenv import load_dotenv

load_dotenv(".env")

openai_api_key = os.getenv("open_ai_key")

## Embeddings: 

We are going to use openAI embeddings to convert each chunk of text to numeric vectors. 

Remember, the reason is that searching through a large number of text chunks is very time consuming. However, numeric vector comparison is extremly fast.

In [16]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

## Chroma vector database

We are going to use openAI embeddings to convert each chunk of text to numeric vectors. 

In [22]:
!pip install Chroma chromadb

Collecting chromadb
  Using cached chromadb-0.4.24-py3-none-any.whl.metadata (7.3 kB)
Collecting build>=1.0.3 (from chromadb)
  Using cached build-1.2.1-py3-none-any.whl.metadata (4.3 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Using cached chroma_hnswlib-0.7.3-cp311-cp311-win_amd64.whl.metadata (262 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.1-py3-none-any.whl.metadata (24 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Using cached uvicorn-0.29.0-py3-none-any.whl.metadata (6.3 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Using cached posthog-3.5.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting pulsar-client>=3.1.0 (from chromadb)
  Downloading pulsar_client-3.5.0-cp311-cp311-win_amd64.whl.metadata (1.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.17.3-cp311-cp311-win_amd64.whl.metadata (4.6 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Using cached opentelemet

In [23]:
from langchain_community.vectorstores import Chroma
vector = Chroma.from_documents(documents, embeddings)

In [17]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(openai_api_key=openai_api_key, model="gpt-3.5-turbo")

Output parser

We would like to conver thte output of the chatmodel into a pure text

In [24]:
from langchain_core.output_parsers import StrOutputParser
output_parser = StrOutputParser()


Retrievers 

Will be used to take the question, and compare it with all the numeric vectors in the databse and return the most similar chunks of text

In [25]:
retriever = vector.as_retriever()

## Adding memory

## Question Maker 

One user asks a new question, there is a history of questions and answers in his/her mind. 

Here the idea is to reforulate user's question into a format that has its own conext. 

We are going to use LLM to perform this reformulation of the question. 

Here is the idea: 

User's followup question => LLM => reforulated question(with history)

In [26]:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

instructions_to_system="""

Given a chat history and the latest user question
which might reference context in the chat history, formulate a standalone question
which can be understood without the chat history. Do NOT answer the question, 
just reforumlate it if needed otherwise return it as is.

"""

question_maker_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", instructions_to_system),
        MessagesPlaceholder(variable_name = "chat_history"),
        ("human", "{question}"),
    ]
)


question_chain = question_maker_prompt | llm | StrOutputParser()

NameError: name 'question_maker_prompt' is not defined