# RAG Langchain PDF Example - Council Financial Plans
This example loads one or more PDF documents, splits the contents into chunks, loads these into a vector store, then uses a retriever to 
ask natural langauge questions.

This uses langchain, a popular package to chain all these opeartions together, and to use an underlying Generative AI model, in this case OpenAI.

In [1]:
import os

# Load the .env file.  This allows us to use environment variables in the .env file
from dotenv import load_dotenv
load_dotenv() # load the .env file

True

In [2]:
os.getenv("OPENAI_API_KEY")

'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'

In [3]:
import textwrap

These are the imports that are required for the langchain to work

In [4]:
from langchain import hub
from langchain.vectorstores import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter


USER_AGENT environment variable not set, consider setting it to identify your requests.


Get the PDF files(s) that we will want to analyse.

In [5]:
pdf_dir = os.path.abspath("./pdf/")
pdf_dir

'c:\\Users\\Mark\\Documents\\repos\\Python-GenAI-Course\\pdf'

Load all PDFs in the directory

In [7]:
from langchain.document_loaders import DirectoryLoader, PyPDFLoader

directory_loader = DirectoryLoader(pdf_dir, glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = directory_loader.load()
documents[:3]

[Document(metadata={'source': 'c:\\Users\\Mark\\Documents\\repos\\Python-GenAI-Course\\pdf\\derbyshire-dales-medium-term-financial-plan-appendix-1.pdf', 'page': 0}, page_content=' \n \n \nMedium-Term Financial Strategy \n2024/25 to 2028/29 \nDraft for approval by Full Council on 29th February 2024 \n \n \n \n \n \n \nThis Medium-Term Financial Strategy is intended to set out the Council’s strategic \napproach to the management of its finances and provide a framework within which \ndecisions can be made regarding future service provision and council tax levels. '),
 Document(metadata={'source': 'c:\\Users\\Mark\\Documents\\repos\\Python-GenAI-Course\\pdf\\derbyshire-dales-medium-term-financial-plan-appendix-1.pdf', 'page': 1}, page_content='Table of Contents \n  Page \n1 Executive Summary 1 \n   \n2 Overview 4 \n2.1 Purpose of the Strategy 4 \n2.2 Principles of the Strategy 4 \n2.3 Background 7 \n2.4 National and International Influences 7 \n2.5 Government Funding 9 \n2.6 The Coronaviru

Use the ChatOpenAI class to create a language model

In [8]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
llm

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x0000015A7089D280>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x0000015A7089F230>, root_client=<openai.OpenAI object at 0x0000015A704AEF90>, root_async_client=<openai.AsyncOpenAI object at 0x0000015A7047B650>, model_name='gpt-4o-mini', model_kwargs={}, openai_api_key=SecretStr('**********'))

Split the text into smaller chunks based on sentences or characters

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)
chunks[:3]

[Document(metadata={'source': 'c:\\Users\\Mark\\Documents\\repos\\Python-GenAI-Course\\pdf\\derbyshire-dales-medium-term-financial-plan-appendix-1.pdf', 'page': 0}, page_content='Medium-Term Financial Strategy \n2024/25 to 2028/29 \nDraft for approval by Full Council on 29th February 2024 \n \n \n \n \n \n \nThis Medium-Term Financial Strategy is intended to set out the Council’s strategic \napproach to the management of its finances and provide a framework within which \ndecisions can be made regarding future service provision and council tax levels.'),
 Document(metadata={'source': 'c:\\Users\\Mark\\Documents\\repos\\Python-GenAI-Course\\pdf\\derbyshire-dales-medium-term-financial-plan-appendix-1.pdf', 'page': 1}, page_content='Table of Contents \n  Page \n1 Executive Summary 1 \n   \n2 Overview 4 \n2.1 Purpose of the Strategy 4 \n2.2 Principles of the Strategy 4 \n2.3 Background 7 \n2.4 National and International Influences 7 \n2.5 Government Funding 9 \n2.6 The Coronavirus Pandemic

In [12]:
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# splits = text_splitter.split_documents(documents)
vectorstore = Chroma.from_documents(documents=chunks, embedding=OpenAIEmbeddings())
vectorstore

<langchain_community.vectorstores.chroma.Chroma at 0x15a70896ff0>

Create the RAG chain

In [13]:
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
rag_chain



{
  context: VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x0000015A70896FF0>, search_kwargs={})
           | RunnableLambda(format_docs),
  question: RunnablePassthrough()
}
| ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])
| ChatOpenAI(client=<op

In [14]:
#  Some example prompts
test_prompt1 = "Summarise Middlesborough council financial plan in two paragraphs?"
test_prompt2 = "In what ways are the financial plans of Middlesborough council and Derbyshire Dales similar?"

In [15]:
response = rag_chain.invoke(test_prompt2)
wrapped_response = textwrap.fill(response, width=120)
print(wrapped_response)

Both Middlesborough Council and Derbyshire Dales District Council are facing financial challenges due to economic
downturns, high inflation, and declining government funding. Their financial plans must account for the need to maintain
reserves to manage unforeseen financial impacts. Additionally, both councils are adapting their financial strategies in
response to changes in the economic landscape and government funding structures.


Some old code below here - ignore

This code is useful to load a single PDF.  However we are loading alll PDF files in a folder, so this is commenedt out.

In [None]:
#from langchain.document_loaders import PyPDFLoader
# pdf_file = pdf_dir + "/DAX Resources.pdf"
# pdf_file
# # Load a PDF file
# loader = PyPDFLoader(pdf_file)

# # Load pages of the document into text chunks
# documents = loader.load()

# documents

In [None]:
# from langchain.chains.summarize import load_summarize_chain
# from langchain.llms import OpenAI

# # Load an OpenAI LLM
# llm = OpenAI(model="gpt-4", api_key=os.getenv("OPENAI_API_KEY"))

# # Use a summarization chain
# summarize_chain = load_summarize_chain(llm)

# # Summarize the chunks of the document
# summaries = summarize_chain.run(chunks)
