# Advanced RAG Pipeline Utilizing LLM

Basic RAG Capabilites: Developing pipeline to load, transform and embed data to store in vector database. 
<br>Plus LLM connection for Augmented Retreival
<br>Chain and retreiver
<br>Stuff document chain

Import libraries

In [1]:
import os 
import sys
import platform
import pkg_resources

import langchain_community
import langchain_core

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma, FAISS
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

from dotenv import load_dotenv

In [2]:
# Print Python, OS, and package versions
print(f"Python version: {platform.python_version()}")
print(f"OS: {platform.system()} {platform.release()}")
print("")
print(f"langchain_community: {langchain_community.__version__}")
print(f"langchain_core: {langchain_core.__version__}")

Python version: 3.10.5
OS: Darwin 23.4.0

langchain_community: 0.2.2
langchain_core: 0.2.4


Load API keys

In [3]:
load_dotenv()

os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
os.environ['LANGCHAIN_API_KEY'] = os.getenv('LANGCHAIN_API_KEY')

### Load, transform and embed

In [4]:
loader = PyPDFLoader('../data/sample_congressional_hearing.pdf')
docs   = loader.load()

docs[0:3]

[Document(page_content='ONLINE PLATFORMS AND MARKET POWER, \nPART 6: EXAMINING THE DOMINANCE OF \nAMAZON, APPLE, FACEBOOK, AND GOOGLE \nHEARING \nBEFORE THE  \nSUBCOMMITTEE ON ANTITRUST, COMMERCIAL AND \nADMINISTRATIVE LAW \nOF THE  \nCOMMITTEE ON THE JUDICIARY \nHOUSE OF REPRESENTATIVES \nONE HUNDRED SIXTEENTH CONGRESS \nSECOND SESSION \nJULY 29, 2020 \nSerial No. 116–94 \nPrinted for the use of the Committee on the Judiciary \n( \nAvailable http://judiciary.house.gov or www.govinfo.gov \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00001 Fmt 6011 Sfmt 6011 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD with HEARING', metadata={'source': '../data/sample_congressional_hearing.pdf', 'page': 0}),
 Document(page_content='ONLINE PLATFORMS AND MARKET POWER, PART 6: EXAMINING THE DOMINANCE OF AMAZON, \nAPPLE, FACEBOOK, AND GOOGLE \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00002 Fmt 6019 Sfmt 6019 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD

In [5]:
# Chunk the document using recursive text splitter
text_spiltter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 20)
split_doc     = text_spiltter.split_documents(docs)

print("Number of chunks in docs:", len(docs), "\n")
print("Number of chunks ib split_docs:", len(split_doc), "\n")
print("Preview below:\n")
print(split_doc[:3])

Number of chunks in docs: 758 

Number of chunks ib split_docs: 1056 

Preview below:

[Document(page_content='ONLINE PLATFORMS AND MARKET POWER, \nPART 6: EXAMINING THE DOMINANCE OF \nAMAZON, APPLE, FACEBOOK, AND GOOGLE \nHEARING \nBEFORE THE  \nSUBCOMMITTEE ON ANTITRUST, COMMERCIAL AND \nADMINISTRATIVE LAW \nOF THE  \nCOMMITTEE ON THE JUDICIARY \nHOUSE OF REPRESENTATIVES \nONE HUNDRED SIXTEENTH CONGRESS \nSECOND SESSION \nJULY 29, 2020 \nSerial No. 116–94 \nPrinted for the use of the Committee on the Judiciary \n( \nAvailable http://judiciary.house.gov or www.govinfo.gov \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00001 Fmt 6011 Sfmt 6011 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD with HEARING', metadata={'source': '../data/sample_congressional_hearing.pdf', 'page': 0}), Document(page_content='ONLINE PLATFORMS AND MARKET POWER, PART 6: EXAMINING THE DOMINANCE OF AMAZON, \nAPPLE, FACEBOOK, AND GOOGLE \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 

In [6]:
# Convert to vector embeddings using OpenAI
# Store vector embeddings in vectore database (vector store) 

# db = FAISS.from_documents((split_doc[:30]), OpenAIEmbeddings())
db = FAISS.from_documents((split_doc[:]), OpenAIEmbeddings())

In [7]:
# Query the vector database

query = "What is this text about?"
db.similarity_search(query)[0].page_content

'as an editor at The New York Times, she wrote a letter explaining \nwhy she resigned. And I’ll just read three sentences for—for all of you, actually. \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00161 Fmt 6602 Sfmt 6602 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD with HEARING'

### Implement LLM capabilites in RAG

In [8]:
# Load LLM
llm = OpenAI(model_name="gpt-3.5-turbo-instruct")
# llm = OpenAI(model_name="gpt-3.5-turbo-instruct")

llm

OpenAI(client=<openai.resources.completions.Completions object at 0x110c45f00>, async_client=<openai.resources.completions.AsyncCompletions object at 0x1276b0b50>, openai_api_key=SecretStr('**********'), openai_proxy='')

In [9]:
# Define prompt

prompt_template = PromptTemplate.from_template("""Answer the following question based only on the provided context.
                                               Think step by step before providing a detailed answer.
                                               Explain how you came to the answer. 
                                               The contexts is as follows: 
                                               <context>
                                               {context}
                                               </context>
                                               Question: {input}""")


In [10]:
print("\t\t\t\t\t\t",prompt_template.format(context = db, input = query))

						 Answer the following question based only on the provided context.
                                               Think step by step before providing a detailed answer.
                                               Explain how you came to the answer. 
                                               The contexts is as follows: 
                                               <context>
                                               <langchain_community.vectorstores.faiss.FAISS object at 0x127414160>
                                               </context>
                                               Question: What is this text about?


In [11]:
# Create a chain for passing a list of Documents to a model.
document_chain = create_stuff_documents_chain(llm , prompt_template)

In [12]:
# Create a retreiver
retreiver = db.as_retriever()
retreiver

VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x127414160>)

In [13]:
# Create a retreival chain (retreiver + document chain = retreiver chain)
retrieval_chain = create_retrieval_chain(retreiver , document_chain)

In [14]:
retrieval_chain.invoke({"input" : "What is this text about?"})

{'input': 'What is this text about?',
 'context': [Document(page_content='as an editor at The New York Times, she wrote a letter explaining \nwhy she resigned. And I’ll just read three sentences for—for all of you, actually. \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00161 Fmt 6602 Sfmt 6602 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD with HEARING', metadata={'source': '../data/sample_congressional_hearing.pdf', 'page': 160}),
  Document(page_content='to address this and, frankly, to correct the record, because I believe that what he was referring to was a question that was incoming from investors about whether we would continue to acquire dif-ferent companies. I don’t think that was—that wasn’t referring to an internal strategy; it was referring to an external question that we were facing about how investors should expect us to act going forward. \nAnd I think he was discussing the fact that, as mobile phones \nwere growing in popularity, there were a lot

In [15]:
print(retrieval_chain.invoke({"input" : "What is this text about?"})['answer'])



This text is about a hearing where members of Congress are questioning the CEOs of major tech companies, including Amazon and Google, about their business practices and treatment of third-party sellers and user data. The specific excerpt is a dialogue between Representative Lucy McBath and Amazon CEO Jeff Bezos, where she expresses concerns about a pattern of behavior and asks for assurances that these issues will be addressed in the future.


In [16]:
print(retrieval_chain.invoke({"input" : "Who are the people speaking?"})['answer'])



The people speaking are members of the subcommittee, including Mr. CICILLINE, who is the chairman, and Mr. ZUCKERBERG, who is one of the witnesses invited to testify.


In [27]:
print(retrieval_chain.invoke({"input" : "Who are all the people speaking? Could you list them all in bullets?"})['answer'])



Answer: 
- Mr. CICILLINE
- Tim Cook
- Mark Zuckerberg


In [19]:
print(retrieval_chain.invoke({"input" : "What was the tone of the investigation? What was the conflict and resolution if any?"})['answer'])



Answer: The tone of the investigation was serious and bipartisan. The conflict was the potential for online market power to harm innovation, privacy, and independent businesses. The resolution was for the subcommittee to publish a report with proposed solutions to these problems.


In [20]:
print(retrieval_chain.invoke({"input" : "Create a summary of the text, Present it in key bullet points. Give me the opening testimonies, witness testimonies and closing testimonies as well as any conflicts and resolutions"})['answer'])



Opening statements:
- Representatives are asked to swear or affirm under penalty of perjury that their testimony is true and correct.
- All witnesses answer in the affirmative.
- Witnesses are asked to summarize their testimonies in 5 minutes with the help of a timing light.

Witness testimonies:
- Jeff Bezos, CEO of Amazon.com, Inc., testifies.
- Sundar Pichai, CEO of Google, Inc., testifies.
- Tim Cook, CEO of Apple Inc., testifies.
- Mark Zuckerberg, CEO of Facebook, Inc., testifies.

Closing statements:
- Majority members submit letters and statements for the hearing record.
- Witnesses are thanked for their participation.
- Witnesses are sworn in and reminded to mute themselves if needed.

Conflicts and resolutions:
- Witnesses are the only ones invited to testify from their respective companies.
- Witnesses must provide their own sworn testimony.
- Representatives are reminded to let the committee know if they need to mute themselves.
