# Advanced RAG Pipeline Utilizing LLM

Basic RAG Capabilites: Developing pipeline to load, transform and embed data to store in vector database. 
<br>Plus LLM connection for Augmented Retreival
<br>Chain and retreiver with stuff document chaining will be implemented

Import libraries

In [1]:
import os 
import sys
import platform
import pkg_resources

import langchain_community
import langchain_core

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma, FAISS
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

from dotenv import load_dotenv

In [2]:
# Print Python, OS, and package versions
print(f"Python version: {platform.python_version()}")
print(f"OS: {platform.system()} {platform.release()}")
print("")
print(f"langchain_community: {langchain_community.__version__}")
print(f"langchain_core: {langchain_core.__version__}")

Python version: 3.10.5
OS: Darwin 23.4.0

langchain_community: 0.2.4
langchain_core: 0.2.5


Load API keys

In [3]:
load_dotenv()

os.environ['OPENAI_API_KEY']       = os.getenv('OPENAI_API_KEY')
os.environ['LANGCHAIN_API_KEY']    = os.getenv('LANGCHAIN_API_KEY')
os.environ["LANGCHAIN_API_KEY"]    = os.getenv("LANGCHAIN_API_KEY")   
os.environ["LANGCHAIN_TRACING_V2"] = "true"                           

### Load, transform and embed

In [4]:
loader = PyPDFLoader('../data/sample_congressional_hearing.pdf')
docs   = loader.load()

docs[0:3]

[Document(page_content='ONLINE PLATFORMS AND MARKET POWER, \nPART 6: EXAMINING THE DOMINANCE OF \nAMAZON, APPLE, FACEBOOK, AND GOOGLE \nHEARING \nBEFORE THE  \nSUBCOMMITTEE ON ANTITRUST, COMMERCIAL AND \nADMINISTRATIVE LAW \nOF THE  \nCOMMITTEE ON THE JUDICIARY \nHOUSE OF REPRESENTATIVES \nONE HUNDRED SIXTEENTH CONGRESS \nSECOND SESSION \nJULY 29, 2020 \nSerial No. 116–94 \nPrinted for the use of the Committee on the Judiciary \n( \nAvailable http://judiciary.house.gov or www.govinfo.gov \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00001 Fmt 6011 Sfmt 6011 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD with HEARING', metadata={'source': '../data/sample_congressional_hearing.pdf', 'page': 0}),
 Document(page_content='ONLINE PLATFORMS AND MARKET POWER, PART 6: EXAMINING THE DOMINANCE OF AMAZON, \nAPPLE, FACEBOOK, AND GOOGLE \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00002 Fmt 6019 Sfmt 6019 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD

In [5]:
# Chunk the document using recursive text splitter
text_spiltter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 20)
split_doc     = text_spiltter.split_documents(docs)

print("Number of chunks in docs:", len(docs), "\n")
print("Number of chunks ib split_docs:", len(split_doc), "\n")
print("Preview below:\n")
print(split_doc[:3])

Number of chunks in docs: 758 

Number of chunks ib split_docs: 1056 

Preview below:

[Document(page_content='ONLINE PLATFORMS AND MARKET POWER, \nPART 6: EXAMINING THE DOMINANCE OF \nAMAZON, APPLE, FACEBOOK, AND GOOGLE \nHEARING \nBEFORE THE  \nSUBCOMMITTEE ON ANTITRUST, COMMERCIAL AND \nADMINISTRATIVE LAW \nOF THE  \nCOMMITTEE ON THE JUDICIARY \nHOUSE OF REPRESENTATIVES \nONE HUNDRED SIXTEENTH CONGRESS \nSECOND SESSION \nJULY 29, 2020 \nSerial No. 116–94 \nPrinted for the use of the Committee on the Judiciary \n( \nAvailable http://judiciary.house.gov or www.govinfo.gov \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00001 Fmt 6011 Sfmt 6011 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD with HEARING', metadata={'source': '../data/sample_congressional_hearing.pdf', 'page': 0}), Document(page_content='ONLINE PLATFORMS AND MARKET POWER, PART 6: EXAMINING THE DOMINANCE OF AMAZON, \nAPPLE, FACEBOOK, AND GOOGLE \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 

In [6]:
# Convert to vector embeddings using OpenAI
# Store vector embeddings in vectore database (vector store) 

# db = FAISS.from_documents((split_doc[:30]), OpenAIEmbeddings())
db = FAISS.from_documents((split_doc[:6]), OpenAIEmbeddings())

In [7]:
# Query the vector database

query = "What is this text about?"
db.similarity_search(query)[0].page_content

'(III) C O N T E N T S \nJULY 29, 2020 \nOPENING STATEMENTS \nPage \nThe Honorable David Cicilline, Chairman, Subcommittee on Antitrust, Com-\nmercial and Administrative Law ........................................................................ 2 \nThe Honorable James Sensenbrenner, Ranking Member, Subcommittee on \nAntitrust, Commercial and Administrative Law ............................................... 4 \nThe Honorable Jerrold Nadler, Chairman, Committee on the Judiciary ............ 5 \nThe Honorable Jim Jordan, Ranking Member, Committee on the Judiciary ..... 7 \nWITNESSES \nJeff Bezos, Chief Executive Officer, Amazon.com, Inc. \nOral Testimony ................................................................................................. 11 Prepared Testimony ......................................................................................... 13 \nSundar Pichai, Chief Executive Officer, Alphabet Inc.'

### Implement LLM capabilites in RAG

In [8]:
# Load LLM
llm = OpenAI(model_name="gpt-3.5-turbo-instruct")

llm

OpenAI(client=<openai.resources.completions.Completions object at 0x12cadf010>, async_client=<openai.resources.completions.AsyncCompletions object at 0x12cade4d0>, openai_api_key=SecretStr('**********'), openai_proxy='')

In [9]:
# Define prompt

prompt_template = PromptTemplate.from_template("""Answer the following question based only on the provided context.
                                               Think step by step before providing a detailed answer.
                                               Explain how you came to the answer. 
                                               The contexts is as follows: 
                                               <context>
                                               {context}
                                               </context>
                                               Question: {input}""")


In [10]:
print("\t\t\t\t\t\t",prompt_template.format(context = db, input = query))

						 Answer the following question based only on the provided context.
                                               Think step by step before providing a detailed answer.
                                               Explain how you came to the answer. 
                                               The contexts is as follows: 
                                               <context>
                                               <langchain_community.vectorstores.faiss.FAISS object at 0x12cadc4c0>
                                               </context>
                                               Question: What is this text about?


In [11]:
# Create a chain for passing a list of Documents to a model.
document_chain = create_stuff_documents_chain(llm , prompt_template)

In [12]:
# Create a retreiver
retreiver = db.as_retriever()
retreiver

VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x12cadc4c0>)

In [13]:
# Create a retreival chain (retreiver + document chain = retreiver chain)
retrieval_chain = create_retrieval_chain(retreiver , document_chain)

In [14]:
retrieval_chain.invoke({"input" : "What is this text about?"})

{'input': 'What is this text about?',
 'context': [Document(page_content='(III) C O N T E N T S \nJULY 29, 2020 \nOPENING STATEMENTS \nPage \nThe Honorable David Cicilline, Chairman, Subcommittee on Antitrust, Com-\nmercial and Administrative Law ........................................................................ 2 \nThe Honorable James Sensenbrenner, Ranking Member, Subcommittee on \nAntitrust, Commercial and Administrative Law ............................................... 4 \nThe Honorable Jerrold Nadler, Chairman, Committee on the Judiciary ............ 5 \nThe Honorable Jim Jordan, Ranking Member, Committee on the Judiciary ..... 7 \nWITNESSES \nJeff Bezos, Chief Executive Officer, Amazon.com, Inc. \nOral Testimony ................................................................................................. 11 Prepared Testimony ......................................................................................... 13 \nSundar Pichai, Chief Executive Officer, Alphabet 

In [15]:
print(retrieval_chain.invoke({"input" : "What is this text about?"})['answer'])



This text is about a hearing held by the Subcommittee on Antitrust, Commercial and Administrative Law of the Committee on the Judiciary, where the dominance of Amazon, Apple, Facebook, and Google in the online platforms and market power is being examined. The hearing took place on July 29, 2020 and included opening statements from various members of the subcommittee and testimony from Jeff Bezos, CEO of Amazon, and Sundar Pichai, CEO of Alphabet Inc. The text also mentions that the hearing is part of a series on online platforms and market power. 


In [16]:
print(retrieval_chain.invoke({"input" : "Who are the people speaking?"})['answer'])



The people speaking are members of the Subcommittee on Antitrust, Commercial and Administrative Law and the Committee on the Judiciary. They include Chairman David Cicilline, Ranking Member James Sensenbrenner, Chairman Jerrold Nadler, and Ranking Member Jim Jordan. They also include witnesses Jeff Bezos and Sundar Pichai, as well as various other members and staff of the subcommittee and committee. This can be determined by looking at the provided context, which lists the names and titles of the individuals speaking.


In [17]:
print(retrieval_chain.invoke({"input" : "Who are all the people speaking? Could you list them all in bullets?"})['answer'])



Answer:
- The Honorable David Cicilline, Chairman, Subcommittee on Antitrust, Commercial and Administrative Law
- The Honorable James Sensenbrenner, Ranking Member, Subcommittee on Antitrust, Commercial and Administrative Law
- The Honorable Jerrold Nadler, Chairman, Committee on the Judiciary
- The Honorable Jim Jordan, Ranking Member, Committee on the Judiciary
- Jeff Bezos, Chief Executive Officer, Amazon.com, Inc.
- Sundar Pichai, Chief Executive Officer, Alphabet Inc.
- Debbie Lesko, Arizona
- Guy Reschenthaler, Pennsylvania
- Ben Cline, Virginia
- Kelly Armstrong, North Dakota
- W. Gregory Steube, Florida
- Perry Apelbaum, Majority Staff Director & Chief Counsel
- Chris Hixon, Minority Staff Director
- David N. Cicilline, Rhode Island, Chair
- Joe Neguse, Colorado, Vice-Chair
- Henry C. 'Hank' Johnson Jr., Georgia
- Jamie Raskin, Maryland
- Pramila Jayapal, Washington
- Val Butler Demings, Florida
- Mary Gay Scanlon, Pennsylvania
- Lucy McBath, Georgia
- F. James Sensenbren


In [18]:
print(retrieval_chain.invoke({"input" : "What was the tone of the investigation? What was the conflict and resolution if any?"})['answer'])



Answer: Based on the provided context, the tone of the investigation appears to be serious and focused. The committee is holding a hearing to examine the dominance of Amazon, Apple, Facebook, and Google, which suggests that there may be concerns about their market power and potential antitrust violations. The committee is also hearing from high-level executives from these companies, indicating the gravity of the situation.

There does not appear to be a specific conflict or resolution mentioned in the context. However, it is possible that the investigation and hearing may lead to further actions or regulations to address the dominance of these companies in the market.


In [19]:
print(retrieval_chain.invoke({"input" : "Create a summary of the text, Present it in key bullet points. Give me the opening testimonies, witness testimonies and closing testimonies as well as any conflicts and resolutions"})['answer'])

.

Summary:
- The hearing is titled "Online Platforms and Market Power, Part 6: Examining the Dominance of Amazon, Apple, Facebook, and Google".
- It is taking place on July 29, 2020 and is the sixth hearing in a series.
- The hearing is being held by the Subcommittee on Antitrust, Commercial, and Administrative Law, under the Committee on the Judiciary.
- The witnesses are Jeff Bezos, CEO of Amazon.com, Inc., and Sundar Pichai, CEO of Alphabet Inc.
- There is a conflict between the majority and minority staff directors, Perry Apelbaum and Chris Hixon, respectively.
- The majority staff director for the Subcommittee is Slade Bond, and the minority chief counsel for Administrative Law is Douglas Geho.
- Opening statements are given by Chairman David Cicilline, Ranking Member James Sensenbrenner, Chairman Jerrold Nadler, and Ranking Member Jim Jordan.
- Witness testimonies are given by Jeff Bezos and Sundar Pichai, both providing oral and prepared statements.
- The hearing is focused on 