# Advanced RAG Pipeline Utilizing LLM

Basic RAG Capabilites: Developing pipeline to load, transform and embed data to store in vector database. 
<br>Plus LLM connection for Augmented Retreival
<br>Chain and retreiver with stuff document chaining will be implemented

Utilize GPT 4 or higher

Import libraries

In [1]:
import os 
import sys
import platform
import pkg_resources

import langchain_community
import langchain_core

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma, FAISS
from langchain_core.prompts import PromptTemplate
from langchain_openai import OpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
import openai
from openai import OpenAI

from dotenv import load_dotenv

In [2]:
# Print Python, OS, and package versions
print(f"Python version: {platform.python_version()}")
print(f"OS: {platform.system()} {platform.release()}")
print("")
print(f"langchain_community: {langchain_community.__version__}")
print(f"langchain_core: {langchain_core.__version__}")
print(f"openai: {openai.__version__}")

Python version: 3.10.5
OS: Darwin 23.4.0

langchain_community: 0.2.4
langchain_core: 0.2.5
openai: 1.33.0


Load API keys

In [3]:
load_dotenv()

os.environ['OPENAI_API_KEY']       = os.getenv('OPENAI_API_KEY')
os.environ['LANGCHAIN_API_KEY']    = os.getenv('LANGCHAIN_API_KEY')
os.environ["LANGCHAIN_API_KEY"]    = os.getenv("LANGCHAIN_API_KEY")   
os.environ["LANGCHAIN_TRACING_V2"] = "true"                           

### Load, transform and embed

In [4]:
loader = PyPDFLoader('../data/sample_congressional_hearing.pdf')
docs   = loader.load()

docs[0:3]

[Document(page_content='ONLINE PLATFORMS AND MARKET POWER, \nPART 6: EXAMINING THE DOMINANCE OF \nAMAZON, APPLE, FACEBOOK, AND GOOGLE \nHEARING \nBEFORE THE  \nSUBCOMMITTEE ON ANTITRUST, COMMERCIAL AND \nADMINISTRATIVE LAW \nOF THE  \nCOMMITTEE ON THE JUDICIARY \nHOUSE OF REPRESENTATIVES \nONE HUNDRED SIXTEENTH CONGRESS \nSECOND SESSION \nJULY 29, 2020 \nSerial No. 116–94 \nPrinted for the use of the Committee on the Judiciary \n( \nAvailable http://judiciary.house.gov or www.govinfo.gov \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00001 Fmt 6011 Sfmt 6011 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD with HEARING', metadata={'source': '../data/sample_congressional_hearing.pdf', 'page': 0}),
 Document(page_content='ONLINE PLATFORMS AND MARKET POWER, PART 6: EXAMINING THE DOMINANCE OF AMAZON, \nAPPLE, FACEBOOK, AND GOOGLE \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00002 Fmt 6019 Sfmt 6019 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD

In [5]:
# Chunk the document using recursive text splitter
text_spiltter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 20)
split_doc     = text_spiltter.split_documents(docs)

print("Number of chunks in docs:", len(docs), "\n")
print("Number of chunks ib split_docs:", len(split_doc), "\n")
print("Preview below:\n")
print(split_doc[:3])

Number of chunks in docs: 758 

Number of chunks ib split_docs: 1056 

Preview below:

[Document(page_content='ONLINE PLATFORMS AND MARKET POWER, \nPART 6: EXAMINING THE DOMINANCE OF \nAMAZON, APPLE, FACEBOOK, AND GOOGLE \nHEARING \nBEFORE THE  \nSUBCOMMITTEE ON ANTITRUST, COMMERCIAL AND \nADMINISTRATIVE LAW \nOF THE  \nCOMMITTEE ON THE JUDICIARY \nHOUSE OF REPRESENTATIVES \nONE HUNDRED SIXTEENTH CONGRESS \nSECOND SESSION \nJULY 29, 2020 \nSerial No. 116–94 \nPrinted for the use of the Committee on the Judiciary \n( \nAvailable http://judiciary.house.gov or www.govinfo.gov \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00001 Fmt 6011 Sfmt 6011 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD with HEARING', metadata={'source': '../data/sample_congressional_hearing.pdf', 'page': 0}), Document(page_content='ONLINE PLATFORMS AND MARKET POWER, PART 6: EXAMINING THE DOMINANCE OF AMAZON, \nAPPLE, FACEBOOK, AND GOOGLE \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 

In [6]:
# Convert to vector embeddings using OpenAI
# Store vector embeddings in vectore database (vector store) 

# db = FAISS.from_documents((split_doc[:30]), OpenAIEmbeddings())
db = FAISS.from_documents((split_doc[:]), OpenAIEmbeddings())

In [7]:
# Query the vector database

query = "What is this text about?"
db.similarity_search(query)[0].page_content

'as an editor at The New York Times, she wrote a letter explaining \nwhy she resigned. And I’ll just read three sentences for—for all of you, actually. \nVerDate Sep 11 2014 23:14 Mar 24, 2021 Jkt 041317 PO 00000 Frm 00161 Fmt 6602 Sfmt 6602 E:\\HR\\OC\\A317.XXX A317khammond on DSKJM1Z7X2PROD with HEARING'

### Implement LLM capabilites in RAG

In [8]:
# Retrieve relevant documents
retriever      = db.as_retriever()
retrieved_docs = retriever.get_relevant_documents(query)


# Define the context, question and LLM

context = "\n\n".join([doc.page_content for doc in retrieved_docs])
query   = "What is this text about?"
llm     = "gpt-4-turbo"

  warn_deprecated(


In [9]:
# Create the messages to send to the API
prompt_template = [
    {"role": "system", "content": "You are a helpful assistant. Answer the following question based only on the provided context. Think step by step before providing a detailed answer. Explain how you came to the answer."},
    {"role": "user", "content": f"The context is as follows: \n<context>\n{context}\n</context>\nQuestion: {query}"}
]

In [10]:
client = OpenAI()

completion = client.chat.completions.create(
    model    = llm,
    messages = prompt_template
)

print(completion.choices[0].message.content.strip())

The provided text is a compilation of excerpts from various congressional hearings or similar settings where major technology executives were questioned by lawmakers. Here's the breakdown of how I determined this:

1. **Presence of Major Tech Executives**: The mentions of names such as "Mr. Zuckerberg," "Mr. Bezos," and "Mr. Pichai" indicate that these dialogs involve high-profile figures from major tech companies — specifically, Mark Zuckerberg of Facebook, Jeff Bezos of Amazon, and Sundar Pichai of Google. These individuals typically testify before Congress or participate in hearings related to their company operations and policies.

2. **Topics Discussed**:
   - The text references discussions around business practices and technologies. For instance, Zuckerberg's mention relates to acquiring companies as a competitive strategy, possibly reflecting antitrust concerns.
   - Bezos is questioned on the patterns of behavior concerning third-party sellers on Amazon, suggesting concerns ab

In [11]:
                                                    ### ENTER INPUTS BELOW ###
query   = "Who are the people speaking?"
llm     = "gpt-4-turbo"


                                                        ### DO NOT EDIT ###
# Retrieve relevant documents
retriever      = db.as_retriever()
retrieved_docs = retriever.get_relevant_documents(query)


# Define the context

context = "\n\n".join([doc.page_content for doc in retrieved_docs])

# Create the messages to send to the API
prompt_template = [
    {"role": "system", "content": "You are a helpful assistant. Answer the following question based only on the provided context. Think step by step before providing a detailed answer. Explain how you came to the answer."},
    {"role": "user", "content": f"The context is as follows: \n<context>\n{context}\n</context>\nQuestion: {query}"}
]

completion = client.chat.completions.create(
    model    = llm,
    messages = prompt_template
)

print(completion.choices[0].message.content.strip())

Based on the context provided in the text, the speakers identified are:

1. **Mr. Cicilline** - Presumably a chairperson or leader of the committee mentioned. He conducts and adjourns the hearing and acknowledges receipt of letters addressed to other individuals.

2. **Mr. Zuckerberg** - Represents himself and his platform in the hearing. He speaks about the goal of his platform to provide a space for all voices and ideas.

3. **Mr. Armstrong** - A member of the subcommittee who introduces two letters addressed to Mr. Cook and Mr. Pichai, showing involvement in the proceedings.

These identified speakers are part of a legislative or investigatory hearing process discussing the influence of certain dominant market players and their impact on democracy and independent business.


In [17]:
                                                    ### ENTER INPUTS BELOW ###
query   = "Who are all the people speaking? Could you list them all in bullets?"
llm     = "gpt-4-turbo"


                                                        ### DO NOT EDIT ###
# Retrieve relevant documents
retriever      = db.as_retriever()
retrieved_docs = retriever.get_relevant_documents(query)


# Define the context

context = "\n\n".join([doc.page_content for doc in retrieved_docs])

# Create the messages to send to the API
prompt_template = [
    {"role": "system", "content": "You are a helpful assistant. Answer the following question based only on the provided context. Think step by step before providing a detailed answer. Explain how you came to the answer."},
    {"role": "user", "content": f"The context is as follows: \n<context>\n{context}\n</context>\nQuestion: {query}"}
]

completion = client.chat.completions.create(
    model    = llm,
    messages = prompt_template
)

print(completion.choices[0].message.content.strip())

Based on the provided context, the following individuals are identified as speaking during the hearing:

- **Mr. Cicilline**: Likely the chairperson or a leading member of the subcommittee, addressing the witnesses, managing the procedural aspects of the hearing, and adjourning the meeting.
- **Witnesses mentioned**: They do not speak in the provided text but are referred to and include:
  - **Tim Cook**, Chief Executive Officer, Apple Inc.
  - **Mark Zuckerberg**, Chief Executive Administrator, Facebook, Inc.
  - **Other witnesses** that could be synonymous with "the men are named Zuckerberg, Cook, Pichai, and Bezos," referencing their influence in the marketplace.

These are the people identified in the text as being directly involved in verbal exchanges or referenced during the proceedings.


In [16]:
                                                    ### ENTER INPUTS BELOW ###
query   = "What was the tone of the investigation? What was the conflict and resolution if any?"
llm     = "gpt-4-turbo"


                                                        ### DO NOT EDIT ###
# Retrieve relevant documents
retriever      = db.as_retriever()
retrieved_docs = retriever.get_relevant_documents(query)


# Define the context

context = "\n\n".join([doc.page_content for doc in retrieved_docs])

# Create the messages to send to the API
prompt_template = [
    {"role": "system", "content": "You are a helpful assistant. Answer the following question based only on the provided context. Think step by step before providing a detailed answer. Explain how you came to the answer."},
    {"role": "user", "content": f"The context is as follows: \n<context>\n{context}\n</context>\nQuestion: {query}"}
]

completion = client.chat.completions.create(
    model    = llm,
    messages = prompt_template
)

print(completion.choices[0].message.content.strip())

The tone of the investigation as described in the context appears to be serious and methodical, emphasizing thoroughness and bipartisanship. Several elements contribute to this tone:

1. **Bipartisan Effort**: The investigation was conducted with bipartisan cooperation, as evidenced by the cooperation between different ranking members from both parties. The mention of it being "an honor to work alongside" colleagues such as Congressman Jim Sensenbrenner and former ranking member Congressman Doug Collins underscores the collaborative and respectful nature of the proceedings.

2. **Extensive Data Collection**: The subcommittee was very diligent in its data collection, obtaining millions of pages of evidence from the firms testifying, as well as from over 100 market participants. This highlights the level of depth and seriousness with which the subcommittee approached the investigation.

3. **Expert Consultations and Hearings**: The process included multiple hearings, briefings, and round

In [15]:
                                                    ### ENTER INPUTS BELOW ###
query   = "Create a summary of the text, Present it in key bullet points. Give me the opening testimonies, witness testimonies and closing testimonies as well as any conflicts and resolutions"
llm     = "gpt-4-turbo"


                                                        ### DO NOT EDIT ###
# Retrieve relevant documents
retriever      = db.as_retriever()
retrieved_docs = retriever.get_relevant_documents(query)


# Define the context

context = "\n\n".join([doc.page_content for doc in retrieved_docs])

# Create the messages to send to the API
prompt_template = [
    {"role": "system", "content": "You are a helpful assistant. Answer the following question based only on the provided context. Think step by step before providing a detailed answer. Explain how you came to the answer."},
    {"role": "user", "content": f"The context is as follows: \n<context>\n{context}\n</context>\nQuestion: {query}"}
]

completion = client.chat.completions.create(
    model    = llm,
    messages = prompt_template
)

print(completion.choices[0].message.content.strip())

### Summary of the Text

- **Opening and Swearing-In**:
  - The session starts with the chairperson, Mr. Cicilline, swearing in the witnesses. Participants include Jeff Bezos of Amazon, Sundar Pichai of Google, Tim Cook of Apple, and Mark Zuckerberg of Facebook.
  - They affirm to provide true testimony under penalty of perjury.

- **Guidelines for Testimony**:
  - Witnesses are reminded that their written statements will be fully entered into the record, and they are asked to summarize their testimonies in 5 minutes.
  - A timing light system on Webex is introduced to help manage time: green light indicates speaking time, yellow signals one minute remaining, and red marks the end of the allotted time.

- **Testimony Sequence and Content**:
  - Jeff Bezos is the first to begin his testimony, followed by Sundar Pichai, Tim Cook, and Mark Zuckerberg.
  - The testimony given by each CEO covers topics directly related to their companies and industries.

- **Management of Documents and Exhi