# Setup

In [43]:
# import libraries
import os
import pandas as pd

from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

load_dotenv()
pd.set_option('display.max_colwidth', 0)

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') # add your OpenAI API Key
# for this example I used Alphabet Inc 10-K Report 2022 
# https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm
DOC_PATH = "./alphabet_10K_2022.pdf"
CHROMA_PATH = "rag_demo"


# Data Indexing

In [8]:
# load your pdf doc
loader = PyPDFLoader(DOC_PATH)
pages = loader.load()

In [9]:
# split the doc into smaller chunks i.e. chunk_size=500
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(pages)

In [10]:
# get OpenAI Embedding model
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [11]:
# embed the chunks as vectors and load them into the database
db_chroma = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_PATH)

# Retrieval and Generation

In [18]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:
{context}
Answer the question based on the above context: {question}.
Provide a detailed answer.
Don’t justify your answers.
Don’t give information not mentioned in the CONTEXT INFORMATION.
Do not say "according to the context" or "mentioned in the context" or similar.
"""
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

In [42]:
def run_query(query: str) -> str:
    docs = db_chroma.similarity_search_with_score(query, k=5)

    context_text = "\n\n".join([doc.page_content for doc, _score in docs])
    prompt = prompt_template.format(context=context_text, question=query)

    model = ChatOpenAI()
    response_text = model.predict(prompt)

    return {
        "docs": docs,
        "response": response_text
    }

def get_dataframe_from_result(res):
    return pd.DataFrame({
        "page": [doc[0].metadata["page"] for doc in res["docs"]],
        "page_content": [doc[0].page_content for doc in res["docs"]],
    })
    

In [45]:
result1 = run_query("what are the top risks mentioned in the document?")

In [52]:
print(result1["response"])
get_dataframe_from_result(result1)

- Value of investments declining
- Harm to financial condition and operating results
- Failure of manufacturers and users to adopt products and services
- Unforeseen operating difficulties and expenditures
- Diversion of management time and focus
- Failure to obtain required approvals from governmental authorities
- Power loss
- Telecommunications failures
- Computer viruses
- Software bugs
- Ransomware attacks
- Computer denial of service attacks
- Phishing schemes
- Natural disasters affecting data centers
- Break-ins
- Sabotage
- Vandalism
- Potential disruptions in facility operations


Unnamed: 0,page,page_content
0,21,"As a result of these factors, the value of our investments could decline, which could harm our financial condition and\noperating results.\nRisks Related to our Industry\nPeople access the Internet through a variety of platforms and devices that continue to evolve with the\nadvancement of technology and user preferences. If manufacturers and users do not widely adopt versions of our\nproducts and services developed for these interfaces, our business could be harmed."
1,21,"As a result of these factors, the value of our investments could decline, which could harm our financial condition and\noperating results.\nRisks Related to our Industry\nPeople access the Internet through a variety of platforms and devices that continue to evolve with the\nadvancement of technology and user preferences. If manufacturers and users do not widely adopt versions of our\nproducts and services developed for these interfaces, our business could be harmed."
2,35,"unforeseen operating difficulties and expenditures. Some of the areas where we face risks include:\n• diversion of management time and focus from operating our business to challenges related to acquisitions and other\nstrategic transactions;\n• failure to obtain required approvals on a timely basis, if at all, from governmental authorities, or conditions placed upon\napproval that could, among other things, delay or prevent us from completing a transaction, or otherwise restrict our"
3,35,"unforeseen operating difficulties and expenditures. Some of the areas where we face risks include:\n• diversion of management time and focus from operating our business to challenges related to acquisitions and other\nstrategic transactions;\n• failure to obtain required approvals on a timely basis, if at all, from governmental authorities, or conditions placed upon\napproval that could, among other things, delay or prevent us from completing a transaction, or otherwise restrict our"
4,20,"power loss, telecommunications failures, computer viruses, software bugs, ransomware attacks, computer denial of service\nattacks, phishing schemes, or other attempts to harm or access our systems. Some of our data centers are located in areas\nwith a high risk of major earthquakes or other natural disasters. Our data centers are also subject to break-ins, sabotage, and\nintentional acts of vandalism, and, in some cases, to potential disruptions resulting from problems experienced by facility"


In [53]:
result2 = run_query("how did covid affect things?")

In [54]:
print(result2["response"])
get_dataframe_from_result(result2)

COVID-19 had a significant impact on the business, financial condition, and operating results. The pandemic affected revenue growth rate and expenses as a percentage of revenues. Additionally, there was outsized growth in advertising revenues during the COVID-19 pandemic. The shift from offline to online activities benefited the business but at a slower pace than historically due to COVID-19. The company also faced increased competition for user engagement and advertisers as a result of the pandemic.


Unnamed: 0,page,page_content
0,35,"performance.\nGeneral Risks\nThe continuing effects of the COVID-19 pandemic and its impact are highly unpredictable and could be\nsignificant, and could harm our business, financial condition, and operating results.\nOur business, operations and financial performance have been, and may continue to be, affected by the macroeconomic\nimpacts resulting from COVID-19, and as a result, our revenue growth rate and expenses as a percentage of our revenues in"
1,35,"performance.\nGeneral Risks\nThe continuing effects of the COVID-19 pandemic and its impact are highly unpredictable and could be\nsignificant, and could harm our business, financial condition, and operating results.\nOur business, operations and financial performance have been, and may continue to be, affected by the macroeconomic\nimpacts resulting from COVID-19, and as a result, our revenue growth rate and expenses as a percentage of our revenues in"
2,41,"The continuing shift from an offline to online world has contributed to the growth of our business and our revenues since\ninception. We expect that this shift to an online world will continue to benefit our business and our revenues, although at a\nslower pace than we have experienced historically, in particular after the outsized growth in our advertising revenues during\nthe COVID-19 pandemic. In addition, we face increasing competition for user engagement and advertisers, which may affect"
3,41,"The continuing shift from an offline to online world has contributed to the growth of our business and our revenues since\ninception. We expect that this shift to an online world will continue to benefit our business and our revenues, although at a\nslower pace than we have experienced historically, in particular after the outsized growth in our advertising revenues during\nthe COVID-19 pandemic. In addition, we face increasing competition for user engagement and advertisers, which may affect"
4,20,"interruption from modifications or upgrades, terrorist attacks, state-sponsored attacks, natural disasters or pandemics,\ngeopolitical tensions or armed conflicts, the effects of climate change (such as sea level rise, drought, flooding, heat waves,\nwildfires and resultant air quality effects and power shutoffs associated with wildfire prevention, and increased storm severity),"
