# Setup

In [1]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
# import libraries
import os
import pandas as pd

from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI

load_dotenv()
pd.set_option('display.max_colwidth', 0)

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') # add your OpenAI API Key
# for this example I used Alphabet Inc 10-K Report 2022 
# https://www.sec.gov/Archives/edgar/data/1652044/000165204423000016/goog-20221231.htm
DOC_PATH = "./alphabet_10K_2022.pdf"
CHROMA_PATH = "rag_demo"


# Data Indexing

In [3]:
# load your pdf doc
loader = PyPDFLoader(DOC_PATH)
pages = loader.load()

In [4]:
# split the doc into smaller chunks i.e. chunk_size=500
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(pages)

In [5]:
# get OpenAI Embedding model
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)


In [6]:
# embed the chunks as vectors and load them into the database
db_chroma = Chroma.from_documents(chunks, embeddings, persist_directory=CHROMA_PATH)

# Retrieval and Generation

In [7]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:
{context}
Answer the question based on the above context: {question}.
Provide a detailed answer.
Don’t justify your answers.
Don’t give information not mentioned in the CONTEXT INFORMATION.
Do not say "according to the context" or "mentioned in the context" or similar.
"""
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)

In [8]:
def run_query(query: str) -> str:
    docs = db_chroma.similarity_search_with_score(query, k=5)

    context_text = "\n\n".join([doc.page_content for doc, _score in docs])
    prompt = prompt_template.format(context=context_text, question=query)

    model = ChatOpenAI()
    response_text = model.predict(prompt)

    return {
        "docs": docs,
        "response": response_text
    }

def get_dataframe_from_result(res):
    return pd.DataFrame({
        "page": [doc[0].metadata["page"] for doc in res["docs"]],
        "page_content": [doc[0].page_content for doc in res["docs"]],
    })
    

In [9]:
result1 = run_query("what are the top risks mentioned in the document?")

  model = ChatOpenAI()
  response_text = model.predict(prompt)


In [10]:
print(result1["response"])
get_dataframe_from_result(result1)

The top risks mentioned in the document are interruption from modifications or upgrades, terrorist attacks, state-sponsored attacks, natural disasters or pandemics, geopolitical tensions or armed conflicts, and the effects of climate change. Additionally, another significant risk highlighted is the potential harm to the business if manufacturers and users do not widely adopt versions of their products and services developed for evolving platforms and devices.


Unnamed: 0,page,page_content
0,20,"interruption from modifications or upgrades, terrorist attacks, state-sponsored attacks, natural disasters or pandemics,\ngeopolitical tensions or armed conflicts, the effects of climate change (such as sea level rise, drought, flooding, heat waves,\nwildfires and resultant air quality effects and power shutoffs associated with wildfire prevention, and increased storm severity),"
1,20,"interruption from modifications or upgrades, terrorist attacks, state-sponsored attacks, natural disasters or pandemics,\ngeopolitical tensions or armed conflicts, the effects of climate change (such as sea level rise, drought, flooding, heat waves,\nwildfires and resultant air quality effects and power shutoffs associated with wildfire prevention, and increased storm severity),"
2,20,"interruption from modifications or upgrades, terrorist attacks, state-sponsored attacks, natural disasters or pandemics,\ngeopolitical tensions or armed conflicts, the effects of climate change (such as sea level rise, drought, flooding, heat waves,\nwildfires and resultant air quality effects and power shutoffs associated with wildfire prevention, and increased storm severity),"
3,21,"As a result of these factors, the value of our investments could decline, which could harm our financial condition and\noperating results.\nRisks Related to our Industry\nPeople access the Internet through a variety of platforms and devices that continue to evolve with the\nadvancement of technology and user preferences. If manufacturers and users do not widely adopt versions of our\nproducts and services developed for these interfaces, our business could be harmed."
4,21,"As a result of these factors, the value of our investments could decline, which could harm our financial condition and\noperating results.\nRisks Related to our Industry\nPeople access the Internet through a variety of platforms and devices that continue to evolve with the\nadvancement of technology and user preferences. If manufacturers and users do not widely adopt versions of our\nproducts and services developed for these interfaces, our business could be harmed."


In [11]:
result2 = run_query("how did covid affect things?")

In [12]:
print(result2["response"])
get_dataframe_from_result(result2)

The COVID-19 pandemic had significant and unpredictable effects on the business, financial condition, and operating results. The macroeconomic impacts resulting from COVID-19 affected revenue growth rate and expenses as a percentage of revenues. Additionally, the outsized growth in advertising revenues during the pandemic may not continue at the same pace in the future. The shift from offline to online activities due to COVID-19 has contributed to the growth of the business and revenues, but there is increasing competition for user engagement and advertisers which may further impact performance.


Unnamed: 0,page,page_content
0,35,"performance.\nGeneral Risks\nThe continuing effects of the COVID-19 pandemic and its impact are highly unpredictable and could be\nsignificant, and could harm our business, financial condition, and operating results.\nOur business, operations and financial performance have been, and may continue to be, affected by the macroeconomic\nimpacts resulting from COVID-19, and as a result, our revenue growth rate and expenses as a percentage of our revenues in"
1,35,"performance.\nGeneral Risks\nThe continuing effects of the COVID-19 pandemic and its impact are highly unpredictable and could be\nsignificant, and could harm our business, financial condition, and operating results.\nOur business, operations and financial performance have been, and may continue to be, affected by the macroeconomic\nimpacts resulting from COVID-19, and as a result, our revenue growth rate and expenses as a percentage of our revenues in"
2,35,"performance.\nGeneral Risks\nThe continuing effects of the COVID-19 pandemic and its impact are highly unpredictable and could be\nsignificant, and could harm our business, financial condition, and operating results.\nOur business, operations and financial performance have been, and may continue to be, affected by the macroeconomic\nimpacts resulting from COVID-19, and as a result, our revenue growth rate and expenses as a percentage of our revenues in"
3,41,"The continuing shift from an offline to online world has contributed to the growth of our business and our revenues since\ninception. We expect that this shift to an online world will continue to benefit our business and our revenues, although at a\nslower pace than we have experienced historically, in particular after the outsized growth in our advertising revenues during\nthe COVID-19 pandemic. In addition, we face increasing competition for user engagement and advertisers, which may affect"
4,41,"The continuing shift from an offline to online world has contributed to the growth of our business and our revenues since\ninception. We expect that this shift to an online world will continue to benefit our business and our revenues, although at a\nslower pace than we have experienced historically, in particular after the outsized growth in our advertising revenues during\nthe COVID-19 pandemic. In addition, we face increasing competition for user engagement and advertisers, which may affect"
