In [1]:
import os
import sys
from dotenv import load_dotenv
import openai

In [2]:
# Load the environment variables from the local .env file
# NOTE: while using Jupyter Notebooks with VS Code, even when the .env
# is located in the root directory of the project, you must use ../.env
# instead of .env
load_dotenv(".env")

False

In [3]:
# Check if the .env file exists in the Google Drive path
if os.path.exists("/content/drive/MyDrive/Projects/.env"):
    load_dotenv("/content/drive/MyDrive/Projects/.env")
    COLAB = True
    print("Note: using Google Colab")
else:
    COLAB = False
    print("Note: not using Google Colab")

Note: not using Google Colab


In [4]:
# Retrieving API keys from the environment
openai.api_key = os.getenv("OPENAI_API_KEY")

## Load and split the data

In [5]:
## load the PDF using pypdf
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# load the data
loader = PyPDFLoader(
    "../data/raw/Apple-Financial-Report-Q1-2022.pdf"
)

# the 10k financial report are huge, we will need to split the doc into multiple chunk.
# This text splitter is the recommended one for generic text. It is parameterized by a list of characters.
# It tries to split on them in order until the chunks are small enough.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=0
)
data = loader.load()
texts = text_splitter.split_documents(data)

# view the first chunk
texts[0]

Document(page_content='UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 10-Q\n(Mark One)\n☒    QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the quarterly period ended December\xa025, 2021\nor\n☐    TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the transition period from \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0  to \xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 .\nCommission File Number: 001-36743\nApple Inc.\n(Exact name of Registrant as specified in its charter)\nCalifornia 94-2404110\n(State or other jurisdiction\nof incorporation or organization)(I.R.S. Employer Identification No.)\nOne Apple Park Way\nCupertino , California 95014\n(Address of principal executive offices) (Zip Code)\n(408) 996-1010\n(Registrant’s telephone number, including area code)\nSecurities registered pursuant to Section 12(b) of the Act:\nTitle of each classTrading \nsymbol(

## Simple Question Answering

Now I know we are going to use OpenAI as LLM Provider so it makes total sense that we should go with OpenAI Embedding. But please  **note**  that the OpenAI Embedding API use  **“text-davinci-003”**  model, you can view the pricing  [here](https://openai.com/pricing), it may cost less for a small document but be careful when you intend to apply for a big chunk of documents (don’t break your bank guys).

**NExt steps**, we will import the  [Chroma](https://docs.trychroma.com/). If you are not familiar with Chroma, then you can find the detail on its official website. Again, I will cover Chroma and its alternative sometime in the future. So the question is, what is Chroma and why do we need it?

In short, Chroma is the embedding database, not like the traditional SQL database or the not-too-new NoSQL database like what you usually work with. It is embedding databases and it makes it easy to build LLM apps.

![](https://miro.medium.com/v2/resize:fit:699/0*-4HPqxvt3UmR-iSN.png)

By Chroma Official Website

Our document is represented in the form of text which makes it challenging to find relevant info based on the question. Say you need to find the revenue of Apple in the last quarter in 1000 pages and compare revenue to previous years. How challenging and time-consuming it may take? So to make our search easier, we will first need to transform or represent words or phrases in a numerical format that can be used as input to machine learning models. In other words, to help machines understand the text. An embedding maps each word or phrase to a vector of real numbers, typically with hundreds of dimensions, such that similar words or phrases are mapped to similar vectors in the embedding space.

One of the main advantages of using embeddings is that they can capture the semantic and syntactic relationships between words or phrases. For example, in an embedding space, the vectors for “king” and “queen” would be closer to each other than to the vector for “apple”, because they are semantically related as royal titles.

![](https://miro.medium.com/v2/resize:fit:700/0*mijTnoEZJI7qqfBl.png)

So, the embedding database does exactly that. It will store all the embedding data in the database and then give us very indexes to allow us to perform an action like data retrieval and do it in a scalable style. If you need to get the answer to the previous question of finding revenue of Apple last quarter, we will first need to perform a similarity search or semantic search on top of embedding a database like Chroma to extract relevant information and feed that information to LLM model to get the answer.

Sounds too complex !! that is where Langchain comes to the rescue with all the hard work will be done in the background for us. Let’s start coding, shall we?

In [6]:
# import Chroma and OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

# initialize OpenAIEmbedding
embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

# use Chroma to create in-memory embedding database from the doc
docsearch = Chroma.from_documents(texts, embeddings,  metadatas=[{"source": str(i)} for i in range(len(texts))])

## perform search based on the question
query = "What is the operating income?"
docs = docsearch.similarity_search(query)

In [8]:
docs[0]

Document(page_content='Operating income $ 3,349 $ 3,503 \nRest of Asia Pacific:\nNet sales $ 9,810 $ 8,225 \nOperating income $ 3,995 $ 2,953 \nA reconciliation of the Company’s segment operating income to the Condensed Consolidated Statements of Operations for the \nthree months ended December 25, 2021 and December 26, 2020  is as follows (in millions):\nThree Months Ended\nDecember 25,\n2021December 26,\n2020\nSegment operating income $ 49,657 $ 40,360 \nResearch and development expense  (6,306)  (5,163) \nOther corporate expenses, net  (1,863)  (1,663) \nTotal operating income $ 41,488 $ 33,534 \nApple Inc. | Q1 2022  Form 10-Q | 13', metadata={'source': '../data/raw/Apple-Financial-Report-Q1-2022.pdf', 'page': 15})

### Explaining the code

Sure, I'll explain what the given Python code does.

1. Import Required Modules
   ```python
   from langchain.vectorstores import Chroma
   from langchain.embeddings.openai import OpenAIEmbeddings
   ```
   The first two lines import the necessary modules for the task. The `Chroma` class from the `langchain.vectorstores` module is used to create an in-memory database of text embeddings. The `OpenAIEmbeddings` class from the `langchain.embeddings.openai` module is used to generate these embeddings.

2. Initialize OpenAI Embedding
   ```python
   embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')
   ```
   Here, an instance of `OpenAIEmbeddings` is created using the model `'text-embedding-ada-002'`. This object, `embeddings`, will be used to convert text into numerical vectors that capture the semantic meaning of the words and phrases in the text.

3. Create Chroma Object
   ```python
   docsearch = Chroma.from_documents(texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))])
   ```
   This line creates an instance of `Chroma` from a set of documents. The `from_documents` method takes the list of documents (`texts`), the embedding model (`embeddings`), and a list of metadata dictionaries. Each document is associated with a metadata dictionary, which in this case simply assigns an index to each document using the key `"source"`.

   The `Chroma` instance `docsearch` serves as an in-memory database of the document embeddings. The embeddings are vectors in a high-dimensional space, where distances between vectors reflect the semantic similarity of the corresponding pieces of text.

4. Perform Similarity Search
   ```python
   query = "What is the operating income?"
   docs = docsearch.similarity_search(query)
   ```
   The `similarity_search` method of the `docsearch` object is used to find the documents that are most similar to a given query. In this case, the query is `"What is the operating income?"`. The method returns a list of documents (or their associated metadata) ordered by their similarity to the query.

The code uses the Langchain library, which provides tools for using language models and embeddings to build natural language processing applications. In this case, the application is a question-answering system that uses OpenAI's language model and embeddings, and the Chroma embedding database, to find relevant information in a large collection of documents.

The context you provided explains why an embedding database like Chroma is useful for this task. Given a question, it is challenging and time-consuming to find the relevant information in a large collection of text documents. The solution is to convert the text into numerical embeddings, which capture the semantic meaning of the words and phrases, and store these in a database. We can then perform a similarity search in the embedding space to quickly identify the documents that are most likely to contain the answer to the question. Then, this information can be passed to a language model to generate a response.

You see we are able to perform a similarity search to get relevant information from the embedding database.

Now, we will use one of the main components of Langchain which is Chain
to incorporate LLM provider into our code. Again, I know it is hard to
digest all of the concepts at once but hey, I will cover all of them in
another post. Remember, the purpose of this guide is to build the
question-answering bot. So just follow the step and if you are curious
and can’t wait to dig more into details, feel free to go to Langchain’s
official website. Valhalla awaits!!!!

There are four types of pre-built question-answering chains:

-   Question Answering:  **load_qa_chain**
-   Question Answering with Sources:  **load_qa_with_sources_chain**
-   Retrieval Question Answer:  **RetrievalQA**
-   Retrieval Question Answering with Sources:  **RetrievalQAWithSourcesChain**

They are pretty much similar, under the hood,  **RetrievalQA and RetrievalQAWithSourcesChain** use  **load_qa_chain and load_qa_with_sources_chain**  respectively, the only difference is the first two will take all the embedding to feed into LLM while the last two only feed LLM with relevant information. We can use the first two to extract the relevant information first and feed that info to LLM only. Also, the first two give us more flexibility than the last two.

The following piece of code will demonstrate how we do it.

In [28]:
## importing necessary framework
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain

from langchain.chat_models import ChatOpenAI

Now we will try 4 different question-answering chains

1. load_qa_chain

In [15]:
## use LLM to get answering
chain = load_qa_chain(
    ChatOpenAI(temperature=0.2, model_name="gpt-3.5-turbo"), chain_type="stuff"
)
query = "What is the operating income?"
chain.run(input_documents=docs, question=query)

'The operating income for the three months ended December 25, 2021, was $41,488 million.'

2. load_qa_with_sources_chain

In [16]:
chain = load_qa_with_sources_chain(
    ChatOpenAI(temperature=0.2, model_name="gpt-3.5-turbo"), chain_type="stuff"
)
query = "What is the operating income?"
chain({"input_documents": docs, "question": query}, return_only_outputs=True)

{'output_text': 'The operating income for the three months ended December 25, 2021, was $41,488 million.\nSOURCES: ../data/raw/Apple-Financial-Report-Q1-2022.pdf'}

3. RetrievalQA

In [17]:
qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(temperature=0.2, model_name="gpt-3.5-turbo"),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
)
query = "What is the operating income?"
qa.run(query)

'The operating income for the three months ended December 25, 2021, was $41,488 million.'

4. RetrievalQAWithSourcesChain

In [18]:
chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0.2, model_name="gpt-3.5-turbo"),
    chain_type="stuff",
    retriever=docsearch.as_retriever(),
)
chain({"question": "What is the operating income?"}, return_only_outputs=True)

{'answer': 'The operating income for the specified period is $41,488 million.\n',
 'sources': '../data/raw/Apple-Financial-Report-Q1-2022.pdf'}

Pretty easy ayy. Most of the code above is pretty basic. We just want to get this work done before digging into more depth about what does framework can offer. Until then, let’s move on to another framework that you can use in conjunction with Langchain and it will give you more power to create even better LLM apps.

In [48]:
## use LLM to get answering
chain = load_qa_chain(
    ChatOpenAI(temperature=0.2, model_name="gpt-3.5-turbo"), chain_type="stuff"
)
query = "What is the engagement with Aluminium Stewardship Initiative?"
chain.run(input_documents=docs, question=query)

"Apple supports the Aluminium Stewardship Initiative (ASI) and is a member of the organization. ASI focuses on promoting responsible sourcing within the aluminum value chain. Apple has recently completed an audit against ASI's Performance Standard, which includes environmental, social, and governance criteria. This engagement with ASI demonstrates Apple's commitment to advancing responsible practices in the aluminum industry."

In [54]:
from pypdf import PdfReader

reader = PdfReader("../data/raw/Apple_Environmental_Progress_Report_2022.pdf")
page = reader.pages[98]
print(page.extract_text())

Our colocation facilities
The majority of our online services are provided by our own 
data centers; however, we also use third-party colocation 
facilities for additional data center capacity. While we don’t 
own these shared facilities and use only a portion of their total 
capacity, we include our portion of their energy use in our 
renewable energy goals.
Starting in January 2018, 100 percent of our power for 
colocation facilities was matched with renewable energy 
generated within the same state or NERC region for facilities  
in the United States, or within the same country or regional  
grid for those around the world. As our loads grow over time, 
we’ll continue working with our colocation suppliers to match 
100 percent of our energy use with renewables.
Furthermore, we worked with one of our main suppliers of 
colocation services to help it develop the capability to provide 
renewable energy solutions to its customers. This partnership 
advances Apple’s renewable energy prog

In [63]:
from llama_index import download_loader


SimpleWebPageReader = download_loader("SimpleWebPageReader")

loader = SimpleWebPageReader()
documents = loader.load_data(urls=['https://example.com/'])

JSONDecodeError: Extra data: line 1 column 4 (char 3)

In [64]:
from llama_index import GPTVectorStoreIndex, download_loader
from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
from langchain.chains.conversation.memory import ConversationBufferMemory

SimpleWebPageReader = download_loader("SimpleWebPageReader")

loader = SimpleWebPageReader()
documents = loader.load_data(urls=['https://google.com'])
index = GPTVectorStoreIndex.from_documents(documents)

tools = [
    Tool(
        name="Website Index",
        func=lambda q: index.query(q),
        description=f"Useful when you want answer questions about the text on websites.",
    ),
]
llm = OpenAI(temperature=0)
memory = ConversationBufferMemory(memory_key="chat_history")
agent_chain = initialize_agent(
    tools, llm, agent="zero-shot-react-description", memory=memory
)

output = agent_chain.run(input="What language is on this website?")

JSONDecodeError: Extra data: line 1 column 4 (char 3)

In [61]:
import json

try:
    # This is where you'd do your JSON loading
    library = json.loads(library_raw_content)
except json.JSONDecodeError:
    print("Caught JSONDecodeError!")
    print(library_raw_content)  # This will print the content that's causing the issue


NameError: name 'library_raw_content' is not defined