# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [22]:
!pip install -q cassio datasets langchain openai tiktoken

In [23]:
!pip install -U langchain-community



Import the packages you'll need:

In [38]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import GooglePalmEmbeddings
from langchain.llms import GooglePalm

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [39]:
api_key = "AIzaSyByOLoGml65E7I0vZhtxMuHFdY96SlqwZM" # get this free api key from https://makersuite.google.com/

llm = GooglePalm(google_api_key=api_key, temperature=0.1)

In [40]:
!pip install PyPDF2



In [41]:
from PyPDF2 import PdfReader

### Setup

In [42]:
ASTRA_DB_APPLICATION_TOKEN = "AstraCS:FbKmLcpFPooAmfZZSDlkAifP:213d9f1f3d55923eb4efc99993d29128e4eecc35020a67b946a4dc06fba5ff52" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "b5c8ab77-153b-40cd-9f91-b71866e12cb2" # enter your Database ID



#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [43]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('/content/NVIDIAAn (1).pdf')

In [44]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [45]:
raw_text

"NVIDIA Announces Financial Results for First Quarter\nFiscal 2024\nQuarterly revenue of $7.19 billion, up 19% from previous quarter\nRecord Data Center revenue of $4.28 billion\nSecond quarter fiscal 2024 revenue outlook of $11.00 billion\nNVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from\na year ago and up 19% from the previous quarter.\nGAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter.\nNon-GAAP earnings per diluted share were $1.09, down 20% from a year ago and up 24% from the previous quarter.\n“The computer industry is going through two simultaneous transitions — accelerated computing and generative AI,” said\nJensen Huang, founder and CEO of NVIDIA.\n“A trillion dollars of installed global data center infrastructure will transition from general purpose to accelerated computing as\ncompanies race to apply generative AI into every product, s

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [46]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <LibevConnection(133283138061120) b5c8ab77-153b-40cd-9f91-b71866e12cb2-us-east1.db.astra.datastax.com:29042:8a699595-71e1-4ba2-a7cd-fe78c9628a06> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Create the LangChain embedding and LLM objects for later usage:

In [47]:

llm = GooglePalm(google_api_key=api_key, temperature=0.1)
embedding = GooglePalmEmbeddings(google_api_key=api_key)

In [48]:
!pip install chromadb langchain


Collecting chromadb
  Using cached chromadb-0.5.0-py3-none-any.whl (526 kB)
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Using cached chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
Collecting fastapi>=0.95.2 (from chromadb)
  Using cached fastapi-0.111.0-py3-none-any.whl (91 kB)
Collecting uvicorn[standard]>=0.18.3 (from chromadb)
  Using cached uvicorn-0.30.0-py3-none-any.whl (62 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Using cached posthog-3.5.0-py2.py3-none-any.whl (41 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Using cached onnxruntime-1.18.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.8 MB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Using cached opentelemetry_api-1.25.0-py3-none-any.whl (59 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Using cached opentelemetry_exporter_otlp_proto_grpc-1.25.0-py3-none-any.whl (18 kB)
Collecting opentelemetry-instrumentati

Create your LangChain vector store ... backed by Astra DB!

In [49]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [50]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [51]:
texts[:50]

['NVIDIA Announces Financial Results for First Quarter\nFiscal 2024\nQuarterly revenue of $7.19 billion, up 19% from previous quarter\nRecord Data Center revenue of $4.28 billion\nSecond quarter fiscal 2024 revenue outlook of $11.00 billion\nNVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from\na year ago and up 19% from the previous quarter.\nGAAP earnings per diluted share for the quarter were $0.82, up 28% from a year ago and up 44% from the previous quarter.\nNon-GAAP earnings per diluted share were $1.09, down 20% from a year ago and up 24% from the previous quarter.\n“The computer industry is going through two simultaneous transitions — accelerated computing and generative AI,” said\nJensen Huang, founder and CEO of NVIDIA.',
 '“The computer industry is going through two simultaneous transitions — accelerated computing and generative AI,” said\nJensen Huang, founder and CEO of NVIDIA.\n“A trillion dollars of inst

### Load the dataset into the vector store



In [52]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 35 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [None]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): What was NVIDIA's total revenue for the first quarter of fiscal 2024?

QUESTION: "What was NVIDIA's total revenue for the first quarter of fiscal 2024?"




ANSWER: "NVIDIA's total revenue for the first quarter of fiscal 2024 was $7.19 billion."

FIRST DOCUMENTS BY RELEVANCE:




    [0.9082] "said.
During the first quarter of fiscal 2024, NVIDIA returned to shareholders $99 m ..."
    [0.9072] "NVIDIA Announces Financial Results for First Quarter
Fiscal 2024
Quarterly revenue o ..."
    [0.9047] "$
2,043
 
$
1,414
 
$
1,618
 
Up 44%
Up 26%
Diluted earnings per share
$
0.82
 
$
0. ..."
    [0.9008] "$
1.09
 
$
0.88
 
$
1.36
 
Up 24%
Down 20%
Outlook
NVIDIA’s outlook for the second q ..."

What's your next question (or type 'quit' to exit): How does the revenue of the first quarter of fiscal 2024 compare to that of the first quarter of the previous year?

QUESTION: "How does the revenue of the first quarter of fiscal 2024 compare to that of the first quarter of the previous year?"




ANSWER: "The revenue of the first quarter of fiscal 2024 was down 13% from the first quarter of the previous year."

FIRST DOCUMENTS BY RELEVANCE:




    [0.8641] "$
2,043
 
$
1,414
 
$
1,618
 
Up 44%
Up 26%
Diluted earnings per share
$
0.82
 
$
0. ..."
    [0.8589] "$
1.09
 
$
0.88
 
$
1.36
 
Up 24%
Down 20%
Outlook
NVIDIA’s outlook for the second q ..."
    [0.8561] "Three Months Ended
 
 
April 30,
 
January 29,
 
May 1,
 
 
 
2023
 
 
 
2023
 
 
 
 ..."
    [0.8531] "NVIDIA Announces Financial Results for First Quarter
Fiscal 2024
Quarterly revenue o ..."

What's your next question (or type 'quit' to exit): Can you compare the GAAP and non-GAAP net income for the first quarter?

QUESTION: "Can you compare the GAAP and non-GAAP net income for the first quarter?"




ANSWER: "GAAP net income was $3.6 billion and non-GAAP net income was $3.7 billion for the first quarter."

FIRST DOCUMENTS BY RELEVANCE:




    [0.8380] "NVIDIA’s investors to be better able to compare its current results with those of pr ..."
    [0.8380] "Three Months Ended
 
 
April 30,
 
January 29,
 
May 1,
 
 
 
2023
 
 
 
2023
 
 
 
 ..."
    [0.8359] "https://investor.nvidia.com
. The webcast will be
recorded and available for replay  ..."
    [0.8340] "Payments related to tax on restricted stock units
 
(507
)
 
 
(532
)
 
Dividends pa ..."

What's your next question (or type 'quit' to exit): Describe what NVIDIA means by 'full-stack inference software' as used in the context of their new product launches

QUESTION: "Describe what NVIDIA means by 'full-stack inference software' as used in the context of their new product launches"




ANSWER: "Full-stack inference software is a suite of software tools that allows developers to deploy and manage AI models on NVIDIA hardware. It includes tools for data preparation, model training, model optimization, and model deployment."

FIRST DOCUMENTS BY RELEVANCE:




    [0.9005] "Highlights
NVIDIA achieved progress since its previous earnings announcement in thes ..."
    [0.8962] "generative AI models trained with their own proprietary data for domain-specific tas ..."
    [0.8934] "forward-looking statements to reflect future events or circumstances.
© 2023 NVIDIA  ..."
    [0.8906] "Announced 
NVIDIA Omniverse™ Cloud
, a fully managed service running in Microsoft Az ..."

What's your next question (or type 'quit' to exit): what is the second quarter fiscal 2024 revenue?

QUESTION: "what is the second quarter fiscal 2024 revenue?"




ANSWER: "$11.00 billion"

FIRST DOCUMENTS BY RELEVANCE:




    [0.8758] "$
2,043
 
$
1,414
 
$
1,618
 
Up 44%
Up 26%
Diluted earnings per share
$
0.82
 
$
0. ..."
    [0.8590] "Three Months Ended
 
 
April 30,
 
January 29,
 
May 1,
 
 
 
2023
 
 
 
2023
 
 
 
 ..."
    [0.8567] "$
1.09
 
$
0.88
 
$
1.36
 
Up 24%
Down 20%
Outlook
NVIDIA’s outlook for the second q ..."
    [0.8485] "NVIDIA Announces Financial Results for First Quarter
Fiscal 2024
Quarterly revenue o ..."

What's your next question (or type 'quit' to exit): What is the non-GAAP diluted earning per share for Q1FY-24

QUESTION: "What is the non-GAAP diluted earning per share for Q1FY-24"




ANSWER: "$1.09"

FIRST DOCUMENTS BY RELEVANCE:




    [0.8729] "said.
During the first quarter of fiscal 2024, NVIDIA returned to shareholders $99 m ..."
    [0.8725] "$
3,052
 
 
$
2,224
 
 
$
3,955
 
 
 
 
 
 
 
 
GAAP other income (expense), net
$
6 ..."
    [0.8711] "$
2,043
 
$
1,414
 
$
1,618
 
Up 44%
Up 26%
Diluted earnings per share
$
0.82
 
$
0. ..."
    [0.8649] "Non-GAAP net income
$
2,713
 
 
$
2,174
 
 
$
3,443
 
 
 
 
 
 
 
 
Diluted net inco ..."
