# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

## installing Dependies

In [1]:
!pip install -q cassio datasets langchain openai tiktoken

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
!pip install langchain-community langchain-openai
!pip install pyPDF2

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.3.31-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.3.27-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0

### importing all the packages

In [3]:
# importing the packages we need
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI

# suppoet for the dataset retrieval with Hugging Face
from datasets import load_dataset

# with CassIO, the engine powering the astraDB integration in langchain
# we will also iniatialize the connection with DB
import cassio


In [4]:
from PyPDF2 import PdfReader

# Setup

# Peovide a Secrete

In [6]:
ASTRA_DB_APPLICATION_TOKEN = "AstraCS:JTDQoZstIQWtINZvEsHkTL:9c4b1cd867115db62547bddbcfe479625191db4cf7b00da237b0fd3eaf8dabf" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "56eada22-55b6-4100-aeab-a83b9f82e" # enter your Database ID

OPENAI_API_KEY = "sk-J3ZbnEqytFesD7kWKuVaT3BlbkVr9dpDbViv2R2un" # enter your OpenAI key

In [7]:
# Providing path of the pdf file
pdfreader = PdfReader('/content/The Hundred-Page Machine Learning Book by Andriy Burkov.pdf')

In [8]:
from typing_extensions import Concatenate
# read text from pdf
raw_text=''
for i, page in enumerate(pdfreader.pages):
  content = page.extract_text()
  if content:
    raw_text += content


In [9]:
raw_text

'The\nHundred-\nPage\nMachine\nLearning\nBook\nAndriy Burkov“All models are wrong, but some are useful.”\n—George Box\nThe book is distributed on the “read ﬁrst, buy later” principle.\nAndriy Burkov The Hundred-Page Machine Learning Book - DraftPreface\nLet’s start by telling the truth: machines don’t learn. What a typical “learning machine”\ndoes, is ﬁnding a mathematical formula, which, when applied to a collection of inputs (called\n“training data”), produces the desired outputs. This mathematical formula also generates the\ncorrect outputs for most other inputs (distinct from the training data) on the condition that\nthose inputs come from the same or a similar statistical distribution as the one the training\ndata was drawn from.\nWhy isn’t that learning? Because if you slightly distort the inputs, the output is very likely\nto become completely wrong. It’s not how learning in animals works. If you learned to play\na video game by looking straight at the screen, you would still be

# initializing the connection with database
creating the langchain embedding and LLM objects for later usage:

In [10]:
# initializing openai
from langchain_openai import OpenAI
llm = OpenAI(openai_api_key=OPENAI_API_KEY)

# initializing the embedding engine
from langchain_openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

In [11]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKENS, database_id=ASTRA_DB_ID)

In [12]:
# create your langchain vectorstorel..........backed by astra DB
astra_vectore_store = Cassandra( embedding=embedding,
                                table_name="qa_mini_demo",
                                 session=None,
                                 keyspace=None,
                                 )

In [17]:
from langchain.text_splitter import CharacterTextSplitter

# Split the raw text into chunks
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=800,
    chunk_overlap=200,
    length_function=len,
)
texts = text_splitter.split_text(raw_text)

# Print the number of chunks and the first chunk as a preview
print(f"Number of text chunks: {len(texts)}")
print("First chunk:")
print(texts[10])

Number of text chunks: 468
First chunk:
second feature, x(2), could contain weight in kg, x(3)could contain gender, and so on. For all
examples in the dataset, the feature at position jin the feature vector always contains the
same kind of information. It means that if x(2)
icontains weight in kg in some example xi,
then x(2)
kwill also contain weight in kg in every example xk,k=1,...,N .T h e label yican
be either an element belonging to a ﬁnite set of classes {1,2,...,C }, or a real number, or a
more complex structure, like a vector, a matrix, a tree, or a graph. Unless otherwise stated,
in this book yiis either one of a ﬁnite set of classes or a real number. You can see a class as
a category to which an example belongs. For instance, if your examples are email messages


In [14]:
## add all the texts into cassandra DB and also use wraper
astra_vectore_store.add_texts(texts)
print(" inserted %i headlines."% len(texts))

from langchain.chains import RetrievalQA

# Create a retriever from AstraDB vector store
retriever = astra_vectore_store.as_retriever(search_kwargs={"k": 2})

# Wrap with RetrievalQA chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

# Chat loop
print("🤖 AI Chatbot Ready! (type 'quit' to exit)\n")
while True:
    query = input("You: ")
    if query.lower() in ["quit", "exit", "q"]:
        print("Chatbot: Goodbye! 👋")
        break

    result = qa.invoke({"query": query})
    print("\nChatbot:", result["result"])

    # Optional: show which chunks were used
    print("\n--- Context Sources ---")
    for i, doc in enumerate(result["source_documents"], 1):
        print(f"[{i}] {doc.page_content[:200]}...\n")


 inserted 468 headlines.
🤖 AI Chatbot Ready! (type 'quit' to exit)

You: what is Machine learning





Chatbot:  Machine learning is a subfield of computer science that involves building algorithms that rely on a collection of examples to solve practical problems. These examples can come from nature, humans, or other algorithms, and the process involves gathering a dataset and using statistical models to solve the problem. There are different types of learning within machine learning, including supervised, semi-supervised, unsupervised, and reinforcement learning.

--- Context Sources ---
[1] Machine learning is a subﬁeld of computer science that is concerned with building al...

[2] Machine learning is a subﬁeld of computer science that is concerned with building al...

[3] Machine learning is a subﬁeld of computer science that is concerned with building al...

[4] Machine learning is a subﬁeld of computer science that is concerned with building al...

You: what is deep learning





Chatbot:  Deep learning is a type of machine learning that involves using neural networks with multiple layers between input and output. The model parameters in deep learning are learned from the outputs of the preceding layers, rather than directly from the features of the training examples. 

--- Context Sources ---
[1] networks with more than one layer between input and output. Such neural networks are...

[2] networks with more than one layer between input and output. Such neural networks are...

[3] networks with more than one layer between input and output. Such neural networks are...

[4] networks with more than one layer between input and output. Such neural networks are...

You: quit
Chatbot: Goodbye! 👋
