# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [None]:
!pip install -q cassio datasets langchain openai tiktoken

Import the packages you'll need:

In [1]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
!pip install PyPDF2



In [2]:
from PyPDF2 import PdfReader

### Setup

In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

True

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [5]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('spice.pdf')

In [6]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [7]:
raw_text

'Spices\nSpices at a central market in Agadir,\nMorocco\nA group of Indian herbs and spices in\nbowls\nSpices of Saúde flea market, São\nPaulo, BrazilSpice\nA spice is a seed, fruit, root, bark, or other plant substance\nprimarily used for flavoring or coloring food. Spices are\ndistinguished from herbs, which are the leaves, flowers, or stems of\nplants used for flavoring or as a garnish. Spices are sometimes used\nin medicine, religious rituals, cosmetics, or perfume produc tion.\nFor example, vanilla is commonly used as an ingredient in\nfragrance manufacturing.[1]\nA spice may be available in several forms: fresh, whole-dried, or\npre-ground dried. Generally, spices are dried. Spices may be\nground into a powder for conve nience. A whole dried spice has the\nlonge st shelf life, so it can be purchased and stored in larger\namounts, making it cheaper on a per-serving basis. A fresh spice,\nsuch as ginger, is usually more flavorful than its dried form, but\nfresh spices are more expe

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [9]:
cassio.init(token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"), database_id=os.getenv("ASTRA_DB_ID"))

Create the LangChain embedding and LLM objects for later usage:

In [10]:
llm = OpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"))
embedding = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))

  llm = OpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"))
  embedding = OpenAIEmbeddings(openai_api_key=os.getenv("OPENAI_API_KEY"))


Create your LangChain vector store ... backed by Astra DB!

In [11]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [12]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it should not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [13]:
texts[:50]

['Spices\nSpices at a central market in Agadir,\nMorocco\nA group of Indian herbs and spices in\nbowls\nSpices of Saúde flea market, São\nPaulo, BrazilSpice\nA spice is a seed, fruit, root, bark, or other plant substance\nprimarily used for flavoring or coloring food. Spices are\ndistinguished from herbs, which are the leaves, flowers, or stems of\nplants used for flavoring or as a garnish. Spices are sometimes used\nin medicine, religious rituals, cosmetics, or perfume produc tion.\nFor example, vanilla is commonly used as an ingredient in\nfragrance manufacturing.[1]\nA spice may be available in several forms: fresh, whole-dried, or\npre-ground dried. Generally, spices are dried. Spices may be\nground into a powder for conve nience. A whole dried spice has the',
 'A spice may be available in several forms: fresh, whole-dried, or\npre-ground dried. Generally, spices are dried. Spices may be\nground into a powder for conve nience. A whole dried spice has the\nlonge st shelf life, so it

### Load the dataset into the vector store



In [14]:

astra_vector_store.add_texts(texts[:50]) #Also performs embeddings

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 22 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [15]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


QUESTION: "Which country produces most spice?"
ANSWER: "India contributes to 75% of global spice production, making it the top spice producing country."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9265] "10 Sri Lanka 8,293 8,438
— World 1,995,5232,063,472
Source: UN Food & Agriculture Or ..."
    [0.9218] "whole and in pow der form.
As of 2019, there is not enough clinical evidence to indi ..."
    [0.9192] "the European aristocracy's demand for spice comes from the King of Aragon, who inves ..."
    [0.9177] "flavors from a spice take time to infuse into the food so spices are added early in  ..."

QUESTION: "quit()"
ANSWER: "I don't know."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8544] "cover up the taste of meat that
had already gone  off. This
compelling but false ide ..."
    [0.8496] "made them expensive. From the 8th until the 15th century, the
Republic of Venice hel ..."
    [0.8484] "Venice.[14] At around  the same time, Christophe r Columbus returned from the New Wo ..."
    [0.8469]