#  Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [2]:
!pip install -q cassio datasets langchain openai tiktoken

Import the packages you'll need:

In [3]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [4]:
# helps to reaaad any pdf's
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [5]:
from PyPDF2 import PdfReader

### Setup

In [13]:
ASTRA_DB_APPLICATION_TOKEN = "AstraCS:vwH******d210" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "8***b9-c14a-4a5f-85a8-a*****cd8e" # enter your Database ID

OPENAI_API_KEY = "sk-eg*****" # enter your OpenAI key

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [8]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('/content/mythology.pdf')

In [9]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [10]:
raw_text

' \nIntroduction to Indian Mythology  \n                                                                                         V.Durgalakshmi, Ph.D {Samskruth}  \nMythology : Definition and Need to Study  \nA collection of myth, especially one belonging to a particu lar religious or cultural tradition is \nthe dictionary definition of mythology. It is also defined as a set of stories or beliefs about a \nparticular person, institution, or situation, especially when exaggerated or fictitious. We \nneed to understand that myt hology is a branch of knowledge that deals with narratives \nabout Goddesses & Gods, demi -gods, legendary personalities of different civilizations and \ntheir cultures. Traditions, folklore and legends are similar to and sometimes part of \nMythology.  \nSince myth ology typically incorporates superhuman characters, it is important for us to \nstudy them with a “time -perspective”. We also need to understand the mythology of our \nrespective cultures to bond with

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [14]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(137289285779072) 8b70b7b9-c14a-4a5f-85a8-a4b5a60acd8e-us-east1.db.astra.datastax.com:29042:ceba643e-b237-470a-b0be-b86329b59e66> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Create the LangChain embedding and LLM objects for later usage:

In [15]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  warn_deprecated(
  warn_deprecated(


Create your LangChain vector store ... backed by Astra DB!

In [16]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [17]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [18]:
texts[:50]

['Introduction to Indian Mythology  \n                                                                                         V.Durgalakshmi, Ph.D {Samskruth}  \nMythology : Definition and Need to Study  \nA collection of myth, especially one belonging to a particu lar religious or cultural tradition is \nthe dictionary definition of mythology. It is also defined as a set of stories or beliefs about a \nparticular person, institution, or situation, especially when exaggerated or fictitious. We \nneed to understand that myt hology is a branch of knowledge that deals with narratives \nabout Goddesses & Gods, demi -gods, legendary personalities of different civilizations and \ntheir cultures. Traditions, folklore and legends are similar to and sometimes part of \nMythology.',
 'about Goddesses & Gods, demi -gods, legendary personalities of different civilizations and \ntheir cultures. Traditions, folklore and legends are similar to and sometimes part of \nMythology.  \nSince myth ology t

### Load the dataset into the vector store



In [19]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 19 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [20]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): what is it all about

QUESTION: "what is it all about"




ANSWER: "The context is discussing the importance and relevance of Indian mythology in modern society, including its role in explaining spiritual potential, exploring common archetypes, and bonding individuals to their culture. It also mentions how myths are passed down through generations and used as a medium to inculcate interest in Indian culture. The conclusion states that Indian mythology serves four basic functions - the Mystical Function, the Cosmological Function, the Sociological Function, and the Pedagogical Function - and is essential in imparting values of Indian culture worldwide. Overall, the context is about the significance of Indian mythology in connecting modern society to bygone ages and transmitting religious experiences and role models."

FIRST DOCUMENTS BY RELEVANCE:




    [0.8865] "explain patterns of worship and attempt to reconnect the modern society to the bygon ..."
    [0.8860] "about Goddesses & Gods, demi -gods, legendary personalities of different civilizatio ..."
    [0.8850] "myth ology and this magnifies the ability of the performer to present them. Indian m ..."
    [0.8828] "mythology is very much prevalent today, as never before. It is interesting to note t ..."

What's your next question (or type 'quit' to exit): who has written that?

QUESTION: "who has written that?"




ANSWER: "The author of this passage is not mentioned, so it is unknown who wrote this."

FIRST DOCUMENTS BY RELEVANCE:




    [0.8709] "values.   
 Rāmāyaṇa has been re -written in Hindi, Thamizh and other Indian langua ..."
    [0.8689] "Rāma (the seventh avatar of Lord Viṣhṇu) in the Thretha yuga.  
 The Mahābhāratha   ..."
    [0.8676] "better picture of the content in terms of the respective time -frames.  
 
Epics  
T ..."
    [0.8665] "same. These stories, which form the backbone of Indian mythology, are a great medium ..."

What's your next question (or type 'quit' to exit): is there any iinfo regarding the author?

QUESTION: "is there any iinfo regarding the author?"




ANSWER: "The author of the Rāmāyaṇa is Vālmīki, while the author of the Mahābhāratha is traditionally attributed to Vyāsa."

FIRST DOCUMENTS BY RELEVANCE:




    [0.8750] "better picture of the content in terms of the respective time -frames.  
 
Epics  
T ..."
    [0.8713] "same. These stories, which form the backbone of Indian mythology, are a great medium ..."
    [0.8678] "values.   
 Rāmāyaṇa has been re -written in Hindi, Thamizh and other Indian langua ..."
    [0.8678] "Rāma (the seventh avatar of Lord Viṣhṇu) in the Thretha yuga.  
 The Mahābhāratha   ..."

What's your next question (or type 'quit' to exit): how many pages it has?

QUESTION: "how many pages it has?"




ANSWER: "It is not possible to determine the exact number of pages as it depends on the format and size of the book. However, the Mahābhāratha is the longest Sanskrit epic and its longest version consists of over 100,000 verses or over 200,000 individual verse lines and also many long passages in prose, making it a very lengthy text."

FIRST DOCUMENTS BY RELEVANCE:




    [0.8783] "over 100,000  verses or over 200,000 individual verse lines (each Śhloka is a couple ..."
    [0.8763] "values.   
 Rāmāyaṇa has been re -written in Hindi, Thamizh and other Indian langua ..."
    [0.8746] "Rāma (the seventh avatar of Lord Viṣhṇu) in the Thretha yuga.  
 The Mahābhāratha   ..."
    [0.8692] "better picture of the content in terms of the respective time -frames.  
 
Epics  
T ..."

What's your next question (or type 'quit' to exit): quit
