<a href="https://colab.research.google.com/github/Abhishekyes/PDFQuery_LangChain/blob/main/PDFQuery_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [1]:
!pip install -q cassio datasets langchain openai tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m812.8/812.8 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m267.1/267.1 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.9/18.9 MB[0m [31m54.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

Import the packages you'll need:

In [2]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [3]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [4]:
from PyPDF2 import PdfReader

### Setup

In [17]:
ASTRA_DB_APPLICATION_TOKEN = "AstraCS:tlUyGOKxxNbKvSdAjozGpZKs:a1bbc6514dce059763911a627b88e6be9236fc32adbc647f862b4801d3b0CkCu" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "6f561ce2-f36c-4632-b2a1-203a09297790" # enter your Database ID

OPENAI_API_KEY = "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" # enter your OpenAI key

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [6]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('cow.pdf')

In [7]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [8]:
raw_text

'Essay on The Cow \nby Radhakanta Swain | category Essay  \nIntroduction  \nCows are found in almost all parts of the world. Th ey are very useful \ndomestic animals. Every child is fed with the cow’s  milk. Hence, the \ncow is a well-known quadruped beast. \nDescription  \nCows are found in many colours, such as white, blac k and red. Some \nare of mixed colours. Cows are neither small nor ve ry big. The body \nof the cow is bulky. There are two horns on her hea d. The horns are \ncurved or straight and pointed. The cow has a long face. She has two \neyes. Her eyes are black and expressive. She has no  tooth on her \nupper jaw. On her lower jaw there are eight teeth. She has a long \ntail. Her tail is thin an narrow. There is a tuft o f hair at the end of \nher tail. The cow has four hoofs at the end of her four legs. Each \nhoof is split into tow parts. She ha an udder  betw een her hind legs. \nHer body is covered with furs. Her stomach is divid ed into four \nparts. So, she has to 

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [9]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <LibevConnection(133282921249952) 6f561ce2-f36c-4632-b2a1-203a09298d17-us-east1.db.astra.datastax.com:29042:bd11484a-5051-48b3-8b81-c834bf62bad5> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Create the LangChain embedding and LLM objects for later usage:

In [10]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

  warn_deprecated(
  warn_deprecated(


Create your LangChain vector store ... backed by Astra DB!

In [11]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [12]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [13]:
texts[:50]

['Essay on The Cow \nby Radhakanta Swain | category Essay  \nIntroduction  \nCows are found in almost all parts of the world. Th ey are very useful \ndomestic animals. Every child is fed with the cow’s  milk. Hence, the \ncow is a well-known quadruped beast. \nDescription  \nCows are found in many colours, such as white, blac k and red. Some \nare of mixed colours. Cows are neither small nor ve ry big. The body \nof the cow is bulky. There are two horns on her hea d. The horns are \ncurved or straight and pointed. The cow has a long face. She has two \neyes. Her eyes are black and expressive. She has no  tooth on her \nupper jaw. On her lower jaw there are eight teeth. She has a long \ntail. Her tail is thin an narrow. There is a tuft o f hair at the end of',
 'upper jaw. On her lower jaw there are eight teeth. She has a long \ntail. Her tail is thin an narrow. There is a tuft o f hair at the end of \nher tail. The cow has four hoofs at the end of her four legs. Each \nhoof is split in

### Load the dataset into the vector store



In [14]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 4 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [16]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): who is cow?

QUESTION: "who is cow?"




ANSWER: "A cow is a quadruped beast, a well-known domestic animal that is found in many colors and is known for its usefulness in providing milk, manure, and other products."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9062] "Essay on The Cow 
by Radhakanta Swain | category Essay  
Introduction  
Cows are fou ..."
    [0.8961] "upper jaw. On her lower jaw there are eight teeth. She has a long 
tail. Her tail is ..."
    [0.8945] "clean. We should feed her properly. We should be gr ateful to her. We should never s ..."
    [0.8916] "mother. They worship her as a goddess. Her milk is very nutritious. 
It is a food fo ..."

What's your next question (or type 'quit' to exit): is it harmfull

QUESTION: "is it harmfull"




ANSWER: "I do not know if the cow is harmful or not, but she is considered very useful and is worshipped as a goddess in some cultures."

FIRST DOCUMENTS BY RELEVANCE:
    [0.8732] "clean. We should feed her properly. We should be gr ateful to her. We should never s ..."
    [0.8716] "mother. They worship her as a goddess. Her milk is very nutritious. 
It is a food fo ..."
    [0.8693] "upper jaw. On her lower jaw there are eight teeth. She has a long 
tail. Her tail is ..."
    [0.8692] "modest outcomes. When the poor become empowered partners 
in the development process ..."

What's your next question (or type 'quit' to exit): hi





QUESTION: "hi"




ANSWER: "I don't know how to respond to that since it is not a question related to the given context. Is there something specific you need help with?"

FIRST DOCUMENTS BY RELEVANCE:
    [0.8800] "the last ten years, have targeted each and every household and 
individual, through  ..."
    [0.8794] "clean. We should feed her properly. We should be gr ateful to her. We should never s ..."
    [0.8794] "9. As our Prime Minister firmly believes , we need to focus on 
four major castes. T ..."
    [0.8789] "and well -roun ded individuals.  
18. The Skill India Mission has trained 1.4 crore  ..."

What's your next question (or type 'quit' to exit): What is the main use of cows mentioned in the introduction?





QUESTION: "What is the main use of cows mentioned in the introduction?"




ANSWER: "Cows are useful domestic animals and their milk is used to feed children."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9236] "Essay on The Cow 
by Radhakanta Swain | category Essay  
Introduction  
Cows are fou ..."
    [0.9169] "mother. They worship her as a goddess. Her milk is very nutritious. 
It is a food fo ..."
    [0.9123] "upper jaw. On her lower jaw there are eight teeth. She has a long 
tail. Her tail is ..."
    [0.9089] "clean. We should feed her properly. We should be gr ateful to her. We should never s ..."

What's your next question (or type 'quit' to exit): What are some of the colors cows can be found in?





QUESTION: "What are some of the colors cows can be found in?"




ANSWER: "Cows can be found in white, black, and red, as well as mixed colors."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9127] "Essay on The Cow 
by Radhakanta Swain | category Essay  
Introduction  
Cows are fou ..."
    [0.9011] "upper jaw. On her lower jaw there are eight teeth. She has a long 
tail. Her tail is ..."
    [0.8892] "mother. They worship her as a goddess. Her milk is very nutritious. 
It is a food fo ..."
    [0.8820] "clean. We should feed her properly. We should be gr ateful to her. We should never s ..."

What's your next question (or type 'quit' to exit): How many horns does a cow have, and what are their characteristics?





QUESTION: "How many horns does a cow have, and what are their characteristics?"




ANSWER: "A cow typically has two horns on her head, which can be curved or straight and pointed."

FIRST DOCUMENTS BY RELEVANCE:
    [0.9176] "upper jaw. On her lower jaw there are eight teeth. She has a long 
tail. Her tail is ..."
    [0.9157] "Essay on The Cow 
by Radhakanta Swain | category Essay  
Introduction  
Cows are fou ..."
    [0.9010] "mother. They worship her as a goddess. Her milk is very nutritious. 
It is a food fo ..."
    [0.8891] "clean. We should feed her properly. We should be gr ateful to her. We should never s ..."

What's your next question (or type 'quit' to exit): quit
