<a href="https://colab.research.google.com/github/KOMPALALOKESH/Ask-PDF/blob/main/Ask_PDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quickstart: Querying PDF With Astra and LangChain

### A question-answering demo using Astra DB and LangChain, powered by Vector Search

#### Pre-requisites:

You need a **_Serverless Cassandra with Vector Search_** database on [Astra DB](https://astra.datastax.com) to run this demo. As outlined in more detail [here](https://docs.datastax.com/en/astra-serverless/docs/vector-search/quickstart.html#_prepare_for_using_your_vector_database), you should get a DB Token with role _Database Administrator_ and copy your Database ID: these connection parameters are needed momentarily.

You also need an [OpenAI API Key](https://cassio.org/start_here/#llm-access) for this demo to work.

#### What you will do:

- Setup: import dependencies, provide secrets, create the LangChain vector store;
- Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer.

Install the required dependencies:

In [None]:
!pip install -q cassio datasets langchain openai tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m809.1/809.1 kB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.9/221.9 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

Import the packages you'll need:

In [None]:
# LangChain components to use
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With CassIO, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:
import cassio

In [None]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/232.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
from PyPDF2 import PdfReader

### Setup

In [None]:
ASTRA_DB_APPLICATION_TOKEN = "AstraCS:XeWQjZDMcgsoBNdLLZMxwMyg:d53d0b3e57ca8e6a561594a71dffb343a544dedc7592bfbe845b1387c9304940" # enter the "AstraCS:..." string found in in your Token JSON file
ASTRA_DB_ID = "f65233ab-9320-4614-89bf-2a5c17611055" # enter your Database ID

OPENAI_API_KEY = "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" # enter your OpenAI key

#### Provide your secrets:

Replace the following with your Astra DB connection details and your OpenAI API key:

In [None]:
# provide the path of  pdf file/files.
pdfreader = PdfReader('budget_speech.pdf')

In [None]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [None]:
raw_text

'Kompala Lokesh\n2-318 Anjaneya swamy street, Bangarupeta, Venkatagiri, 524404\n♂phone+91 8328260402 /envel⌢pekompalalokesh123@gmail.com /linkedinLokesh Kompala /githubKOMPALALOKESH\nEducation\nSri Venkateswara University College of Engineering January 2021 – April 2024 Expected\nBachelor of Science in Computer Science and Engineering – CGPA: 8.21 Tirupati, Andhra Pradesh\nNarayana Junior College June 2018 – April 2020\nM. P. C. – CGPA: 9.69 Nellore, Andhra Pradesh\nNarayana High School April 2018\nTenth Class – CGPA: 10.0 Nellore, Andhra Pradesh\nTechnical Skills\nLanguages/Databases : Python, Java, SQL, MySQL, Postgres, HTML, CSS, Django\nLibraries/Frameworks : Numpy, Pandas, scikit-learn, Matplotlib, TensorFlow, LLM\nTools : VS Code, Google Colaboratory, Jupyter Notebook, GIT/GitHub, Tableau, Excel\nRelevant Coursework\n•Data Structures\n•Machine Learning•Data Science\n•Artificial Intelligence•Java Programming\n•Feature Engineering•Django\nProjects\nCodeGPT |Python, Google Colaborat

Initialize the connection to your database:

_(do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)_

In [None]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(139700044123488) f65233ab-9320-4614-89bf-2a5c17611055-us-east-2.db.astra.datastax.com:29042:7d1e19c9-12a6-4994-98b3-32954b433065> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Create the LangChain embedding and LLM objects for later usage:

In [None]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

Create your LangChain vector store ... backed by Astra DB!

In [None]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session=None,
    keyspace=None,
)

In [None]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [None]:
texts[:50]

['Kompala Lokesh\n2-318 Anjaneya swamy street, Bangarupeta, Venkatagiri, 524404\n♂phone+91 8328260402 /envel⌢pekompalalokesh123@gmail.com /linkedinLokesh Kompala /githubKOMPALALOKESH\nEducation\nSri Venkateswara University College of Engineering January 2021 – April 2024 Expected\nBachelor of Science in Computer Science and Engineering – CGPA: 8.21 Tirupati, Andhra Pradesh\nNarayana Junior College June 2018 – April 2020\nM. P. C. – CGPA: 9.69 Nellore, Andhra Pradesh\nNarayana High School April 2018\nTenth Class – CGPA: 10.0 Nellore, Andhra Pradesh\nTechnical Skills\nLanguages/Databases : Python, Java, SQL, MySQL, Postgres, HTML, CSS, Django\nLibraries/Frameworks : Numpy, Pandas, scikit-learn, Matplotlib, TensorFlow, LLM\nTools : VS Code, Google Colaboratory, Jupyter Notebook, GIT/GitHub, Tableau, Excel',
 'Libraries/Frameworks : Numpy, Pandas, scikit-learn, Matplotlib, TensorFlow, LLM\nTools : VS Code, Google Colaboratory, Jupyter Notebook, GIT/GitHub, Tableau, Excel\nRelevant Coursewo

### Load the dataset into the vector store



In [None]:

astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines." % len(texts[:50]))

astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 4 headlines.


### Run the QA cycle

Simply run the cells and ask a question -- or `quit` to stop. (you can also stop execution with the "▪" button on the top toolbar)

Here are some suggested questions:
- _What is the current GDP?_
- _How much the agriculture target will be increased to and what the focus will be_


In [None]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): what is the name of applicant in the document have?

QUESTION: "what is the name of applicant in the document have?"




ANSWER: "The name of the applicant in the document is Kompala Lokesh."

FIRST DOCUMENTS BY RELEVANCE:




    [0.8546] "Kompala Lokesh
2-318 Anjaneya swamy street, Bangarupeta, Venkatagiri, 524404
♂phone+ ..."
    [0.8517] "extensive dataset training.
CoCurricular / Achievements
•Solved 250+ DSA problems in ..."
    [0.8420] "on Google Colaboratory.
Credit Card Fraud Detect |Python, Google Colaboratory, Githu ..."
    [0.8389] "Libraries/Frameworks : Numpy, Pandas, scikit-learn, Matplotlib, TensorFlow, LLM
Tool ..."

What's your next question (or type 'quit' to exit): exit

QUESTION: "exit"




ANSWER: "I don't know."

FIRST DOCUMENTS BY RELEVANCE:




    [0.8652] "extensive dataset training.
CoCurricular / Achievements
•Solved 250+ DSA problems in ..."
    [0.8601] "Kompala Lokesh
2-318 Anjaneya swamy street, Bangarupeta, Venkatagiri, 524404
♂phone+ ..."
    [0.8559] "Libraries/Frameworks : Numpy, Pandas, scikit-learn, Matplotlib, TensorFlow, LLM
Tool ..."
    [0.8547] "on Google Colaboratory.
Credit Card Fraud Detect |Python, Google Colaboratory, Githu ..."

What's your next question (or type 'quit' to exit): quit
