## LLM2- Querying PDF with AstraDB 


Pre-requisites:

You need a Serverless Cassandra with Vector Search database on ASTRA DB to run this demo. As outlined in more detail here, you should get a DB Token with role Database Administrator and copy your Database ID: these connection parameters are needed momentarily.

You also need an OpenAI API Key, for this demo to work.

What you will do:
  <i> Setup: import dependencies, provide secrets, create the LangChain vector store;
  <ii>Run a Question-Answering loop retrieving the relevant headlines and having an LLM construct the answer. 

In [1]:
# Install the dependencies
!pip install -q cassio
!pip install datasets 
!pip install langchain 
!pip install openai 
!pip install tiktoken

^C
Collecting datasets
  Downloading datasets-2.16.1-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp311-cp311-win_amd64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets)
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
   ---------------------------------------- 0.0/507.1 kB ? eta -:--:--
   ---------------------------------------- 0.0/507.1 kB ? eta -:--:--
   ---------------------------------------- 0.0/507.1 kB ? eta -:--:--
    --------------------------------------- 10.2/507.1 k

In [49]:
!pip install sqlalchemy==2.0.25



In [None]:
from langchain.vectorstores.cassandra import Cassandra 
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings

# Support for dataset retrieval with Hugging Face
from datasets import load_dataset

# With Cassio, the engine powering the Astra DB integration in LangChain,
# you will also initialize the DB connection:

import cassio

In [19]:
!pip install PyPDF2



In [32]:
from PyPDF2 import PdfReader

### Setup

Provide your secrets:

Replace the following with your ASTRA DB connection details and your OpenAI API key:

In [1]:
ASTRA_DB_APPLICATION_TOKEN ="YOUR_ASTRA_DB_APPLICATION_TOKEN_HERE"
ASTRA_DB_ID="YOUR_ASTRA_DB_ID_HERE"

OPENAI_API_KEY="YOUR_OPENAI_API_KEY_HERE" 

In [34]:
# provide the path of pdf file/files.
pdfreader= PdfReader('budget_speech.pdf')

In [35]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content= page.extract_text()
    if content:
        raw_text += content  

#### Initialize the connection  to your database:
#### (do not worry if you see a few warnings, it's just that the drivers are chatty about negotiating protocol versions with the DB.)

In [36]:
cassio.init(token=ASTRA_DB_APPLICATION_TOKEN, database_id=ASTRA_DB_ID)

#### Create the LangChain embedding and LLM objects for later usage:

In [37]:
llm = OpenAI(openai_api_key=OPENAI_API_KEY)
embedding= OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

#### Create your LandChain vector store.....backend by Astra DB 

In [38]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_mini_demo",
    session= None,
    keyspace=None,
)

In [39]:
from langchain.text_splitter import CharacterTextSplitter
# We need to split the text using Character Text Split such that it should not increase token size 
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size = 800,
    chunk_overlap = 200,
    length_function = len,
)
texts =  text_splitter.split_text(raw_text)

In [40]:
texts[:50]

['GOVERNMENT OF INDIA\nBUDGET 2023-2024\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2023CONTENTS \nPART-A \n Page No.  \n\uf0b7 Introduction 1 \n\uf0b7 Achievements since 2014: Leaving no one behind 2 \n\uf0b7 Vision for Amrit Kaal  – an empowered and inclusive economy 3 \n\uf0b7 Priorities of this Budget 5 \ni. Inclusive Development  \nii. Reaching the Last Mile \niii. Infrastructure and Investment \niv. Unleashing the Potential \nv. Green Growth \nvi. Youth Power  \nvii. Financial Sector  \n \n \n \n \n \n \n \n \n\uf0b7 Fiscal Management 24 \nPART B  \n  \nIndirect Taxes  27 \n\uf0b7 Green Mobility  \n\uf0b7 Electronics   \n\uf0b7 Electrical   \n\uf0b7 Chemicals and Petrochemicals   \n\uf0b7 Marine products  \n\uf0b7 Lab Grown Diamonds  \n\uf0b7 Precious Metals  \n\uf0b7 Metals  \n\uf0b7 Compounded Rubber  \n\uf0b7 Cigarettes  \n  \nDirect Taxes  30 \n\uf0b7 MSMEs and Professionals',
 '\uf0b7 Chemicals and Petrochemicals   \n\uf0b7 Marine products  \n\uf0b7 La

### Load the dataset into the vector store 

In [41]:
astra_vector_store.add_texts(texts[:50])
print("Inserted %i headlines." % len(texts[:50]))
astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 50 headlines.


NameError: name 'VectorStoreIndexWrapper' is not defined

   ### Run the QA cycle

#### Simply run the cells and ask a question -- or quit to stop.(You can also stop execution with the "-" button on the top toolbar)

#### Here are some suggested questions:
####    What is the current GDP?
####    How much the agriculture target will be increased to and what the focus will be

In [None]:
first_question= True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit'  to exit): ").strip()
    else:
        query_text = input("\nWhat is your next question (or type 'quit' to exit):").strip()

    if  query_text.lower() == "quit":
        break
    if query_text == "":
        continue

    first_question = False
    print("\nQUESTION: \"%s\"" % query_text)
    answer= astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text,k=4):
        print("  [%0.4f] \"%s...\"" % (score,doc.page_content[:84]))     