**Querying PDF using AstraDB & Langchain**

In [None]:
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain_groq import ChatGroq
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
import cassio
from PyPDF2 import PdfReader
import os

In [None]:
ASTRA_DB_APP_TOKEN = os.getenv("ASTRA_DB_APP_TOKEN")
ASTRA_DB_ID = os.getenv("ASTRA_DB_ID")

GROQ_API_KEY = os.getenv("GROQ_API_KEY")
os.environ["HF_TOKEN"] = os.getenv("HF_TOKEN")

In [9]:
pdf_reader = PdfReader("budget_speech.pdf")

In [10]:
pdf_reader

<PyPDF2._reader.PdfReader at 0x7acec8547950>

In [11]:
raw_text = ''
for i, page in enumerate(pdf_reader.pages):
  content = page.extract_text()
  if content:
    raw_text += content

In [12]:
raw_text

" 20  \n \nPART B  \nIndirect Taxes  \n115. My proposals relating to Customs aim to rationalize tariff structure and \naddress duty inversion. These will also support domestic manufacturing and \nvalue addition, promote exports, facilitate trade and provide relief to common \npeople.  \nRationalisation of Customs Tariff Structure for Industrial Goods  \n116. As a part of comprehensive review of Customs rate structure \nannounced in July 2024 Budget, I propose to:  \n(i) remove seven tariff rates. This is over and above the seven tariff \nrates removed in 2023 -24 budget. After this, there will be only eight \nremaining tariff rates including ‘zero’ rate.  \n(ii) apply appropriate cess to broadly maintain effective duty incidence \nexcept on a few items, where such incidence will reduce marginally.  \n(iii) levy not more than one cess or surcharge. Therefore, I propose to \nexempt Social Welfare Surcharge on 82 tariff lines that are subject \nto a cess.  \n117. I shall now take up secto

**Connect to your DB**

In [None]:
cassio.init(token=ASTRA_DB_APP_TOKEN, database_id=ASTRA_DB_ID)

**LLM & Embeddings**

In [16]:
llm = ChatGroq(model="gemma2-9b-it", groq_api_key = GROQ_API_KEY)
embedding = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [17]:
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="budget_embeddings",
    session=None,
    keyspace=None,
    )

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    separators="\n",
    chunk_size=800,
    chunk_overlap=200,
    length_function=len,
)
texts = text_splitter.split_text(raw_text)
texts[:50]

['20  \n \nPART B  \nIndirect Taxes  \n115. My proposals relating to Customs aim to rationalize tariff structure and \naddress duty inversion. These will also support domestic manufacturing and \nvalue addition, promote exports, facilitate trade and provide relief to common \npeople.  \nRationalisation of Customs Tariff Structure for Industrial Goods  \n116. As a part of comprehensive review of Customs rate structure \nannounced in July 2024 Budget, I propose to:  \n(i) remove seven tariff rates. This is over and above the seven tariff \nrates removed in 2023 -24 budget. After this, there will be only eight \nremaining tariff rates including ‘zero’ rate.  \n(ii) apply appropriate cess to broadly maintain effective duty incidence \nexcept on a few items, where such incidence will reduce marginally.',
 'remaining tariff rates including ‘zero’ rate.  \n(ii) apply appropriate cess to broadly maintain effective duty incidence \nexcept on a few items, where such incidence will reduce margina

**Load the data into DB**

In [21]:
astra_vector_store.add_texts(texts[:50])
print("Inserted %i headlines." % len(texts[:50]))
astra_vector_index = VectorStoreIndexWrapper(vectorstore=astra_vector_store)

Inserted 29 headlines.


In [22]:
first_question = True
while True:
    if first_question:
        query_text = input("\nEnter your question (or type 'quit' to exit): ").strip()
    else:
        query_text = input("\nWhat's your next question (or type 'quit' to exit): ").strip()

    if query_text.lower() == "quit":
        break

    if query_text == "":
        continue

    first_question = False

    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    print("ANSWER: \"%s\"\n" % answer)

    print("FIRST DOCUMENTS BY RELEVANCE:")
    for doc, score in astra_vector_store.similarity_search_with_score(query_text, k=4):
        print("    [%0.4f] \"%s ...\"" % (score, doc.page_content[:84]))


Enter your question (or type 'quit' to exit): What is the budget change for electronic goods?

QUESTION: "What is the budget change for electronic goods?"




ANSWER: "Here are the budget changes for electronic goods according to the text:

* **Interactive Flat Panel Display (IFPD):** BCD increased from 10% to 20%.
* **Open Cell and other components:** BCD reduced from 10% to 5%.
* **Parts of Open Cells for LCD/LED TVs:** BCD exemption. 


Let me know if you have any other questions."

FIRST DOCUMENTS BY RELEVANCE:




    [0.7240] "Textiles  
121.  To promote domestic production of technical textile products such a ..."
    [0.7092] "from 10% to 20% and reduce the BCD to 5% on Open Cell and other 
components.  
123.  ..."
    [0.6882] "lithium -ion battery, both for mobile phones and electric vehicles.  
Shipping Secto ..."
    [0.6773] "limits.   26  
 
147. In my speech in July 2024, I had promised that all processes i ..."

What's your next question (or type 'quit' to exit): What was proposed about the business?

QUESTION: "What was proposed about the business?"




ANSWER: "The proposal was to extend the benefits of the existing tonnage tax scheme to inland vessels registered under the Indian Vessels Act, 2021 to promote inland water transport in the country."

FIRST DOCUMENTS BY RELEVANCE:




    [0.6673] "limits.   26  
 
147. In my speech in July 2024, I had promised that all processes i ..."
    [0.6492] "The benefits of existing tonnage tax scheme are proposed to be extended to 
inland v ..."
    [0.6350] "Our Government  is committed to keeping an ear to the ground and a finger on 
the pu ..."
    [0.6246] "149. I have a few proposals to promote investment and employment.  
Tax certainty fo ..."

What's your next question (or type 'quit' to exit): quit
