<a href="https://www.kaggle.com/code/durjoychandrapaul/rag-q-a-system-by-langchain-huggingface-for-pdf?scriptVersionId=204704280" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

Author: [Durjoy Chandra Paul](https://www.linkedin.com/in/durjoy-chandra-paul/)
## Introduction
This project showcases a Retrieval-Augmented Generation (RAG) approach to building a question-answering system based on a PDF document.The process includes initializing a language model from Hugging Face, which is an open-source platform, and splitting the document's text into manageable chunks. A vector store is set up to store the embeddings, allowing for efficient information retrieval. Finally, an interactive query loop enables users to ask questions and receive answers based on the content of the PDF.

## Import Libraries and Explore Input Directory

In [1]:
import numpy as np 
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/book-with-information/budget_speech.pdf


In [2]:
! pip install -U langchain langchain_community langchain_astradb langchain-huggingface > /dev/null 2>&1

In [3]:
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain_community.document_loaders import PyPDFLoader
from langchain_astradb import AstraDBVectorStore
from langchain_community.llms import HuggingFaceHub
from langchain_huggingface import HuggingFaceEmbeddings

In [4]:
loader = PyPDFLoader('/kaggle/input/book-with-information/budget_speech.pdf')

In [5]:
pages = loader.load_and_split()
len(pages)

56

In [6]:
pages[0]

Document(metadata={'source': '/kaggle/input/book-with-information/budget_speech.pdf', 'page': 0}, page_content='GOVERNMENT OF INDIA\nBUDGET 2023-2024\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2023')

## Retrieve  Secrets for API Access

In [7]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
ASTRA_DB_API_ENDPOINT = user_secrets.get_secret("ASTRA_DB_API_ENDPOINT")
ASTRA_DB_APPLICATION_TOKEN = user_secrets.get_secret("ASTRA_DB_APPLICATION_TOKEN")
ASTRA_DB_ID = user_secrets.get_secret("ASTRA_DB_ID")
huggingfacehub_api_token=user_secrets.get_secret('huggingfacehub_api_token')

## LLM

In [8]:
llm = HuggingFaceHub(repo_id="HuggingFaceH4/zephyr-7b-beta", model_kwargs={"temperature":0.5,"max_length":64}, huggingfacehub_api_token=huggingfacehub_api_token)

  llm = HuggingFaceHub(repo_id="HuggingFaceH4/zephyr-7b-beta", model_kwargs={"temperature":0.5,"max_length":64}, huggingfacehub_api_token=huggingfacehub_api_token)


In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
)
texts = text_splitter.split_documents(pages)

## Embedding Model

In [10]:
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Set Up Vector Store

In [11]:
vstore = AstraDBVectorStore(
    embedding=embeddings_model,
    collection_name="langchain_pdf_query",
    api_endpoint=ASTRA_DB_API_ENDPOINT,
    token=ASTRA_DB_APPLICATION_TOKEN
)

In [12]:
vstore.add_documents(texts)
astra_vector_index = VectorStoreIndexWrapper(vectorstore=vstore)

## Q/A

In [13]:
# List of predefined questions
questions = [
    "What is the current GDP?",
    "How much the agriculture target will be increased to and what the focus will be?",
]

# Iterate through the predefined questions
for query_text in questions:
    print("\nQUESTION: \"%s\"" % query_text)
    answer = astra_vector_index.query(query_text, llm=llm).strip()
    unique_lines = set(answer.split('\n'))
    filtered_answer = "\n".join(unique_lines)
    print("ANSWER: \"%s\"\n" % filtered_answer)



QUESTION: "What is the current GDP?"
ANSWER: "
in private investments, and provide a cushion against global headwinds. 
45. This substantial increase in recent years is central to the 
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
citizens, especially our youth, women, farmers, OBCs, Scheduled Castes and 
appreciates India’s achievements and successes, we are sure that elders
3. Today as Indians stands with their head held high, and the world 
government’s efforts to enhance growth potential and job creation, crowd-
total expenditure are estimated at ` 27.2 lakh crore and ` 45 lakh crore 
2. In the 75th year of our Independence, the world has recognised the 
Helpful Answer: According to the given context, the current year's economic growth is estimated to be at 7%. This suggests that the current GDP is growing at 7% in the current year. However, the context doesn't prov

In [14]:
## Cleaning the Astra DB after use  
await vstore.aclear()

## Conclusion
Using powerful language models like those from OpenAI can significantly enhance the quality of replies in this system. These models excel at generating concise and summarized responses, reducing redundancy while maintaining clarity. 