PDF-QA Pipeline:
Load a PDF, split pages into chunks, embed, retrieve, and generate answers citing page numbers.

In [34]:
import os

from chromadb.utils.embedding_functions.sentence_transformer_embedding_function import    SentenceTransformerEmbeddingFunction
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_groq import ChatGroq
from langchain.chains import RetrievalQA
from langchain.schema import Document
from pymongo.server_selectors import any_server_selector
from transformers.models.auto.configuration_auto import model_type_to_module_name

load_dotenv()
groq_api_key = os.getenv("GROQ_API_KEY")


Load PDF with Pages as metadata

In [35]:
def load_pdf(pdf_path):
    loader = PyPDFLoader(pdf_path)
    pages = loader.load_and_split()
    for i in range(len(pages)):
        pages[i].metadata['page_number'] = i+1
    return pages

Split into chunks

In [36]:
def split_document(documents):
    splitter = RecursiveCharacterTextSplitter(chunk_size=800,chunk_overlap=200)
    return splitter.split_documents(documents)

Embedded chunks

In [37]:
def create_vectorestore(chunks):
    embedding_model = HuggingFaceEmbeddings(
        model_name = "sentence-transformers/all-MiniLM-L6-v2"

    )
    vectorestore = FAISS.from_documents(chunks,embedding_model)
    return vectorestore


LLM for Groq

In [38]:
def get_llm():
    return ChatGroq(api_key=groq_api_key,model_name = "llama-3.3-70b-versatile")

Retrival QA chain

In [39]:
def build_qa_chain(vectorestore):
    retriever = vectorestore.as_retriever()
    llm = get_llm()
    qa_chain = RetrievalQA.from_chain_type(
        llm = llm,
        retriever = retriever,
        return_source_documents=True
    )
    return qa_chain


Asking a question and citing the sources

In [40]:
def answer_question(qa_chain,query):
    result = qa_chain.invoke({"query" : query})
    answer = result['result']
    sources = result['source_documents']

    cited_pages = []
    for doc in sources:
        page_number = doc.metadata.get("page_number","N/A")
        if page_number not in cited_pages:
            cited_pages.append(page_number)

    cited_pages.sort()

    # Convert all page numbers to strings
    page_numbers_as_text = []
    for page in cited_pages:
        page_numbers_as_text.append(str(page))

    # Join them with commas
    joined_pages = ", ".join(page_numbers_as_text)

    # Create the citation text
    citation_text = "\n\n📄 **Cited Pages**: " + joined_pages

    return answer+citation_text

Main Function

In [41]:
if __name__ == "__main__":
    pdf_path = "artificial_intelligence.pdf"
    print("Loading and processing the PDF")
    docs = load_pdf(pdf_path)
    chunks = split_document(docs)
    vectorstore = create_vectorestore(chunks)
    qa_chain = build_qa_chain(vectorstore)

    # Example query
    question = "What are the key points discussed in the document?"
    print("\n Question:", question)
    response = answer_question(qa_chain, question)
    print("\n Answer:\n", response)

Loading and processing the PDF

 Question: What are the key points discussed in the document?

 Answer:
 The key points discussed in the document are:

1. Introduction to Natural Language Processing (NLP) and its steps.
2. The five general steps in NLP are not explicitly listed, but two of them are mentioned: 
   - Lexical Analysis: identifying and analyzing the structure of words.
   - Syntactic Analysis (Parsing): analyzing words in a sentence for grammar and arranging words to show their relationships.
3. Various concepts related to NLP, including:
   - Morphology: the study of word construction.
   - Morpheme: the primitive unit of meaning in a language.
   - Syntax: arranging words to make a sentence.
   - Semantics: the meaning of words and how to combine them into phrases and sentences.
   - Pragmatics: using and understanding sentences in different situations.
   - Discourse: how the preceding sentence affects the interpretation of the next sentence.
   - World Knowledge: gener