# Retrieval Augmented Generation (RAG) Pipeline for PDF Question Answering 

## Overview

This project implements a Retrieval Augmented Generation (RAG) pipeline to answer questions based on the content of PDF documents.  It combines information retrieval with a large language model (LLM) GPT-2 here to provide more accurate and contextually relevant answers. The project utilizes several Python libraries, including PyMuPDF, pdfminer.six, sentence-transformers, FAISS, and transformers.

## Dataset

The project operates on a collection of PDF documents.The documents were created through prompting and are documents titled Introduction to Machine Learning,Basics of NLP and Introduction to Data Sciece.

## Methodology

1. **Data Acquisition and Preparation:**
    - PDF files are read from a specified directory.
    - Text is extracted from each PDF using `PyMuPDF` and `pdfminer.six`.
    - The extracted text is chunked into smaller, overlapping segments to manage context length and improve retrieval granularity(Sliding Window Chunking).

2. **Embedding Generation:**
    - The `sentence-transformers` library is used to generate embeddings for each text chunk.  The `all-MiniLM-L6-v2` model is used for embedding generation.
    - These embeddings are vector representations of the text, capturing semantic meaning.

3. **FAISS Indexing:**
    - A FAISS index (`IndexFlatL2`) is created to store and efficiently search the generated embeddings.  This allows for fast retrieval of relevant chunks given a query.

4. **Retrieval:**
    - When a user provides a query, its embedding is generated using the same `sentence-transformers` model.
    - The FAISS index is queried to find the most similar (and therefore most relevant) text chunks based on the query embedding.

5. **Language Model Interaction and Response Generation:**
    - A pre-trained language model (`gpt2-large`) from the `transformers` library is loaded.
    - A prompt is constructed containing the retrieved relevant chunks and the user's query.  The context is truncated to fit within the LLM's context window.
    - The LLM generates a response based on the provided prompt.



## Results

The project provides a functional RAG pipeline.  The quality of the responses depends on factors like the quality of the PDFs, the chunking strategy, the choice of embedding model, and the LLM used.  The `gpt2-large` model is used in this example.The model understood which PDF document was related to the query and provided an adequete reasponse.

## Conclusion

This project demonstrates a basic RAG pipeline for PDF question answering. It provides a foundation for building more sophisticated systems. Future work will focus on improving context handling, optimizing GPU utilization, enhancing prompt engineering, and implementing evaluation metrics.



In [None]:
!pip install PyMuPDF pdfminer.six torch transformers sentence-transformers faiss-cpu

Collecting PyMuPDF
  Downloading pymupdf-1.25.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting pdfminer.six
  Downloading pdfminer.six-20240706-py3-none-any.whl.metadata (4.1 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading pymupdf-1.25.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m92.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20240706-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m109.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m80.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collec

In [None]:
!conda install pytorch torchvision cudatoolkit=11.1 -c pytorch

/bin/bash: line 1: conda: command not found


In [None]:
#necessary libraries
import os
import io
import fitz
from pdfminer.high_level import extract_text
from sentence_transformers import SentenceTransformer
import faiss
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

  from tqdm.autonotebook import tqdm, trange


In [None]:
#checking cuda
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [None]:
#pdfs created and added to this path
pdf_folder = '/content/sample_data/pdfs'
print(f"Using PDF files from: {pdf_folder}")


Using PDF files from: /content/sample_data/pdfs


In [None]:
def extract_text_from_pdf(pdf_path):
    return extract_text(pdf_path)

documents = []
for filename in os.listdir(pdf_folder):
    if filename.endswith('.pdf'):
        file_path = os.path.join(pdf_folder, filename)
        text = extract_text_from_pdf(file_path)
        documents.append((filename, text))

print(f"Processed {len(documents)} PDF files.")

Processed 3 PDF files.


In [None]:
#Chunking using sliding window chunking
def chunk_text(text, chunk_size=1000, overlap=200):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += (chunk_size - overlap)
    return chunks

chunked_documents = []
for filename, text in documents:
    chunks = chunk_text(text)
    chunked_documents.extend([(filename, chunk) for chunk in chunks])

print(f"Created {len(chunked_documents)} chunks.")

Created 6 chunks.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
print("GPU Memory:", torch.cuda.get_device_properties(0).total_memory / (1024.0 **3), "GB")
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

Using device: cuda
GPU Memory: 39.56427001953125 GB


In [None]:
#Vectorizing the chunks
model = SentenceTransformer('all-MiniLM-L6-v2')
if torch.cuda.is_available():
    model.to(device)
else:
    model.to('cpu')

batch_size = 32
embeddings = []
for i in range(0, len(chunked_documents), batch_size):
    batch = chunked_documents[i:i+batch_size]
    batch_embeddings = model.encode([chunk for _, chunk in batch], convert_to_tensor=True)
    if torch.cuda.is_available():
        batch_embeddings = batch_embeddings.cpu().numpy()
    else:
        batch_embeddings = batch_embeddings.numpy()
    embeddings.extend(batch_embeddings)

embeddings = torch.tensor(embeddings)

print(f"Generated {len(embeddings)} embeddings.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Generated 6 embeddings.


  embeddings = torch.tensor(embeddings)


In [None]:
#Indexing using FAISS for easy storage and search of embeddings
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings.numpy())

print("FAISS index created.")


FAISS index created.


In [None]:
def retrieve_relevant_chunks(query, top_k=5):
    query_embedding = model.encode([query], convert_to_tensor=True).cpu().numpy()
    _, indices = index.search(query_embedding, top_k)
    return [chunked_documents[i] for i in indices[0]]

In [None]:
#Loading gpt-2 using transformers
model_name = "gpt2-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
lm_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

print("Language model loaded.")

Language model loaded.


In [None]:
# function that generates response based on query based on these specified hyperparameters
def generate_response(query, max_length=200):
    relevant_chunks = retrieve_relevant_chunks(query)
    context = "\n".join([chunk for _, chunk in relevant_chunks])

    context = context[:1024 - len(query) - 20]

    prompt = f"Context: {context}\n\nQuestion: {query}\n\nAnswer:"

    input_ids = tokenizer.encode(prompt, return_tensors="pt", max_length=1024, truncation=True).to(device)
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=device)

    with torch.no_grad():
        output = lm_model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_length=min(1024, len(input_ids[0]) + max_length),
            num_return_sequences=1,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.7
        )

    return tokenizer.decode(output[0], skip_special_tokens=True)

In [None]:
#The query
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
query = "What algorithms are mentioned in the PDF 'Introduction to Machine Learning' for each type of machine learning?"
response = generate_response(query)
print("Query:", query)
print("Response:", response)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Query: What algorithms are mentioned in the PDF 'Introduction to Machine Learning' for each type of machine learning?
Response: Context: Introduction to Machine Learning 

Machine learning is a rapidly growing field of artificial intelligence that focuses on the development 
of algorithms and statistical models that enable computer systems to improve their performance on 
a specific task through experience. At its core, machine learning is about creating systems that can 
learn from data, identify patterns, and make decisions with minimal human intervention. This 
approach has revolutionized numerous industries, from healthcare and finance to transportation and 
entertainment.There are three main types of machine learning: supervised learning, unsupervised 
learning, and reinforcement learning. Supervised learning involves training models on labeled data to 
make predictions or classifications. Unsupervised learning, on the other hand, deals with finding 
hidden patterns or structures 