# Retrieval-Augmented Generation (RAG) Pipeline with PDF Data

This notebook demonstrates a Retrieval-Augmented Generation (RAG) pipeline for interacting with PDF documents.
The goal is to extract structured data from PDFs, chunk the data, generate embeddings, and perform retrieval-augmented query answering.

The steps include:
1. Extracting text and structured data from PDF.
2. Chunking and embedding the data.
3. Storing the data in a vector database for efficient similarity-based retrieval.
4. Answering user queries with the aid of a language model (LLM).

The tools used include `PyMuPDF`, `pdfplumber`, `sentence-transformers`, and `FAISS`.

In [1]:
import fitz  # PyMuPDF for PDF text extraction
import pdfplumber
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Step 1: Extracting text from the PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

# Step 2: Extracting tables (use pdfplumber for tables)
def extract_tables_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        tables = []
        for page in pdf.pages:
            tables.append(page.extract_tables())
    return tables

In [2]:
# Step 3: Chunking and Embedding Text
model = SentenceTransformer('all-MiniLM-L6-v2')

def chunk_and_embed_text(text):
    chunks = text.split("\n\n")  # Simple text chunking based on paragraphs
    embeddings = model.encode(chunks)
    return chunks, embeddings

In [3]:
# Step 4: Store embeddings in FAISS for similarity search
def create_faiss_index(embeddings):
    index = faiss.IndexFlatL2(embeddings.shape[1])  # Create FAISS index
    index.add(embeddings)
    return index

In [4]:
# Step 5: Query Processing
def search_query(query, index, chunks):
    query_embedding = model.encode([query])
    D, I = index.search(query_embedding, k=5)  # Search for the 5 closest matches
    result_chunks = [chunks[i] for i in I[0]]
    return result_chunks

## Example Usage

Below is an example of how to use the functions defined in this notebook.

1. Extract text and tables from a PDF file.
2. Chunk and embed the text.
3. Store the embeddings in a FAISS index.
4. Query the system and retrieve relevant chunks.

### Example:
```python
pdf_path = "path_to_pdf.pdf"
pdf_text = extract_text_from_pdf(pdf_path)
pdf_chunks, pdf_embeddings = chunk_and_embed_text(pdf_text)

# Create FAISS index and add embeddings
index = create_faiss_index(np.array(pdf_embeddings))

# Query the system
user_query = "What is the unemployment rate for people with a high school degree?"
retrieved_chunks = search_query(user_query, index, pdf_chunks)

# Print the retrieved chunks for review
print(retrieved_chunks)
```
