# PDF RAG with context highlighting

This notebook is a demonstration of retrieval augmented generation with context highlighting.

## Environment setup

Install and import dependancies.

In [None]:
!pip install pymupdf
!pip install langchain
!pip install transformers
!pip install chromadb

In [1]:
# Pdf tools
import fitz

# Langchain modules
from langchain.chains import RetrievalQA
from langchain.llms.huggingface_pipeline import HuggingFacePipeline

from langchain.embeddings.huggingface import HuggingFaceEmbeddings

from langchain.schema.document import Document

from langchain.vectorstores import Chroma

## Load the context document

Change the `doc_path` to locate the source document.

In [2]:
doc_path = "insurance_sample.pdf"

In [3]:
# Load the PDF tool
doc = fitz.open(doc_path)

# Extract the text manually using blocks
texts = []
for page in doc:
    blocks = page.get_text("blocks")
    for block in blocks:
        page_num = page.number
        author = doc.metadata['author'] if 'author' in doc.metadata else ""
        text = Document(
            page_content = block[4],
            metadata = {
                'source':doc_path,
                'page':page_num,
                'author':author
            }
        )
        texts.append(text)

# Embed the extracted text and store in a ChromaDB
embeddings = HuggingFaceEmbeddings(model_name="intfloat/e5-large")
retriever = Chroma.from_documents(texts, embeddings).as_retriever(search_kwargs={'k':3})

## Prepare the LLM pipeline

Change the `model_name` to select an open LLM from the HuggingFace leaderboards. I am using `facebook/opt-iml-max-1.3b`, the top ~1B model at the time of writing, because of hardware limitations.

Generally, 7B or 13B models are ideal for RAG. This is only a demonstration, so the LLM's response may not be ideal.

In [4]:
model_name="facebook/opt-iml-max-1.3b"

In [5]:
model = HuggingFacePipeline.from_model_id(
    model_id=model_name,
    task="text-generation",
    pipeline_kwargs={
        "max_length":2048
    }
)

chain = RetrievalQA.from_chain_type(
    llm=model,
    retriever=retriever,
    return_source_documents=True
)

## Run the pipeline

Running the chain produces an `LLMResponse` object, which contains both the LLM's answer based on the context, and the source blocks which we can use for context highlighting.

In [6]:
question = "In what cases would this policy be terminated?"

In [7]:
llm_response = chain(question)
# Display the LLM's answer
llm_response['result']

' (a) nonpayment of premium; (b) nonrenewal by us; (c) nonrenewal by you; (d) nonrenewal by us; (e) nonrenewal by you; (f) nonrenewal by us; (g) nonrenewal by you; (h) intentional loss; (i) intentional loss; (j) nonrenewal by you; (k) nonrenewal by you; (l) nonrenewal by you; (m) nonrenewal by you; (n) nonrenewal by you; (o) nonrenewal by you; (p) nonrenewal by you; (q) nonrenewal by you; (r) nonrenewal by you; (s) nonrenewal by you; (t) nonrenewal by you; (u) nonrenewal by you; (v) nonrenewal by you; (w) nonrenewal by you; (x) nonrenewal by you; (y) nonrenewal by you; (z) nonrenewal by you; (2) nonrenewal by you; (3) nonrenewal by you; (4) nonrenewal by you; (5) nonrenewal by you; (6) nonrenewal by you; (7) nonrenewal by you; (8) nonrenewal by you; (9) nonrenewal by you; (10) nonrenewal by you; (11) nonrenewal by you; (12) nonrenewal by you; (13) nonrenewal by you; (14) nonrenewal by you; (15) nonrenewal by you; (16) nonrenewal by you; (17) nonrenewal by you; (18) nonrenewal by you; (

## Build a highlighted results PDF

Using the PyMuPDF library (imported as fitz) to highlight the source document based on the sources from the `LLMResponse`. Change the `report_path` to locate the destination for the PDF report.

In [8]:
report_path = 'report.pdf'

In [10]:
sources = llm_response['source_documents']
source_texts = []
for source in sources:
    source_texts.append(source.page_content)

for page in doc:
    blocks = page.get_text("blocks")
    highlight_quads = []
    for block in blocks:
        if block[4] in source_texts:
            x0 = block[0]
            y0 = block[1]
            x1 = block[2]
            y1 = block[3]
            quad = fitz.Quad(
                (x0, y0),
                (x1, y0),
                (x0, y1),
                (x1, y1)
            )
            highlight_quads.append(quad)
    
    if highlight_quads:
        page.add_highlight_annot(highlight_quads)

doc.save(report_path)