<a href="https://colab.research.google.com/github/Mohammadhsiavash/DeepL-Training/blob/main/AI%20Agents%20%2B%20Automation/03_PDF_Summarizer_with_Agents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Create an agent that:
1. Loads a PDF file
2. Extracts and chunks the content
3. Summarizes each chunk intelligently
4. Produces a full summary of the document

# Install Required Libraries

In [1]:
!pip install fitz PyMuPDF transformers

Collecting fitz
  Downloading fitz-0.0.1.dev2-py2.py3-none-any.whl.metadata (816 bytes)
Collecting PyMuPDF
  Downloading pymupdf-1.26.4-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting configobj (from fitz)
  Downloading configobj-5.0.9-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting configparser (from fitz)
  Downloading configparser-7.2.0-py3-none-any.whl.metadata (5.5 kB)
Collecting nipype (from fitz)
  Downloading nipype-1.10.0-py3-none-any.whl.metadata (7.1 kB)
Collecting pyxnat (from fitz)
  Downloading pyxnat-1.6.3-py3-none-any.whl.metadata (5.4 kB)
Collecting prov>=1.5.2 (from nipype->fitz)
  Downloading prov-2.1.1-py3-none-any.whl.metadata (3.7 kB)
Collecting rdflib>=5.0.0 (from nipype->fitz)
  Downloading rdflib-7.1.4-py3-none-any.whl.metadata (11 kB)
Collecting traits>=6.2 (from nipype->fitz)
  Downloading traits-7.0.2-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.8 kB)
Collecting acres (from n

# Define PDF Reader Tool

In [3]:
import pymupdf as fitz # PyMuPDF

def extract_text_from_pdf(path):
  doc = fitz.open(path)
  full_text = ""
  for page in doc:
    full_text += page.get_text()
  return full_text

# Download a sample PDF

In [4]:
!wget https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf -O dummy.pdf

--2025-08-30 17:08:27--  https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
Resolving www.w3.org (www.w3.org)... 104.18.22.19, 104.18.23.19, 2606:4700::6812:1713, ...
Connecting to www.w3.org (www.w3.org)|104.18.22.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13264 (13K) [application/pdf]
Saving to: ‘dummy.pdf’


2025-08-30 17:08:28 (83.1 MB/s) - ‘dummy.pdf’ saved [13264/13264]



# Chunk the extracted content

In [5]:
def chunk_text(text, chunk_size=500):
    """Chunks text into smaller pieces."""
    words = text.split()
    chunks = []
    current_chunk = []
    for word in words:
        current_chunk.append(word)
        if len(" ".join(current_chunk)) >= chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

# Load a summarization model

In [8]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


# Summarize Each Chunk

In [15]:
def summarize_chunks(chunks, summarizer):
    """Summarizes each chunk of text."""
    summaries = []
    for chunk in chunks:
        # Adjust max_length based on chunk length, ensuring a minimum
        max_len = min(150, max(30, len(chunk.split())))
        summary = summarizer(chunk, max_length=max_len, min_length=10, do_sample=False)[0]['summary_text']
        summaries.append(summary)
    return summaries

# Example usage (assuming 'chunks' and 'summarizer' are available from previous steps)
# text = extract_text_from_pdf('dummy.pdf')
# chunks = chunk_text(text)
# chunk_summaries = summarize_chunks(chunks, summarizer)
# print(chunk_summaries)

# Assemble Final Summary

In [16]:
def summarize_pdf(path):
    print("Extracting PDF text...")
    full_text = extract_text_from_pdf(path)
    print("Chunking text...")
    chunks = chunk_text(full_text)
    print("Generating summaries...")
    summaries = summarize_chunks(chunks, summarizer) # Pass the summarizer object
    final_summary = "\n\n".join(summaries)
    return final_summary

# Run the Agent


In [17]:
pdf_path = "dummy.pdf"
summary = summarize_pdf(pdf_path)
print("\n📝 FINAL SUMMARY:\n")
print(summary)

Your max_length is set to 30, but your input_length is only 6. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=3)


Extracting PDF text...
Chunking text...
Generating summaries...

📝 FINAL SUMMARY:

Dummy PDF file. Dummy PDF files. Dummies.
