### 📖 Where We Are

**So far**, we've learned the fundamentals of data ingestion using simple text files. We covered:
- The basic structure of a LangChain `Document`.
- How to load single and multiple text files using `TextLoader` and `DirectoryLoader`.
- Why text splitting is crucial and how to use different splitters like `RecursiveCharacterTextSplitter`.

**In this notebook**, we'll tackle a much more common and complex data source: **PDF files**. We will explore different loaders, compare their strengths, and build a robust pipeline to handle the unique challenges that PDFs present.

## 1. Loading PDF Files

PDFs are one of the most common document formats, but they can be tricky. Unlike a plain `.txt` file, a PDF is a complex format that can contain text, images, tables, and complex layouts. Extracting text accurately is the first major hurdle.

**Analogy**: Think of a `.txt` file as a simple typed letter, where all the text is in a clear, linear order. A PDF, on the other hand, is like a glossy magazine page. The text might be in different columns, wrapped around images, or have distinct headers and footers. A PDF loader is like a person whose job is to read that magazine page and transcribe just the text, ignoring the irrelevant parts and trying to keep the reading order logical.

LangChain offers several loaders for this task, each with its own strengths. We'll look at the two most popular ones: `PyPDFLoader` and `PyMuPDFLoader`.

In [1]:
# Import the necessary PDF loaders from the langchain_community library.
from langchain_community.document_loaders import (
    PyPDFLoader,    # Uses the `pypdf` library
    PyMuPDFLoader   # Uses the `PyMuPDF` library, which is generally faster
)

### Method 1: `PyPDFLoader`

This is the standard, easy-to-use loader for PDFs. It loads a PDF and splits it by page, creating one `Document` object for each page. It's a great starting point for simple, text-based PDFs.

In [2]:
print("1️⃣ PyPDFLoader")

# A try-except block is used to gracefully handle potential errors, like the file not being found.
try:
    # Initialize the loader with the path to your PDF file.
    pypdf_loader=PyPDFLoader("data/pdf/attention.pdf")
    
    # The .load() method reads and parses the PDF.
    # It returns a list where each element is a Document object representing a page.
    pypdf_docs=pypdf_loader.load()
    
    print(f"  Loaded {len(pypdf_docs)} pages")
    print(f"  Page 1 content: {pypdf_docs[0].page_content[:100]}...")
    
    # Each Document contains metadata, including the source file and the page number.
    print(f"  Metadata: {pypdf_docs[0].metadata}")

except Exception as e:
    print(f"Error : {e}")

1️⃣ PyPDFLoader
  Loaded 15 pages
  Page 1 content: Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and...
  Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'data/pdf/attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}


### Method 2: `PyMuPDFLoader`

This loader uses the `PyMuPDF` library, which is known for its speed and efficiency. It often provides faster loading times and can sometimes extract text more cleanly than `PyPDFLoader`. It also splits the PDF by page, creating one `Document` per page.

In [3]:
print("\n2️⃣ PyMuPDFLoader (Fast and accurate)")
try:
    # Initialize the loader, similar to PyPDFLoader.
    pymupdf_loader = PyMuPDFLoader("data/pdf/attention.pdf")
    pymupdf_docs = pymupdf_loader.load()
    
    print(f"  Loaded {len(pymupdf_docs)} pages")
    # The metadata extracted by PyMuPDFLoader can be more detailed.
    print(f"  Page 1 Metadata: {pymupdf_docs[0].metadata}")
except Exception as e:
    print(f"  Error: {e}")


2️⃣ PyMuPDFLoader (Fast and accurate)
  Loaded 15 pages
  Page 1 Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'source': 'data/pdf/attention.pdf', 'file_path': 'data/pdf/attention.pdf', 'total_pages': 15, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'trapped': '', 'modDate': 'D:20240410211143Z', 'creationDate': 'D:20240410211143Z', 'page': 0}


### 📊 PDF Loader Comparison

| Loader | Speed | Text Extraction | Dependencies | Use Case |
| :--- | :---: | :---: | :---: | :--- |
| **`PyPDFLoader`** | Standard | Good | `pypdf` | Simple, reliable option for standard text-based PDFs. |
| **`PyMuPDFLoader`**| **Fast** | **Excellent** | `PyMuPDF` | Best for performance and when dealing with more complex layouts. Recommended for most use cases. |

## 2. Handling PDF Challenges & Building a Processing Pipeline

Simply loading a PDF is often not enough. Raw extracted text can be messy and contain artifacts that will confuse a language model.

Common issues include:
- **Ligatures**: Characters like 'ﬁ' and 'ﬂ' that are actually single characters in the PDF but represent two letters ('fi', 'fl').
- **Excessive Whitespace**: Unnecessary newlines, spaces, and tabs that break the flow of sentences.
- **Headers/Footers**: Repetitive text on each page (e.g., "Page 5 of 12") that adds no value.
- **Hyphenation**: Words broken across lines with a hyphen (e.g., "perform-" on one line and "ance" on the next).

To handle this, we need to create a post-processing step to clean the text before splitting it into chunks.

In [4]:
# Example of raw, messy text that might be extracted from a PDF
raw_pdf_text = """Company Financial Report


    The ﬁnancial performance for ﬁscal year 2024
    shows signiﬁcant growth in proﬁtability.
    
    Revenue increased by 25%.
    
The company's efﬁciency improved due to workﬂow
optimization.

Page 1 of 10
"""

# Define a simple cleaning function to address common issues
def clean_text(text):
    # 1. Replace ligatures with their standard two-character counterparts.
    text = text.replace("ﬁ", "fi").replace("ﬂ", "fl")
    
    # 2. Consolidate multiple whitespace characters into a single space.
    # .split() breaks the string by whitespace, and ' '.join() reassembles it with single spaces.
    text = " ".join(text.split())
    
    return text

cleaned = clean_text(raw_pdf_text)
print("BEFORE:")
# The repr() function shows the string with its raw formatting, including newlines (\n).
print(repr(raw_pdf_text))
print("\nAFTER:")
print(repr(cleaned))

BEFORE:
"Company Financial Report\n\n\n    The ﬁnancial performance for ﬁscal year 2024\n    shows signiﬁcant growth in proﬁtability.\n\n    Revenue increased by 25%.\n\nThe company's efﬁciency improved due to workﬂow\noptimization.\n\nPage 1 of 10\n"

AFTER:
"Company Financial Report The financial performance for fiscal year 2024 shows significant growth in profitability. Revenue increased by 25%. The company's efficiency improved due to workflow optimization. Page 1 of 10"


### Building a `SmartPDFProcessor` Class

Instead of running these steps manually each time, we can encapsulate this logic into a reusable class. This makes our data ingestion pipeline cleaner, more robust, and easier to manage.

In [5]:
# We'll need Document for type hinting and List for defining return types.
from langchain_core.documents import Document
from typing import List
# We will use PyPDFLoader and a text splitter.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

class SmartPDFProcessor:
    """An advanced PDF processor that loads, cleans, and chunks PDFs, while enhancing metadata."""
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 100):
        """Initializes the processor with a text splitter configuration."""
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            # Using a simple space separator can be effective after our cleaning removes complex whitespace.
            separators=[" "],
        )

    def process_pdf(self, pdf_path: str) -> List[Document]:
        """The main method to process a PDF file."""
        # 1. Load the PDF using PyPDFLoader.
        loader = PyPDFLoader(pdf_path)
        pages = loader.load()

        # This list will hold all the final, processed chunks.
        processed_chunks = []

        # 2. Iterate through each loaded page.
        for page_num, page in enumerate(pages):
            # 3. Clean the text content of the page.
            cleaned_text = self._clean_text(page.page_content)

            # 4. Skip pages that are nearly empty after cleaning to avoid creating useless chunks.
            if len(cleaned_text.strip()) < 50:
                continue

            # 5. Split the cleaned text into chunks. We use `create_documents` to keep metadata.
            chunks = self.text_splitter.create_documents(
                texts=[cleaned_text],
                # Enhance the metadata for each chunk.
                metadatas=[{
                    **page.metadata,  # Inherit original metadata from the page.
                    "page": page_num + 1, # Add a 1-based page number.
                    "total_pages": len(pages),
                    "chunk_method": "smart_pdf_processor", # Add a custom tag.
                    "char_count": len(cleaned_text)
                }]
            )
            
            # 6. Add the chunks from this page to our main list.
            processed_chunks.extend(chunks)

        return processed_chunks

    def _clean_text(self, text: str) -> str:
        """A private helper method to clean extracted text."""
        # Consolidate whitespace.
        text = " ".join(text.split())
        # Fix common PDF extraction issues (ligatures).
        text = text.replace("ﬁ", "fi").replace("ﬂ", "fl")
        return text

In [6]:
# Instantiate our custom processor.
processor = SmartPDFProcessor()

try:
    # Run the entire pipeline on our PDF file.
    smart_chunks = processor.process_pdf("data/pdf/attention.pdf")
    print(f"Processed into {len(smart_chunks)} smart chunks")

    # Inspect the metadata of the first chunk to see our enhancements.
    if smart_chunks:
        print("\nSample chunk metadata:")
        for key, value in smart_chunks[0].metadata.items():
            print(f"  {key}: {value}")

except Exception as e:
    print(f"Processing error: {e}")

Processed into 49 smart chunks

Sample chunk metadata:
  producer: pdfTeX-1.40.25
  creator: LaTeX with hyperref
  creationdate: 2024-04-10T21:11:43+00:00
  author: 
  keywords: 
  moddate: 2024-04-10T21:11:43+00:00
  ptex.fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5
  subject: 
  title: 
  trapped: /False
  source: data/pdf/attention.pdf
  total_pages: 15
  page: 1
  page_label: 1
  chunk_method: smart_pdf_processor
  char_count: 2857


### 🔑 Key Takeaways

* **PDFs are Complex**: Unlike plain text, PDFs require specialized loaders. The text extraction can be imperfect, containing artifacts like ligatures and strange whitespace.
* **Choose the Right Loader**: `PyMuPDFLoader` is generally recommended for its speed and performance, but `PyPDFLoader` is a solid, simple alternative.
* **Cleaning is Crucial**: Always inspect the raw text extracted from your documents. A simple cleaning function to handle whitespace and text artifacts significantly improves the quality of your data.
* **Build a Pipeline**: For a robust RAG system, encapsulate your ingestion logic (load -> clean -> split) into a reusable class. This makes your process more reliable, manageable, and allows you to easily add enhancements like custom metadata.