# Reading a PDF and PDF Page Chunking

In this tutorial, we'll learn how to:
1. Load and read PDF documents using LangChain
2. Extract text content from PDFs (See Example PDF Resources at the end of this Notebook)
3. Understand how documents are automatically chunked
4. Inspect document metadata

This is a foundational skill for building RAG (Retrieval Augmented Generation) systems and other document processing applications.

## Step 1: Import Required Libraries

We'll use:
- `os`: For operating system interactions (file paths, environment variables)
- `PyMuPDFLoader`: LangChain's document loader that uses the PyMuPDF library to read PDF files

PyMuPDFLoader is particularly useful because it:
- Extracts text page by page
- Preserves document structure
- Automatically creates Document objects with metadata

In [4]:
# import libraries
import os
from langchain_community.document_loaders import PyMuPDFLoader

## Step 2: Specify and Validate PDF File Path

Set the path to your PDF file and validate it exists. Make sure to:
- Use the correct file path for your system
- Check that the file exists at the specified location
- Use forward slashes (/) or escaped backslashes (\\\\) for Windows paths

**Note:** Replace this path with your actual PDF file location.

In [5]:
# Path to the uploaded PDF (replace with your actual file path)
pdf_path = "./data/41598_2020_Article_64454.pdf"

# Validate that the file exists
if os.path.exists(pdf_path):
    print(f"✅ PDF file found: {pdf_path}")
    print(f"File size: {os.path.getsize(pdf_path)} bytes")
else:
    print(f"❌ PDF file not found: {pdf_path}")
    print("Please check the file path and ensure the file exists.")
    # You might want to exit here or provide alternative file suggestions

✅ PDF file found: ./data/41598_2020_Article_64454.pdf
File size: 2240101 bytes


## Step 3: Define a Function to Load and Extract Text from PDF

We'll create a reusable function that:
1. Takes a PDF file path as input
2. Uses PyMuPDFLoader to load the PDF
3. Converts the PDF into LangChain's document format
4. Returns the document chunks for further processing

The loader automatically chunks the PDF by pages, creating one Document object per page.

In [6]:
def load_pdf_with_langchain(pdf_path):
    
    # Use LangChain's built-in loader
    loader = PyMuPDFLoader(pdf_path)
    
    # Load the PDF into LangChain's document format
    documents = loader.load()
    
    print(f"Successfully loaded {len(documents)} document chunks from the PDF.")
    return documents

## Step 4: Extract the Document Chunks

Now we'll call our function to load the PDF and extract its content. The loader will:
1. Read the PDF file
2. Extract text from each page
3. Create Document objects with page content and metadata
4. Return a list of all document chunks (one per page)

In [8]:
# Extract the document chunks
docs = load_pdf_with_langchain(pdf_path)

Successfully loaded 13 document chunks from the PDF.


## Step 5: Inspect the Extracted Content

Let's examine the first couple of chunks to understand what we extracted. For each chunk, we'll look at:
- The chunk number
- The first 500 characters of text content
- The metadata (includes information like page number, source file, etc.)

This helps us verify that the extraction worked correctly and understand the structure of our documents.

In [9]:
# Let's view the first couple of chunks to see what we got
print("\n Sample Extracted Content:")
for i, doc in enumerate(docs[:2]):
    print(f"\n--- Chunk {i + 1} ---")
    print(doc.page_content[:500])  # Show first 500 characters
    print("Metadata:", doc.metadata)


 Sample Extracted Content:

--- Chunk 1 ---
1
Scientific Reports |         (2020) 10:7483  | https://doi.org/10.1038/s41598-020-64454-x
www.nature.com/scientificreports
Inhibitory action of 
phenothiazinium dyes against 
Neospora caninum
Luiz Miguel Pereira1,2, Caroline Martins Mota   3, Luciana Baroni1,  
Cássia Mariana Bronzon da Costa1, Jade Cabestre Venancio Brochi1, Mark Wainwright4, 
Tiago Wilson Patriarca Mineo   3, Gilberto Úbida Leite Braga1 & Ana Patrícia Yatsuda1,2 ✉
Neospora caninum is an Apicomplexan parasite related to impor
Metadata: {'producer': 'iText® 5.3.5 ©2000-2012 1T3XT BVBA (SPRINGER SBM; licensed version)', 'creator': 'Springer', 'creationdate': '2020-04-25T03:23:43+05:30', 'source': './data/41598_2020_Article_64454.pdf', 'file_path': './data/41598_2020_Article_64454.pdf', 'total_pages': 13, 'format': 'PDF 1.4', 'title': 'Inhibitory action of phenothiazinium dyes against Neospora caninum', 'author': 'Luiz Miguel Pereira', 'subject': 'Scientific Reports, doi:10.

## Observation and Takeaways

### What did we achieve?

✅ **Successfully loaded a PDF** using LangChain's PyMuPDFLoader, which is powered by the PyMuPDF engine under the hood.

✅ **The loader extracted the text page by page**, and wrapped each into a `Document` object with metadata (like page number and file name).

### Key Points to Remember:

1. **Page-based Chunking**: PyMuPDFLoader automatically chunks by pages, creating one Document object per page

2. **Metadata Preservation**: Each chunk includes valuable metadata:
   - `page`: The page number (0-indexed)
   - `source`: The original file path
   - Other PDF-specific information

3. **Document Structure**: Each Document has:
   - `page_content`: The actual text content
   - `metadata`: Dictionary with additional information

4. **Next Steps**: For RAG applications, you might want to:
   - Split large pages into smaller chunks
   - Create embeddings for semantic search
   - Store in a vector database
   - Implement retrieval mechanisms

### Use Cases:
- Building RAG (Retrieval Augmented Generation) systems
- Document question-answering applications
- Semantic search over PDF documents
- Information extraction from PDF libraries

## Medical PDF Documents for Use in Testing PDF Chunking:
Here are some great sources for finding medical PDFs that are free and legal to use for your tutorial:

**Government Health Organizations:**
- **CDC (Centers for Disease Control)**: [cdc.gov](https://cdc.gov) - Tons of health reports, guidelines, and fact sheets
- **NIH (National Institutes of Health)**: [nih.gov](https://nih.gov) - Research papers and health information
- **WHO (World Health Organization)**: [who.int](https://who.int) - Global health reports and guidelines
- **FDA**: [fda.gov](https://fda.gov) - Drug information, medical device reports

**Open Access Medical Journals:**
- **PubMed Central**: [ncbi.nlm.nih.gov/pmc](https://ncbi.nlm.nih.gov/pmc) - Free full-text medical research papers
- **PLOS Medicine**: [journals.plos.org/plosmedicine](https://journals.plos.org/plosmedicine) - Open access medical journal
- **BMC Medicine**: [biomedcentral.com](https://biomedcentral.com) - Open access articles

**Medical Education Resources:**
- **MedlinePlus**: [medlineplus.gov](https://medlineplus.gov) - Patient education materials
- **OpenStax**: [openstax.org](https://openstax.org) - Free medical textbooks (like Anatomy & Physiology)

**Specific Examples You Could Use:**
- CDC COVID-19 guidelines
- WHO disease outbreak reports
- Medical research papers from PubMed Central
- Clinical practice guidelines
- Public health reports

**Quick tip:** Most government health websites have a search function where you can add "filetype:pdf" to find PDF documents specifically.

Would you like me to help you search for a specific type of medical document, or would you prefer recommendations for a particular medical topic?

## Optional: Check Environment Variables (Warning: May Show Sensitive Data)

⚠️ **Warning**: This step will display ALL environment variables, which may include sensitive information like API keys, passwords, or personal paths. 

**Only run this if:**
- You're in a safe environment
- You want to debug environment setup issues
- You understand the security implications

This can be useful for:
- Verifying API keys are loaded
- Checking Python paths
- Debugging configuration issues