# Schedule 2 Processor Example with PDFReader

This notebook demonstrates how to:
1. Detect Schedule 2 PDFs using the router
2. Process them with Schedule2Processor (using free PDFReader)
3. Generate LlamaIndex documents for RAG
4. Save extracted data for reference

## Setup: Import Dependencies

In [1]:
from pathlib import Path
from ingest.processors.pdf.schedule2 import Schedule2Processor
from ingest.processors.pdf import PDFRouter
from ingest.indexer import IndexBuilder

  from .autonotebook import tqdm as notebook_tqdm


## Step 1: Find Schedule 2 PDF

Search for Schedule 2 PDF files in the data directory.

In [2]:
import os

# Print current working directory for debugging
print(f"Current working directory: {os.getcwd()}")

# Define the PDF directory
pdf_dir = Path(
    "data/raw/official-documents/3-ato-software-developers-portal/superstream-standard"
)

# Check if directory exists
if not pdf_dir.exists():
    print(f"Directory not found: {pdf_dir}")
    print("Checking parent directories...")
    
    # Try to find the data directory by going up
    for i in range(5):
        test_path = Path("../" * i) / pdf_dir
        if test_path.exists():
            pdf_dir = test_path
            print(f"Found at: {pdf_dir}")
            break

# Find Schedule 2 files (recursive search)
schedule2_files = list(pdf_dir.glob("**/*Schedule*2*Terms*.pdf"))

print(f"Found {len(schedule2_files)} Schedule 2 files")

if not schedule2_files:
    print("No Schedule 2 files found")
    # List what's in the directory
    if pdf_dir.exists():
        print(f"Contents of {pdf_dir}:")
        for item in pdf_dir.rglob("*.pdf"):
            print(f"  - {item}")
else:
    pdf_path = schedule2_files[0]
    print(f"Found PDF: {pdf_path.name}")
    print(f"Full path: {pdf_path.absolute()}")

Current working directory: C:\Users\hy120\Downloads\AI project\SuperStream-RAG\ingest\processors\pdf\schedule2
Directory not found: data\raw\official-documents\3-ato-software-developers-portal\superstream-standard
Checking parent directories...
Found at: ..\..\..\..\data\raw\official-documents\3-ato-software-developers-portal\superstream-standard
Found 1 Schedule 2 files
Found PDF: Schedule_2_Terms_and_Definitions_v2.1.pdf
Full path: C:\Users\hy120\Downloads\AI project\SuperStream-RAG\ingest\processors\pdf\schedule2\..\..\..\..\data\raw\official-documents\3-ato-software-developers-portal\superstream-standard\schedules\Schedule_2_Terms_and_Definitions_v2.1.pdf


## Step 2: Route the PDF

Use the PDFRouter to analyze and route the PDF to the appropriate processor.

In [3]:
# Initialize router
router = PDFRouter()
plan = router.route(pdf_path)

if plan:
    print("Routing Result:")
    print(f"  Schedule Type: {plan.schedule_type}")
    print(f"  Processor Type: {plan.processor_type}")
    print(f"  Extractors: {plan.extractors}")
    print(f"  Description: {plan.description}")
else:
    print("Document type not recognized!")

Routing Result:
  Schedule Type: Schedule_2
  Processor Type: table
  Extractors: ['terminology']
  Description: Terms and Definitions - Simple glossary table


## Step 3: Process Schedule 2 with PDFReader

Extract terms and documents from the PDF using the free PDFReader.

In [4]:
# Initialize processor
processor = Schedule2Processor()

# Process the PDF
print("Processing PDF with PDFReader...")
result = processor.process(pdf_path)

# Access extracted data
raw_content = result["raw_content"]
terms = result["terms"]
documents = result["documents"]
metadata = result["metadata"]

print("Processing completed!")

Processing PDF with PDFReader...
Processing Schedule 2: Schedule_2_Terms_and_Definitions_v2.1.pdf
  üîç Detecting version...

  üìã Version Detection:
     File: Schedule_2_Terms_and_Definitions_v2.1.pdf
     Version: v2.1
     Detected from: filename
     Confidence: 95.0%
     Processor: schedule2_v2
  Parsing with PDFReader...
  Successfully parsed 23 pages
  Extracting terms from content...
  Extracted 114 terms
  Creating LlamaIndex documents...
  Created 114 LlamaIndex documents
Processing completed!


## Step 4: View Extraction Results

Display the metadata and extracted terms.

In [None]:
print("=" * 70)
print("EXTRACTION RESULTS")
print("=" * 70)

print(f"\nMetadata:")
for key, value in metadata.items():
    print(f"  {key}: {value}")

## Step 5: View Extracted Terms

Display the first 10 extracted terms.

In [None]:
print(f"\nExtracted Terms (first 10):")
for i, term in enumerate(terms[:10], 1):
    print(f"\n{i}. {term.term}")
    print(f"   Definition: {term.definition[:100]}...")

## Step 6: View LlamaIndex Documents

Display information about the generated documents for RAG.

In [None]:
print(f"LlamaIndex Documents:")
print(f"  Total documents: {len(documents)}")

if documents:
    print(f"\n  Sample document:")
    doc = documents[0]
    print(f"    ID: {doc.doc_id}")
    print(f"    Text: {doc.text[:100]}...")
    print(f"    Metadata: {doc.metadata}")

## Step 7: Save Extracted Data

In [None]:
# Save extracted data
output_dir = Path("data/processed/terminology")
processor.save_extracted_data(result, output_dir)

print(f"Extracted data saved to {output_dir}")

## Step 8: Build FAISS Index

Create a FAISS index for RAG (Retrieval-Augmented Generation).

In [None]:
# Build FAISS index
print("Building FAISS Index...")
index_builder = IndexBuilder()
index = index_builder.build_index(documents)

print(f"  Index built successfully!")

# Save index
index_dir = Path("data/indices/schedule2_index")
index.storage_context.persist(str(index_dir))
print(f"  Index saved to {index_dir}")

## Summary

The notebook has successfully:
- ‚úÖ Located the Schedule 2 PDF
- ‚úÖ Routed the document to the appropriate processor
- ‚úÖ Extracted terms and documents using PDFReader
- ‚úÖ Saved extracted data
- ‚úÖ Built and saved a FAISS index for RAG