## Rule-Based Extraction Prototype Demo

This notebook demonstrates the end-to-end document processing pipeline:

Ingest → Text Extraction → Classification → Field Extraction → Structured Output

The goal is to evaluate rule-based extraction across multiple document types and observe strengths and limitations on semi-structured PDFs.

In [1]:
import os
from pathlib import Path

In [None]:
# root path 
root_path = Path.cwd().parent

In [4]:
# change the working path to root path
os.chdir(root_path)

In [10]:
# essential imports

from src.ingest import discover_documents
from src.extract import extract_text_from_pdf, extract_fields_by_type
from src.classifier import classify_document
from src.utils import save_json
from src.pipeline import process_directory

In [None]:
# Paths 

input_dir = Path("data/samples2")
output_path = Path("results/results.json")

input_dir, output_path


(WindowsPath('data/samples2'), WindowsPath('results/results.json'))

### Pipeline

In [11]:
results = process_directory(input_dir)

save_json(results, output_path)

print(f"Extraction complete! Results saved at: {output_path}")

Extraction complete! Results saved at: results\results.json


### Results 

In [12]:
import json

with open(output_path, "r", encoding="utf-8") as f:
    data = json.load(f)

data

[{'file_name': 'compliance_sample.pdf',
  'document_type': 'compliance',
  'fields': {'document_name': 'DATA PRIVACY COMPLIANCE POLICY Version: 2.1 Effective Date: 01 January 2026 Issued by: Compliance Department Regulation: GDPR (General Data Protection Regulation) Scope: This policy applies to all employees and contractors handling personal data. Requirements: - Personal data must be encrypted - Access must be restricted to authorized personnel - Data breaches must be reported within 72 hours Review Cycle: Annual',
   'effective_date': '01',
   'version': '2.1',
   'regulatory_reference': 'GDPR (General Data Protection Regulation) Scope: This policy applies to all employees and contractors handling personal data. Requirements: - Personal data must be encrypted - Access must be restricted to authorized personnel - Data breaches must be reported within 72 hours Review Cycle: Annual',
   'issuing_authority': 'Compliance Department Regulation: GDPR (General Data Protection Regulation) Sc

### Prettier Output

In [13]:
from pprint import pprint

for item in data:
    pprint(item)

    print("-" * 60)

{'document_type': 'compliance',
 'fields': {'document_name': 'DATA PRIVACY COMPLIANCE POLICY Version: 2.1 '
                             'Effective Date: 01 January 2026 Issued by: '
                             'Compliance Department Regulation: GDPR (General '
                             'Data Protection Regulation) Scope: This policy '
                             'applies to all employees and contractors '
                             'handling personal data. Requirements: - Personal '
                             'data must be encrypted - Access must be '
                             'restricted to authorized personnel - Data '
                             'breaches must be reported within 72 hours Review '
                             'Cycle: Annual',
            'effective_date': '01',
            'issuing_authority': 'Compliance Department Regulation: GDPR '
                                 '(General Data Protection Regulation) Scope: '
                                 'This p