# OCR Pipeline Assignment – Handwritten Document PII Extraction

## Complete End-to-End Pipeline

**Objective**: Build a comprehensive OCR + PII extraction pipeline for handwritten documents

**Pipeline Flow**:
1. Input (handwritten JPEG)
2. Pre-processing (rotation correction, noise reduction, contrast enhancement)
3. OCR (text extraction using Tesseract)
4. Text Cleaning (error correction, normalization)
5. PII Detection (names, phone numbers, emails, dates, addresses, medical IDs)
6. Redacted Image Generation (optional)

## Setup and Installation

First, ensure all dependencies are installed. Run the following commands in your terminal:

```bash
# Install Python dependencies
pip install -r requirements.txt

# Download spaCy model
python -m spacy download en_core_web_sm

# Install Tesseract OCR (system dependency)
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/UB-Mannheim/tesseract/wiki
```

In [None]:
# Import all required libraries
import sys
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import json
import pandas as pd
from IPython.display import Image, display
from tqdm import tqdm

# Add src to path
sys.path.insert(0, os.path.abspath('../src'))

# Import our custom modules
from preprocessing import preprocess_pipeline, visualize_preprocessing_steps
from ocr_engine import extract_text, extract_with_boxes, ocr_pipeline, get_ocr_confidence
from text_cleaner import clean_pipeline, get_cleaning_stats
from pii_detector import detect_all_pii, highlight_pii_in_text, get_pii_summary, setup_spacy_model
from redactor import generate_redacted_image, create_comparison_image, map_text_to_coordinates

# Configure matplotlib
plt.rcParams['figure.figsize'] = (15, 10)
plt.rcParams['font.size'] = 10

print("✅ All modules imported successfully!")

In [None]:
# Pre-load spaCy model (this might take a moment the first time)
print("Loading spaCy model...")
nlp = setup_spacy_model()
print("✅ spaCy model loaded successfully!")

## 1. Load and Display Sample Images

Let's load the three sample handwritten documents.

In [None]:
# Define sample image paths
sample_images = [
    '../Sample/page_14.jpg',
    '../Sample/page_30.jpg',
    '../Sample/page_35.jpg'
]

# Display all sample images
fig, axes = plt.subplots(1, 3, figsize=(20, 7))
fig.suptitle('Sample Handwritten Documents', fontsize=16, fontweight='bold')

for idx, img_path in enumerate(sample_images):
    img = cv2.imread(img_path)
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    axes[idx].imshow(img_rgb)
    axes[idx].set_title(f'Sample {idx + 1}: {os.path.basename(img_path)}')
    axes[idx].axis('off')

plt.tight_layout()
plt.savefig('../results/00_original_samples.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"✅ Loaded {len(sample_images)} sample images")

## 2. Image Pre-processing

Pre-processing steps to enhance image quality for better OCR:
- **Deskewing**: Correct tilted/rotated images
- **Denoising**: Reduce noise while preserving edges
- **Contrast Enhancement**: Apply CLAHE for better text visibility
- **Binarization**: Convert to black & white for clearer text

In [None]:
# Demonstrate preprocessing on first sample
print("Demonstrating preprocessing pipeline on Sample 1...\n")
results, fig = visualize_preprocessing_steps(sample_images[0])
plt.savefig('../results/01_preprocessing_steps.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n✅ Preprocessing demonstration complete!")

In [None]:
# Preprocess all samples and save
print("Preprocessing all sample images...\n")

preprocessed_images = {}

for img_path in tqdm(sample_images, desc="Preprocessing"):
    base_name = os.path.basename(img_path).replace('.jpg', '')
    output_path = f'../outputs/preprocessed/{base_name}_preprocessed.jpg'
    
    # Run preprocessing pipeline
    result = preprocess_pipeline(img_path, save_path=output_path)
    preprocessed_images[base_name] = result
    
    print(f"  ✓ Saved: {output_path}")

print("\n✅ All images preprocessed and saved!")

## 3. OCR - Text Extraction

Extract text from preprocessed images using Tesseract OCR with LSTM models optimized for handwriting.

In [None]:
# Extract text from all samples
print("Extracting text using OCR...\n")

ocr_results = {}

for img_path in tqdm(sample_images, desc="OCR Processing"):
    base_name = os.path.basename(img_path).replace('.jpg', '')
    
    # Use preprocessed binary image for OCR
    binary_image = preprocessed_images[base_name]['binary']
    
    # Extract text with boxes and confidence
    result = ocr_pipeline(binary_image, extract_boxes=True, get_confidence=True)
    ocr_results[base_name] = result
    
    # Save extracted text
    output_path = f'../outputs/ocr_results/{base_name}_extracted_text.txt'
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(result['text'])
    
    print(f"\n{'='*60}")
    print(f"Sample: {base_name}")
    print(f"{'='*60}")
    print(f"Extracted Text:\n{result['text'][:300]}...")
    print(f"\nConfidence Stats:")
    print(f"  - Mean: {result['confidence']['mean_confidence']:.2f}%")
    print(f"  - Median: {result['confidence']['median_confidence']:.2f}%")
    print(f"  - Word Count: {result['confidence']['word_count']}")

print("\n✅ OCR extraction complete!")

## 4. Text Cleaning

Clean and normalize OCR output:
- Remove noise characters and OCR artifacts
- Correct common OCR errors (O↔0, l↔I, etc.)
- Standardize whitespace and formatting
- Normalize dates and phone numbers

In [None]:
# Clean extracted text
print("Cleaning extracted text...\n")

cleaned_texts = {}

for base_name, ocr_result in ocr_results.items():
    original_text = ocr_result['text']
    cleaned_text = clean_pipeline(original_text)
    cleaned_texts[base_name] = cleaned_text
    
    # Get cleaning stats
    stats = get_cleaning_stats(original_text, cleaned_text)
    
    print(f"{'='*60}")
    print(f"Sample: {base_name}")
    print(f"{'='*60}")
    print(f"Cleaning Stats:")
    print(f"  - Original length: {stats['original_length']} characters")
    print(f"  - Cleaned length: {stats['cleaned_length']} characters")
    print(f"  - Characters removed: {stats['characters_removed']}")
    print(f"  - Word count: {stats['original_words']} → {stats['cleaned_words']}")
    print(f"\nCleaned Text Preview:\n{cleaned_text[:300]}...\n")

print("✅ Text cleaning complete!")

## 5. PII Detection

Detect all Personally Identifiable Information (PII):
- **PERSON**: Names
- **PHONE**: Phone numbers (multiple formats)
- **EMAIL**: Email addresses
- **DATE**: Dates (DOB, appointment dates)
- **ADDRESS**: Physical addresses, locations
- **MEDICAL_ID**: Patient IDs, medical record numbers
- **ORG**: Organizations (hospitals, clinics)

In [None]:
# Detect PII in all samples
print("Detecting PII in extracted text...\n")

pii_results = {}

for base_name, cleaned_text in cleaned_texts.items():
    # Detect all PII
    pii_data = detect_all_pii(cleaned_text)
    pii_results[base_name] = pii_data
    
    # Save PII detection results as JSON
    output_path = f'../outputs/pii_detected/{base_name}_pii.json'
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(pii_data, f, indent=2, ensure_ascii=False)
    
    # Display summary
    print(f"{'='*60}")
    print(f"Sample: {base_name}")
    print(f"{'='*60}")
    print(get_pii_summary(pii_data))
    print()

print("✅ PII detection complete!")

In [None]:
# Visualize PII detection with highlighted text
print("Creating PII-highlighted text visualizations...\n")

for base_name, pii_data in pii_results.items():
    highlighted_text = highlight_pii_in_text(cleaned_texts[base_name], pii_data, highlight_char='█')
    
    print(f"{'='*60}")
    print(f"Sample: {base_name} - Text with PII Redacted")
    print(f"{'='*60}")
    print(highlighted_text[:500])
    print("\n")

## 6. PII Detection Statistics and Analysis

In [None]:
# Create summary statistics DataFrame
summary_data = []

for base_name, pii_data in pii_results.items():
    row = {'Sample': base_name, 'Total PII': pii_data['pii_count']}
    row.update(pii_data['pii_types'])
    summary_data.append(row)

summary_df = pd.DataFrame(summary_data)
print("\nPII Detection Summary:")
print("="*80)
print(summary_df.to_string(index=False))
print("="*80)

# Save summary to CSV
summary_df.to_csv('../outputs/pii_detected/summary_statistics.csv', index=False)
print("\n✅ Summary statistics saved!")

In [None]:
# Visualize PII distribution
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Chart 1: Total PII by sample
samples = [row['Sample'] for row in summary_data]
totals = [row['Total PII'] for row in summary_data]
axes[0].bar(samples, totals, color='steelblue')
axes[0].set_title('Total PII Entities per Sample', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Sample')
axes[0].set_ylabel('Count')
axes[0].grid(axis='y', alpha=0.3)

# Chart 2: PII type distribution (stacked)
pii_types = ['PERSON', 'PHONE', 'EMAIL', 'DATE', 'ADDRESS', 'MEDICAL_ID', 'ORG']
type_data = {pii_type: [row[pii_type] for row in summary_data] for pii_type in pii_types}

x = np.arange(len(samples))
width = 0.6
bottom = np.zeros(len(samples))

colors = ['#e74c3c', '#3498db', '#f39c12', '#2ecc71', '#9b59b6', '#1abc9c', '#e67e22']

for idx, pii_type in enumerate(pii_types):
    values = type_data[pii_type]
    axes[1].bar(x, values, width, label=pii_type, bottom=bottom, color=colors[idx])
    bottom += values

axes[1].set_title('PII Type Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Sample')
axes[1].set_ylabel('Count')
axes[1].set_xticks(x)
axes[1].set_xticklabels(samples)
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../results/02_pii_statistics.png', dpi=150, bbox_inches='tight')
plt.show()

print("✅ PII statistics visualized!")

## 7. Image Redaction (Optional)

Generate redacted versions of the original images with PII obscured.

In [None]:
# Generate redacted images for all samples
print("Generating redacted images...\n")

redaction_results = {}

for idx, img_path in enumerate(tqdm(sample_images, desc="Redacting")):
    base_name = os.path.basename(img_path).replace('.jpg', '')
    
    # Get PII data
    pii_data = pii_results[base_name]
    
    # Generate redacted image (black boxes)
    output_path = f'../outputs/redacted/{base_name}_redacted.jpg'
    result = generate_redacted_image(img_path, pii_data, output_path, redaction_type='black')
    
    # Also generate labeled version for visualization
    labeled_output = f'../outputs/redacted/{base_name}_labeled.jpg'
    labeled_result = generate_redacted_image(img_path, pii_data, labeled_output, redaction_type='labeled')
    
    redaction_results[base_name] = {
        'black': result,
        'labeled': labeled_result
    }
    
    print(f"  ✓ {base_name}: {result['entities_redacted']} entities redacted")

print("\n✅ Image redaction complete!")

In [None]:
# Display redaction results (labeled version for visibility)
fig, axes = plt.subplots(len(sample_images), 2, figsize=(16, 6 * len(sample_images)))
fig.suptitle('Redaction Results - Original vs Labeled PII', fontsize=16, fontweight='bold')

for idx, img_path in enumerate(sample_images):
    base_name = os.path.basename(img_path).replace('.jpg', '')
    
    # Original
    original = redaction_results[base_name]['labeled']['original']
    axes[idx, 0].imshow(cv2.cvtColor(original, cv2.COLOR_BGR2RGB))
    axes[idx, 0].set_title(f'{base_name} - Original')
    axes[idx, 0].axis('off')
    
    # Labeled redacted
    redacted = redaction_results[base_name]['labeled']['redacted']
    axes[idx, 1].imshow(cv2.cvtColor(redacted, cv2.COLOR_BGR2RGB))
    axes[idx, 1].set_title(f'{base_name} - PII Labeled ({redaction_results[base_name]["labeled"]["entities_redacted"]} entities)')
    axes[idx, 1].axis('off')

plt.tight_layout()
plt.savefig('../results/03_redaction_results.png', dpi=150, bbox_inches='tight')
plt.show()

## 8. Complete Pipeline Demo

Demonstrate the entire pipeline on a single sample from start to finish.

In [None]:
def complete_pipeline(image_path, output_dir='../outputs'):
    """
    Run the complete OCR + PII extraction pipeline.
    
    Args:
        image_path (str): Path to input image
        output_dir (str): Directory for outputs
        
    Returns:
        dict: Complete pipeline results
    """
    print(f"\n{'='*70}")
    print(f"Running Complete Pipeline: {os.path.basename(image_path)}")
    print(f"{'='*70}\n")
    
    results = {}
    
    # Step 1: Preprocessing
    print("[1/6] Preprocessing image...")
    preprocessed = preprocess_pipeline(image_path)
    results['preprocessed'] = preprocessed
    print("  ✓ Image preprocessed")
    
    # Step 2: OCR
    print("[2/6] Extracting text with OCR...")
    ocr_result = ocr_pipeline(preprocessed['binary'], extract_boxes=True, get_confidence=True)
    results['ocr'] = ocr_result
    print(f"  ✓ Extracted {len(ocr_result['text'])} characters")
    print(f"  ✓ Mean confidence: {ocr_result['confidence']['mean_confidence']:.2f}%")
    
    # Step 3: Text Cleaning
    print("[3/6] Cleaning text...")
    cleaned_text = clean_pipeline(ocr_result['text'])
    results['cleaned_text'] = cleaned_text
    stats = get_cleaning_stats(ocr_result['text'], cleaned_text)
    print(f"  ✓ Removed {stats['characters_removed']} noise characters")
    
    # Step 4: PII Detection
    print("[4/6] Detecting PII...")
    pii_data = detect_all_pii(cleaned_text)
    results['pii'] = pii_data
    print(f"  ✓ Detected {pii_data['pii_count']} PII entities")
    for pii_type, count in pii_data['pii_types'].items():
        if count > 0:
            print(f"    - {pii_type}: {count}")
    
    # Step 5: Generate Redacted Image
    print("[5/6] Generating redacted image...")
    redacted_result = generate_redacted_image(image_path, pii_data, redaction_type='labeled')
    results['redacted'] = redacted_result
    print(f"  ✓ Redacted {redacted_result['entities_redacted']} entities")
    
    # Step 6: Summary
    print("[6/6] Pipeline complete!\n")
    
    return results

# Run complete pipeline on first sample
demo_results = complete_pipeline(sample_images[0])

print("\n" + "="*70)
print("PIPELINE SUMMARY")
print("="*70)
print(f"Total Characters Extracted: {len(demo_results['ocr']['text'])}")
print(f"OCR Confidence: {demo_results['ocr']['confidence']['mean_confidence']:.2f}%")
print(f"Total PII Detected: {demo_results['pii']['pii_count']}")
print(f"Entities Redacted: {demo_results['redacted']['entities_redacted']}")
print("="*70)

## 9. Export All Results

Export comprehensive results for all samples in JSON format.

In [None]:
# Compile all results
comprehensive_results = []

for base_name in pii_results.keys():
    comprehensive_results.append({
        'sample_name': base_name,
        'ocr_text': ocr_results[base_name]['text'],
        'cleaned_text': cleaned_texts[base_name],
        'ocr_confidence': ocr_results[base_name]['confidence'],
        'pii_detection': pii_results[base_name],
        'redaction_stats': {
            'entities_redacted': redaction_results[base_name]['black']['entities_redacted']
        }
    })

# Save comprehensive results
output_file = '../outputs/comprehensive_results.json'
with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(comprehensive_results, f, indent=2, ensure_ascii=False)

print(f"✅ Comprehensive results saved to: {output_file}")

## 10. Final Results Summary

**Pipeline Performance Summary:**

In [None]:
print("\n" + "="*80)
print(" "*25 + "FINAL RESULTS SUMMARY")
print("="*80 + "\n")

for base_name in pii_results.keys():
    print(f"Sample: {base_name}")
    print("-" * 80)
    print(f"  OCR Confidence:        {ocr_results[base_name]['confidence']['mean_confidence']:.2f}%")
    print(f"  Words Extracted:       {ocr_results[base_name]['confidence']['word_count']}")
    print(f"  PII Entities Found:    {pii_results[base_name]['pii_count']}")
    print(f"  Entities Redacted:     {redaction_results[base_name]['black']['entities_redacted']}")
    print()

print("="*80)
print("\n✅ PIPELINE EXECUTION COMPLETE!\n")
print("Outputs saved to:")
print("  - Preprocessed images:  outputs/preprocessed/")
print("  - OCR text files:       outputs/ocr_results/")
print("  - PII detection JSON:   outputs/pii_detected/")
print("  - Redacted images:      outputs/redacted/")
print("  - Result screenshots:   results/")
print("="*80)

---

## Conclusion

This notebook demonstrates a complete end-to-end OCR + PII extraction pipeline for handwritten documents. The pipeline successfully:

1. ✅ **Pre-processes** images to handle tilted documents and improve quality
2. ✅ **Extracts text** from handwritten documents using Tesseract OCR
3. ✅ **Cleans text** to correct common OCR errors
4. ✅ **Detects PII** across 7 categories (names, phones, emails, dates, addresses, medical IDs, organizations)
5. ✅ **Generates redacted images** with PII obscured

### Key Features:
- Modular, reusable code architecture
- Handles tilted images and various handwriting styles
- Comprehensive PII detection with confidence scores
- Multiple redaction visualization options
- Complete JSON output for integration

### Benchmarking Note:
To test with new documents, simply:
1. Place new images in a folder
2. Update the `sample_images` list
3. Re-run all cells

The pipeline is production-ready and can be integrated into larger systems via the modular Python modules in the `src/` directory.