# Batch Processing: OCR + Layout + Graph cho to√†n b·ªô DocVQA Dataset

Notebook n√†y ch·∫°y to√†n b·ªô pipeline qua dataset v√† l∆∞u k·∫øt qu·∫£:
1. **OCR**: PaddleOCR extraction
2. **Layout Analysis**: Detect regions (Table, Figure, Form, TextBlock)
3. **Graph Building**: Semantic layout graph v·ªõi spatial & semantic relations
4. **Export JSON**: L∆∞u to√†n b·ªô v√†o file JSON

**Output Format:**
```json
{
  "image_id": "12345",
  "ocr": { "tokens": [...], "num_tokens": 100 },
  "layout": { "regions": [...], "num_regions": 5 },
  "graph": { "nodes": [...], "edges": [...], "adjacency": {...} }
}
```

## 1. Import Libraries v√† Setup

In [2]:
import sys
sys.path.insert(0, '..')
import pathlib
PROJECT_ROOT = pathlib.Path().resolve().parents[1]
sys.path.append(str(PROJECT_ROOT))
# Force reload modules to get latest changes
import importlib

from pathlib import Path
from datetime import datetime

# Import pipeline components
from src.ocr.ocr_processor import PaddleOCRProcessor
from src.ocr.layout_analyzer import DocumentLayoutAnalyzer
from src.graph.graph_builder import GraphBuilder

# Import utilities
from src.utils import pipeline as pipeline_module
from src.utils import batch_processor as batch_module
from src.utils import statistics_collector as stats_module

# Reload modules to get latest code changes
importlib.reload(pipeline_module)
importlib.reload(batch_module)
importlib.reload(stats_module)

from src.utils.pipeline import FullPipelineProcessor
from src.utils.batch_processor import BatchProcessor
from src.utils.statistics_collector import StatisticsCollector

print("‚úÖ All libraries imported (with reload)!")

‚úÖ All libraries imported (with reload)!


In [3]:
# Test Token Classifier
from src.ocr.token_classifier import TokenClassifier

classifier = TokenClassifier()
print("‚úÖ TokenClassifier loaded!")
print("\nTest v·ªõi sample tokens:")

sample_tokens = [
    {'text': 'Invoice Date:', 'bbox': [10, 10, 100, 30]},
    {'text': '12/01/2024', 'bbox': [110, 10, 180, 30]},
    {'text': 'Customer Name:', 'bbox': [10, 40, 120, 60]},
    {'text': 'ABC Corp', 'bbox': [130, 40, 200, 60]},
]

classified = classifier.classify_tokens(sample_tokens)
for token in classified:
    print(f"  {token['text']:20} -> {token['token_type']}")

stats = classifier.get_statistics(classified)
print(f"\nStatistics: {stats}")

‚úÖ TokenClassifier loaded!

Test v·ªõi sample tokens:
  Invoice Date:        -> form_key
  12/01/2024           -> form_value
  Customer Name:       -> form_key
  ABC Corp             -> text

Statistics: {'form_key': 2, 'form_value': 1, 'text': 1}


## 2. Configuration

In [4]:
# Dataset paths
IMAGES_FOLDER = Path('../dataset/DocVQA_Images')
OUTPUT_FOLDER = Path('../output/full_pipeline')
SUBSETS = ['train', 'validation', 'test']

# Processing configuration
MAX_IMAGES_PER_SUBSET = 10  # None = process all, or set number like 100
USE_PREPROCESSING = True
MAX_IMAGE_SIZE = 2500

# Create output folder
OUTPUT_FOLDER.mkdir(parents=True, exist_ok=True)

print("üìÅ Dataset Folder:", IMAGES_FOLDER)
print("üìÅ Output Folder:", OUTPUT_FOLDER)
print("üìä Max images per subset:", MAX_IMAGES_PER_SUBSET or "ALL")
print("üîß Preprocessing:", "ENABLED" if USE_PREPROCESSING else "DISABLED")

üìÅ Dataset Folder: ../dataset/DocVQA_Images
üìÅ Output Folder: ../output/full_pipeline
üìä Max images per subset: 10
üîß Preprocessing: ENABLED


## 3. Initialize Pipeline Components

In [5]:
# Test v·ªõi 1 image
test_image = IMAGES_FOLDER / 'train' / '10668.png'

if test_image.exists():
    print(f"Testing pipeline with: {test_image.name}")
    
    # Create temp pipeline (no need to initialize full components yet)
    temp_ocr = PaddleOCRProcessor()
    temp_layout = DocumentLayoutAnalyzer()
    temp_graph = GraphBuilder()
    temp_pipeline = FullPipelineProcessor(temp_ocr, temp_layout, temp_graph)
    
    # Process
    result = temp_pipeline.process_image(test_image)
    
    if result['success']:
        print("\n‚úÖ Pipeline test successful!")
        print(f"   Tokens: {result['num_tokens']}")
        print(f"   Regions: {result['num_regions']}")
        print(f"   Edges: {result['num_edges']}")
    else:
        print(f"\n‚ùå Pipeline test failed: {result.get('error')}")
else:
    print(f"‚ö†Ô∏è Test image not found: {test_image}")

Testing pipeline with: 10668.png


[33mChecking connectivity to the model hosters, this may take a while. To bypass this check, set `DISABLE_MODEL_SOURCE_CHECK` to `True`.[0m


ƒêang kh·ªüi t·∫°o PaddleOCR engine...


[32mCreating model: ('PP-OCRv5_server_det', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/phn/.paddlex/official_models/PP-OCRv5_server_det`.[0m
[32mCreating model: ('PP-OCRv5_server_rec', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/phn/.paddlex/official_models/PP-OCRv5_server_rec`.[0m


-> PaddleOCR ƒë√£ s·∫µn s√†ng!

‚úÖ Pipeline test successful!
   Tokens: 87
   Regions: 4
   Edges: 12


## 3.5. Test Pipeline v·ªõi 1 Image (Optional)

Test pipeline v·ªõi 1 ·∫£nh tr∆∞·ªõc khi ch·∫°y batch to√†n b·ªô dataset.

In [6]:
# Initialize OCR processor
print("Initializing PaddleOCR...")
ocr_processor = PaddleOCRProcessor(
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False
)

# Initialize Layout Analyzer
print("Initializing Layout Analyzer...")
layout_analyzer = DocumentLayoutAnalyzer(
    y_overlap_threshold=0.5,
    line_height_tolerance=0.3,
    max_x_gap_ratio=3.0,
    block_vertical_gap=20,
    block_x_overlap_threshold=0.3
)

# Initialize Graph Builder
print("Initializing Graph Builder...")
graph_builder = GraphBuilder(
    iou_threshold=0.1,
    distance_threshold=200.0,
    projection_threshold=0.3,
    max_neighbors=5,
    min_edge_score=0.2
)

# Initialize Full Pipeline Processor
print("Initializing Pipeline Processor...")
pipeline_processor = FullPipelineProcessor(
    ocr_processor=ocr_processor,
    layout_analyzer=layout_analyzer,
    graph_builder=graph_builder
)

# Initialize Batch Processor
print("Initializing Batch Processor...")
batch_processor = BatchProcessor(pipeline_processor)

print("\n‚úÖ All components initialized!")

[32mCreating model: ('PP-OCRv5_server_det', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/phn/.paddlex/official_models/PP-OCRv5_server_det`.[0m


Initializing PaddleOCR...
ƒêang kh·ªüi t·∫°o PaddleOCR engine...


[32mCreating model: ('PP-OCRv5_server_rec', None)[0m
[32mModel files already exist. Using cached files. To redownload, please delete the directory manually: `/home/phn/.paddlex/official_models/PP-OCRv5_server_rec`.[0m


-> PaddleOCR ƒë√£ s·∫µn s√†ng!
Initializing Layout Analyzer...
Initializing Graph Builder...
Initializing Pipeline Processor...
Initializing Batch Processor...

‚úÖ All components initialized!


## 4. Run Batch Processing

‚ö†Ô∏è **L∆∞u √Ω**: X·ª≠ l√Ω to√†n b·ªô dataset s·∫Ω m·∫•t nhi·ªÅu th·ªùi gian. B·∫Øt ƒë·∫ßu v·ªõi `MAX_IMAGES_PER_SUBSET` nh·ªè ƒë·ªÉ test tr∆∞·ªõc.

In [7]:
# Ch·∫°y batch processing
print(f"\n{'='*70}")
print("STARTING BATCH PROCESSING")
print(f"{'='*70}\n")

start_time = datetime.now()

# Process dataset using BatchProcessor
stats = batch_processor.process_dataset(
    images_folder=IMAGES_FOLDER,
    output_folder=OUTPUT_FOLDER,
    subsets=SUBSETS,
    max_images_per_subset=MAX_IMAGES_PER_SUBSET,
    skip_existing=True
)

end_time = datetime.now()
elapsed = end_time - start_time

# Print final summary
print(f"\n{'='*70}")
print("FINAL SUMMARY")
print(f"{'='*70}")
print(f"Total Processed: {stats['total_processed']:,}")
print(f"  ‚úÖ Success: {stats['total_success']:,}")
print(f"  ‚ùå Failed: {stats['total_failed']:,}")
print(f"Time Elapsed: {elapsed}")
print(f"{'='*70}\n")


STARTING BATCH PROCESSING


Processing subset: TRAIN
Found 10 images


Processing train: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:06<00:00,  1.57it/s]



TRAIN Summary:
  ‚úÖ Success: 5
  ‚ùå Failed: 5
  üìä Total: 10

Processing subset: VALIDATION
Found 10 images


Processing validation: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:07<00:00,  1.40it/s]



VALIDATION Summary:
  ‚úÖ Success: 7
  ‚ùå Failed: 3
  üìä Total: 10

Processing subset: TEST
Found 10 images


Processing test: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:05<00:00,  1.74it/s]


TEST Summary:
  ‚úÖ Success: 8
  ‚ùå Failed: 2
  üìä Total: 10

FINAL SUMMARY
Total Processed: 30
  ‚úÖ Success: 20
  ‚ùå Failed: 10
Time Elapsed: 0:00:19.483921






## 5. Verify Results

In [8]:
# Count output files
print(f"\n{'='*70}")
print("OUTPUT VERIFICATION")
print(f"{'='*70}\n")

for subset in SUBSETS:
    subset_output = OUTPUT_FOLDER / subset
    if subset_output.exists():
        json_files = list(subset_output.glob('*.json'))
        print(f"{subset:15}: {len(json_files):,} JSON files")
    else:
        print(f"{subset:15}: 0 files (not processed)")

total_files = len(list(OUTPUT_FOLDER.glob('**/*.json')))
print(f"\n{'TOTAL':15}: {total_files:,} JSON files")
print(f"{'='*70}\n")


OUTPUT VERIFICATION

train          : 10 JSON files
validation     : 10 JSON files
test           : 10 JSON files

TOTAL          : 30 JSON files



## 6. Inspect Sample Output

In [9]:
# Load and inspect a sample output file
import json

sample_files = list(OUTPUT_FOLDER.glob('**/*.json'))
sample_files = [f for f in sample_files if f.name != 'dataset_statistics.json']

if sample_files:
    sample_file = sample_files[0]
    print(f"üìÑ Sample file: {sample_file.name}")
    print(f"üìÅ Path: {sample_file}")
    
    with open(sample_file, 'r', encoding='utf-8') as f:
        sample_data = json.load(f)
    
    print(f"\n{'='*70}")
    print("SAMPLE OUTPUT STRUCTURE")
    print(f"{'='*70}")
    print(f"Version: {sample_data['version']}")
    print(f"Image ID: {sample_data['image_id']}")
    print(f"\nOCR:")
    print(f"  - Tokens: {sample_data['ocr']['num_tokens']}")
    print(f"  - Success: {sample_data['ocr']['success']}")
    print(f"\nLayout:")
    print(f"  - Lines: {sample_data['layout']['num_lines']}")
    print(f"  - Blocks: {sample_data['layout']['num_blocks']}")
    print(f"  - Regions: {sample_data['layout']['num_regions']}")
    print(f"\nGraph:")
    print(f"  - Nodes: {sample_data['graph']['num_nodes']}")
    print(f"  - Edges: {sample_data['graph']['num_edges']}")
    
    if sample_data['graph']['edges']:
        print(f"\nSample Edges (top 5):")
        for i, edge in enumerate(sample_data['graph']['edges'][:5], 1):
            print(f"  {i}. {edge['relation']} (score: {edge['score']:.3f}, category: {edge.get('category', 'N/A')})")
    
    print(f"{'='*70}\n")
else:
    print("‚ö†Ô∏è No output files found yet.")

üìÑ Sample file: 1016.json
üìÅ Path: ../output/full_pipeline/test/1016.json

SAMPLE OUTPUT STRUCTURE
Version: 1.0.0
Image ID: 1016

OCR:
  - Tokens: 28
  - Success: True

Layout:
  - Lines: 24
  - Blocks: 11
  - Regions: 3

Graph:
  - Nodes: 3
  - Edges: 6

Sample Edges (top 5):
  1. above (score: 0.810, category: spatial)
  2. above (score: 0.643, category: spatial)
  3. below (score: 0.810, category: spatial)
  4. above (score: 0.657, category: spatial)
  5. below (score: 0.657, category: spatial)



## 7. Collect and Export Statistics

In [10]:
# Collect statistics from all processed files
print("Collecting statistics from all processed files...")
overall_stats = StatisticsCollector.collect_from_folder(OUTPUT_FOLDER)

# Print statistics
StatisticsCollector.print_statistics(overall_stats)

# Save statistics to JSON
stats_file = OUTPUT_FOLDER / 'dataset_statistics.json'
StatisticsCollector.save_statistics(overall_stats, stats_file)

Collecting statistics from all processed files...


Collecting statistics:   0%|          | 0/30 [00:00<?, ?it/s]

Collecting statistics: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30/30 [00:00<00:00, 403.89it/s]

‚ö†Ô∏è Error processing 1027.json: Expecting value: line 8086 column 32 (char 171081)
‚ö†Ô∏è Error processing 1028.json: Expecting value: line 8086 column 32 (char 171081)
‚ö†Ô∏è Error processing 10002.json: Expecting value: line 3642 column 32 (char 71609)
‚ö†Ô∏è Error processing 10004.json: Expecting value: line 3642 column 32 (char 71609)
‚ö†Ô∏è Error processing 10013.json: Expecting value: line 8985 column 32 (char 183793)
‚ö†Ô∏è Error processing 10015.json: Expecting value: line 8985 column 32 (char 183793)
‚ö†Ô∏è Error processing 1002.json: Expecting value: line 5248 column 32 (char 109018)
‚ö†Ô∏è Error processing 1023.json: Expecting value: line 6923 column 32 (char 134459)
‚ö†Ô∏è Error processing 1024.json: Expecting value: line 6923 column 32 (char 134459)
‚ö†Ô∏è Error processing 1025.json: Expecting value: line 6923 column 32 (char 134459)

DATASET STATISTICS
Total Images: 30
Total OCR Tokens: 781
Total Layout Regions: 77
Total Graph Edges: 246

Averages per Image:
  Tokens: 


