# OCR-mLLM Pipeline Documentation

This notebook implements a pipeline for processing images through OCR (Optical Character Recognition) and Large Language Models (LLMs). The pipeline supports multiple processing paths:

1. Raw OCR extraction (output in plain text): Pages are processed through pytesseract to generate raw OCR text - without mLLM intervention.
2. Direct image transcription (output in plain text): The original PNG is processed by the mLLM.
3. OCR post-correction (output in plain text): Raw OCR text and the original PNG are both fed into the mLLM with the OCR output clearly labelled in the prompt.
4. JSON from corrected text (output in JSON): OCR post-corrected text is used as input to generate structured JSON output.
5. Direct JSON from image (output in JSON): The original PNG is processed directly to produce structured JSON data.


## Directory Structure

The pipeline creates and uses the following directory structure:

```
project_root/
├── data/
│   └── pngs/            # Source images in PNG format
├── results/
│   ├── txt/            # Text outputs
│   │   ├── ocr-img2txt/pytesseract/     # Raw OCR output
│   │   ├── llm-img2txt/{model}/         # Direct LLM image transcription
│   │   └── ocr-llm-img2txt/{model}/     # LLM-refined OCR output
│   └── json/           # JSON outputs
│       ├── llm-img2json/{model}/        # Direct image to JSON via LLM
│       └── llm-txt2json/{model}/        # Text to JSON via LLM
└── benchmarking-results/
    ├── txt-accuracy/   # Text accuracy metrics
    └── json-accuracy/  # JSON accuracy metrics
```

## Supported Models

The pipeline currently supports:
- OpenAI GPT-4V (gpt-4o)
- Google Gemini Pro Vision (gemini-2.5-flash)

## Prerequisites

Before running this notebook:

1. Create your virtual environment: (optional)
   ```bash
   python3 -m venv .venv
   source .venv/bin/activate
   ```
   After running the notebook type `deactivate` in the terminal to stop your virtual environment

   With Anaconda:
   ```bash
   conda env create --file=environment.yaml
   conda activate ocr-benchmarking
   ```

2. Install required packages (if not using Anaconda):
   ```bash
   pip install -r config/requirements.txt
   ```

2. Set up API keys in a `.env` file:
   ```
   OPENAI_API_KEY=your_key_here
   GOOGLE_API_KEY=your_key_here
   ```

3. Place source images in PNG format in the `data/pngs/` directory

4. For Windows users: Install Tesseract OCR and set the path in the notebook

## Pipeline Components

The notebook is organized into several key sections:

1. Setup & Initialization
2. OCR Processing (Tesseract)
3. LLM Processing (GPT-4V, Gemini)
4. JSON Conversion
5. Benchmarking & Metrics

Each section can be run independently, though they build on each other in sequence.

## 1. Setup & Initialization

### Directory Setup

The first code cell handles:
- Setting up project paths and directory structure
- Creating necessary output directories
- Configuring model names and paths

The `make_llm_dirs()` function creates standardized directory structures for each LLM model's outputs, supporting both text and JSON formats. Key components include:

```python
# Directory structure
source_dir = root_dir / "data" / "pngs"              # Input images
txt_output_dir = root_dir / "results" / "txt"        # Text outputs
json_output_dir = root_dir / "results" / "json"      # JSON outputs
bm_txt_output_dir = root_dir / "benchmarking-results"/ "txt-accuracy"  # Benchmarking results
```

The `make_llm_dirs()` function ensures each model has its own organized output directories for different processing stages.


### API Client Setup

The second code cell initializes API clients and utilities:

1. **API Clients**:
   - OpenAI client for GPT-4V
   - Google client for Gemini Pro Vision

2. **Environment Variables**:
   API keys are loaded from environment variables:
   ```python
   openai_api_key = os.getenv("OPENAI_API_KEY")
   # Google API key is loaded directly in client initialization
   ```

Make sure all required API keys are set in your `.env` file before running this section.


### Image File Loading

The third code cell prepares the input data. It:
   - Scans the `data/pngs/` directory
   - Filters for PNG files only
   - Stores full file paths for processing

This list of image paths (`img_filepaths`) will be used by subsequent processing steps in the pipeline.


## 1. Setup

### a. Run this cell to ensure you have all the necessary directories

Before running the cell make sure you have an images folder in your root directory to feed the images into the pipeline

In [None]:
from pathlib import Path
import os
import sys
import pytesseract
from PIL import Image
import asyncio
from venv import logger
from json_creation import *
from google.genai import types
sys.path.append('../')

# Get the root directory of the project
root_dir = Path.cwd().parent.parent

sys.path.append(Path(str(root_dir / "tools")))
from tools.file_retrieval import *
                
# Get the user's path for the images folder assuming all images are stored here in .png format
source_dir = root_dir / "data" / "pngs"
txt_source_dir = root_dir / "results" / "txt" / "ocr-llm-img2txt"

# Get the path for the output folders, create one if it doesn't exist
txt_output_dir = root_dir / "results" / "txt" # txt output
txt_output_dir.mkdir(parents=True, exist_ok=True)

json_output_dir = root_dir / "results" / "json" # json output
json_output_dir.mkdir(parents=True, exist_ok=True)

bm_txt_output_dir = root_dir / "benchmarking-results"/ f"txt-accuracy" # benchmarking text output
bm_txt_output_dir.mkdir(parents=True, exist_ok=True)

bm_json_output_dir = root_dir / "benchmarking-results"/ f"json-accuracy" # benchmarking json output
bm_json_output_dir.mkdir(parents=True, exist_ok=True)

llms = {"openai": "gpt-4o", "google": "gemini-2.5-flash"}

# Create relevant directories for mLLM & OCR outputs
def make_llm_dirs(llms, target_dir, doc_format):
    for llm in llms.values():
        if doc_format == "txt":
            dir = target_dir / f"ocr-img2txt" / "pytesseract"
            dir.mkdir(parents=True, exist_ok=True)
            dir = target_dir / f"llm-img2txt" / llm
            dir.mkdir(parents=True, exist_ok=True)
            dir = target_dir / f"ocr-llm-img2txt" / llm
            dir.mkdir(parents=True, exist_ok=True)
        else:
            dir = target_dir / f"llm-img2json" / llm
            dir.mkdir(parents=True, exist_ok=True)
            dir = target_dir / f"llm-txt2json" / llm
            dir.mkdir(parents=True, exist_ok=True)
make_llm_dirs(llms, txt_output_dir, "txt")
make_llm_dirs(llms, json_output_dir, "json")

### b. Setup API keys & image encoding function

In [None]:
# optional
from dotenv import load_dotenv

load_dotenv()

In [None]:
from openai import OpenAI
from anthropic import Anthropic
from google import genai
import base64
from json_creation import *
from txt_creation import *

openai_api_key = os.getenv("OPENAI_API_KEY")
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")

gpt_client = OpenAI(api_key=openai_api_key)
gemini_client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
claude_client = Anthropic(api_key=anthropic_api_key)


### c. Get image file paths

In [None]:
# Add all filenames in images directory into the `filenames` array with the ENTIRE filepath
img_filepaths = []
ocr_output_filepaths = []

for path in source_dir.iterdir():
    if path.suffix.lower() == ".png" and path.is_file():
      img_filepaths.append(path)

## 2. Run pytesseract

In [None]:
# Windows users should run this cell, inserting their path to Tesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

In [None]:
# Read the files from ocr-benchmarking/images folder & write to results folder
for path in img_filepaths:
    file_name = txt_output_dir / "ocr-img2txt" / "pytesseract" / path.stem
    file_name = str(file_name) + ".txt"
    
    with open(file_name, 'w') as file:
        file.write(pytesseract.image_to_string(Image.open(str(path))))

## 3. OpenAI

### (i) Text

#### a. OCR-LLM (Async)

In [None]:
# Fetch ocr output files
ocr_output_dir = txt_output_dir/"ocr-img2txt"/"pytesseract"
ocr_output_filepaths = get_paths(ocr_output_dir, "txt")

# Run the async processes
await process_double_async(img_filepaths, ocr_output_filepaths, txt_output_dir/"ocr-llm-img2txt", openai_img_txt2txt_async, llms['openai'])

#### b. LLM (Async)

In [None]:
await process_single_async(img_filepaths, txt_output_dir/"llm-img2txt", openai_img2txt_async, llms['openai'])

### (ii) JSON

#### a. Image to JSON (Async)

In [None]:
await process_json_async(img_filepaths, json_output_dir/"llm-img2json", openai_img2json_async, llms['openai'])

#### b. Text to JSON (Async)

In [None]:
dir = txt_source_dir / llms['openai'] # where to look for ocr-llm-img2txt output
# Get the text paths from ocr-llm-img2txt/gpt-4o directory
txt_filepaths = get_paths(dir, "txt")

# Call the main function that concurrently runs relevant async function
await process_json_async(txt_filepaths, json_output_dir/"llm-txt2json", openai_txt2json_async, llms['openai'])

## 4. Gemini


### (i) Text

#### a. LLM call (Async)

In [None]:
await process_single_async(img_filepaths, txt_output_dir/"llm-img2txt", gemini_img2txt_async, llms['google'])

#### b. OCR-LLM (Async)

In [None]:
# Fetch ocr output files
ocr_output_dir = txt_output_dir/"ocr-img2txt"/"pytesseract"
ocr_output_filepaths = get_paths(ocr_output_dir, "txt")

# Run the async processes
await process_double_async(img_filepaths, ocr_output_filepaths, txt_output_dir/"ocr-llm-img2txt", gemini_img_txt2txt_async, llms['google'])

### (ii) JSON

#### a. Image to JSON (Async)

In [None]:
await process_json_async(img_filepaths, json_output_dir/"llm-img2json", gemini_img2json_async, llms['google'])

#### b. Text to JSON (Async)

In [None]:

dir = txt_source_dir / llms['google'] # where to look for ocr-llm-img2txt output

# Get the text paths from ocr-llm-img2txt/gpt-4o directory
txt_filepaths = get_paths(dir, "txt")

# Call the main function that concurrently runs relevant async function
await process_json_async(txt_filepaths, json_output_dir/"llm-txt2json", gemini_txt2json_async, llms['google'])


## 5. Benchmark results

a. Text accuracy benchmarking

In [None]:
import glob
import sys
sys.path.append(str(Path.cwd().
parent))
from benchmarking.txt_accuracy import clean_text_normalized, clean_text_nonorm, compute_metrics, build_dataframe
from tools.file_retrieval import get_doc_names, get_docs, get_all_models
from datetime import datetime
from venv import logger

def main():
    """
    Prerequisites:
    - Ground truth text files located at `project_root/ground-truth/txt/kbaa-pxyz.txt`
    - LLM/OCR transcribed files located at:
        - for LLM transcriptions: `project_root/results/llm_img2txt/<MODEL-NAME>/kbaa-pxyz.txt`
        - for OCR transcriptions: `project_root/results/ocr_img2txt/<MODEL-NAME>/kbaa-pxyz.txt`

    The main function will:
    - Gather all ground truth text files
    - For each ground truth text file and for each LLM/OCR model, gather the corresponding transcription
    - Clean all the text files (normalized and not normalized)
    - Compute metrics for each file and model
    - Save results in two CSV files (one for normalized, one for non-normalized)
        - Results are saved in `project_root/benchmarking-results/txt-accuracy`
    """

    # =============
    # Preliminaries
    # =============

    # args = parse_arguments()

    script_dir = str(Path.cwd())
    project_root = str(root_dir)
    logger.info("Script directory: %s", script_dir)
    logger.info("Project root: %s", project_root)

    # Ground truth
    ground_truth_dir = root_dir / "data" / "ground-truth" / "txt"
    doc_names = get_doc_names(ground_truth_dir, "txt", keep_prefix=False)

    # results/ paths
    all_models = get_all_models(
        "txt",
        os.path.join(txt_output_dir, f"llm-img2txt"),
        os.path.join(txt_output_dir, "ocr-img2txt"),
        os.path.join(txt_output_dir, f"ocr-llm-img2txt"),
    )

    logger.info(f"Models found: {all_models}")

    # ===========
    # Gather files
    # ===========

    # -> Gather ground truths and put into dict:
    ground_truths, all_texts = get_docs(ground_truth_dir, doc_names, "txt", name_has_prefix=True)
    ground_truths["__ALL__"] = all_texts

    doc_lengths_normalized = {
        doc: len(clean_text_normalized(text)) for doc, text in ground_truths.items()
    }
    doc_lengths_nonorm = {
        doc: len(clean_text_nonorm(text)) for doc, text in ground_truths.items()
    }
    total_doc_len_normalized = len(clean_text_normalized(ground_truths["__ALL__"]))
    total_doc_len_nonorm = len(clean_text_nonorm(ground_truths["__ALL__"]))

    # -> Gather each transcribed document and put into dict:

    # Structure: results[model][doc]
    results = {}

    for model_type, model in all_models:
        logger.info("Collecting results for model: %s", model)
        model_path = os.path.join(txt_output_dir, model_type, model)
        results[model_type] = results.get(model_type, {})
        results[model_type][model], results[model_type][model]["__ALL__"] = get_docs(model_path, doc_names, "txt", name_has_prefix=False)
        logger.info("Collected results for model_type: %s, model: %s", model_type, model)

    # ===============
    # Compute metrics
    # ===============

    normalized_results_data = {}
    nonorm_results_data = {}

    for model_type, model in all_models:
        normalized_results_data[model_type] = normalized_results_data.get(model_type, {})
        normalized_results_data[model_type][model] = normalized_results_data[model_type].get(model, {})
        nonorm_results_data[model_type] = nonorm_results_data.get(model_type, {})
        nonorm_results_data[model_type][model] = nonorm_results_data[model_type].get(model, {})

        logger.info("Computing metrics for model_type: %s, model: %s", model_type, model)
        for doc in doc_names:
            logger.info("Computing metrics for document: %s", doc)
            normalized_results_data[model_type][model][doc] = compute_metrics(
                ground_truths[doc], results[model_type][model][doc], "txt", normalized=True
            )
            nonorm_results_data[model_type][model][doc] = compute_metrics(
                ground_truths[doc], results[model_type][model][doc], "txt", normalized=False
            )

        normalized_results_data[model_type][model]["__ALL__"] = compute_metrics(
            ground_truths["__ALL__"], results[model_type][model]["__ALL__"], "txt", normalized=True
        )
        nonorm_results_data[model_type][model]["__ALL__"] = compute_metrics(
            ground_truths["__ALL__"], results[model_type][model]["__ALL__"], "txt", normalized=False
        )

    # Compute metrics separately for [__ALL__]

    # ====================
    # Put metrics in table
    # ====================

    time = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

    results_base_dir = root_dir / "benchmarking-results" / f"txt-accuracy"

    # Create different results directory for each model type
    for model_type, _ in all_models:
        results_dir = results_base_dir / model_type
        results_dir.mkdir(parents=True, exist_ok=True)

        normalized_df = build_dataframe(
            f"normalized_{time}",
            doc_names,
            normalized_results_data[model_type],
            doc_lengths_normalized,
            total_doc_len_normalized,
        )
        nonorm_df = build_dataframe(
            f"nonorm_{time}",
            doc_names,
            nonorm_results_data[model_type],
            doc_lengths_nonorm,
            total_doc_len_nonorm,
        )

        # ============
        # Save results
        # ============

        # # Default save to project_root/benchmarking-results/txt-accuracy
        # results_path = os.path.join(project_root, "benchmarking-results", "txt-accuracy")
        # if not os.path.exists(results_path):
        #     os.makedirs(results_path)
        normalized_df.to_csv(os.path.join(str(results_dir), f"normalized_{time}.csv"))
        nonorm_df.to_csv(os.path.join(str(results_dir), f"nonorm_{time}.csv"))


if __name__ == "__main__":
    main()

b. JSON benchmarking accuracy

In [None]:
import glob
import json
import sys
sys.path.append(str(Path.cwd().
parent))
from benchmarking.json_accuracy import filter_expected_columns, build_dataframe, compare_dataframes_normalized, compare_dataframes_exact, compare_dataframes_fuzzy
from tools.file_retrieval import get_doc_names, get_docs, get_all_models
from venv import logger
from datetime import datetime
import pandas as pd

def main():
    """
    Prerequisites:
    - Ground truth JSON files located at `project_root/ground-truth/json/gt_kbaa-pXYZ.json`
    - LLM/OCR transcribed JSON files located at:
        - for ground truth text to JSON via LLM:
            - `project_root/results/gt-txt2json/<MODEL-NAME>/<MODEL-NAME>_img_kbaa-pXYZ.json`
        - for OCR text to JSON via LLM:
            - `project_root/results/ocr-txt2json/<MODEL-NAME>/<MODEL-NAME>_img_kbaa-pXYZ.json`
        - for image to JSON via LLM:
            - `project_root/results/llm-img2json/<MODEL-NAME>/<MODEL-NAME>_img_kbaa-pXYZ.json`
        - for text to JSON via LLM:
            - `project_root/results/llm-txt2json/<MODEL-NAME>/<MODEL-NAME>_img_kbaa-pXYZ.json`

    The main function will:
    - Gather all ground truth JSON files
    - For each ground truth JSON file and for each LLM/OCR model, open the JSON file's entries object as a Pandas dataframe
    - Clean all the JSON files (either basic cleaning and normalization)
    - Compute metrics for each file and model
    - Save results in two CSV files (one for normalized, one for non-normalized)
        - Results are saved in `project_root/benchmarking-results/txt-accuracy`
    """

    # =============
    # Preliminaries
    # =============

    #logger.info("Script directory: %s", script_dir)
    logger.info("Project root: %s", root_dir)

    # Ground truth
    ground_truth_dir = os.path.join(root_dir, "data", "ground-truth", "json")
    doc_names = get_doc_names(ground_truth_dir, "json", keep_prefix=False)

    # results/ paths
    all_models = get_all_models( "json",
        os.path.join(root_dir, "results", "json", "llm-img2json"),
        os.path.join(root_dir, "results", "json", "llm-txt2json")
    )
    logger.info(f"Models found: {all_models}")

    # ===========
    # Gather files
    # ===========

    # -> Gather ground truths and put into dict:

    ground_truths_json, _ = get_docs(
        ground_truth_dir, doc_names, "json", name_has_prefix=True
    )

    logger.info("Collected ground truth results: %s", list(ground_truths_json.keys()))

    # Convert JSON to dataframe

    ground_truths_df = {
        doc_name: filter_expected_columns(pd.DataFrame(doc_json['entries'])) for doc_name, doc_json in ground_truths_json.items()
    }

    logger.info("Converted ground truths to dataframes")

    # -> Gather each transcribed document and put into dict:

    # Structure: results[(model_type, model)][doc]
    results_json = {} # Stores collected outputs as JSON
    results_df = {} # Stores collected outputs as dataframes

    for model_type, model in all_models:
        logger.info("Collecting results for model: %s/%s", model_type, model)

        model_path = os.path.join(root_dir, "results", "json", model_type, model)
        print(model_path)
        results_json[(model_type, model)], _ = get_docs(
            model_path, doc_names, "json", name_has_prefix=True
        )

        logger.info("Collected results for model: %s", list(results_json[(model_type, model)].keys()))

        results_df[(model_type, model)] = {
            doc_name: filter_expected_columns(pd.DataFrame(doc_json['entries'])) for doc_name, doc_json in results_json[(model_type, model)].items()
        }

        logger.info("Converted results to dataframes")


    # ===============
    # Compute metrics
    # ===============

    normalized_results_data = {}
    nonorm_results_data = {}
    fuzzy_results_data = {}

    for model_type, model in all_models:
        normalized_results_data[model_type] = normalized_results_data.get(model_type, {})
        normalized_results_data[model_type][model] = normalized_results_data[model_type].get(model, {})

        nonorm_results_data[model_type] = nonorm_results_data.get(model_type, {})
        nonorm_results_data[model_type][model] = nonorm_results_data[model_type].get(model, {})

        fuzzy_results_data[model_type] = fuzzy_results_data.get(model_type, {})
        fuzzy_results_data[model_type][model] = fuzzy_results_data[model_type].get(model, {})
        
        logger.info("Computing metrics for model: %s", model)

        for doc in doc_names:
            logger.info("Computing metrics for document: %s", doc)

            normalized_results_data[model_type][model][doc] = compare_dataframes_normalized(
                ground_truths_df[doc], results_df[(model_type, model)][doc]
            )
            nonorm_results_data[model_type][model][doc] = compare_dataframes_exact(
                ground_truths_df[doc], results_df[(model_type, model)][doc]
            )
            fuzzy_results_data[model_type][model][doc] = compare_dataframes_fuzzy(
                ground_truths_df[doc], results_df[(model_type, model)][doc]
            )


    # =====================================
    # Put metrics in table and save results
    # =====================================

    time = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    # Iterate over model types:
    for model_type in normalized_results_data.keys():
        normalized_df = build_dataframe(f"{model_type}_normalized_{time}", doc_names, normalized_results_data[model_type])
        nonorm_df = build_dataframe(f"{model_type}_nonorm_{time}", doc_names, nonorm_results_data[model_type])
        fuzzy_df = build_dataframe(f"{model_type}_fuzzy_{time}", doc_names, fuzzy_results_data[model_type])

        results_path = os.path.join(root_dir, "benchmarking-results", "json-accuracy", model_type)
        if not os.path.exists(results_path):
            os.makedirs(results_path)

        normalized_df.to_csv(os.path.join(results_path, f"{model_type}_normalized_{time}.csv"))
        nonorm_df.to_csv(os.path.join(results_path, f"{model_type}_nonorm_{time}.csv"))
        fuzzy_df.to_csv(os.path.join(results_path, f"{model_type}_fuzzy_{time}.csv"))
    


if __name__ == "__main__":
    main()