
# **OCR Benchmark Notebook**

This serves to **compare multiple OCR libraries** on your own images. It includes:
- Tesseract (via `pytesseract`), EasyOCR, PaddleOCR, and DocTR
- Optional cells for Keras-OCR and TrOCR (Hugging Face)
- A simple evaluation using **WER/CER (via `jiwer`)**
- Runtime timing per image + per model, and a summary chart

> **Folder layout:**
>
> - `data/images/` → test images (PNG/JPG/TIFF/PDF-as-images)
> - `data/ground_truth.csv` → optional ground-truth file with columns: `filename,text`  
>   (filenames should match exactly the images in `data/images/`)
>



In [3]:
!git clone https://github.com/THS-ST/ocr-models

fatal: destination path 'ocr-models' already exists and is not an empty directory.


In [4]:

# --- Install Python packages (run once per environment) ---

%pip -q install pytesseract easyocr paddleocr doctr[torch] jiwer opencv-python pillow matplotlib rapidfuzz
%pip -q install python-doctr

# Extra (uncomment as needed):
%pip -q install keras-ocr  # requires TensorFlow 2.x and may be heavy
%pip -q install transformers timm accelerate # for TrOCR (Hugging Face)
%pip -q install mmocr==1.0.1 mmengine==0.10.4 mmdet==3.2.0  # complex; only if you want OpenMMLab


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/80.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.5/80.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.4/40.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m39.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.0/76.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m64.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

**Tesseract binary is required** for `pytesseract`.

In [5]:
!apt-get install -y tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


# **Imports and Setup**


## System Notes
- GPU is optional but speeds up deep learning models (PaddleOCR, DocTR, Keras-OCR, TrOCR).
- If you see CUDA errors, switch to CPU by installing CPU-only wheels or setting the appropriate flags.


In [6]:

import os, time, json, string, glob, io, math
from pathlib import Path
from dataclasses import dataclass
from typing import Dict, Callable, List, Tuple

import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd

# Evaluation
import jiwer

DATA_DIR = Path("ocr-models/data")
IMAGES_DIR = DATA_DIR / "images"
GT_CSV = DATA_DIR / "ground_truth.csv"

IMAGES_DIR.mkdir(parents=True, exist_ok=True)
print("Expecting images in:", IMAGES_DIR.resolve())
print("Optional ground-truth CSV at:", GT_CSV.resolve())


Expecting images in: /content/ocr-models/data/images
Optional ground-truth CSV at: /content/ocr-models/data/ground_truth.csv


We will load our images with OpenCV (BGR) and return an RGB np.ndarray.

In [7]:
def load_image(path: Path):
    img_bgr = cv2.imread(str(path), cv2.IMREAD_COLOR)
    if img_bgr is None:
        raise FileNotFoundError(f"Could not read image: {path}")
    img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
    return img_rgb

The following are additional helper functions for loading images and ground truth values.

In [8]:
def list_images(d: Path) -> List[Path]:
    exts = ["*.png", "*.jpg", "*.jpeg", "*.tif", "*.tiff", "*.bmp"]
    files = []
    for e in exts:
        files.extend(sorted(d.glob(e)))
    return files

def load_ground_truth(gt_csv: Path) -> pd.DataFrame:
    if gt_csv.exists():
        df = pd.read_csv(gt_csv)
        assert {"filename", "text"}.issubset(df.columns), "ground_truth.csv must have columns: filename,text"
        return df
    else:
        return pd.DataFrame(columns=["filename", "text"])

def get_gt_for(fname: str, gt_df: pd.DataFrame) -> str:
    row = gt_df.loc[gt_df["filename"] == fname]
    return "" if row.empty else str(row.iloc[0]["text"])

Next, we define robust text normalization for fair evaluation.

In [9]:
transform = jiwer.Compose([
    jiwer.ToLowerCase(),
    jiwer.Strip(),
    jiwer.RemoveMultipleSpaces(),
    jiwer.RemovePunctuation(),
])

def compute_metrics(pred: str, ref: str) -> dict:
    if not ref:
        return {"wer": math.nan, "cer": math.nan}
    wer = jiwer.wer(ref, pred, truth_transform=transform, hypothesis_transform=transform)
    cer = jiwer.cer(ref, pred, truth_transform=transform, hypothesis_transform=transform)
    return {"wer": wer, "cer": cer}


Now we can call the functions to load the images and ground truth values.

In [13]:
images = list_images(IMAGES_DIR)
# gt_df = load_ground_truth(GT_CSV)

In [14]:
images

[PosixPath('ocr-models/data/images/1.jpg'),
 PosixPath('ocr-models/data/images/10.jpg'),
 PosixPath('ocr-models/data/images/2.jpg'),
 PosixPath('ocr-models/data/images/3.jpg'),
 PosixPath('ocr-models/data/images/4.jpg'),
 PosixPath('ocr-models/data/images/5.jpg'),
 PosixPath('ocr-models/data/images/6.jpg'),
 PosixPath('ocr-models/data/images/7.jpg'),
 PosixPath('ocr-models/data/images/8.jpg'),
 PosixPath('ocr-models/data/images/9.jpg')]

We will define the following dataclass for our OCR results.

In [15]:
from dataclasses import dataclass

@dataclass
class OCRResult:
    filename: str
    model: str
    pred_text: str
    secs: float

# **1.0: Tesseract**
One of the most widely used OCR engines globally is Tesseract, developed by Google, it utilizes a long short-term memory (LSTM) based neural network for text recognition and has support for over 100 languages (Smith, 2007). It excels at recognizing printed text from high quality images, but performance significantly drops with degraded images or complex layouts. The model also faces challenges with handwriting recognition and images with significant distortions such as noise or skew (Gupta et al., 2021).

We will use the ``pytesseract`` module.

In [16]:
import pytesseract

Now we will define our tesseractOCR function. This will take an ``np.ndarray`` and return an extracted text string.

In [17]:
def ocr_tesseract(img_rgb: np.ndarray) -> str:
    pil_img = Image.fromarray(img_rgb)
    text = pytesseract.image_to_string(pil_img)
    return text.strip()

The results will be stored in this array.

In [18]:
tesseract_results = []

Next, we will call this function for all the images in the directory.

In [19]:
current_model = "tesseract"

for img_path in images:
        img = load_image(img_path)
        start = time.time()
        try:
            pred = ocr_tesseract(img)
        except Exception as e:
            pred = f"[ERROR] {e}"
        secs = time.time() - start
        tesseract_results.append(OCRResult(filename=img_path.name, model=current_model, pred_text=pred, secs=secs))

Let's turn the array into a dataframe for a better look at our OCR results.

In [20]:
df = pd.DataFrame([r.__dict__ for r in tesseract_results])
df

Unnamed: 0,filename,model,pred_text,secs
0,1.jpg,tesseract,1. Fluimucil 600mg\n#15\n\nSig: Dissolve 1 tab...,0.806565
1,10.jpg,tesseract,Amoxicillin 500mg cap #21\nSig: take 1 cap 3x ...,0.293336
2,2.jpg,tesseract,Fluimucil 600mg #15\nDissoWve 1 tablet in % gl...,0.36137
3,3.jpg,tesseract,Fluimucil 600mg #15\nDissolve 1 tablet in % gl...,0.417509
4,4.jpg,tesseract,1. DecolgeniNeozep\n#15\nSig: Take 1 tablet 3x...,0.522823
5,5.jpg,tesseract,Decolgen/Neozep #15\n1 tablet 3x/day for 5 day...,0.501405
6,6.jpg,tesseract,Decolgen/Neozep #15\n1 tablet three times a da...,0.783092
7,7.jpg,tesseract,1. Betadine Gargle\na\nSig: Gargle 3x/day\n\n2...,0.571959
8,8.jpg,tesseract,Betadine Gargle a\nGargle 3x/day\n\nDiffiam Lo...,0.562165
9,9.jpg,tesseract,Betadine Gargle a\nGargle three times a day\n\...,0.446683


Since newline characters are printed as is in a dataframe, let's try isolating the text for a better visualization.

In [21]:
for r in tesseract_results:
  print("[" + r.filename + "]")
  print(r.pred_text)
  print("\n")

[1.jpg]
1. Fluimucil 600mg
#15

Sig: Dissolve 1 tablet in % glass water once a
day for 5 days

2. Immunosin 500mg
#21

Sig: Take 1 tablet 3x/day for 7 days


[10.jpg]
Amoxicillin 500mg cap #21
Sig: take 1 cap 3x a day x 1 week


[2.jpg]
Fluimucil 600mg #15
DissoWve 1 tablet in % glass water 1x/day.

Immunosin 500mg #21
axiday


[3.jpg]
Fluimucil 600mg #15
Dissolve 1 tablet in % glass water once a day.

Immunosin 500mg #21
Three times a day


[4.jpg]
1. DecolgeniNeozep
#15
Sig: Take 1 tablet 3x/day for 5 days

2. Loratadine 10mg
#21

Sig: Take 1 tablet at breakfast for 14 days


[5.jpg]
Decolgen/Neozep #15
1 tablet 3x/day for 5 days

Loratadine 10mg #21
1 tablet 1x/day at breakfast for 14 days


[6.jpg]
Decolgen/Neozep #15
1 tablet three times a day for 5 days

Loratadine 10mg #21
1 tablet once a day at breakfast for 14 days


[7.jpg]
1. Betadine Gargle
a
Sig: Gargle 3x/day

2. Difflam lozenges
#15
Sig: Every 4 hours


[8.jpg]
Betadine Gargle a
Gargle 3x/day

Diffiam Lozenges #15
1x/4ho

# **2.0: EasyOCR**


In [22]:
import easyocr

In [27]:
easyocr_reader = easyocr.Reader(['en'], gpu=True if os.environ.get("USE_CUDA", "0") == "1" else False)



Progress: |██████████████████████████████████████████████████| 100.0% Complete



Progress: |--------------------------------------------------| 0.0% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.1% CompleteProgress: |--------------------------------------------------| 0.2% CompleteProgress: |--------------------------------------------------| 0.2% CompleteProgress: |--------------------------------------------------| 0.3% CompleteProgress: |--------------------------------------------------| 0.4% CompleteProgress: |--------------------------------------------------| 0.4% CompleteProgress: |--------------------------------------------------| 0.5% CompleteProgress: |--------------------------------------------------| 0.5% CompleteProgress: |--------------------------------------------------| 0.6% CompleteProgress: |--------------------------------------------------| 0.6% CompleteProgress: |--------------------------------------------------| 0.7% Complet

In [28]:
def ocr_easyocr(img_rgb: np.ndarray) -> str:
    result = easyocr_reader.readtext(img_rgb, detail=0, paragraph=True)
    return "\n".join([r.strip() for r in result if r.strip()])

In [29]:
easyocr_results = []

In [30]:
current_model = "easyocr"

for img_path in images:
        img = load_image(img_path)
        start = time.time()
        try:
            pred = ocr_easyocr(img)
        except Exception as e:
            pred = f"[ERROR] {e}"
        secs = time.time() - start
        easyocr_results.append(OCRResult(filename=img_path.name, model=current_model, pred_text=pred, secs=secs))



In [31]:
eocr_df = pd.DataFrame([r.__dict__ for r in easyocr_results])
eocr_df

Unnamed: 0,filename,model,pred_text,secs
0,1.jpg,easyocr,Fluimucil 60Omg #15 Sig: Dissolve tablet in Yz...,8.663459
1,10.jpg,easyocr,Amoxicillin 500mg cap #21 Sig: take cap 3x a d...,7.846941
2,2.jpg,easyocr,Fluimucil 600mg Dissolve tablet in Yz glass wa...,8.157607
3,3.jpg,easyocr,Fluimucil 600mg Dissolve tablet in Yz glass wa...,7.195216
4,4.jpg,easyocr,Decolgen Neozep #15 Sig: Take tablet 3xlday fo...,8.505757
5,5.jpg,easyocr,Decolgen Neozep tablet 3xlday for 5 days\n#15\...,7.480144
6,6.jpg,easyocr,Decolgen Neozep tablet three times day for 5 d...,7.817087
7,7.jpg,easyocr,Betadine Gargle\nSig: Gargle 3xlday\nDifflam l...,7.916207
8,8.jpg,easyocr,Betadine Gargle Gargle 3xlday\nDifflam Lozenge...,8.448358
9,9.jpg,easyocr,Betadine Gargle Gargle three times day\nDiffla...,11.85946


In [32]:
for r in easyocr_results:
  print("[" + r.filename + "]")
  print(r.pred_text)
  print("\n")

[1.jpg]
Fluimucil 60Omg #15 Sig: Dissolve tablet in Yz glass water once a day for 5 days
Immunosin 500mg #21 Sig: Take 1 tablet 3xlday for 7 days


[10.jpg]
Amoxicillin 500mg cap #21 Sig: take cap 3x a day X week


[2.jpg]
Fluimucil 600mg Dissolve tablet in Yz glass water Ixlday:
#15
Immunosin 50Omg 3xlday
#21


[3.jpg]
Fluimucil 600mg Dissolve tablet in Yz glass water once day:
#15
Immunosin 50Omg Three times day
#21


[4.jpg]
Decolgen Neozep #15 Sig: Take tablet 3xlday for 5 days
Loratadine 1Omg #21 Sig: Take tablet at breakfast for 14 days


[5.jpg]
Decolgen Neozep tablet 3xlday for 5 days
#15
Loratadine 1Omg tablet Ixlday at breakfast for 14 days
#21


[6.jpg]
Decolgen Neozep tablet three times day for 5 days
#15
Loratadine 1Omg tablet once day at breakfast for 14 days
#21


[7.jpg]
Betadine Gargle
Sig: Gargle 3xlday
Difflam lozenges #15 Sig: Every 4 hours


[8.jpg]
Betadine Gargle Gargle 3xlday
Difflam Lozenges Ixl4hours
#15


[9.jpg]
Betadine Gargle Gargle three times day
Difflam

# **3.0: PaddleOCR**

# **4.0: DocTR**

# **5.0: TrOCR**

# **6.0: Keras-OCR**