<a href="https://colab.research.google.com/github/Treepyy/ocr-models/blob/main/OCR_Models_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# OCR Benchmark Notebook

This serves to **compare multiple OCR libraries** on your own images. It includes:
- Tesseract (via `pytesseract`), EasyOCR, PaddleOCR, and DocTR
- Optional cells for Keras-OCR and TrOCR (Hugging Face)
- A simple evaluation using **WER/CER (via `jiwer`)**
- Runtime timing per image + per model, and a summary chart

> **Folder layout:**
>
> - `data/images/` → test images (PNG/JPG/TIFF/PDF-as-images)
> - `data/ground_truth.csv` → optional ground-truth file with columns: `filename,text`  
>   (filenames should match exactly the images in `data/images/`)
>



In [2]:

# --- Install Python packages (run once per environment) ---

%pip -q install pytesseract easyocr paddleocr doctr[torch] jiwer opencv-python pillow matplotlib rapidfuzz
%pip -q install python-doctr

# Extra (uncomment as needed):
%pip -q install keras-ocr  # requires TensorFlow 2.x and may be heavy
%pip -q install transformers timm accelerate # for TrOCR (Hugging Face)
%pip -q install mmocr==1.0.1 mmengine==0.10.4 mmdet==3.2.0  # complex; only if you want OpenMMLab


[0m

**Tesseract binary is required** for `pytesseract`.

In [3]:
!apt-get install -y tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.



## System Notes
- GPU is optional but speeds up deep learning models (PaddleOCR, DocTR, Keras-OCR, TrOCR).
- If you see CUDA errors, switch to CPU by installing CPU-only wheels or setting the appropriate flags.


In [1]:

import os, time, json, string, glob, io, math
from pathlib import Path
from dataclasses import dataclass
from typing import Dict, Callable, List, Tuple

import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt

# Evaluation
import jiwer

# OCR libraries
import pytesseract
import easyocr
from paddleocr import PaddleOCR
from doctr.models import ocr_predictor
from doctr.io import DocumentFile

DATA_DIR = Path("data")
IMAGES_DIR = DATA_DIR / "images"
GT_CSV = DATA_DIR / "ground_truth.csv"

IMAGES_DIR.mkdir(parents=True, exist_ok=True)
print("Expecting images in:", IMAGES_DIR.resolve())
print("Optional ground-truth CSV at:", GT_CSV.resolve())


Expecting images in: /content/data/images
Optional ground-truth CSV at: /content/data/ground_truth.csv
