Skip to content

GitMithril/distributed-urdu-ocr

Repository files navigation

Distributed Deep Learning-Based Urdu OCR & Restoration System

This project is an end-to-end Optical Character Recognition (OCR) pipeline designed specifically for Urdu text. It features a dual-model architecture: a U-Net based Image Restoration model to clean degraded documents, followed by a Conv-Transformer sequence-to-sequence model to transcribe the text.

The system is trained on a combination of printed Nastaleeq text (MMU-OCR-21) and handwritten text (UHWR), ensuring robustness across different writing styles and qualities.


🚀 Project Status

  • Phase 1: Data Preprocessing & Pipeline Construction[COMPLETED]
  • Phase 2: Deep Learning Image Restoration (U-Net)[COMPLETED]
  • Phase 3: Deep Learning OCR (Conv-Transformer)[COMPLETED]
  • Phase 4: Pipeline Integration, Metrics & Visualization[COMPLETED]
  • Phase 5: Real-world Application / Deployment[COMPLETED]

🛠️ Architecture Overview

1. Data Preprocessing (Phase 1)

  • Standardization: All images are padded/scaled to a uniform 128×2048 size to prevent cropping of wide handwritten images while maintaining aspect ratios.
  • Stochastic Degradation: Synthetic noise (Gaussian blur, salt & pepper, low contrast, affine skew) is applied on-the-fly to train the restoration model.
  • Splitting: 70% Train / 15% Val / 15% Test with zero data leakage.

2. Image Restoration Model (Phase 2)

  • Architecture: Pre-trained ResNet-34 U-Net (via segmentation-models-pytorch).
  • Goal: Take noisy, degraded, or skewed document images and reconstruct clean, high-contrast text.
  • Metrics: Mean Squared Error (MSE), PSNR, and SSIM.

3. OCR Sequence Model (Phase 3)

  • Architecture:
    • Encoder: Custom 7-block CNN Backbone that extracts spatial features, pooling the 128×2048 image down to a sequence of 128 tokens (d_model=256).
    • Decoder: 6-layer Transformer (3 Encoder / 3 Decoder) utilizing standard sinusoidal positional encoding and self-attention.
  • Training Strategy: Implements a source-weighted Cross-Entropy loss (UHWR samples are weighted 3.0×, MMU-OCR-21 weighted 1.0×) to force the model to prioritize difficult handwritten text despite class imbalance.
  • Decoding: Character-level Beam Search with length penalty.
  • Vocabulary: 173 tokens covering Urdu characters, digits, punctuation, and control tokens (<PAD>, <SOS>, <EOS>, <UNK>).

💻 Installation & Setup

To run this project locally, ensure you have Python installed, then set up the PyTorch environment:

# 1. Create and activate a virtual environment
python -m venv torch-env
torch-env\Scripts\activate

# 2. Upgrade pip
python -m pip install --upgrade pip

# 3. Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 4. Install remaining project dependencies
pip install -r requirements.txt

Optional: Install Tesseract OCR (Urdu) for baseline comparison

benchmark.py compares this project against Tesseract (pytesseract, language code urd).
To install Tesseract and ensure the Urdu language pack is present, run the included Windows script from the project root:

.\install_tesseract_urdu.bat

If the script reports urd in tesseract --list-langs, the Tesseract baseline is ready for comparison.

Optional: Jupyter Notebook Kernel

If you plan to run or modify the project-notebook.ipynb, you can register the environment as a Jupyter kernel:

pip install ipykernel
python -m ipykernel install --user --name=torch-env --display-name "Python (torch-env)"

Optional: Train on Kaggle / Google Colab

If you want to train the OCR model on a powerful cloud GPU without the memory overhead of the full project exploration pipeline, you can use the standalone train_ocr_only.ipynb notebook. This headless notebook features automatic resumption from checkpoints and supports PATH_MAPPING to resolve nested directories on Kaggle without needing to modify your generated CSV splits.


📁 Project Structure

├── splits/                    # Contains generated train/val/test CSVs
├── checkpoints/               # Saved model weights (.pth) and vocab.json
├── models/
│   ├── vocab.py               # Vocabulary builder and tokenization
│   ├── restoration_model.py   # U-Net architecture & builder
│   ├── ocr_model.py           # CNN + Transformer architecture
│   ├── ocr_trainer.py         # Custom training loops & weighted loss
│   └── pipeline.py            # End-to-End inference pipeline wrapper
├── datasets/
│   ├── restoration_dataset.py # On-the-fly degradation dataloader
│   └── ocr_dataset.py         # Sequence padding & source-weight dataloader
├── preprocessing.py           # Standardization and augmentation pipelines
├── benchmark.py               # Benchmarking script to evaluate CER/WER against Tesseract
├── install_tesseract_urdu.bat # Installs Tesseract + Urdu language pack for baseline comparison
├── project-notebook.ipynb     # Main orchestrator for training & visualization
└── requirements.txt           # Dependencies

🚢 Running the Backend Server (Local, No Docker)

The FastAPI server exposes both the Hadoop-backed endpoints (/api/...) and the direct RunPod endpoints (/api/DL/...). To run locally without Docker:

1. Configure credentials — edit deployment/hadoop/.env:

RUNPOD_API_KEY=<your key>
RUNPOD_ENDPOINT_ID=<your endpoint id>

2. Create venv, install dependencies, and start the server:

# From the project root
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

cd deployment/hadoop
PYTHONPATH=../.. python -m uvicorn app:app --host 0.0.0.0 --port 8000

The server starts at http://localhost:8000. Interactive API docs are available at http://localhost:8000/docs.

DL endpoints (no Hadoop required):

Method Path Description
POST /api/DL/process-batch Upload a .zip of images; returns job_id
GET /api/DL/jobs/{job_id}/status Poll job state and progress
GET /api/DL/jobs/{job_id}/results Fetch recognized text per image

Optional env vars for tuning:

Variable Default Description
OCR_MODE real Set to mock to skip RunPod calls during testing
DL_MAX_WORKERS 4 Parallel document threads
DL_RUNPOD_TIMEOUT_SECONDS 120 Max wait per RunPod job
DL_RUNPOD_POLL_INTERVAL_SECONDS 2 Polling interval
DL_RUNPOD_MAX_RETRIES 3 Retries on transient API errors

Example request:

curl -X POST http://localhost:8000/api/DL/process-batch \
  -F "file=@test_batch.zip"
# returns: {"job_id": "dl-a1b2c3d4", "message": "Batch processing started."}

curl http://localhost:8000/api/DL/jobs/dl-a1b2c3d4/status
curl http://localhost:8000/api/DL/jobs/dl-a1b2c3d4/results

🐳 Running with Docker (Full Hadoop Cluster)

docker-compose up --build

Services: FastAPI on :8000, YARN RM UI on :8088, HDFS NameNode UI on :9870, History Server on :8188.


RunPod Serverless Inference

Use the serverless inference image in deployment/inference/:

docker buildx build \
  --platform linux/amd64 \
  -f deployment/inference/Dockerfile \
  -t <registry>/<repo>:runpod-serverless \
  --push .

Set the RunPod worker environment:

  • DEVICE=cuda
  • RESTORATION_CKPT=/app/checkpoints/best_restoration_model.pth
  • OCR_CKPT=/app/checkpoints/best_ocr_model.pth
  • VOCAB_PATH=/app/checkpoints/vocab.json

Notes:

  • Build with --platform linux/amd64 for RunPod workers.
  • Keep OCR checkpoint and vocab.json from the same training run.
  • Loaders support raw state_dict, wrapped checkpoints (for example model_state_dict), and module.-prefixed keys.

About

A Hadoop HDFS based nastaleeq urdu image restoration and ocr pipeline, helping to digitize old and degraded urdu scripture such old government documents with minimal error rate.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors