Distributed Deep Learning-Based Urdu OCR & Restoration System

This project is an end-to-end Optical Character Recognition (OCR) pipeline designed specifically for Urdu text. It features a dual-model architecture: a U-Net based Image Restoration model to clean degraded documents, followed by a Conv-Transformer sequence-to-sequence model to transcribe the text.

The system is trained on a combination of printed Nastaleeq text (MMU-OCR-21) and handwritten text (UHWR), ensuring robustness across different writing styles and qualities.

🚀 Project Status

Phase 1: Data Preprocessing & Pipeline Construction — [COMPLETED]
Phase 2: Deep Learning Image Restoration (U-Net) — [COMPLETED]
Phase 3: Deep Learning OCR (Conv-Transformer) — [COMPLETED]
Phase 4: Pipeline Integration, Metrics & Visualization — [COMPLETED]
Phase 5: Real-world Application / Deployment — [COMPLETED]

🛠️ Architecture Overview

1. Data Preprocessing (Phase 1)

Standardization: All images are padded/scaled to a uniform 128×2048 size to prevent cropping of wide handwritten images while maintaining aspect ratios.
Stochastic Degradation: Synthetic noise (Gaussian blur, salt & pepper, low contrast, affine skew) is applied on-the-fly to train the restoration model.
Splitting: 70% Train / 15% Val / 15% Test with zero data leakage.

2. Image Restoration Model (Phase 2)

Architecture: Pre-trained ResNet-34 U-Net (via segmentation-models-pytorch).
Goal: Take noisy, degraded, or skewed document images and reconstruct clean, high-contrast text.
Metrics: Mean Squared Error (MSE), PSNR, and SSIM.

3. OCR Sequence Model (Phase 3)

Architecture:
- Encoder: Custom 7-block CNN Backbone that extracts spatial features, pooling the 128×2048 image down to a sequence of 128 tokens (d_model=256).
- Decoder: 6-layer Transformer (3 Encoder / 3 Decoder) utilizing standard sinusoidal positional encoding and self-attention.
Training Strategy: Implements a source-weighted Cross-Entropy loss (UHWR samples are weighted 3.0×, MMU-OCR-21 weighted 1.0×) to force the model to prioritize difficult handwritten text despite class imbalance.
Decoding: Character-level Beam Search with length penalty.
Vocabulary: 173 tokens covering Urdu characters, digits, punctuation, and control tokens (<PAD>, <SOS>, <EOS>, <UNK>).

💻 Installation & Setup

To run this project locally, ensure you have Python installed, then set up the PyTorch environment:

# 1. Create and activate a virtual environment
python -m venv torch-env
torch-env\Scripts\activate

# 2. Upgrade pip
python -m pip install --upgrade pip

# 3. Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 4. Install remaining project dependencies
pip install -r requirements.txt

Optional: Install Tesseract OCR (Urdu) for baseline comparison

benchmark.py compares this project against Tesseract (pytesseract, language code urd).
To install Tesseract and ensure the Urdu language pack is present, run the included Windows script from the project root:

.\install_tesseract_urdu.bat

If the script reports urd in tesseract --list-langs, the Tesseract baseline is ready for comparison.

Optional: Jupyter Notebook Kernel

If you plan to run or modify the project-notebook.ipynb, you can register the environment as a Jupyter kernel:

pip install ipykernel
python -m ipykernel install --user --name=torch-env --display-name "Python (torch-env)"

Optional: Train on Kaggle / Google Colab

If you want to train the OCR model on a powerful cloud GPU without the memory overhead of the full project exploration pipeline, you can use the standalone train_ocr_only.ipynb notebook. This headless notebook features automatic resumption from checkpoints and supports PATH_MAPPING to resolve nested directories on Kaggle without needing to modify your generated CSV splits.

📁 Project Structure

├── splits/                    # Contains generated train/val/test CSVs
├── checkpoints/               # Saved model weights (.pth) and vocab.json
├── models/
│   ├── vocab.py               # Vocabulary builder and tokenization
│   ├── restoration_model.py   # U-Net architecture & builder
│   ├── ocr_model.py           # CNN + Transformer architecture
│   ├── ocr_trainer.py         # Custom training loops & weighted loss
│   └── pipeline.py            # End-to-End inference pipeline wrapper
├── datasets/
│   ├── restoration_dataset.py # On-the-fly degradation dataloader
│   └── ocr_dataset.py         # Sequence padding & source-weight dataloader
├── preprocessing.py           # Standardization and augmentation pipelines
├── benchmark.py               # Benchmarking script to evaluate CER/WER against Tesseract
├── install_tesseract_urdu.bat # Installs Tesseract + Urdu language pack for baseline comparison
├── project-notebook.ipynb     # Main orchestrator for training & visualization
└── requirements.txt           # Dependencies

🚢 Running the Backend Server (Local, No Docker)

The FastAPI server exposes both the Hadoop-backed endpoints (/api/...) and the direct RunPod endpoints (/api/DL/...). To run locally without Docker:

1. Configure credentials — edit deployment/hadoop/.env:

RUNPOD_API_KEY=<your key>
RUNPOD_ENDPOINT_ID=<your endpoint id>

2. Create venv, install dependencies, and start the server:

# From the project root
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

cd deployment/hadoop
PYTHONPATH=../.. python -m uvicorn app:app --host 0.0.0.0 --port 8000

The server starts at http://localhost:8000. Interactive API docs are available at http://localhost:8000/docs.

DL endpoints (no Hadoop required):

Method	Path	Description
`POST`	`/api/DL/process-batch`	Upload a `.zip` of images; returns `job_id`
`GET`	`/api/DL/jobs/{job_id}/status`	Poll job state and progress
`GET`	`/api/DL/jobs/{job_id}/results`	Fetch recognized text per image

Optional env vars for tuning:

Variable	Default	Description
`OCR_MODE`	`real`	Set to `mock` to skip RunPod calls during testing
`DL_MAX_WORKERS`	`4`	Parallel document threads
`DL_RUNPOD_TIMEOUT_SECONDS`	`120`	Max wait per RunPod job
`DL_RUNPOD_POLL_INTERVAL_SECONDS`	`2`	Polling interval
`DL_RUNPOD_MAX_RETRIES`	`3`	Retries on transient API errors

Example request:

curl -X POST http://localhost:8000/api/DL/process-batch \
  -F "file=@test_batch.zip"
# returns: {"job_id": "dl-a1b2c3d4", "message": "Batch processing started."}

curl http://localhost:8000/api/DL/jobs/dl-a1b2c3d4/status
curl http://localhost:8000/api/DL/jobs/dl-a1b2c3d4/results

🐳 Running with Docker (Full Hadoop Cluster)

docker-compose up --build

Services: FastAPI on :8000, YARN RM UI on :8088, HDFS NameNode UI on :9870, History Server on :8188.

RunPod Serverless Inference

Use the serverless inference image in deployment/inference/:

docker buildx build \
  --platform linux/amd64 \
  -f deployment/inference/Dockerfile \
  -t <registry>/<repo>:runpod-serverless \
  --push .

Set the RunPod worker environment:

DEVICE=cuda
RESTORATION_CKPT=/app/checkpoints/best_restoration_model.pth
OCR_CKPT=/app/checkpoints/best_ocr_model.pth
VOCAB_PATH=/app/checkpoints/vocab.json

Notes:

Build with --platform linux/amd64 for RunPod workers.
Keep OCR checkpoint and vocab.json from the same training run.
Loaders support raw state_dict, wrapped checkpoints (for example model_state_dict), and module.-prefixed keys.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
datasets		datasets
deployment		deployment
models		models
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
docker-compose.yml		docker-compose.yml
install_tesseract_urdu.bat		install_tesseract_urdu.bat
plan.md		plan.md
preprocessing.py		preprocessing.py
project-notebook.ipynb		project-notebook.ipynb
project-notebook.py		project-notebook.py
requirements.txt		requirements.txt
train_ocr_only.ipynb		train_ocr_only.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Deep Learning-Based Urdu OCR & Restoration System

🚀 Project Status

🛠️ Architecture Overview

1. Data Preprocessing (Phase 1)

2. Image Restoration Model (Phase 2)

3. OCR Sequence Model (Phase 3)

💻 Installation & Setup

Optional: Install Tesseract OCR (Urdu) for baseline comparison

Optional: Jupyter Notebook Kernel

Optional: Train on Kaggle / Google Colab

📁 Project Structure

🚢 Running the Backend Server (Local, No Docker)

🐳 Running with Docker (Full Hadoop Cluster)

RunPod Serverless Inference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Deep Learning-Based Urdu OCR & Restoration System

🚀 Project Status

🛠️ Architecture Overview

1. Data Preprocessing (Phase 1)

2. Image Restoration Model (Phase 2)

3. OCR Sequence Model (Phase 3)

💻 Installation & Setup

Optional: Install Tesseract OCR (Urdu) for baseline comparison

Optional: Jupyter Notebook Kernel

Optional: Train on Kaggle / Google Colab

📁 Project Structure

🚢 Running the Backend Server (Local, No Docker)

🐳 Running with Docker (Full Hadoop Cluster)

RunPod Serverless Inference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages