This project is an end-to-end Optical Character Recognition (OCR) pipeline designed specifically for Urdu text. It features a dual-model architecture: a U-Net based Image Restoration model to clean degraded documents, followed by a Conv-Transformer sequence-to-sequence model to transcribe the text.
The system is trained on a combination of printed Nastaleeq text (MMU-OCR-21) and handwritten text (UHWR), ensuring robustness across different writing styles and qualities.
- Phase 1: Data Preprocessing & Pipeline Construction — [COMPLETED]
- Phase 2: Deep Learning Image Restoration (U-Net) — [COMPLETED]
- Phase 3: Deep Learning OCR (Conv-Transformer) — [COMPLETED]
- Phase 4: Pipeline Integration, Metrics & Visualization — [COMPLETED]
- Phase 5: Real-world Application / Deployment — [COMPLETED]
- Standardization: All images are padded/scaled to a uniform
128×2048size to prevent cropping of wide handwritten images while maintaining aspect ratios. - Stochastic Degradation: Synthetic noise (Gaussian blur, salt & pepper, low contrast, affine skew) is applied on-the-fly to train the restoration model.
- Splitting: 70% Train / 15% Val / 15% Test with zero data leakage.
- Architecture: Pre-trained
ResNet-34U-Net (viasegmentation-models-pytorch). - Goal: Take noisy, degraded, or skewed document images and reconstruct clean, high-contrast text.
- Metrics: Mean Squared Error (MSE), PSNR, and SSIM.
- Architecture:
- Encoder: Custom 7-block CNN Backbone that extracts spatial features, pooling the
128×2048image down to a sequence of128tokens (d_model=256). - Decoder: 6-layer Transformer (3 Encoder / 3 Decoder) utilizing standard sinusoidal positional encoding and self-attention.
- Encoder: Custom 7-block CNN Backbone that extracts spatial features, pooling the
- Training Strategy: Implements a source-weighted Cross-Entropy loss (UHWR samples are weighted 3.0×, MMU-OCR-21 weighted 1.0×) to force the model to prioritize difficult handwritten text despite class imbalance.
- Decoding: Character-level Beam Search with length penalty.
- Vocabulary: 173 tokens covering Urdu characters, digits, punctuation, and control tokens (
<PAD>,<SOS>,<EOS>,<UNK>).
To run this project locally, ensure you have Python installed, then set up the PyTorch environment:
# 1. Create and activate a virtual environment
python -m venv torch-env
torch-env\Scripts\activate
# 2. Upgrade pip
python -m pip install --upgrade pip
# 3. Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# 4. Install remaining project dependencies
pip install -r requirements.txtbenchmark.py compares this project against Tesseract (pytesseract, language code urd).
To install Tesseract and ensure the Urdu language pack is present, run the included Windows script from the project root:
.\install_tesseract_urdu.batIf the script reports urd in tesseract --list-langs, the Tesseract baseline is ready for comparison.
If you plan to run or modify the project-notebook.ipynb, you can register the environment as a Jupyter kernel:
pip install ipykernel
python -m ipykernel install --user --name=torch-env --display-name "Python (torch-env)"If you want to train the OCR model on a powerful cloud GPU without the memory overhead of the full project exploration pipeline, you can use the standalone train_ocr_only.ipynb notebook. This headless notebook features automatic resumption from checkpoints and supports PATH_MAPPING to resolve nested directories on Kaggle without needing to modify your generated CSV splits.
├── splits/ # Contains generated train/val/test CSVs
├── checkpoints/ # Saved model weights (.pth) and vocab.json
├── models/
│ ├── vocab.py # Vocabulary builder and tokenization
│ ├── restoration_model.py # U-Net architecture & builder
│ ├── ocr_model.py # CNN + Transformer architecture
│ ├── ocr_trainer.py # Custom training loops & weighted loss
│ └── pipeline.py # End-to-End inference pipeline wrapper
├── datasets/
│ ├── restoration_dataset.py # On-the-fly degradation dataloader
│ └── ocr_dataset.py # Sequence padding & source-weight dataloader
├── preprocessing.py # Standardization and augmentation pipelines
├── benchmark.py # Benchmarking script to evaluate CER/WER against Tesseract
├── install_tesseract_urdu.bat # Installs Tesseract + Urdu language pack for baseline comparison
├── project-notebook.ipynb # Main orchestrator for training & visualization
└── requirements.txt # Dependencies
The FastAPI server exposes both the Hadoop-backed endpoints (/api/...) and the direct RunPod endpoints (/api/DL/...). To run locally without Docker:
1. Configure credentials — edit deployment/hadoop/.env:
RUNPOD_API_KEY=<your key>
RUNPOD_ENDPOINT_ID=<your endpoint id>
2. Create venv, install dependencies, and start the server:
# From the project root
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
cd deployment/hadoop
PYTHONPATH=../.. python -m uvicorn app:app --host 0.0.0.0 --port 8000The server starts at http://localhost:8000. Interactive API docs are available at http://localhost:8000/docs.
DL endpoints (no Hadoop required):
| Method | Path | Description |
|---|---|---|
POST |
/api/DL/process-batch |
Upload a .zip of images; returns job_id |
GET |
/api/DL/jobs/{job_id}/status |
Poll job state and progress |
GET |
/api/DL/jobs/{job_id}/results |
Fetch recognized text per image |
Optional env vars for tuning:
| Variable | Default | Description |
|---|---|---|
OCR_MODE |
real |
Set to mock to skip RunPod calls during testing |
DL_MAX_WORKERS |
4 |
Parallel document threads |
DL_RUNPOD_TIMEOUT_SECONDS |
120 |
Max wait per RunPod job |
DL_RUNPOD_POLL_INTERVAL_SECONDS |
2 |
Polling interval |
DL_RUNPOD_MAX_RETRIES |
3 |
Retries on transient API errors |
Example request:
curl -X POST http://localhost:8000/api/DL/process-batch \
-F "file=@test_batch.zip"
# returns: {"job_id": "dl-a1b2c3d4", "message": "Batch processing started."}
curl http://localhost:8000/api/DL/jobs/dl-a1b2c3d4/status
curl http://localhost:8000/api/DL/jobs/dl-a1b2c3d4/resultsdocker-compose up --buildServices: FastAPI on :8000, YARN RM UI on :8088, HDFS NameNode UI on :9870, History Server on :8188.
Use the serverless inference image in deployment/inference/:
docker buildx build \
--platform linux/amd64 \
-f deployment/inference/Dockerfile \
-t <registry>/<repo>:runpod-serverless \
--push .Set the RunPod worker environment:
DEVICE=cudaRESTORATION_CKPT=/app/checkpoints/best_restoration_model.pthOCR_CKPT=/app/checkpoints/best_ocr_model.pthVOCAB_PATH=/app/checkpoints/vocab.json
Notes:
- Build with
--platform linux/amd64for RunPod workers. - Keep OCR checkpoint and
vocab.jsonfrom the same training run. - Loaders support raw
state_dict, wrapped checkpoints (for examplemodel_state_dict), andmodule.-prefixed keys.