# Arabic OCR Correction — Colab Inference

Run LLM inference for Arabic OCR correction on a Google Colab GPU.

**Three-step workflow:**
1. **Local** — `python pipelines/run_phase2.py --mode export` → upload `inference_input.jsonl` to Google Drive
2. **Here** — run this notebook (output goes directly to Drive — survives disconnects)
3. **Local** — download `corrections.jsonl` from Drive, then `python pipelines/run_phase2.py --mode analyze`

**Before running:**
- Upload `inference_input.jsonl` to `MyDrive/arabic-ocr/` in Google Drive
- Set `REPO_URL` in Cell 2 to your GitHub repo
- Select **Runtime → Change runtime type → GPU (T4)**

In [None]:
# Cell 1 — Mount Drive and install dependencies
from google.colab import drive
drive.mount('/content/drive')

!pip install transformers accelerate huggingface_hub pyyaml tqdm -q

In [None]:
# Cell 2 — Clone the project repo
# For private repos: https://YOUR_TOKEN@github.com/USERNAME/Arabic-Post-OCR-Correction.git
REPO_URL = "https://github.com/YOUR_USERNAME/Arabic-Post-OCR-Correction.git"
PROJECT_DIR = "/content/project"

!git clone {REPO_URL} {PROJECT_DIR}

In [None]:
# Cell 3 — Run inference (output to Drive — resumes automatically on reconnect)
DRIVE_DIR = "/content/drive/MyDrive/arabic-ocr"

!python {PROJECT_DIR}/scripts/infer.py \
    --input  {DRIVE_DIR}/inference_input.jsonl \
    --output {DRIVE_DIR}/corrections.jsonl \
    --model  Qwen/Qwen3-4B-Instruct-2507

In [None]:
# Cell 4 — (Optional) Add HF sync as a backup in addition to Drive
# import os
# HF_REPO  = "YOUR_HF_USERNAME/arabic-ocr-corrections"
# HF_TOKEN = "hf_xxx"  # or use: os.environ.get("HF_TOKEN", "")
#
# !python {PROJECT_DIR}/scripts/infer.py \
#     --input  {DRIVE_DIR}/inference_input.jsonl \
#     --output {DRIVE_DIR}/corrections.jsonl \
#     --model  Qwen/Qwen3-4B-Instruct-2507 \
#     --hf-repo  {HF_REPO} \
#     --hf-token {HF_TOKEN} \
#     --sync-every 50