# Arabic OCR Correction — Kaggle Inference

Run LLM inference for Arabic OCR correction on a Kaggle GPU kernel.

**Three-step workflow:**
1. **Local** — `python pipelines/run_phase2.py --mode export` → upload `inference_input.jsonl` to a Kaggle dataset
2. **Here** — run this notebook (resumes automatically on session timeout)
3. **Local** — `python pipelines/run_phase2.py --mode analyze`

**Before running:**
- Set `REPO_URL` in Cell 2 to your GitHub repo
- Set `HF_REPO` in Cell 3 to your HuggingFace dataset repo (for cross-session sync)
- Add `HF_TOKEN` as a Kaggle secret (Add-ons → Secrets) — or paste it directly (not recommended)
- Enable **GPU T4 x2** accelerator and **Internet**

In [None]:
# Cell 1 — Install dependencies
!pip install transformers accelerate huggingface_hub pyyaml tqdm -q
# Uncomment for 4-bit quantization (needed on P100 or low-VRAM T4):
# !pip install bitsandbytes -q

In [None]:
# Cell 2 — Clone the project repo
# For private repos: https://YOUR_TOKEN@github.com/USERNAME/Arabic-Post-OCR-Correction.git
REPO_URL = "https://github.com/YOUR_USERNAME/Arabic-Post-OCR-Correction.git"
PROJECT_DIR = "/kaggle/working/project"

!git clone {REPO_URL} {PROJECT_DIR}

In [None]:
# Cell 3 — Run inference
# HF sync keeps progress safe across session timeouts — re-run this cell to resume.
import os
HF_REPO  = "YOUR_HF_USERNAME/arabic-ocr-corrections"  # HuggingFace dataset repo
HF_TOKEN = os.environ.get("HF_TOKEN", "")             # Read from Kaggle secret

# Adjust --input to match your Kaggle dataset name
!python {PROJECT_DIR}/scripts/infer.py \
    --input  /kaggle/input/YOUR_DATASET_NAME/inference_input.jsonl \
    --output /kaggle/working/corrections.jsonl \
    --model  Qwen/Qwen3-4B-Instruct-2507 \
    --hf-repo  {HF_REPO} \
    --hf-token {HF_TOKEN} \
    --sync-every 100

In [None]:
# Cell 4 — (Optional) Run a quick smoke test first (50 samples from KHATT-train)
# !python {PROJECT_DIR}/scripts/infer.py \
#     --input  /kaggle/input/YOUR_DATASET_NAME/inference_input.jsonl \
#     --output /kaggle/working/corrections_test.jsonl \
#     --model  Qwen/Qwen3-4B-Instruct-2507 \
#     --datasets KHATT-train --limit 50