## MonReader - part 3

----

### Multimodal OCR (No Pre-processing): VLM vs. Tesseract

**Objective.**  
Evaluate a **Vision–Language Model (VLM)** that performs OCR **directly from raw page images** (no deskew, no binarization, no line/word segmentation). We’ll later compare its verbatim transcription quality against **Tesseract** on the same pages.

We use two sources:
- *The Chamber* — John Grisham *(English)*
- *A onda que se ergueu no mar* — Ruy Castro *(Portuguese)*

**Why this experiment.**  
VLMs can read document text straight from RGB photos by leveraging learned visual invariances (rotation, lighting, curvature). The aim is to measure how far a “no-preprocessing” VLM can go versus a classical pipeline, and to identify the situations where simple conditioning (e.g., deskew) still helps.

**Minimal Pipeline Overview (this part).**  

1. **F – VLM (raw)**: Feed the original page photo to the model with a *verbatim transcription* prompt; capture JSON output `{language, lines}` and a `.txt` view.  
2. **G – Compare**: Compute CER/WER against gold text (and Tesseract), plus latency and error tags.

> In this first section we only set up the dataset, verify image quality, and prepare folders for the VLM run, **no pre-processing**.


----


#### Imports and Environment

In [1]:
from pathlib import Path
import shutil
import numpy as np
import pandas as pd
from PIL import Image
import cv2
import matplotlib.pyplot as plt


In [2]:
BASE = Path.cwd()
DATA_DIR = BASE / "data"
BOOK_DIR = DATA_DIR / "books"
WORK_DIR = BASE / "work"

ENG_BOOK_DIR = BOOK_DIR / "The_Chamber-John_Grisham"
POR_BOOK_DIR = BOOK_DIR / "A_onda_que_se_ergueu_no_mar-Ruy_Castro"

ENG_IMG_DIR = ENG_BOOK_DIR / "images"
POR_IMG_DIR = POR_BOOK_DIR / "images"

for p in [BOOK_DIR, WORK_DIR, ENG_BOOK_DIR, POR_BOOK_DIR, ENG_IMG_DIR, POR_IMG_DIR]:
        p.mkdir(parents=True, exist_ok=True)


----

### Step F — VLM OCR (GGUF local, no pre-processing)

**Goal.**  
Use a **quantized GGUF** build of *Llama 3.2-Vision Instruct* to transcribe book-page photos directly (no deskew, no binarization).  
We start with a **single-image smoke test**, then we’ll scale to the full dataset.

**Why GGUF?**  
GGUF files are pre-quantized, self-contained weights that can run efficiently on the local GPUs through the `llama.cpp` engine (used by LM Studio and Ollama).  

They trade a few points of accuracy for huge VRAM savings, perfect for a GTX 1080 Ti.



In [3]:
import requests, base64, json, time
from pathlib import Path

### Setting up Ollama for Local Multimodal Inference

Before running the OCR prompting steps, we first set up **Ollama**, a lightweight local engine for running quantized large language and vision models (GGUF format) efficiently on consumer GPUs.

**Installation**
1. Go to [https://ollama.com/download](https://ollama.com/download)
2. Download and install the correct version depending on your OS.
3. Open a Terminal and verify the installation:
   ```bash
   ollama --version
   ollama list
4. Pull the multimodal model:
   ```bash
   ollama pull llama3.2-vision


#### Prompt design

We’ll use a **verbatim OCR prompt**. The model must output text *exactly* as it appears, with preserved line breaks and punctuation.  
We ask for JSON to keep parsing simple.


In [4]:
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "llama3.2-vision"
IMG = Path(r"E:\Devs\pyEnv-1\Apziva\MonReader\data\books\A_onda_que_se_ergueu_no_mar-Ruy_Castro\images\pag12.JPEG")
assert IMG.exists(), f"Image not found {IMG}"


In [5]:
def b64_image(p: Path) -> str:
    return base64.b64encode(open(p, "rb").read()).decode()

In [6]:
SYSTEM_PROMPT = (
    "You are an OCR transcriber. Output exactly the text you see. "
    "Preserve line breaks and punctuation. "
    "Return ONLY valid JSON with keys {\"language\":\"eng|por|guess\",\"lines\":[\"...\"]}. "
    "Transcribe this image verbatim."
)

# SYSTEM_PROMPT = (
#     "You are an OCR transcriber. Return ONLY valid JSON:\n"
#     '{"language":"eng|por|guess","lines":["..."]}\n'
#     "Transcribe the image verbatim. Preserve line breaks and punctuation."
# )


In [7]:
num_predict_values = [4096, 2048, 1024, 512]

In [12]:
payload = {
    "model": MODEL,
    "prompt": SYSTEM_PROMPT,
    "images": [b64_image(IMG)],
    "stream": False,
    "options": { "num_predict": 4096 }
}


In [None]:
resps = []
for num_predict in num_predict_values:
    t0 = time.time()
    payload["options"]["num_predict"] = num_predict
    resp = requests.post(OLLAMA_URL, json=payload, timeout=600)
    resps.append(resp)
    lat = time.time() - t0
    print("HTTP", resp.status_code, f"{num_predict=} {lat:.1f}s")

In [None]:
# print the responses for each 'num_predict' value

for resp in resps:
    if resp.status_code != 200:
        print("Error:", resp.text[:1000])
    else:
        data = resp.json()
        text = data.get("response", "")

        # Extract the JSON object if the model adds extra characters (like a trailing ".")
        start = text.find("{")
        end = text.rfind("}")
        candidate = text[start:end+1] if (start != -1 and end != -1 and end > start) else text

        try:
            js = json.loads(candidate)
            lang = js.get("language", "guess")
            lines = js.get("lines", [])
            if isinstance(lines, str):  # just in case it returns a single string
                lines = lines.splitlines()
        except Exception:
            # fallback: treat as plain text
            lang = "guess"
            lines = text.splitlines()

        print(f"Detected language: {lang} | Lines: {len(lines)}")
        print("\n".join(lines[:60]))
        print(80*"=")



Detected language: eng|por|guess | Lines: 2
A trilha sonora de um país ideal
Iha que coisa mais linda: as garotas de Ipanema-1961 tomavam cuba-libre, dirigiam Kharman-Ghas e voavam pela Panair. Usavam frasqueira, vestido-tubinho, cilio postico, peruca, laque. Diziam-se existencialistas, adoravam arte abstrata e não perdiam um filme da Nouvelle Vague. Seus pontos eram o Beco das Garrafas, a Cinemateca, o Arpoador. Iam a praia com a camisa social do irmão e, sob esta, um biquini que, de tão insolente, fazia o sangue dos rapazes ferver da maneira mais inconveniente. Tudo isso passou. A querida Panair nunca mais voou, a Nouvelle Vague é um filme em preto e branco e ninguém mais toma cuba-libre — quem pensaria hoje em misturar rum com Coca-Cola? Quanto a quele biquini, era mesmo insolente, em-bora, por padres subsequentes, sua calça contivesse pano para fabricar dois ou três para-ques. Dito assim, é como se, em 1961, o céu do Brasil ainda fosse povoado por pterodáctilos. Mas, há uma exceção