## MonReader - part 3

----

### Multimodal OCR (No Pre-processing): VLM vs. Tesseract

**Objective.**  
Evaluate a **Vision–Language Model (VLM)** that performs OCR **directly from raw page images** (no deskew, no binarization, no line/word segmentation). We’ll later compare its verbatim transcription quality against **Tesseract** on the same pages.

We use two sources:
- *The Chamber* — John Grisham *(English)*
- *A onda que se ergueu no mar* — Ruy Castro *(Portuguese)*

**Why this experiment.**  
VLMs can read document text straight from RGB photos by leveraging learned visual invariances (rotation, lighting, curvature). The aim is to measure how far a “no-preprocessing” VLM can go versus a classical pipeline, and to identify the situations where simple conditioning (e.g., deskew) still helps.

**Minimal Pipeline Overview (this part).**  

1. **F – VLM (raw)**: Feed the original page photo to the model with a *verbatim transcription* prompt; capture JSON output `{language, lines}` and a `.txt` view.  
2. **G – Compare**: Compute CER/WER against gold text (and Tesseract), plus latency and error tags.

> In this first section we only set up the dataset, verify image quality, and prepare folders for the VLM run, **no pre-processing**.


----


#### Imports and Environment

In [1]:
from pathlib import Path
import shutil
import numpy as np
import pandas as pd
from PIL import Image
import cv2
import matplotlib.pyplot as plt


In [2]:
BASE = Path.cwd()
DATA_DIR = BASE / "data"
BOOK_DIR = DATA_DIR / "books"
WORK_DIR = BASE / "work"

ENG_BOOK_DIR = BOOK_DIR / "The_Chamber-John_Grisham"
POR_BOOK_DIR = BOOK_DIR / "A_onda_que_se_ergueu_no_mar-Ruy_Castro"

ENG_IMG_DIR = ENG_BOOK_DIR / "images_lr"
POR_IMG_DIR = POR_BOOK_DIR / "images_lr"

for p in [BOOK_DIR, WORK_DIR, ENG_BOOK_DIR, POR_BOOK_DIR, ENG_IMG_DIR, POR_IMG_DIR]:
        p.mkdir(parents=True, exist_ok=True)


----

### Step F — VLM OCR (GGUF local, no pre-processing)

**Goal.**  
Use a **quantized GGUF** build of *Llama 3.2-Vision Instruct* to transcribe book-page photos directly (no deskew, no binarization).  
We start with a **single-image smoke test**, then we’ll scale to the full dataset.

**Why GGUF?**  
GGUF files are pre-quantized, self-contained weights that can run efficiently on the local GPUs through the `llama.cpp` engine (used by LM Studio and Ollama).  

They trade a few points of accuracy for huge VRAM savings, perfect for a GTX 1080 Ti.



In [3]:
import requests, base64, json, time, re
from pathlib import Path

### Setting up Ollama for Local Multimodal Inference

Before running the OCR prompting steps, we first set up **Ollama**, a lightweight local engine for running quantized large language and vision models (GGUF format) efficiently on consumer GPUs.

**Installation**
1. Go to [https://ollama.com/download](https://ollama.com/download)
2. Download and install the correct version depending on your OS.
3. Open a Terminal and verify the installation:
   ```bash
   ollama --version
   ollama list
4. Pull the multimodal model:
   ```bash
   ollama pull llama3.2-vision


#### Prompt design

We’ll use a **verbatim OCR prompt**. The model must output text *exactly* as it appears, with preserved line breaks and punctuation.  
We ask for JSON to keep parsing simple.


In [4]:
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "llama3.2-vision"
IMG = Path(r"E:\Devs\pyEnv-1\Apziva\MonReader\data\books\A_onda_que_se_ergueu_no_mar-Ruy_Castro\images_lr\pag12.JPEG")
assert IMG.exists(), f"Image not found {IMG}"


In [5]:
def b64_image(p: Path) -> str:
    return base64.b64encode(open(p, "rb").read()).decode()

In [6]:
# Version 0.2
# SYSTEM_PROMPT = (
#     "You are an OCR transcriber. Output exactly the text you see. "
#     "Preserve line breaks and punctuation. "
#     "Return ONLY valid JSON with keys {\"language\":\"eng|por|guess\",\"lines\":[\"...\"]}. "
#     "Transcribe this image verbatim."
# )

# Version 0.1
# SYSTEM_PROMPT = (
#     "You are an OCR transcriber. Return ONLY valid JSON:\n"
#     '{"language":"eng|por|guess","lines":["..."]}\n'
#     "Transcribe the image verbatim. Preserve line breaks and punctuation."
# )

# Version 0.2
SYSTEM_PROMPT = """
You are an OCR transcriber.

Return ONLY one valid JSON object with keys:
- "language": one of ["eng","por","guess"]
- "lines": an array of strings, one per line in reading order

Rules:
- Do NOT repeat the JSON object.
- Do NOT include any text outside the single JSON object.
- Preserve line breaks and punctuation exactly as seen.
- If unsure about a character, copy it as best you can (do not explain).

Transcribe the image verbatim.
""".strip()




In [7]:
num_predict_values = [4096, 2048, 1024, 512]

In [None]:
resps = []

for num_predict in num_predict_values:
    t0 = time.time()
    
    payload = {
        "model": MODEL,
        "prompt": SYSTEM_PROMPT,
        "images": [b64_image(IMG)],
        "format": "json",
        "stream": True,
        "options": {
            "temperature": 0,
            "top_p": 1,
            "repeat_penalty": 1.15,
            "num_predict": num_predict}
    }
    
    chunks = []
    status_code = None
    
    try:
        
        with requests.post(OLLAMA_URL, json=payload, stream=True, timeout=(10, 3600)) as r:
            status_code = r.status_code
            r.raise_for_status()
            
            for line in r.iter_lines(decode_unicode=True):
                if not line:
                    continue
                
                try:
                    obj = json.loads(line)
                except json.JSONDecodeError:
                    # If Ollama ever emits a non-JSON line, skip or log it
                    continue
                
                if "error" in obj and obj["error"]:
                    raise RuntimeError(f"Ollama error: {obj['error']}")
                
                chunks.append(obj.get("response", ""))
                
                if obj.get("done"):
                    break
                
        text = "".join(chunks)
        lat = time.time() - t0
        resps.append({"num_predict": num_predict, "latency_s": lat, "text": text})
        print("HTTP", status_code, f"{num_predict=} {lat:.1f}s")
        
    except Exception as e:
        lat = time.time() - t0
        print(f"FAILED {num_predict=} after {lat:.1f}s: {e}")
        resps.append({"num_predict": num_predict, "latency_s": lat, "text": None, "error": str(e)})




In [9]:

def extract_first_valid_json(text: str):
    """
    Try to find and parse the first valid JSON object embedded in text.
    Returns (obj, n_candidates) where:
      - obj is a dict if found, else None
      - n_candidates is how many {...} blocks we saw (rough proxy for repetition)
    """
    # Roughly find JSON object candidates. Non-greedy to avoid swallowing everything.
    candidates = re.findall(r"\{.*?\}", text, flags=re.DOTALL)
    for c in candidates:
        try:
            return json.loads(c), len(candidates)
        except json.JSONDecodeError:
            continue
    return None, len(candidates)

def coerce_lines(js):
    """Normalize the 'lines' field into a list[str]."""
    lines = js.get("lines", [])
    if isinstance(lines, str):
        return lines.splitlines()
    if isinstance(lines, list):
        return [str(x) for x in lines]
    return [str(lines)]


In [10]:
# print the responses for each 'num_predict' value

for r in resps:
    num_predict = r.get("num_predict")
    text = r.get("text")

    if not text:
        print(f"num_predict={num_predict} | EMPTY or ERROR")
        print(80 * "=")
        continue

    # Attempt 1: parse a single JSON object by trimming junk around it
    start = text.find("{")
    end = text.rfind("}")
    candidate = (
        text[start:end+1]
        if (start != -1 and end != -1 and end > start)
        else text
    )

    parsed_ok = False
    lang = "guess"
    lines = []
    json_objects_found = 0

    try:
        js = json.loads(candidate)
        lang = js.get("language", "guess")
        lines = coerce_lines(js)
        parsed_ok = True
        json_objects_found = 1  # we parsed one (assume single-object case)
    except Exception:
        # Attempt 2: handle repeated JSON objects / messy streams
        js, json_objects_found = extract_first_valid_json(text)
        if js is not None:
            lang = js.get("language", "guess")
            lines = coerce_lines(js)
            parsed_ok = True
        else:
            # Final fallback: plain text split
            lines = text.splitlines()

    print(
        f"num_predict={num_predict} | "
        f"parsed_json={parsed_ok} | "
        f"json_objs≈{json_objects_found} | "
        f"language={lang} | "
        f"lines={len(lines)} | "
        f"chars={len(text)}"
    )
    print("\n".join(lines[:60]))
    print(80 * "=")


num_predict=4096 | parsed_json=True | json_objs≈1 | language=por | lines=28 | chars=1888
A trilha sonora de um pais ideal
O
lha que coisa mais linda: as garotas de Ipanema-1961
tomavam cuba-libre, dirigiam Kharman-Glias e voavam
pela Panair. Usavam frasqueira, vestido-tubinho, cilio
postico, perua, laque. Diziam-se existencialistas, adoravam
arte abstrata e nao perdiam um filme da Nouvelle Vague.
Seus pontos eram o Beco das Garrafas, a Cinemateca, o Arpoador.
Iam a praia com a camisa social do irmao e, sob esta, um biquini que
de tao insolente, fazia o sangue dos rapazes ferver da maneira
mais incoveniente.
Tudo isso passou. A querida Panair nunca mais voou, a
Nouvelle Vague e um filme em preto e branco e ninguem mais
toma cuba-libre - quem pensaria hoje em misturar rum com
Coca-Cola? Quanto aquele biquini, era mesmo insolente, em-
bora, por padroes subsequentes, sua calinha contivesse pano
para fabricar dois ou tres para-ques. Dito assim, e como se, em
1961, o ceu do Brasil ainda foss