## MonReader - part 4

----

### Text-to-Speech Generation from OCR Transcriptions (Audiobook-Style Rendering)

**Objective.**  
Transform the page-level OCR transcriptions produced in **MonReader – Part 3** into **spoken audio**, enabling end-to-end document accessibility and audiobook-style consumption.  
This part evaluates modern **Text-to-Speech (TTS)** models on long-form literary text, focusing on intelligibility, prosody, multilingual robustness, and generation efficiency.

We use the same two sources as previous parts:
- *The Chamber* — John Grisham *(English)*
- *A onda que se ergueu no mar* — Ruy Castro *(Portuguese)*

All text input is sourced from the verified per-page `.txt` outputs generated by the best-performing OCR model (**Qwen2.5-VL**), ensuring a clean and stable foundation for speech synthesis.

**Why this experiment.**  
While OCR converts visual documents into text, **Text-to-Speech closes the accessibility loop**, enabling hands-free reading, audio archiving, and downstream multimodal applications helping the blind, the researchers and for everyone else in need for fully automatic, highly fast and high-quality document scanning in bulk.

Long-form literary TTS presents challenges beyond short prompts: paragraph continuity, punctuation-aware prosody, proper-noun pronunciation, multilingual phonemes, and stability over extended generation windows. This experiment measures how well current TTS systems handle these constraints in a realistic, book-scale setting.

**Proposed Pipeline Overview**

1. **I – Data Preparation**  
   Normalize and structure OCR text for speech synthesis: page ordering, light TTS-oriented text normalization (hyphen joins, punctuation handling), language tagging, and chunking to respect model limits.

2. **J – TTS Model Benchmarking**  
   Evaluate three leading Text-to-Speech models on representative excerpts from both books.  
   Comparison dimensions include generation speed, pronunciation accuracy, prosody, multilingual handling (English vs. Portuguese), and long-form stability.

3. **K – Full Audio Generation**  
   Using the selected TTS model and fixed configuration, generate per-page audio files for both books.  
   Outputs include structured audio manifests and reproducible file layouts suitable for later merging into chapter- or book-level recordings.

> This part prioritizes **reproducibility and minimal intervention**: the OCR pipeline remains unchanged, and only lightweight, TTS-specific normalization is applied to the extracted text.


----


#### Imports and Environment

In [1]:
from pathlib import Path
import re
import json
import time
import math
import unicodedata
import pandas as pd
from typing import List, Dict, Iterable

In [2]:
BASE = Path.cwd()
WORK_DIR = BASE / "work"

# OCR outputs from Part 3
STEPH_DIR = WORK_DIR / "stepH_qwen2p5vl_full"

# Part 4 outputs
STEP4_DIR = WORK_DIR / "step4_tts"
STEP4_DIR.mkdir(parents=True, exist_ok=True)


----

### Step I — Data Preparation for Text-to-Speech

**Goal.**  
Transform OCR-derived page outputs into **clean, continuous, speech-ready text** while preserving semantic content and correct reading order.

This step performs **strictly minimal, TTS-oriented processing**, with no OCR correction or semantic rewriting. The focus is on structural consistency and speech robustness.

**Processing stages in this step:**
1- inspection of per-page OCR JSON structure
2- explicit verification of page ordering
3- extraction and concatenation of page-level text
- removal of OCR and formatting artifacts
- repair of hyphenated and line-wrapped words
- generation of canonical continuous prose
- sentence-aware chunking to respect TTS model limits

No linguistic enhancement, paraphrasing, or content modification is applied. The output of this step serves as the **single source of truth** for all subsequent Text-to-Speech benchmarking and audio generation.




#### Step I.1 - Inspection of per-page OCR JSON structure

Each page has one JSON like:

```pgsql
work/
└─ stepH_qwen2p5vl_full/
   └─ <BOOK_ID>/
      └─ json/
         └─ pagXX.json
```


In [12]:
from pprint import pprint

In [None]:
# print the JSON structure of the page12 from the first book
BOOK_ID = "A_onda_que_se_ergueu_no_mar-Ruy_Castro"
PAGE = "pag12"

json_path = Path("work/stepH_qwen2p5vl_full") / BOOK_ID / "json" / f"{PAGE}.json"

with open(json_path, "r", encoding="utf-8") as f:
    data = json.load(f)

pprint(data)

{'image': 'pag12.JPEG',
 'image_path': 'e:\\Devs\\pyEnv-1\\Apziva\\MonReader\\data\\books\\A_onda_que_se_ergueu_no_mar-Ruy_Castro\\images\\pag12.JPEG',
 'latency_s': 870.2552971839905,
 'model': 'qwen2.5vl:latest',
 'options': {'num_predict': 1536,
             'repeat_penalty': 1.25,
             'stop': ['}\n', '}\r\n', '}'],
             'temperature': 0,
             'top_p': 1},
 'parsed': {'degenerate': False,
            'json_objs': 0,
            'language': 'guess',
            'lines': ['{',
                      '  "language": "por",',
                      '  "lines": [',
                      '    "A trilha",',
                      '    "sonora de um",',
                      '    "país ideal",',
                      '    "",',
                      '    "Olha que coisa mais linda: as garotas de '
                      'Ipanema-1961",',
                      '    "tomavam cuba-libre, dirigiam Kharman-Ghias e '
                      'voavam",',
                      '   

#### Step I.2 - Explicit verification of page ordering

Page filenames like "pag2.JPEG" vs "pag10.JPEG" will sort incorrectly with plain string sorting.
We define an explicit "natural sort" that extracts the numeric page id and sorts by it, the functions were included in the `MonReader_tools.py` library.


In [16]:
import importlib
import MonReader_tools

importlib.reload(MonReader_tools)

from MonReader_tools import (
    page_number_from_name, list_pages_sorted, verify_page_order
)

In [17]:
# Run verification for both books (Step H JSON outputs)

BOOKS = [
    "The_Chamber-John_Grisham",
    "A_onda_que_se_ergueu_no_mar-Ruy_Castro",
]

dfs = {}
for b in BOOKS:
    dfs[b] = verify_page_order(b, which="json", preview=50)
    display(dfs[b].head(20))


>>> The_Chamber-John_Grisham (json)
Directory: e:\Devs\pyEnv-1\Apziva\MonReader\work\stepH_qwen2p5vl_full\The_Chamber-John_Grisham\json
Order preview:
 - pag0.json
 - pag2.json
 - pag4.json
 - pag6.json
 - pag8.json
 - pag10.json
 - pag12.json
 - pag14.json
 - pag16.json
 - pag18.json
 - pag20.json
 - pag22.json


Unnamed: 0,name,page_n
0,pag0.json,0
1,pag2.json,2
2,pag4.json,4
3,pag6.json,6
4,pag8.json,8
5,pag10.json,10
6,pag12.json,12
7,pag14.json,14
8,pag16.json,16
9,pag18.json,18



>>> A_onda_que_se_ergueu_no_mar-Ruy_Castro (json)
Directory: e:\Devs\pyEnv-1\Apziva\MonReader\work\stepH_qwen2p5vl_full\A_onda_que_se_ergueu_no_mar-Ruy_Castro\json
Order preview:
 - pag12.json
 - pag16.json
 - pag18.json
 - pag20.json
 - pag22.json
 - pag24.json
 - pag26.json
 - pag28.json
 - pag32.json
 - pag36.json
 - pag40.json
 - pag44.json


Unnamed: 0,name,page_n
0,pag12.json,12
1,pag16.json,16
2,pag18.json,18
3,pag20.json,20
4,pag22.json,22
5,pag24.json,24
6,pag26.json,26
7,pag28.json,28
8,pag32.json,32
9,pag36.json,36


### Step I.3 - extraction and concatenation of page-level text