# 03 – Linking pages to high-resolution scans

Every PPA page can be traced back to a TIFF / JPEG image on disk.  The helper
`corppa.utils.path_utils` knows how to construct those paths from IDs.  In
this notebook we will:

1. recap the identifier anatomy (work id → volume id → image filename),
2. resolve **one Gale page** to its relative image path, and
3. optionally display a thumbnail if the file is locally available.

> **Note**: as of April 2025 the helpers cover Gale volumes; HathiTrust and
> EEBO images will raise `NotImplementedError` until their path patterns are
> added.

In [None]:
import gzip, json, itertools, pathlib, pandas as pd
from corppa.utils import path_utils as pu
from pprint import pprint

DATA_DIR   = pathlib.Path('..') / 'shared_data' / 'ppa_corpus_2025-02-03_1308'
PAGES_FILE = DATA_DIR / 'ppa_pages.jsonl.gz'
META_CSV   = DATA_DIR / 'ppa_metadata.csv'

## 1  Identifier anatomy refresher

| level | example | meaning |
|-------|---------|---------|
| **work id** | `CW012706` | unique to a bibliographic item (title-level) |
| **page id** | `CW0127060085-p1` | volume (85) + page (1) |
| **HT id**   | `mdp.39015012345678` | HathiTrust volume id (when source = HathiTrust) |

`path_utils` provides small functions to hop between these layers without
memorising string hacks.

In [None]:
demo = 'mdp.39015012345678'
print('encode_htid →', pu.encode_htid(demo))
print('decode_htid →', pu.decode_htid(pu.encode_htid(demo)))

## 2  Pick one **Gale** work so the path util can succeed

In [None]:
meta_df = pd.read_csv(META_CSV, dtype=str)
gale_work = meta_df.loc[meta_df['source'] == 'Gale', 'work_id'].iloc[0]
print('Chosen Gale work →', gale_work)

In [None]:
# grab its very first page
def iter_pages():
    with gzip.open(PAGES_FILE, 'rt', encoding='utf-8') as fh:
        for line in fh:
            yield json.loads(line)

page = next(p for p in iter_pages() if p['work_id'] == gale_work)
pprint({k: page[k] for k in ('id', 'label', 'order')})

## 3  Resolve the image path

In [None]:
img_rel = pu.get_image_relpath(page['work_id'], page['order'])
print('Relative image path →', img_rel)

If your corpus includes the actual image directory (e.g. mounted under
`/shared_data/gale_images`), you can preview the scan:

In [None]:
# OPTIONAL: show a thumbnail if the file exists --
from PIL import Image, ImageOps
IMG_ROOT = pathlib.Path('../shared_data/')  # adjust for your setup

img_path = IMG_ROOT / img_rel
if img_path.exists():
    thumb = ImageOps.fit(Image.open(img_path), (400, 550))
    display(thumb)
else:
    print('Image file not found locally →', img_path)

### What did we just do?

1. **Selected** a Gale work because Gale image paths are implemented.
2. **Streamed** the large JSONL to grab its first page.
3. **Called** `get_image_relpath(work_id, order)` which internally
   * finds the *volume id* for that page,
   * derives the stub directory for Gale,
   * concatenates everything into `Gale/<stub>/<volume>/<order>.tif`.

You can now use the same helper inside data loaders, dashboards, or
annotation tools to pull the exact scan for any page.

---
**Next steps**

* Add HathiTrust & EEBO path rules to `path_utils.py` (PRs welcome!).