# 01 – PPA Corpus Overview

This notebook gives a *bird‑eye tour* of the Princeton Prosody Archive (PPA) corpus as deposited on disk in three files:

| file | description | size (≈) |
|------|-------------|-----------|
| `ppa_pages.jsonl.gz` | page‑level JSON Lines (*\*.gz*); text of every page plus minimal page-level metadata | **1.8 GB compressed**, ~4.2 GB uncompressed |
| `ppa_metadata.csv`  | work‑level bibliographic metadata | 3.4 MB |
| `ppa_metadata.json` | same as above but in JSON | 5.4 MB |

Pages → Works join on the common key **`work_id`**.  All code below streams the page file so you *never* load 4 GB into memory.

In [1]:
import gzip, json, pathlib, itertools, pandas as pd, textwrap
from pprint import pprint

DATA_DIR = pathlib.Path('..') / 'shared_data' / 'ppa_corpus_2025-02-03_1308'
PAGES_FILE = DATA_DIR / 'ppa_pages.jsonl.gz'
META_CSV   = DATA_DIR / 'ppa_metadata.csv'
META_JSON  = DATA_DIR / 'ppa_metadata.json'

## 1. Peek inside the page corpus (first 3 records)

In [3]:
with gzip.open(PAGES_FILE, 'rt', encoding='utf-8') as fh:
    sample_pages = list(itertools.islice(fh, 3))

for i, raw in enumerate(sample_pages, 1):
    page = json.loads(raw)
    print(f'— Page #{i}')
    pprint({k: page[k] for k in ('id','work_id','order','label','tags')})
    snippet = textwrap.shorten(page['text'].replace('\n',' '), width=120)
    print('text →', snippet)
    print('-' * 80)

— Page #1
{'id': 'A01224.1',
 'label': '[1]',
 'order': 1,
 'tags': ['dedication'],
 'work_id': 'A01224'}
text → To the Right excellent and most honorable Ladie, the Ladie Marie, Countesse of Pembroke. VOi, pia nympha, tuum, [...]
--------------------------------------------------------------------------------
— Page #2
{'id': 'A01224.10',
 'label': '[10]',
 'order': 10,
 'tags': ['book'],
 'work_id': 'A01224'}
text → Boscan 3. Booke. Los altares delante estauan puestos, Ardiendo encima d'ellos toda Arabia. Cap. 5. Of the [...]
--------------------------------------------------------------------------------
— Page #3
{'id': 'A01224.100',
 'label': '[100]',
 'order': 100,
 'tags': ['book'],
 'work_id': 'A01224'}
text → 〈 in non-Latin alphabet 〉. 〈 in non-Latin alphabet 〉. This is the Prosopopoeia of Peleus, which is thus left off; [...]
--------------------------------------------------------------------------------


Each JSON object holds:
* **`id`** — unique page identifier (`<work_id>.<page_number>`).
* **`order`** — numeric page sequence within the work.
* **`label`** — original printed folio/page label (e.g. `A2r`, `[1]`).
* **`tags`** — semantic tags (e.g. `['dedication']`, `['title-page']`).
* **`text`** — full OCR transcription (UTF‑8).

> **Why JSONL?**  You can stream line‑by‑line without loading the whole file—perfect for 4 GB of UTF‑8 text!

## 2. Load work‑level metadata (CSV)

In [2]:
metadata_df = pd.read_csv(META_CSV, dtype=str)
print(f'{len(metadata_df):,} works loaded')
metadata_df.head()

7,122 works loaded


Unnamed: 0,work_id,source_id,cluster_id,title,author,pub_year,publisher,pub_place,collections,work_type,source,source_url,sort_title,subtitle
0,A01224,A01224,A01224,The Arcadian rhetorike: or The præcepts of rhe...,"Fraunce, Abraham, fl. 1587-1633",1588,Thomas Orwin,At London,"['Linguistic', 'Literary']",full-work,EEBO-TCP,http://name.umdl.umich.edu/A01224.0001.001,Arcadian rhetorike: or The præcepts of rhetori...,"Greeke, Latin, English, Italian, French, Spani..."
1,A01225,A01225,A01225,The Countesse of Pembrokes Emanuel,"Fraunce, Abraham, fl. 1587-1633",1591,"[By Thomas Orwyn] for William Ponsonby, dwelli...",Printed at London,"['Literary', 'Original Bibliography']",full-work,EEBO-TCP,http://name.umdl.umich.edu/A01225.0001.001,Countesse of Pembrokes Emanuel Conteining the ...,"Conteining the natiuity, passion, buriall, and..."
2,A01227,A01227,A01227,The Countesse of Pembrokes Yuychurch,"Fraunce, Abraham, fl. 1587-1633",1591,"Thomas Orwyn for William Ponsonby, dwelling in...",London,"['Literary', 'Original Bibliography']",full-work,EEBO-TCP,http://name.umdl.umich.edu/A01227.0001.001,Countesse of Pembrokes Yuychurch Conteining th...,"Conteining the affectionate life, and vnfortun..."
3,A01514-pnp,A01514,A01514-pnp,Certayne notes of Instruction concerning the m...,"Gascoigne, George, 1542?-1577.",1575,By H. Bynneman for Richard Smith. These bookes...,Imprinted at London,['Literary'],excerpt,EEBO-TCP,https://quod.lib.umich.edu/e/eebo2/A01514.0001...,Certayne notes of Instruction concerning the m...,
4,A03670,A03670,A03670,"Horace his arte of poetrie, pistles, and satyr...",Horace,1567,"In Fletestrete, nere to S. Dunstones Churche, ...",Imprinted at London,['Literary'],full-work,EEBO-TCP,http://name.umdl.umich.edu/A03670.0001.001,"Horace his arte of poetrie, pistles, and satyr...",and to the Earle of Ormounte by Tho. Drant add...


In [3]:
import json, ast, pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

# ---- 1. publication years (numeric) ---------------------------------
pub_year = pd.to_numeric(metadata_df['pub_year'], errors='coerce').astype('Int64')
print("-" * 50, "\nTop-5 publication years")
print(pub_year.value_counts().head(5), "\n")

# ---- 2. sources & work types ----------------------------------------
print("-" * 50, "\nSource counts")
print(metadata_df['source'].value_counts().rename_axis('source')
        .reset_index(name='works').head(10))

print("-" * 50, "\nWork-type counts")
print(metadata_df['work_type'].value_counts().rename_axis('work_type')
        .reset_index(name='works'))

# ---- 3. collections (multivalued) -----------------------------------
coll_counter, bad_rows = Counter(), 0
for raw in metadata_df['collections'].dropna():
    parsed = None
    for parser in (json.loads, ast.literal_eval):
        try:
            parsed = parser(raw)
            break
        except Exception:
            continue
    if parsed is None:
        bad_rows += 1
        continue
    coll_counter.update(parsed)

coll_series = pd.Series(coll_counter, name='works').sort_values(ascending=False)
print("-" * 50, f"\nCollection membership counts (bad rows skipped: {bad_rows})")
display(coll_series)

-------------------------------------------------- 
Top-5 publication years
pub_year
1922    126
1920    106
1913    101
1912     93
1779     93
Name: count, dtype: Int64 

-------------------------------------------------- 
Source counts
       source  works
0  HathiTrust   5539
1        Gale   1517
2    EEBO-TCP     66
-------------------------------------------------- 
Work-type counts
   work_type  works
0  full-work   6097
1    article    661
2    excerpt    364
-------------------------------------------------- 
Collection membership counts (bad rows skipped: 0)


Literary                  4292
Linguistic                3430
Original Bibliography      986
Typographically Unique     708
Word Lists                 403
Dictionaries               180
Uncategorized               62
Name: works, dtype: int64

**Key columns:**
* `work_id` — primary key; joins to pages.
* `title`, `author`, `pub_year`, `publisher`, `pub_place` — bibliographic info.
* `collections` — thematic groupings (array‑valued JSON in `.json`; semicolon‑delimited string in `.csv`).
* `source` — provenance, either *EEBO-TCP*, *HathiTrust*, or *Gale*.
* `work_type` — *full‑work*, *excerpt*, *article*.

Both CSV & JSON represent the same 7,122 rows; choose whichever is convenient.

## 3. Joining pages with their work metadata

The page file holds full text **but almost no bibliographic context**; the
metadata table holds rich work-level details **but no page text**. Linking the two by their shared key **`work_id`** lets you:

* retrieve a work’s title/author/year while analyzing its pages,
* group or filter pages by attributes such as source, collection, or publication date,
* display scans or transcriptions alongside catalog information.

The snippet below streams the 1.8 GB page file twice (memory-safe) to:

1. sample one page to pick an arbitrary `work_id`;
2. pull that work’s first ten pages to inspect their printed labels;  
3. look up the work’s bibliographic record for confirmation.

This round-trip pattern is the foundation for any downstream analysis or visualization.

In [None]:
import itertools, gzip, json

def page_iter():
    """Yield one parsed page dict at a time (keeps RAM usage tiny)."""
    with gzip.open(PAGES_FILE, 'rt', encoding='utf-8') as fh:
        for line in fh:
            yield json.loads(line)

# -- 1. pick an arbitrary work by sampling 1 page --
some_work_id = next(itertools.islice(page_iter(), 10000, 10001))['work_id']
print("Chosen work_id →", some_work_id)

# -- 2. collect the first 10 pages that belong to that work --
pages_for_work = [p for p in page_iter() if p['work_id'] == some_work_id][:10]

# -- 3. fetch the bibliographic record from the metadata table --
meta_row = metadata_df.set_index('work_id').loc[some_work_id]
print(meta_row[['title', 'author', 'pub_year']].to_string())

# display the original print labels for a quick sanity-check
print("First 5 page labels:", [p['label'] for p in pages_for_work[:5]])

Chosen work_id → A52335
title       The English historical library, or, A short vi...
author                           Nicolson, William, 1655-1727
pub_year                                                 1696
First 5 page labels: ['[1]', '[10]', '66', '67', '68']


## 4. Counting pages efficiently
Running `sum(1 for _ in gzip.open(...))` would take ~30 s.  Uncomment below if you really need the figure.

In [None]:
#%%time
#page_count = sum(1 for _ in gzip.open(PAGES_FILE, 'rt', encoding='utf-8'))
#print(f'Total pages: {page_count:,}')

Total pages: 1,982,024
CPU times: user 20.5 s, sys: 432 ms, total: 20.9 s
Wall time: 21.2 s


---
### Take‑aways
* **Pages** are stored as JSONL (streamable).
* **Works** metadata lives in CSV/JSON (easily loaded via pandas).
* Join the two via `work_id`.

➡️ Proceed to **02_filtering_pages.ipynb** to learn how to carve out subsets with `corppa` utilities.