# 02 – Filtering pages with `corppa`

The `corppa.utils.filter` helpers let you stream-filter the 1.8 GB PPA page
file without ever loading it fully into RAM.  In this notebook we show how to

* keep only **certain works** (by `work_id` or via a metadata query),
* keep **specific page ranges** within those works,
* include or exclude pages by **tags / labels**, and
* **write** the result to a new compressed JSONL file.

Everything below executes in well under a minute on a laptop.

In [2]:
import pathlib, gzip, json, itertools, pandas as pd
from corppa.utils.filter import filter_pages, save_filtered_corpus
from tqdm.auto import tqdm

# --- data locations ---
DATA_DIR = pathlib.Path('..') / 'shared_data' / 'ppa_corpus_2025-02-03_1308'
PAGES_FILE = DATA_DIR / 'ppa_pages.jsonl.gz'
META_CSV   = DATA_DIR / 'ppa_metadata.csv'

## A. Filter entire works by a metadata query

Suppose we want *all* pages from works printed **in 1890**.

In [3]:
metadata_df = pd.read_csv(META_CSV, dtype=str)
work_ids_1890 = metadata_df.loc[metadata_df['pub_year'] == '1890', 'work_id'].tolist()
print(f"Works published in 1890: {len(work_ids_1890)}")

Works published in 1890: 48


In [4]:
# collect first 3 matches just for demo
sample_work_ids = work_ids_1890[:3]
print('Demo set →', sample_work_ids)

pages_iter = filter_pages(PAGES_FILE, work_ids=sample_work_ids, disable_progress=False)
page_count = 0
for page in tqdm(pages_iter, desc='streaming'):
    page_count += 1
print(f"\nTotal pages kept: {page_count}")

Demo set → ['chi.12153205-p526', 'coo1.ark:/13960/t2m623920', 'hvd.ah2545']


Filtering: checked 1,982,024 pages, selected 447 | elapsed: 00:09
streaming: 447it [00:09, 47.49it/s] 


Total pages kept: 447





## B. Filter by **page ranges** inside a work

Here we take a single work and keep only pages 1–5 & 20–25.

In [5]:
one_work = sample_work_ids[0]
# dict format expected by filter_pages → {work_id: [list-of-orders]}
work_pages = {one_work: list(range(1, 6)) + list(range(20, 26))}

subset = list(filter_pages(PAGES_FILE, work_pages=work_pages, disable_progress=True))
print(f"Kept {len(subset)} pages for {one_work}. Labels → {[p['label'] for p in subset[:5]]}")

Kept 0 pages for chi.12153205-p526. Labels → []


## C. Include / exclude by tag or label

Let’s keep pages tagged *`poem`* but **exclude** any whose label equals `[advertisement]`.

In [None]:
kept = filter_pages(
    PAGES_FILE,
    include_any_tags=['poem'],
    exclude_labels=['[advertisement]'],
    disable_progress=True,
)
first_five = list(itertools.islice(kept, 5))
print([{'id': p['id'], 'label': p['label'], 'tags': p['tags']} for p in first_five])

## D. Save a filtered corpus to disk

The helper writes a **new gzipped JSONL** file that you can chain into later
workflows or share with collaborators.

In [None]:
OUT_FILE = DATA_DIR / 'subset_pages.jsonl.gz'
save_filtered_corpus(
    PAGES_FILE,
    OUT_FILE,
    work_ids=sample_work_ids,
    include_any_tags=['poem'],
    disable_progress=False,
)
print('Written →', OUT_FILE.relative_to(DATA_DIR.parent.parent))

## E. Doing the same from the command line

```bash
# keep only poems from works printed in 1890
corppa-filter \
  --pages ../shared_data/ppa_corpus_2025-02-03_1308/ppa_pages.jsonl.gz \
  --output subset_1890_poems.jsonl.gz \
  --work-ids $(python -c "import pandas as pd,sys; df=pd.read_csv('META.csv'); print(' '.join(df.query('pub_year==\'1890\'')['work_id']))") \
  --include-any-tags poem
```

*(Wrap the `$(...)` bit or pre-compute a text file with work IDs.)*

---
➡️ Proceed to `03_images_and_paths.ipynb` to learn how to resolve a page to its high-resolution scan and build image paths!