# Corpus Management Demo

This notebook demonstrates downloading and preprocessing corpora using the `reducelang.corpus` module.


In [None]:
# %load_ext autoreload
# %autoreload 2

from pathlib import Path

from reducelang.corpus import CORPUS_REGISTRY, download_corpus, preprocess_corpus, generate_datacard
from reducelang.corpus.extractors import get_extractor
from reducelang.corpus.registry import list_corpora, get_corpus_spec
from reducelang.alphabet import ENGLISH_ALPHABET, ROMANIAN_ALPHABET


The corpus registry maps `(language, corpus_name)` to specifications including URLs, formats, and licenses.


In [None]:
print("Available corpora:")
for lang, corpus in list_corpora():
    print(f"  {lang}/{corpus}")


Download a small English corpus (text8) to demonstrate the workflow.


In [None]:
spec = get_corpus_spec("en", "text8")
raw_path = Path("data/corpora/en/latest/raw/text8/text8.zip")
raw_path.parent.mkdir(parents=True, exist_ok=True)
download_corpus(spec.url, raw_path, spec.sha256)
print(f"Downloaded to {raw_path}")


Preprocess the corpus using the configured English alphabet.


In [None]:
extractor = get_extractor(spec.extractor_class)
output_path = Path("data/corpora/en/latest/processed/text8.txt")
output_path.parent.mkdir(parents=True, exist_ok=True)
metadata = preprocess_corpus(raw_path, output_path, ENGLISH_ALPHABET, extractor)
print(f"Processed {metadata['char_count']} characters")
print(f"Coverage: {metadata['coverage']:.4f}")


Generate a data card documenting the corpus snapshot and preprocessing.


In [None]:
datacard_path = Path("data/corpora/en/latest/processed/text8_datacard.json")
from reducelang.corpus.datacard import generate_datacard
generate_datacard(corpus_spec=spec, metadata=metadata, output_path=datacard_path, snapshot_date="latest")
import json
print(json.dumps(json.loads(datacard_path.read_text(encoding='utf-8')), indent=2))


This demonstrates the full corpus management pipeline: download → extract → normalize → document. The CLI command `reducelang prep` automates this workflow.
