Sync/extraction tooling for the OpenAlex scholarly metadata snapshot. The dataset itself lives on HuggingFace (Git LFS via Xet storage).
| Path | Description |
|---|---|
sync/ |
Python tooling — download from S3, extract relationship tables to Parquet, manage the snapshot |
openalex-snapshot/ |
Git submodule — source data and extracted tables |
All sync commands run from this directory (the repo root). The submodule must be initialised so openalex-snapshot/data/ exists.
git clone https://github.com/Mearman/OpenAlex.git
cd OpenAlex
git submodule update --initpip install -r sync/requirements.txtOpenAlex publishes the snapshot on AWS S3, freely accessible:
# Download all entities
python3 -m sync download
# Download a single entity
python3 -m sync download --entity worksFiles are saved as part_XXXX.jsonl.gz (renamed from S3's part_XXXX.gz so HuggingFace's dataset viewer detects the format).
Each source shard produces one Parquet file per relationship type (e.g. work_abstracts, author_institutions):
# Extract everything (skips already-completed shards)
python3 -m sync extract
# Extract a single entity
python3 -m sync extract --entity works
# Distributed across two machines
python3 -m sync extract --slice-index 0 --slice-total 2 # machine 1
python3 -m sync extract --slice-index 1 --slice-total 2 # machine 2# Upload all untracked parquet files (smallest-first, 50 per batch)
python3 -m sync upload
# Custom batch size
python3 -m sync upload --batch-size 100data/{entity}/
updated_date=YYYY-MM-DD/part_XXXX.jsonl.gz # source data (from S3)
{relationship_type}/
{entity}__updated_date=...__part_XXXX.parquet # extracted tables
For example, works produces relationship tables for abstracts, authorships, references, concepts, keywords, locations, and more.
| Host | https://huggingface.co/datasets/Mearman/OpenAlex |
| Format | JSONL source (.jsonl.gz) + Parquet relationship tables |
| License | CC0 (public domain) |
Works, Authors, Sources, Institutions, Publishers, Funders, Awards, Topics, Concepts, Fields, Subfields, Domains.