OpenAlex Research Data

Sync/extraction tooling for the OpenAlex scholarly metadata snapshot. The dataset itself lives on HuggingFace (Git LFS via Xet storage).

What's here

Path	Description
`sync/`	Python tooling — download from S3, extract relationship tables to Parquet, manage the snapshot
`openalex-snapshot/`	Git submodule — source data and extracted tables

Quick Start

All sync commands run from this directory (the repo root). The submodule must be initialised so openalex-snapshot/data/ exists.

git clone https://github.com/Mearman/OpenAlex.git
cd OpenAlex
git submodule update --init

Install dependencies

pip install -r sync/requirements.txt

Download from S3

OpenAlex publishes the snapshot on AWS S3, freely accessible:

# Download all entities
python3 -m sync download

# Download a single entity
python3 -m sync download --entity works

Files are saved as part_XXXX.jsonl.gz (renamed from S3's part_XXXX.gz so HuggingFace's dataset viewer detects the format).

Extract relationship tables to Parquet

Each source shard produces one Parquet file per relationship type (e.g. work_abstracts, author_institutions):

# Extract everything (skips already-completed shards)
python3 -m sync extract

# Extract a single entity
python3 -m sync extract --entity works

# Distributed across two machines
python3 -m sync extract --slice-index 0 --slice-total 2   # machine 1
python3 -m sync extract --slice-index 1 --slice-total 2   # machine 2

Upload to HuggingFace

# Upload all untracked parquet files (smallest-first, 50 per batch)
python3 -m sync upload

# Custom batch size
python3 -m sync upload --batch-size 100

Entity layout

data/{entity}/
  updated_date=YYYY-MM-DD/part_XXXX.jsonl.gz      # source data (from S3)
  {relationship_type}/
    {entity}__updated_date=...__part_XXXX.parquet  # extracted tables

For example, works produces relationship tables for abstracts, authorships, references, concepts, keywords, locations, and more.

Dataset


Host	https://huggingface.co/datasets/Mearman/OpenAlex
Format	JSONL source (`.jsonl.gz`) + Parquet relationship tables
License	CC0 (public domain)

Entities

Works, Authors, Sources, Institutions, Publishers, Funders, Awards, Topics, Concepts, Fields, Subfields, Domains.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.githooks		.githooks
.github		.github
openalex-snapshot @ 48bf3a7		openalex-snapshot @ 48bf3a7
sync		sync
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenAlex Research Data

What's here

Quick Start

Install dependencies

Download from S3

Extract relationship tables to Parquet

Upload to HuggingFace

Entity layout

Dataset

Entities

External links

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenAlex Research Data

What's here

Quick Start

Install dependencies

Download from S3

Extract relationship tables to Parquet

Upload to HuggingFace

Entity layout

Dataset

Entities

External links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages