Unified access to molecular benchmark datasets, with one-click upload to public data repositories.
molhub gives every molecular dataset the same interface — index by position, iterate sample by sample, or slice with subsets — and ships uploaders for publishing your own data to HuggingFace Hub and Figshare. Samples carry atomic coordinates, forces, and computed properties in a consistent structure, so downstream code works identically across QM9, revMD17, 3BPA, or your own CSV files.
Under active development. Public APIs may change between minor releases.
| Module | Capability |
|---|---|
molhub.dataset (protocols) |
MapDataset and IterableDataset runtime-checkable protocols, plus TargetSchema declaring graph-level vs atom-level targets |
molhub.dataset (helpers) |
InMemoryDataset wraps a list of frames; SubsetDataset slices any map-style dataset by index |
QM9Source |
130,831 small organic molecules with quantum-chemical properties — auto-downloads & caches from Figshare |
RevMD17Source |
Revised MD17 molecular-dynamics trajectories for 10 molecules, with energies and forces |
ThreeBPASource |
3BPA temperature-transferability benchmark — extended-XYZ files at 300 K / 600 K / 1200 K |
CSVDataset |
Generic CSV loader for local files or remote URLs — zero extra dependencies, each row becomes a Frame |
HuggingFaceUploader |
Upload files, folders, and datasets to a HuggingFace Hub repository |
FigshareUploader |
Create Figshare articles and upload files via the Figshare REST API |
Each sample is a molpy Frame: the atoms block holds per-atom data (element, x, y, z, number, and optionally fx, fy, fz), while graph-level targets such as energy live in frame.metadata.
pip install molhub # core (datasets + protocols + Figshare uploader)
pip install molhub[huggingface] # + HuggingFace Hub upload support
pip install molhub[dev] # + dev tooling (pytest, pytest-cov, pytest-mock, ruff)Requires Python >= 3.10. Core dependencies: molcrafts-molpy >= 0.3.0, tqdm, requests >= 2.28.
from molhub.dataset import QM9Source, CSVDataset
# Load QM9 — auto-downloads & caches the 130k-molecule tarball
qm9 = QM9Source("./data/qm9")
print(len(qm9)) # 130831
frame = qm9[42] # one sample carries atoms + computed properties
print(frame.metadata) # {'A': ..., 'B': ..., 'U0': ..., ...}
# Load any CSV from a URL or local file
ds = CSVDataset("https://zenodo.org/records/14980914/files/LAMALAB_CURATED_Tg_structured.csv")
print(ds.headers) # ['labels.SMILES', 'labels.Exp_Tg(K)', ...]
frame = ds[0]
print(frame.metadata["labels.Exp_Tg(K)"]) # 373.0Every dataset conforms to one of two runtime-checkable protocols, so downstream code can consume them generically.
MapDataset — index-addressable; supports len() and random access by integer position. Use when samples fit in memory or are backed by a random-access store.
from molhub.dataset import MapDataset
class MyDataset:
def __len__(self) -> int: ...
def __getitem__(self, idx: int) -> Frame: ...
@property
def source_id(self) -> str: ...
assert isinstance(MyDataset(), MapDataset) # runtime-checkable protocolIterableDataset — streaming; samples are yielded one at a time. Use for lazy file readers, on-the-fly generation, or datasets too large to hold in memory.
from molhub.dataset import IterableDataset
class MyStream:
def __iter__(self) -> Iterator[Frame]: ...
@property
def source_id(self) -> str: ...
assert isinstance(MyStream(), IterableDataset)TargetSchema — each dataset declares where its targets live, so batching and collation logic knows what to expect:
| Target level | Location | Example |
|---|---|---|
graph_level |
frame.metadata |
energy, homo, lumo |
atom_level |
atoms block columns |
fx, fy, fz |
InMemoryDataset / SubsetDataset — convenience helpers for wrapping and slicing:
from molhub.dataset import InMemoryDataset, SubsetDataset
inmem = InMemoryDataset(frames, name="my-frames") # wrap a list
subset = SubsetDataset(qm9, indices=[0, 10, 100, 1000]) # slice by indexQM9 (QM9Source) — 130,831 small organic molecules with quantum-chemical properties. Reference: Ramakrishnan et al., Scientific Data 1, 140022 (2014).
from molhub.dataset import QM9Source
source = QM9Source("./data/qm9", targets=["U0", "H", "gap"])
# Auto-downloads from Figshare, caches locally, lazy-loads on first accessGraph-level targets: A, B, C, mu, alpha, homo, lumo, gap, r2, zpve, U0, U, H, G, Cv.
revMD17 (RevMD17Source) — revised MD17 molecular-dynamics trajectories (10 molecules). Reference: Christensen & von Lilienfeld, MLST (2020).
from molhub.dataset import RevMD17Source
source = RevMD17Source("./data/revmd17", molecule="aspirin")
frame = source[100]
# atoms block: element, x, y, z, number, fx, fy, fz
# metadata: {'energy': ...}Available molecules: aspirin, azobenzene, benzene, ethanol, malonaldehyde, naphthalene, paracetamol, salicylic, toluene, uracil.
3BPA (ThreeBPASource) — temperature-transferability benchmark at 300 K, 600 K, and 1200 K. Reference: Kovacs et al., J. Chem. Theory Comput. (2021).
from molhub.dataset import ThreeBPASource
train = ThreeBPASource("./data/3bpa/train_300K.xyz", tag="train_300K")
test_600 = ThreeBPASource("./data/3bpa/test_600K.xyz", tag="test_600K")CSVDataset — generic CSV loader, works on local files and remote URLs. No extra dependencies.
from molhub.dataset import CSVDataset
ds = CSVDataset("/path/to/data.csv")
ds = CSVDataset("https://example.com/data.csv")
# cached to $MOLHUB_CACHE_DIR, falling back to ~/.cache/molhubHuggingFace Hub — requires pip install molhub[huggingface].
from molhub.uploader import HuggingFaceUploader
uploader = HuggingFaceUploader(token="hf_...") # or set HF_TOKEN
uploader.upload_file("data/qm9.tar.bz2", repo_id="my-org/qm9", path_in_repo="raw/qm9.tar.bz2")
uploader.upload_folder("data/processed/", repo_id="my-org/dataset", path_in_repo="processed/")
uploader.upload_dataset("data/dataset.h5", repo_id="my-org/new-dataset") # create + uploadFigshare — uses the core requests dependency; no extra install needed.
from molhub.uploader import FigshareUploader
uploader = FigshareUploader(token="...") # or set FIGSHARE_TOKEN
result = uploader.upload_dataset(
"data/qm9.tar.bz2",
title="QM9 dataset",
description="Raw QM9 molecular properties dataset",
tags=["quantum-chemistry", "molecules"],
)| Project | Role |
|---|---|
| molpy | Python toolkit — the shared molecular data model & workflow layer |
| molrs | Rust core — molecular data structures & compute kernels (native + WASM) |
| molpack | Packmol-grade molecular packing (Rust + Python) |
| molvis | WebGL molecular visualization & editing |
| molexp | Workflow & experiment-management platform |
| molnex | Molecular machine-learning framework |
| molq | Unified job queue — local / SLURM / PBS / LSF |
| molcfg | Layered configuration library |
| mollog | Structured logging, stdlib-compatible |
| molhub | Molecular dataset hub — this repo |
| molmcp | MCP server for the ecosystem |
| molrec | Atomistic record specification |
BSD-3-Clause — see LICENSE.