Summary
Database.add_arrays_batch(...) rejects valid variable-size molecules when custom atom_properties are appended across different n_atoms values.
The failure happens even when each individual atom property has the correct per-structure shape, e.g. (natoms, 3) for a vector per-atom property. The first write locks the schema using the full per-record payload size (natoms * elem_bytes) instead of the per-atom slot size (elem_bytes), so the next molecule with a different natoms fails schema validation.
This affects at least atompack-db==0.3.0.
Minimal reproduction
import tempfile
from pathlib import Path
import atompack
import numpy as np
work = Path(tempfile.mkdtemp())
out = atompack.Database(str(work / "merged.atp"), overwrite=True)
for natoms in [20, 29]:
positions = np.zeros((1, natoms, 3), dtype=np.float32)
atomic_numbers = np.ones((1, natoms), dtype=np.uint8)
energy = np.array([0.0], dtype=np.float64)
forces = np.zeros((1, natoms, 3), dtype=np.float32)
out.add_arrays_batch(
positions,
atomic_numbers,
energy=energy,
forces=forces,
atom_properties={
"teacher_forces": np.zeros((1, natoms, 3), dtype=np.float32),
"hidden_scalar": np.zeros((1, natoms), dtype=np.float32),
},
)
out.flush()
Actual result
The second append fails:
ValueError: Invalid data: Schema mismatch for section 'teacher_forces': expected SchemaEntry { type_tag: 4, per_atom: true, elem_bytes: 12, slot_bytes: 240 }, got SchemaEntry { type_tag: 4, per_atom: true, elem_bytes: 12, slot_bytes: 348 }
The values correspond to:
240 = 20 * 3 * sizeof(float32)
348 = 29 * 3 * sizeof(float32)
Expected result
This should succeed. teacher_forces is a per-atom vector property with shape (natoms, 3) for each molecule. Its schema should be independent of molecule size:
per_atom: true, elem_bytes: 12, slot_bytes: 12
Similarly, scalar per-atom properties shaped (natoms,) should use slot_bytes = sizeof(dtype), not natoms * sizeof(dtype).
Diagnosis
The issue appears to be in the Python/Rust batch path, not in SOA decoding itself.
In atompack-py/src/database_batch.rs, extract_vec3_column() stores the per-record payload size in BatchSectionColumn.slot_bytes:
slot_bytes: expected_rows * 3 * std::mem::size_of::<T>(),
Then schema_section_from_column() passes that same value into the database schema:
fn schema_section_from_column(column: &BatchSectionColumn) -> DatabaseSchemaSection {
schema_section(column.kind, &column.key, column.type_tag, column.slot_bytes)
}
For KIND_ATOM_PROP, this means schema slot_bytes becomes molecule-size dependent. The first molecule locks e.g. 240; the second molecule with a different natoms produces e.g. 348; schema merge rejects it.
Built-in forces do not hit this because the fast path explicitly uses 12 for TYPE_VEC3_F32:
schema_sections.push(schema_section(KIND_BUILTIN, "forces", TYPE_VEC3_F32, 12));
The raw SOA/schema parsing path also seems to compute the desired per-atom schema correctly: per-atom arrays and vec3 fields use elem_bytes as schema slot_bytes.
Suggested fix direction
Separate the two concepts currently represented by BatchSectionColumn.slot_bytes:
- per-record payload stride used for slicing batch buffers, e.g.
natoms * 3 * sizeof(T)
- schema slot bytes used for schema locking, e.g.
3 * sizeof(T) for per-atom vec3 and sizeof(T) for per-atom scalar arrays
For KIND_ATOM_PROP, schema_section_from_column() should likely emit schema slot_bytes = type_tag_elem_bytes(type_tag) for numeric per-atom fields, while preserving the current per-record payload stride for slicing.
Context
This came up while trying to merge AtomPack shards containing variable-size atomistic structures with cached model outputs:
- built-in DFT
energy and forces
- graph-level teacher energy as molecule metadata
- teacher force predictions as atom property
(natoms, 3)
- hidden representations as atom properties
(natoms,) or vector channels
Writing same-natoms shards works. Merging/appending across different natoms fails on the custom atom-property schema lock.
Summary
Database.add_arrays_batch(...)rejects valid variable-size molecules when customatom_propertiesare appended across differentn_atomsvalues.The failure happens even when each individual atom property has the correct per-structure shape, e.g.
(natoms, 3)for a vector per-atom property. The first write locks the schema using the full per-record payload size (natoms * elem_bytes) instead of the per-atom slot size (elem_bytes), so the next molecule with a differentnatomsfails schema validation.This affects at least
atompack-db==0.3.0.Minimal reproduction
Actual result
The second append fails:
The values correspond to:
240 = 20 * 3 * sizeof(float32)348 = 29 * 3 * sizeof(float32)Expected result
This should succeed.
teacher_forcesis a per-atom vector property with shape(natoms, 3)for each molecule. Its schema should be independent of molecule size:Similarly, scalar per-atom properties shaped
(natoms,)should useslot_bytes = sizeof(dtype), notnatoms * sizeof(dtype).Diagnosis
The issue appears to be in the Python/Rust batch path, not in SOA decoding itself.
In
atompack-py/src/database_batch.rs,extract_vec3_column()stores the per-record payload size inBatchSectionColumn.slot_bytes:Then
schema_section_from_column()passes that same value into the database schema:For
KIND_ATOM_PROP, this means schemaslot_bytesbecomes molecule-size dependent. The first molecule locks e.g.240; the second molecule with a differentnatomsproduces e.g.348; schema merge rejects it.Built-in
forcesdo not hit this because the fast path explicitly uses12forTYPE_VEC3_F32:The raw SOA/schema parsing path also seems to compute the desired per-atom schema correctly: per-atom arrays and vec3 fields use
elem_bytesas schemaslot_bytes.Suggested fix direction
Separate the two concepts currently represented by
BatchSectionColumn.slot_bytes:natoms * 3 * sizeof(T)3 * sizeof(T)for per-atom vec3 andsizeof(T)for per-atom scalar arraysFor
KIND_ATOM_PROP,schema_section_from_column()should likely emit schemaslot_bytes = type_tag_elem_bytes(type_tag)for numeric per-atom fields, while preserving the current per-record payload stride for slicing.Context
This came up while trying to merge AtomPack shards containing variable-size atomistic structures with cached model outputs:
energyandforces(natoms, 3)(natoms,)or vector channelsWriting same-
natomsshards works. Merging/appending across differentnatomsfails on the custom atom-property schema lock.