Skip to content

add_arrays_batch rejects variable-size atom_properties due to per-atom schema slot_bytes #33

@Ramlaoui

Description

@Ramlaoui

Summary

Database.add_arrays_batch(...) rejects valid variable-size molecules when custom atom_properties are appended across different n_atoms values.

The failure happens even when each individual atom property has the correct per-structure shape, e.g. (natoms, 3) for a vector per-atom property. The first write locks the schema using the full per-record payload size (natoms * elem_bytes) instead of the per-atom slot size (elem_bytes), so the next molecule with a different natoms fails schema validation.

This affects at least atompack-db==0.3.0.

Minimal reproduction

import tempfile
from pathlib import Path

import atompack
import numpy as np

work = Path(tempfile.mkdtemp())
out = atompack.Database(str(work / "merged.atp"), overwrite=True)

for natoms in [20, 29]:
    positions = np.zeros((1, natoms, 3), dtype=np.float32)
    atomic_numbers = np.ones((1, natoms), dtype=np.uint8)
    energy = np.array([0.0], dtype=np.float64)
    forces = np.zeros((1, natoms, 3), dtype=np.float32)

    out.add_arrays_batch(
        positions,
        atomic_numbers,
        energy=energy,
        forces=forces,
        atom_properties={
            "teacher_forces": np.zeros((1, natoms, 3), dtype=np.float32),
            "hidden_scalar": np.zeros((1, natoms), dtype=np.float32),
        },
    )

out.flush()

Actual result

The second append fails:

ValueError: Invalid data: Schema mismatch for section 'teacher_forces': expected SchemaEntry { type_tag: 4, per_atom: true, elem_bytes: 12, slot_bytes: 240 }, got SchemaEntry { type_tag: 4, per_atom: true, elem_bytes: 12, slot_bytes: 348 }

The values correspond to:

  • 240 = 20 * 3 * sizeof(float32)
  • 348 = 29 * 3 * sizeof(float32)

Expected result

This should succeed. teacher_forces is a per-atom vector property with shape (natoms, 3) for each molecule. Its schema should be independent of molecule size:

per_atom: true, elem_bytes: 12, slot_bytes: 12

Similarly, scalar per-atom properties shaped (natoms,) should use slot_bytes = sizeof(dtype), not natoms * sizeof(dtype).

Diagnosis

The issue appears to be in the Python/Rust batch path, not in SOA decoding itself.

In atompack-py/src/database_batch.rs, extract_vec3_column() stores the per-record payload size in BatchSectionColumn.slot_bytes:

slot_bytes: expected_rows * 3 * std::mem::size_of::<T>(),

Then schema_section_from_column() passes that same value into the database schema:

fn schema_section_from_column(column: &BatchSectionColumn) -> DatabaseSchemaSection {
    schema_section(column.kind, &column.key, column.type_tag, column.slot_bytes)
}

For KIND_ATOM_PROP, this means schema slot_bytes becomes molecule-size dependent. The first molecule locks e.g. 240; the second molecule with a different natoms produces e.g. 348; schema merge rejects it.

Built-in forces do not hit this because the fast path explicitly uses 12 for TYPE_VEC3_F32:

schema_sections.push(schema_section(KIND_BUILTIN, "forces", TYPE_VEC3_F32, 12));

The raw SOA/schema parsing path also seems to compute the desired per-atom schema correctly: per-atom arrays and vec3 fields use elem_bytes as schema slot_bytes.

Suggested fix direction

Separate the two concepts currently represented by BatchSectionColumn.slot_bytes:

  1. per-record payload stride used for slicing batch buffers, e.g. natoms * 3 * sizeof(T)
  2. schema slot bytes used for schema locking, e.g. 3 * sizeof(T) for per-atom vec3 and sizeof(T) for per-atom scalar arrays

For KIND_ATOM_PROP, schema_section_from_column() should likely emit schema slot_bytes = type_tag_elem_bytes(type_tag) for numeric per-atom fields, while preserving the current per-record payload stride for slicing.

Context

This came up while trying to merge AtomPack shards containing variable-size atomistic structures with cached model outputs:

  • built-in DFT energy and forces
  • graph-level teacher energy as molecule metadata
  • teacher force predictions as atom property (natoms, 3)
  • hidden representations as atom properties (natoms,) or vector channels

Writing same-natoms shards works. Merging/appending across different natoms fails on the custom atom-property schema lock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions