Enhancement request
LadybugDB would benefit from a native bulk-load / fast-ingest mode optimized for scenarios where durability-per-write is not required, such as:
- Replaying a write-ahead log into a fresh database
- Initial data loading from an external source of truth
- Migration/recovery operations
Motivation
We are using LadybugDB as the embedded graph database for Graphiti (knowledge graph) in the Liminis notes application. Our graphiti integration keeps its own JSONL WAL (separate from LadybugDB's db.wal) so we can recover the database by replaying mutations. Production workspace has 16,996 WAL files containing ~200K+ mutations.
Replay takes 25+ minutes, even though the machine is neither CPU-bound nor I/O-throughput-bound. The dominant cost is per-write commit durability (presumably fsync on db.wal per transaction).
Key insight
During replay, durability-per-write is not valuable. The source of truth is the external WAL file. If replay crashes mid-way, the correct recovery is to discard the partial DB and restart from scratch. fsync'ing each write only slows things down without providing any real safety.
What we'd like
A mode where:
- All mutations go into a single logical transaction (or batch)
- fsync is deferred to a single call at the end
- Indices are built on final state (we already do this)
- Optionally: checksums skipped, checkpoints deferred
Potential shapes:
-
Database-level flag at open time:
```python
kuzu.Database(path, bulk_load_mode=True)
```
-
Explicit transactions exposed via the Python `Connection` API:
```python
conn.execute("BEGIN TRANSACTION")
for mutation in wal:
conn.execute(mutation.cypher, mutation.params)
conn.execute("COMMIT")
```
(We see BEGIN/COMMIT in Kuzu's Cypher grammar but haven't verified it works via the Python `execute()` call.)
-
Native `COPY FROM` bulk loader like Kuzu upstream — but this requires the data in a specific format (CSV, Parquet) and doesn't fit our MERGE-heavy mutation stream.
Current workaround options we're considering
- `auto_checkpoint=False` at Database construction — defers single final checkpoint
- `enable_checksums=False` — small speedup
- Wrapping replay in `BEGIN TRANSACTION ... COMMIT` if that works
Context
Expected impact
If bulk-load mode can cut fsync-per-write, we estimate 10-100x speedup on replay, turning 25-minute replays into 30 seconds — making the WAL a much more ergonomic recovery tool.
Enhancement request
LadybugDB would benefit from a native bulk-load / fast-ingest mode optimized for scenarios where durability-per-write is not required, such as:
Motivation
We are using LadybugDB as the embedded graph database for Graphiti (knowledge graph) in the Liminis notes application. Our graphiti integration keeps its own JSONL WAL (separate from LadybugDB's db.wal) so we can recover the database by replaying mutations. Production workspace has 16,996 WAL files containing ~200K+ mutations.
Replay takes 25+ minutes, even though the machine is neither CPU-bound nor I/O-throughput-bound. The dominant cost is per-write commit durability (presumably fsync on db.wal per transaction).
Key insight
During replay, durability-per-write is not valuable. The source of truth is the external WAL file. If replay crashes mid-way, the correct recovery is to discard the partial DB and restart from scratch. fsync'ing each write only slows things down without providing any real safety.
What we'd like
A mode where:
Potential shapes:
Database-level flag at open time:
```python
kuzu.Database(path, bulk_load_mode=True)
```
Explicit transactions exposed via the Python `Connection` API:
```python
conn.execute("BEGIN TRANSACTION")
for mutation in wal:
conn.execute(mutation.cypher, mutation.params)
conn.execute("COMMIT")
```
(We see BEGIN/COMMIT in Kuzu's Cypher grammar but haven't verified it works via the Python `execute()` call.)
Native `COPY FROM` bulk loader like Kuzu upstream — but this requires the data in a specific format (CSV, Parquet) and doesn't fit our MERGE-heavy mutation stream.
Current workaround options we're considering
Context
Expected impact
If bulk-load mode can cut fsync-per-write, we estimate 10-100x speedup on replay, turning 25-minute replays into 30 seconds — making the WAL a much more ergonomic recovery tool.