Skip to content

Fix DuckDB staging table schema evolution support#605

Merged
jfrench9 merged 2 commits intomainfrom
bugfix/duckdb-staging-evolution
Apr 7, 2026
Merged

Fix DuckDB staging table schema evolution support#605
jfrench9 merged 2 commits intomainfrom
bugfix/duckdb-staging-evolution

Conversation

@jfrench9
Copy link
Copy Markdown
Member

@jfrench9 jfrench9 commented Apr 7, 2026

Summary

Refactors the DuckDB table management layer to properly handle schema evolution in staging tables. Previously, staging tables could fail or produce incorrect results when the schema of incoming data diverged from the existing table definition. This change ensures that new or altered columns are gracefully incorporated without requiring manual table drops or migrations.

Key Accomplishments

  • Schema evolution in DuckDB manager: Extended robosystems/graph_api/core/duckdb/manager.py with robust logic (~50 new lines) to detect schema mismatches between incoming data and existing tables, and to automatically reconcile them (e.g., adding new columns, handling type changes).
  • Staging processor alignment: Updated the SEC ingestion staging processor to leverage the new schema-evolution-aware table management, ensuring the pipeline correctly stages data even as upstream schemas change over time.
  • Defensive table creation/update flow: The manager now inspects existing table schemas before writes, reducing the risk of silent data loss or ingestion failures caused by column drift.

Breaking Changes

None expected. The changes are additive and backward-compatible — existing tables with matching schemas will continue to work as before. Tables with schema mismatches that previously would have errored will now be evolved automatically.

Testing Notes

  • Verify that staging ingestion succeeds when the target DuckDB table does not yet exist (fresh creation path).
  • Verify that staging ingestion succeeds when the target table already exists with the same schema (no-op evolution path).
  • Verify that staging ingestion succeeds when the target table exists but is missing columns present in the new data (column-addition evolution path).
  • Confirm that no data is lost in existing columns when schema evolution adds new columns.
  • Test with SEC adapter ingestion end-to-end to validate the integration between the staging processor and the updated manager.

Infrastructure Considerations

  • No new dependencies or infrastructure changes required.
  • DuckDB storage files may grow slightly if schema evolution adds nullable columns to existing tables; monitor disk usage in environments with frequent schema changes.
  • Consider periodic validation or compaction of evolved tables in long-running environments to ensure query performance remains stable.

🤖 Generated with Claude Code

Branch Info:

  • Source: bugfix/duckdb-staging-evolution
  • Target: main
  • Type: bugfix

Co-Authored-By: Claude noreply@anthropic.com

- Updated the DuckDBTableManager to automatically add new columns from parquet files to the existing table schema during ingestion.
- Adjusted the DuckDBStager to set drop_on_retry to False for improved stability during retries.
Comment thread robosystems/graph_api/core/duckdb/manager.py Fixed
@jfrench9 jfrench9 merged commit c60629c into main Apr 7, 2026
7 checks passed
@jfrench9 jfrench9 deleted the bugfix/duckdb-staging-evolution branch April 7, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant