feat(self-packaging #44): embed:// DuckDB filesystem#54
Merged
Conversation
This was referenced May 22, 2026
Part of #40, depends on #43 (shared ArchiveContents). Lets SQL templates say `read_csv('embed://data/cities.csv')` and have them resolve to the in-memory archive when the binary is bundled. - `src/include/duckdb_embed_fs.hpp` + `src/duckdb_embed_fs.cpp`: - `EmbeddedFileSystem` subclasses `duckdb::FileSystem` - `EmbeddedFileHandle` is the per-open-file state (non-owning data pointer + streaming position); shares the same `std::shared_ptr<const ArchiveEntries>` set by #43 so there's one decompressed map for both readers - overrides: OpenFile, Read (streaming + positional), GetFileSize, Seek, **SeekPosition**, **Glob**, FileExists, CanHandleFile, OnDiskFile, CanSeek, GetName - `Glob` accepts both exact paths and `*` / `?` patterns; returns paths preserved with the `embed://` prefix and sorted - `RegisterEmbeddedFileSystem()` helper: looks up `FileProviderFactory::GetBundleContents()`, walks via `DatabaseManager::getInstance()->getConnection()` to reach the `duckdb::Connection` wrapper, fetches the VFS via `FileSystem::GetFileSystem(context)`, and calls `RegisterSubSystem` on it; no-op when no bundle / DB not yet up. - `src/main.cpp`: after `initializeDatabase`, call `RegisterEmbeddedFileSystem()` once. Logged at INFO when it registers; silent when there's no bundle. The Glob + SeekPosition overrides are the spike-found requirement -- the base `FileSystem` throws "not implemented" rather than no-oping for both, and `read_csv()` exercises both before returning a row. Tests (`test/cpp/duckdb_embed_fs_test.cpp`, 10 cases): - CanHandleFile recognises embed:// paths - OpenFile returns handle / throws on missing - Read (streaming) advances position; second read returns 0 at EOF - Read (positional) does not move the cursor - Seek + SeekPosition agree - Glob expands `*.csv` to the right entries (sorted) - Glob exact-match for non-wildcard / empty for missing - **proof-of-life:** spin up an in-memory DuckDB, register the embed FS, run `SELECT name FROM read_csv('embed://data/people.csv') ORDER BY name`, assert Alice / Bob / Carol - **proof-of-life with glob:** same with `read_csv('embed://data/*.csv', union_by_name=true)` returning 6 rows This is the spike's behaviour #9 (the most important test of the whole epic), and it now runs natively against the flapi build of DuckDB rather than the spike's standalone amalgamation. Verified: - 10/10 new tests pass. - Existing 588 tests still pass; same 2 pre-existing DuckDB ASan leaks on main (`QueryExecutor type coverage`, `DuckDBResult RAII`) remain unchanged. - Filesystem-mode smoke test (`flapi --validate-config -c examples/flapi.yaml`) loads cleanly with no embed FS messages. Closes #44.
6b4007e to
1f0bb8f
Compare
This was referenced May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of epic #40. Stacked on #53 which is stacked on #52 which is stacked on #51.
Summary
Adds the DuckDB-side reader for the embed scheme. After this PR
lands, the bundle is fully observable from both halves of flapi:
config loading via
IFileProvider(#43) AND SQL-sideread_csv()/read_parquet()/read_json()/ etc. via DuckDB'sVirtualFileSystem.src/include/duckdb_embed_fs.hpp+src/duckdb_embed_fs.cpp--EmbeddedFileSystemsubclassingduckdb::FileSystem, plus theRegisterEmbeddedFileSystem()helper that wires it onto therunning DuckDB instance.
src/main.cpp-- one new call afterinitializeDatabase, loggedat INFO when it actually registers, silent when there's no bundle.
Why two file-system layers (and not one)
IFileProvidercovers config loading and SQL-template reading fromflapi C++ code. But SQL templates themselves contain DuckDB calls
like
read_csv('data/foo.csv'), and those bypassIFileProviderentirely -- they go to DuckDB's
VirtualFileSystem. Hence twoadapters over one shared
std::shared_ptr<const ArchiveEntries>:one decompressed map, two readers.
The override set
The spike's hard-won lesson:
So we override:
OpenFileEmbeddedFileHandleover the entry's bytesRead(handle, buf, n)Read(handle, buf, n, location)GetFileSizeSeek/SeekPositionGlobembed://data/*.csvto the matching entries (sorted)CanHandleFileembed://paths soVirtualFileSystemroutes them hereFileExistsOnDiskFilefalse(relevant for DuckDB's random-read optimisations)CanSeektrueGetNameembedRegistration in main.cpp
Once after
initializeDatabase():The helper is a no-op when:
detectAndRegisterEmbeddedBundle()(Self-packaging #3: EmbeddedArchiveFileProvider + FileProviderFactory dispatch #43)DatabaseManageris not initialisedEach failure mode silently returns false; the binary keeps running.
Tests (10 cases / proof-of-life included)
CanHandleFileembed://, rejects/,s3://, bare pathsOpenFilefor known entryOpenFilefor missing entryIOExceptionReadReadSeek+SeekPositionGlob('embed://data/*.csv')Globexact-match / missingread_csvSELECT name FROM read_csv('embed://data/people.csv') ORDER BY namereturns Alice/Bob/Carolread_csvread_csv('embed://data/*.csv', union_by_name=true)returns 6 rowsThe two ``proof-of-life'' tests are spike behaviour #9 -- the
hardest-won evidence that the design actually works. They now run
against the flapi build of DuckDB, not the spike's standalone
amalgamation, so this is real validation that the integration
transplants cleanly.
Test plan
DuckDB AddressSanitizer leaks on main remain unchanged.
flapi --validate-config -c examples/flapi.yamlrunswith no "Registered embed://" log line (no bundle in this
binary -- filesystem mode untouched).
Closes #44. Part of #40. Stacked on #53 -> #52 -> #51.