Skip to content

feat(self-packaging #42): selfpath + bundle_locator (EOCD reverse scan)#52

Closed
jrosskopf wants to merge 1 commit into
feature/gh-41-libarchive-archive-iofrom
feature/gh-42-selfpath-bundle-locator
Closed

feat(self-packaging #42): selfpath + bundle_locator (EOCD reverse scan)#52
jrosskopf wants to merge 1 commit into
feature/gh-41-libarchive-archive-iofrom
feature/gh-42-selfpath-bundle-locator

Conversation

@jrosskopf
Copy link
Copy Markdown
Contributor

Part of epic #40. Depends on #51 (sets the base branch -- once #51 lands, GitHub will offer a one-click rebase onto main).

Summary

Two small modules that together let the running binary discover whether
a ZIP archive has been appended to it.

  • src/selfpath.{hpp,cpp} -- cross-platform self-binary path
    (/proc/self/exe / _NSGetExecutablePath / GetModuleFileNameW).
  • src/bundle_locator.{hpp,cpp} -- reverse-scan a file for a ZIP
    End-of-Central-Directory record, returning BundleLocation{offset, size}
    or nullopt.

Why

This is the runtime-detection half of self-packaging. The next sub-issue
(#43, EmbeddedArchiveFileProvider) calls LocateBundleInSelf() once
at startup; if it returns a location, the bytes at that range are fed to
archive_io::ReadArchive (PR #51) and become the in-memory config tree.

What the locator checks

Check Rejected as Why
EOCD signature 0x06054b50 not present in tail nullopt obviously no ZIP
multi-disk archive nullopt not a single-binary use case
entries_this != entries_total nullopt inconsistent record
comment length doesn't fit in scanned tail nullopt record claims more than file has
bytes after EOCD+comment are non-zero nullopt not a clean append
central dir doesn't fit before EOCD nullopt impossible layout
central dir offset > file central dir offset nullopt overflow / corruption

The padding tolerance is the load-bearing bit. The spike caught
libarchive's default 10240-byte tar-block rounding pushing the EOCD off
file-EOF. We scan a tail of 22 + 65535 + 65536 bytes and accept any
amount of zero padding after the EOCD+comment, up to that budget.
PR #51 already neutralises this on the writer side via
archive_write_set_bytes_in_last_block(a, 1), but the reader is
defensive too.

How BundleLocation is computed

ZIP layout in the file:

[leading host binary bytes][local file headers][central dir][EOCD][trailing pad]
                          ^                                ^
                          bundle_start                     eocd_file_offset
bundle_start = eocd_file_offset - cd_size - cd_offset_in_archive
bundle_end   = eocd_file_offset + 22 + comment_length
loc.offset   = bundle_start
loc.size     = bundle_end - bundle_start

So loc.offset is the byte you seek() to, and loc.size is what you
read -- the result is a valid standalone ZIP that ReadArchive can
parse directly without any libarchive offset-shifting magic.

Tests (red-then-green TDD)

# Test What it asserts
1 locate appended ZIP 4 KiB random + ZIP → offset == 4096, size == zip.size()
2 padding tolerance + 10240 zeros → still located at same offset
3 no signature random bytes, scrubbed → nullopt
4 malformed EOCD fake sig at EOF with cd_size = 0xffffffffnullopt
5 truncation append valid ZIP, lop 1 KiB → nullopt
6 GetSelfPath returns existing path
7 LocateBundleInSelf unbundled test binary → nullopt
7/7 Test #19: LocateBundleInSelf returns nullopt when no bundle is present_test  Passed
100% tests passed, 0 tests failed out of 7

Fixtures reuse archive_io::WriteArchive from #51 -- so the locator is
tested against real libarchive output, not synthetic ZIPs.

Test plan

  • All 7 new tests pass.
  • Existing 590 tests still pass; the same 2 pre-existing
    AddressSanitizer leaks in DuckDB internals (QueryExecutor type coverage, DuckDBResult RAII) from 96806ac on main remain.
  • CI cross-platform once stacked PRs land.

Closes #42. Part of #40. Stacked on #51.

Part of #40, depends on #41. Two small modules that let the running
binary discover an appended ZIP bundle.

- `src/selfpath.{hpp,cpp}` -- cross-platform self-binary path:
  - Linux:   readlink("/proc/self/exe")
  - macOS:   _NSGetExecutablePath
  - Windows: GetModuleFileNameW (with buffer growth loop)

- `src/bundle_locator.{hpp,cpp}` -- reverse-scan a file for a ZIP
  End-of-Central-Directory record:
  - reads tail buffer of 22 + max_comment + 64 KiB padding budget
  - reverse-scans for the 0x06054b50 signature
  - validates: single-disk archive, entry counts match, comment
    length fits in tail, anything after comment must be zero
    (padding tolerance), central directory must fit before EOCD
  - returns BundleLocation{offset, size} or nullopt
  - `LocateBundleInSelf()` convenience wrapper over GetSelfPath()

Tests (`test/cpp/bundle_locator_test.cpp`, 7 cases):
- locate a ZIP appended to 4 KiB of random leading bytes
- tolerate 10 KiB of trailing zero padding (the spike-caught case)
- nullopt when no EOCD signature exists in random data
- nullopt when an EOCD signature has impossible cd_size / cd_offset
- nullopt when the bundle is truncated by 1 KiB from EOF
- selfpath returns an existing path
- LocateBundleInSelf returns nullopt against the unbundled test binary

Fixtures reuse `archive_io::WriteArchive` (#41) to produce real ZIPs.

Closes #42.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant