Skip to content

Verify: BINARY types preserve data integrity (legacy #147)#21

Merged
HenryNebula merged 5 commits intodevfrom
worktree-triage+147-binary-utf8
Apr 23, 2026
Merged

Verify: BINARY types preserve data integrity (legacy #147)#21
HenryNebula merged 5 commits intodevfrom
worktree-triage+147-binary-utf8

Conversation

@HenryNebula
Copy link
Copy Markdown
Owner

Summary

Fixes #20 (legacy baztian/jaydebeapi#147): Verifies that BINARY/VARBINARY data round-trips correctly through the Arrow JDBC adapter without data loss.

The upstream jaydebeapi had a bug where binary data was decoded as UTF-8 strings via str(java_val), corrupting non-UTF-8 bytes. This does not affect jaydebeapiarrow because the Arrow JDBC adapter handles binary types natively, returning Python bytes objects.

Changes

  • Added mockBinaryResult(byte[]) to MockConnection — mocks getBinaryStream() (the method the Arrow consumer actually calls), getBytes(), wasNull(), and getObject()
  • Added 3 test cases in test_mock.py:
    • test_binary_non_utf8_bytes_preserved — verifies bytes like 0x80, 0xff, 0xfe survive round-trip
    • test_binary_all_byte_values — all 256 byte values round-trip correctly
    • test_binary_empty — empty binary data handled correctly

Test plan

  • Added 3 test cases reproducing the reported concern
  • Verified all 63 mock tests pass
  • Binary data returns as bytes (not str), confirming no UTF-8 decoding occurs

Closes #20

Generated with Claude Code

@HenryNebula
Copy link
Copy Markdown
Owner Author

Also need to add integration tests for external DB besides mock

@HenryNebula HenryNebula force-pushed the worktree-triage+147-binary-utf8 branch from e74e894 to 73e8149 Compare April 23, 2026 07:11
HenryNebula and others added 4 commits April 23, 2026 10:02
The legacy issue (baztian/jaydebeapi#147) about BINARY types being
decoded as UTF-8 strings does not affect jaydebeapiarrow — the Arrow
JDBC adapter handles binary types natively and returns Python bytes.

Add mockBinaryResult to MockConnection and three test cases that
verify binary data round-trips correctly, including non-UTF-8 byte
sequences (0x80, 0xff, 0xfe) and all 256 byte values.

Closes #20

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Addresses PR review feedback requesting integration tests beyond
the mock driver. Tests that binary data containing non-UTF-8
bytes (0x80, 0xff, 0xfe) round-trips correctly through the Arrow
path against a real HSQLDB database.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address reviewer feedback: verify binary data integrity on an external
database. The PostgresTest override tests the full 256-byte spectrum
and common non-UTF-8 sequences that historically get corrupted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rebase onto dev restored timestamp tests that were previously removed.
Add Drill-specific test_binary_non_utf8_roundtrip using CTAS (Drill
doesn't support parameterized INSERT for binary). Skip on Trino since
the memory connector doesn't support VARBINARY round-trip via CTAS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@HenryNebula HenryNebula force-pushed the worktree-triage+147-binary-utf8 branch from 2e12898 to 4a92bb1 Compare April 23, 2026 14:14
Drill cannot create VARBINARY columns with non-UTF-8 bytes via CTAS
(hex literal conversion is unsupported). Binary data integrity is
already verified via mock tests, HSQLDB, and PostgreSQL integration
tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@HenryNebula HenryNebula merged commit 293a105 into dev Apr 23, 2026
14 checks passed
@HenryNebula HenryNebula deleted the worktree-triage+147-binary-utf8 branch April 23, 2026 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant