Skip to content

feat(core): hash entity identifiers if too long for db engine#846

Merged
AlessandroPomponio merged 1 commit into
mainfrom
ap_382_hash_entity_identifier_when_too_long
Apr 14, 2026
Merged

feat(core): hash entity identifiers if too long for db engine#846
AlessandroPomponio merged 1 commit into
mainfrom
ap_382_hash_entity_identifier_when_too_long

Conversation

@AlessandroPomponio
Copy link
Copy Markdown
Member

@AlessandroPomponio AlessandroPomponio commented Apr 14, 2026

Summary

This PR adds automatic hashing for long entity identifiers to prevent database field length violations. The entity identifier is used as a primary key in SQLSampleStore. When an entity identifier exceeds 700 characters, it is now hashed using SHA256 to ensure it stays within the 768-character INNODB (MySQL) database limit for columns used as indexes while maintaining uniqueness and determinism.

NOTE: this issue stems from a limitation of the INNODB engine on MySQL when used with the utf8mb4 charset. Attempting to create a VARCHAR column with an index on it (such as a primary key) that is above 768 characters would fail with the following error:

SQL Error [1071] [42000]: Specified key was too long; max key length is 3072 bytes

3072/4 = 768 is the maximum length possible

Resolves #382

Files Changed

📄 orchestrator/schema/entity.py

Modified the entity_identifier_from_properties_and_values function to handle long identifiers. The function now checks if the generated identifier exceeds 700 characters (safe threshold below the 768-character database limit). If it does, the identifier is hashed using SHA256 and prefixed with "hash-" to indicate it's a hashed value. Short identifiers remain human-readable for debugging purposes.

📄 tests/schema/test_entity.py

Added comprehensive test coverage for the new identifier hashing functionality:

  • test_entity_identifier_short_not_hashed: Verifies short identifiers remain unchanged and human-readable
  • test_entity_identifier_long_hashed: Confirms long identifiers (4539+ chars) are properly hashed with the "hash-" prefix and stay within database limits (69 chars total)
  • test_entity_identifier_different_points_different_identifiers: Ensures different input points produce unique identifiers for both short and long cases
  • test_entity_identifier_threshold_boundary: Tests edge cases at exactly the 700-character threshold to verify the cutoff works correctly

Signed-off-by: Alessandro Pomponio <alessandro.pomponio1@ibm.com>
@michael-johnston
Copy link
Copy Markdown
Member

We should probably note in title and description the dB that has the restriction e.g MySQL has a x limit in primary keys. Since entity id is primary …

should also note if/if not SQLite has similar issue

@AlessandroPomponio AlessandroPomponio changed the title feat(core): hash entity identifiers if too long for db feat(core): hash entity identifiers if too long for db engine Apr 14, 2026
@AlessandroPomponio
Copy link
Copy Markdown
Member Author

@michael-johnston let me know if this is better
Feel free to make suggestions otherwise

@AlessandroPomponio AlessandroPomponio added this pull request to the merge queue Apr 14, 2026
Merged via the queue into main with commit 104ed79 Apr 14, 2026
19 checks passed
@AlessandroPomponio AlessandroPomponio deleted the ap_382_hash_entity_identifier_when_too_long branch April 14, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: handle case where entity identifier is too long for the DB

2 participants