Skip to content

Long identifier truncation improvements#6564

Closed
labkey-adam wants to merge 89 commits intodevelopfrom
fb_long_identifiers
Closed

Long identifier truncation improvements#6564
labkey-adam wants to merge 89 commits intodevelopfrom
fb_long_identifiers

Conversation

@labkey-adam
Copy link
Contributor

@labkey-adam labkey-adam commented Apr 15, 2025

Rationale

Not heeding database-specific truncation rules for identifiers has led to several issues related to long column, table, and alias names, particularly those with non-ASCII characters. Here are two example issues that are resolved by this PR:

In these cases, DomainImpl.generateStorageColumnName() would call an AliasManager.decideAlias() variant that usually returned the name as the storage column name, without first making it "legal," which lead to silent truncation to 63 UTF-8 bytes on PostgreSQL. First fix attempt was to call a different method to ensure makeLegalName() was invoked when generating storage names, but this led to some undesirable behavior, such as provisioned columns named "Group" and "User" being stored as "group_" and "user_" (they're keywords on PostgreSQL), which then confused index creation (which of course should know the storage names, but they don't currently). IMO, provisioned column storage names should match the column names as much as possible (I'm less concerned about this for aliases). We're already quoting them appropriately. So the fix here is to stop using AliasManager to generate storage column names; instead, a new simple class StorageNameGenerator is now responsible for generating legal and unique storage column names. It calls the dialect-specific truncation method and uniquifies names with a suffix counter, but otherwise it leaves the characters as is. This means we'll start seeing special characters in provisioned tables' column names.

Changes

  • Introduce StringUtilsLabKey.truncateStartToUtf8ByteLimit() that truncates from the right end of the string. Add tests. Simplify overly complex truncateToUtf8ByteLimit().
  • Move the truncation and "make legal" methods to SqlDialect to allow these to be dialect-specific
  • Implement correct truncation in PostgreSQL dialect
  • Pass in a @NotNull SqlDialect to more AliasManager calls
  • Introduce FallBackDialect for the unfortunate cases where a SqlDialect is not provided to AliasManager; it implements conservative truncation rules that work across all supported databases
  • Eliminate useLegacyMaxLength flag
  • Force callers to specify the number of characters they need reserved for suffixes, etc. Previous code made blanket assumptions that were often incorrect, unnecessary, and/or redundant.
  • Switch to surrogate-pair-aware truncation so we don't end up with half characters in our names
  • Widen StorageColumnName and mvIndicatorStorageColumnName to ensure even the longest generated names will fit

Copy link
Contributor

@labkey-jeckels labkey-jeckels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions. I haven't test the code.

@labkey-adam labkey-adam closed this May 2, 2025
@labkey-adam
Copy link
Contributor Author

Merged to #6498

@labkey-adam labkey-adam deleted the fb_long_identifiers branch May 3, 2025 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants