Long identifier truncation improvements#6564
Closed
labkey-adam wants to merge 89 commits intodevelopfrom
Closed
Conversation
ColumnInfo.getAlias() ColumnInfo.getSelectName()
some fuzz testing
fix double-quoting (makeLegalIdentifier) of names in specimen land
…e awkward truncation rule
labkey-jeckels
approved these changes
Apr 22, 2025
Contributor
labkey-jeckels
left a comment
There was a problem hiding this comment.
Minor suggestions. I haven't test the code.
… method names, new surrogate-pair-aware char truncation methods, test for broken surrogates. Also, code review feedback.
Contributor
Author
|
Merged to #6498 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale
Not heeding database-specific truncation rules for identifiers has led to several issues related to long column, table, and alias names, particularly those with non-ASCII characters. Here are two example issues that are resolved by this PR:
In these cases,
DomainImpl.generateStorageColumnName()would call anAliasManager.decideAlias()variant that usually returned the name as the storage column name, without first making it "legal," which lead to silent truncation to 63 UTF-8 bytes on PostgreSQL. First fix attempt was to call a different method to ensure makeLegalName() was invoked when generating storage names, but this led to some undesirable behavior, such as provisioned columns named "Group" and "User" being stored as "group_" and "user_" (they're keywords on PostgreSQL), which then confused index creation (which of course should know the storage names, but they don't currently). IMO, provisioned column storage names should match the column names as much as possible (I'm less concerned about this for aliases). We're already quoting them appropriately. So the fix here is to stop using AliasManager to generate storage column names; instead, a new simple classStorageNameGeneratoris now responsible for generating legal and unique storage column names. It calls the dialect-specific truncation method and uniquifies names with a suffix counter, but otherwise it leaves the characters as is. This means we'll start seeing special characters in provisioned tables' column names.Changes
StringUtilsLabKey.truncateStartToUtf8ByteLimit()that truncates from the right end of the string. Add tests. Simplify overly complextruncateToUtf8ByteLimit().@NotNullSqlDialectto more AliasManager callsFallBackDialectfor the unfortunate cases where a SqlDialect is not provided to AliasManager; it implements conservative truncation rules that work across all supported databasesuseLegacyMaxLengthflagStorageColumnNameandmvIndicatorStorageColumnNameto ensure even the longest generated names will fit