This repository has been archived by the owner on Mar 14, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This resolves issues our Algolia indexing job has had with mutli-byte content, following up on #6952.
I've also updated the test suite to be a bit more representative of the issues we're seeing, including using both single- and multi-byte characters in the mocked data.
From what I could tell, there were two root causes for the current problems:
JSON size considerations
Previously,
byteof
was used to find the size of a given record (and therefore how many bytes needed to be trimmed). But the Algolia API cares about the size of the serialized JSON object.In this PR, I've switched off of
byteof
and instead serialize an object to JSON first, and then count those bytes, to match what Algolia checks.String truncation
Previously, the trimming logic attempted to find the percentage of the string that needed to be removed by dividing the total bytes in the string by the number of bytes that needed to be dropped, and then called
slice()
to remove that percentage of characters from the start.This approach would only work if the entire string is made up of characters with a uniform number of bytes (i.e. all one byte, or all two bytes, etc.). When dealing with a string that contains characters with varying byte sizes, there is no guarantee that removing a certain number of characters from the end will reduce the size by a proportional amount. For instance, if the string contains a large number of multi-byte characters at the start, and mostly single-byte characters at the end, then removing the final 25% of characters will not reduce the overall byte size by 25%.
Rather than try to write code that will truncate a string properly while respecting multi-byte character boundaries, I'm turning to the
truncate-utf8-bytes
module, which seems to work as expected.