Navigation Menu

Skip to content
This repository has been archived by the owner on Mar 14, 2024. It is now read-only.

Fix Algolia multi-byte content indexing #7406

Merged
merged 3 commits into from Feb 25, 2022
Merged

Conversation

jeffposnick
Copy link
Contributor

@jeffposnick jeffposnick commented Feb 24, 2022

This resolves issues our Algolia indexing job has had with mutli-byte content, following up on #6952.

I've also updated the test suite to be a bit more representative of the issues we're seeing, including using both single- and multi-byte characters in the mocked data.

From what I could tell, there were two root causes for the current problems:

JSON size considerations

Previously, byteof was used to find the size of a given record (and therefore how many bytes needed to be trimmed). But the Algolia API cares about the size of the serialized JSON object.

In this PR, I've switched off of byteof and instead serialize an object to JSON first, and then count those bytes, to match what Algolia checks.

String truncation

Previously, the trimming logic attempted to find the percentage of the string that needed to be removed by dividing the total bytes in the string by the number of bytes that needed to be dropped, and then called slice() to remove that percentage of characters from the start.

This approach would only work if the entire string is made up of characters with a uniform number of bytes (i.e. all one byte, or all two bytes, etc.). When dealing with a string that contains characters with varying byte sizes, there is no guarantee that removing a certain number of characters from the end will reduce the size by a proportional amount. For instance, if the string contains a large number of multi-byte characters at the start, and mostly single-byte characters at the end, then removing the final 25% of characters will not reduce the overall byte size by 25%.

Rather than try to write code that will truncate a string properly while respecting multi-byte character boundaries, I'm turning to the truncate-utf8-bytes module, which seems to work as expected.

@netlify
Copy link

netlify bot commented Feb 24, 2022

✔️ Deploy Preview for web-dev-staging ready!

🔨 Explore the source changes: 8fab63a

🔍 Inspect the deploy log: https://app.netlify.com/sites/web-dev-staging/deploys/6218f17643cee1000707f1b9

😎 Browse the preview: https://deploy-preview-7406--web-dev-staging.netlify.app

algolia.js Show resolved Hide resolved
@jeffposnick jeffposnick added the DO NOT MERGE Actively working on but experimental label Feb 24, 2022
@jeffposnick jeffposnick added $-presubmit Add label to run presubmit tests. and removed DO NOT MERGE Actively working on but experimental labels Feb 25, 2022
@github-actions github-actions bot removed the $-presubmit Add label to run presubmit tests. label Feb 25, 2022
@jeffposnick jeffposnick merged commit 2a97404 into main Feb 25, 2022
@jeffposnick jeffposnick deleted the algolia-record-size-fix branch February 25, 2022 15:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants