Fix Algolia multi-byte content indexing #7406

jeffposnick · 2022-02-24T22:48:25Z

This resolves issues our Algolia indexing job has had with mutli-byte content, following up on #6952.

I've also updated the test suite to be a bit more representative of the issues we're seeing, including using both single- and multi-byte characters in the mocked data.

From what I could tell, there were two root causes for the current problems:

JSON size considerations

Previously, byteof was used to find the size of a given record (and therefore how many bytes needed to be trimmed). But the Algolia API cares about the size of the serialized JSON object.

In this PR, I've switched off of byteof and instead serialize an object to JSON first, and then count those bytes, to match what Algolia checks.

String truncation

Previously, the trimming logic attempted to find the percentage of the string that needed to be removed by dividing the total bytes in the string by the number of bytes that needed to be dropped, and then called slice() to remove that percentage of characters from the start.

This approach would only work if the entire string is made up of characters with a uniform number of bytes (i.e. all one byte, or all two bytes, etc.). When dealing with a string that contains characters with varying byte sizes, there is no guarantee that removing a certain number of characters from the end will reduce the size by a proportional amount. For instance, if the string contains a large number of multi-byte characters at the start, and mostly single-byte characters at the end, then removing the final 25% of characters will not reduce the overall byte size by 25%.

Rather than try to write code that will truncate a string properly while respecting multi-byte character boundaries, I'm turning to the truncate-utf8-bytes module, which seems to work as expected.

netlify · 2022-02-24T22:48:30Z

✔️ Deploy Preview for web-dev-staging ready!

🔨 Explore the source changes: 8fab63a

🔍 Inspect the deploy log: https://app.netlify.com/sites/web-dev-staging/deploys/6218f17643cee1000707f1b9

😎 Browse the preview: https://deploy-preview-7406--web-dev-staging.netlify.app

algolia.js

jeffposnick added 2 commits February 24, 2022 16:38

Fix logic that trims Algolia records

b794bb9

Switch to truncate-utf8-bytes

289b63d

jeffposnick requested a review from devnook February 24, 2022 22:48

pullapprove bot requested a review from PaulKinlan February 24, 2022 22:48

jeffposnick commented Feb 24, 2022

View reviewed changes

algolia.js Show resolved Hide resolved

jeffposnick added the DO NOT MERGE Actively working on but experimental label Feb 24, 2022

devnook approved these changes Feb 25, 2022

View reviewed changes

Cleanup

8fab63a

jeffposnick added $-presubmit Add label to run presubmit tests. and removed DO NOT MERGE Actively working on but experimental labels Feb 25, 2022

github-actions bot removed the $-presubmit Add label to run presubmit tests. label Feb 25, 2022

jeffposnick merged commit 2a97404 into main Feb 25, 2022

jeffposnick deleted the algolia-record-size-fix branch February 25, 2022 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Algolia multi-byte content indexing #7406

Fix Algolia multi-byte content indexing #7406

jeffposnick commented Feb 24, 2022 •

edited

netlify bot commented Feb 24, 2022 •

edited

Navigation Menu

Fix Algolia multi-byte content indexing #7406

Fix Algolia multi-byte content indexing #7406

Conversation

jeffposnick commented Feb 24, 2022 • edited

JSON size considerations

String truncation

netlify bot commented Feb 24, 2022 • edited

jeffposnick commented Feb 24, 2022 •

edited

netlify bot commented Feb 24, 2022 •

edited