Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\u0000 Unicde escaping of Hebrew characters uses 6 characters per Hebrew character #603

Closed
bandleader opened this issue Nov 29, 2020 · 1 comment

Comments

@bandleader
Copy link

bandleader commented Nov 29, 2020

Currently, Sefaria's API seems to send Hebrew text in Unicode-escaped JSON strings. This uses 6 characters per Hebrew character. The following is a single verse in Kohelet:

{
  "he": ["\u05d0\u05b7\u05dc\u05be\u05ea\u05bc\u05b0\u05d1\u05b7\u05d4\u05b5\u05a8\u05dc \u05e2\u05b7\u05dc\u05be\u05e4\u05bc\u05b4\u059c\u05d9\u05da\u05b8 \u05d5\u05b0\u05dc\u05b4\u05d1\u05bc\u05b0\u05da\u05b8\u05a7 \u05d0\u05b7\u05dc\u05be\u05d9\u05b0\u05de\u05b7\u05d4\u05b5\u059b\u05e8 \u05dc\u05b0\u05d4\u05d5\u05b9\u05e6\u05b4\u05a5\u05d9\u05d0 \u05d3\u05b8\u05d1\u05b8\u0596\u05e8 \u05dc\u05b4\u05e4\u05b0\u05e0\u05b5\u05a3\u05d9 \u05d4\u05b8\u05d0\u05b1\u05dc\u05b9\u05d4\u05b4\u0591\u05d9\u05dd \u05db\u05bc\u05b4\u05a3\u05d9 \u05d4\u05b8\u05d0\u05b1\u05dc\u05b9\u05d4\u05b4\u05a4\u05d9\u05dd \u05d1\u05bc\u05b7\u05e9\u05c1\u05bc\u05b8\u05de\u05b7\u0599\u05d9\u05b4\u05dd\u0599 \u05d5\u05b0\u05d0\u05b7\u05ea\u05bc\u05b8\u05a3\u05d4 \u05e2\u05b7\u05dc\u05be\u05d4\u05b8\u05d0\u05b8\u0594\u05e8\u05b6\u05e5 \u05e2\u05b7\u05bd\u05dc\u05be\u05db\u05bc\u05b5\u059b\u05df \u05d9\u05b4\u05d4\u05b0\u05d9\u05a5\u05d5\u05bc \u05d3\u05b0\u05d1\u05b8\u05e8\u05b6\u0596\u05d9\u05da\u05b8 \u05de\u05b0\u05e2\u05b7\u05d8\u05bc\u05b4\u05bd\u05d9\u05dd\u05c3"]
}

(And ironically, it turns out to be about King Shlomo advising us to keep our words short 😀)

Is there are reason you aren't simply using Unicode characters in the JSON? Do some clients (that support Unicode) not support Unicode in JSON? That would be very surprising, and more so because the JSON spec actually says it "shall" be encoded in Unicode, UTF-8 by default. That should encode Hebrew in 2 bytes per character instead of 6. UTF16 should be the same.

Trying a few different texts, with commentaries and without, I saw data savings ranging between 27% and 59%. (To easily test this, navigate to an API URL like this one, wait for it to load, and paste this one-liner in to the JS console:)

[document.body.innerText].map(x=>[x.length, unescape(encodeURIComponent(JSON.stringify(JSON.parse(x)))).length]).map(x=>[...x, Math.round(100*(1-x[1]/x[0])) + "% saved"])[0]

Thanks in advance!

Notes

  1. It's also possible to use a custom encoding to represent Hebrew in 1 byte per character or even less, but that might be out of scope.
  2. At some point I thought I saw Sefaria returning mixed escaped strings and actual Unicode, but I can't find where now.
  3. A related suggestion would be to optionally remove cantillation marks (trop) and/or vowelisation (nikud) from the text before sending. I can post as a separate issue if you like. This is far more feasible if we go with a GraphQL API (GraphQL API #602).
@blockspeiser
Copy link
Contributor

Thanks for flagging this. We've just made this fix for all of our APIs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants