Skip to content

Encoding scheme to encode any Unicode string with only [0-9a-zA-Z_]. Similar to URL percent-encoding. Especially useful for GraphQL ID generation.

License

Notifications You must be signed in to change notification settings

Airsequel/double-x-encoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Double X Encoding

Encoding scheme to encode any Unicode string with only characters from [0-9a-zA-Z_]. Therefore it's quite similar to URL percent-encoding. It's especially useful for GraphQL ID generation.

Constraints for the encoding scheme:

  1. Common IDs like file_format, fileFormat, FileFormat, FILE_FORMAT, __file_format__, … must not be altered
  2. Support all Unicode characters
  3. Characters of the ASCII range must lead to shorter encodings
  4. Optional support for encoding leading digits (like in 1_file_format) to fulfill constraints of some ID schemes (e.g. GraphQL's).

Examples

Input Output
camelCaseId camelCaseId
snake_case_id snake_case_id
__Schema __Schema
doxxing doxxing
DOXXING DOXXXXXXING
id with spaces idXX0withXX0spaces
id-with.special$chars! idXXDwithXXEspecialXX4charsXX1
id_with_ümläutß id_with_XXaaapmmlXXaaaoeutXXaaanp
Emoji: 😅 EmojiXXGXX0XXbpgaf
Multi Byte Emoji: 👨‍🦲 MultiXX0ByteXX0EmojiXXGXX0XXbpegiXXacaanXXbpjlc
\u{100000} XXYbaaaaa
\u{10ffff} XXYbapppp

With encoding of leading digit and double underscore activated (necessary for GraphQL ID generation):

Input Output
1FileFormat XXZ1FileFormat
__index__ XXRXXRindexXXRXXR

Explanation

The encoding scheme is based on the following rules:

  1. All characters in [0-9A-Za-z_] except for XX are encoded as is
  2. XX is encoded as XXXXXX
  3. All other printable characters inside the ASCII range are encoded as a sequence of 3 characters: XX[0-9A-W]
  4. All other Unicode code points until U+fffff (e.g. Emojis) are encoded as a sequence of 7 characters: XX[a-p]{5}, where the 5 characters are the hexadecimal representation with an alternative hex alphabet ranging from a to p instead of 0 to f.
  5. All Unicode code points in the Supplementary Private Use Area-B (U+100000 to U+10ffff) are encoded as a sequence of 9 characters: XXY[a-p]{6}

If the optional leading digit encoding is enabled, a leading digit is encoded as XXZ[0-9].

If the optional double underscore encoding is enabled, double underscores are encoded as XXRXXR.

Installation

  • Haskell: Via Hackage
  • Other languages:
    The code is not yet available via common package managers. Please copy the code into your project for the time being.