In [3]:
text = "Ｕｎｉｃｏｄｅ! 🅤🅝🅘🅒🅞🅓🅔‽ 🇺‌🇳‌🇮‌🇨‌🇴‌🇩‌🇪! 😄 The very name strikes fear and awe into the hearts of programmers worldwide. We all know we ought to “support Unicode” in our software (whatever that means—like using wchar_t for all the strings, right?). But Unicode can be abstruse, and diving into the thousand-page Unicode Standard plus its dozens of supplementary annexes, reports, and notes can be more than a little intimidating. I don’t blame programmers for still finding the whole thing mysterious, even 30 years after Unicode’s inception. A few months ago, I got interested in Unicode and decided to spend some time learning more about it in detail. In this article, I’ll give an introduction to it from a programmer’s point of view. I’m going to focus on the character set and what’s involved in working with strings and files of Unicode text. However, in this article I’m not going to talk about fonts, text layout/shaping/rendering, or localization in detail—those are separate issues, beyond my scope (and knowledge) here. Diversity and Inherent Complexity The Unicode Codespace Codespace Allocation Scripts Usage Frequency Encodings UTF-8 UTF-16 Combining Marks Canonical Equivalence Normalization Forms Grapheme Clusters And More… Diversity and Inherent Complexity As soon as you start to study Unicode, it becomes clear that it represents a large jump in complexity over character sets like ASCII that you may be more familiar with. It’s not just that Unicode contains a much larger number of characters, although that’s part of it. Unicode also has a great deal of internal structure, features, and special cases, making it much more than what one might expect a mere “character set” to be. We’ll see some of that later in this article. When confronting all this complexity, especially as an engineer, it’s hard not to find oneself asking, “Why do we need all this? Is this really necessary? Couldn’t it be simplified?” However, Unicode aims to faithfully represent the entire world’s writing systems. The Unicode Consortium’s stated goal is “enabling people around the world to use computers in any language”. And as you might imagine, the diversity of written languages is immense! To date, Unicode supports 135 different scripts, covering some 1100 languages, and there’s still a long tail of over 100 unsupported scripts, both modern and historical, which people are still working to add. Given this enormous diversity, it’s inevitable that representing it is a complicated project. Unicode embraces that diversity, and accepts the complexity inherent in its mission to include all human writing systems. It doesn’t make a lot of trade-offs in the name of simplification, and it makes exceptions to its own rules where necessary to further its mission. Moreover, Unicode is committed not just to supporting texts in any single language, but also to letting multiple languages coexist within one text—which introduces even more complexity. Most programming languages have libraries available to handle the gory low-level details of text manipulation, but as a programmer, you’ll still need to know about certain Unicode features in order to know when and how to apply them. It may take some time to wrap your head around it all, but don’t be discouraged—think about the billions of people for whom your software will be more accessible through supporting text in their language. Embrace the complexity! The Unicode Codespace Let’s start with some general orientation. The basic elements of Unicode—its “characters”, although that term isn’t quite right—are called code points. Code points are identified by number, customarily written in hexadecimal with the prefix “U+”, such as U+0041 “A” latin capital letter a or U+03B8 “θ” greek small letter theta. Each code point also has a short name, and quite a few other properties, specified in the Unicode Character Database. The set of all possible code points is called the codespace. The Unicode codespace consists of 1,114,112 code points. However, only 128,237 of them—about 12% of the codespace—are actually assigned, to date. There’s plenty of room for growth! Unicode also reserves an additional 137,468 code points as “private use” areas, which have no standardized meaning and are available for individual applications to define for their own purposes. Codespace Allocation To get a feel for how the codespace is laid out, it’s helpful to visualize it. Below is a map of the entire codespace, with one pixel per code point. It’s arranged in tiles for visual coherence; each small square is 16×16 = 256 code points, and each large square is a “plane” of 65,536 code points. There are 17 planes altogether."
tokens = text.encode("utf-8")
tokens = list(map(int, tokens))
tokens

[239,
 188,
 181,
 239,
 189,
 142,
 239,
 189,
 137,
 239,
 189,
 131,
 239,
 189,
 143,
 239,
 189,
 132,
 239,
 189,
 133,
 33,
 32,
 240,
 159,
 133,
 164,
 240,
 159,
 133,
 157,
 240,
 159,
 133,
 152,
 240,
 159,
 133,
 146,
 240,
 159,
 133,
 158,
 240,
 159,
 133,
 147,
 240,
 159,
 133,
 148,
 226,
 128,
 189,
 32,
 240,
 159,
 135,
 186,
 226,
 128,
 140,
 240,
 159,
 135,
 179,
 226,
 128,
 140,
 240,
 159,
 135,
 174,
 226,
 128,
 140,
 240,
 159,
 135,
 168,
 226,
 128,
 140,
 240,
 159,
 135,
 180,
 226,
 128,
 140,
 240,
 159,
 135,
 169,
 226,
 128,
 140,
 240,
 159,
 135,
 170,
 33,
 32,
 240,
 159,
 152,
 132,
 32,
 84,
 104,
 101,
 32,
 118,
 101,
 114,
 121,
 32,
 110,
 97,
 109,
 101,
 32,
 115,
 116,
 114,
 105,
 107,
 101,
 115,
 32,
 102,
 101,
 97,
 114,
 32,
 97,
 110,
 100,
 32,
 97,
 119,
 101,
 32,
 105,
 110,
 116,
 111,
 32,
 116,
 104,
 101,
 32,
 104,
 101,
 97,
 114,
 116,
 115,
 32,
 111,
 102,
 32,
 112,
 114,
 111,
 103,
 114,
 97,
 109,
 109,
 101

In [5]:
#pair stats
def get_stats(ids):
    count = {}
    for pair in zip(ids, ids[1:]):
        count[pair] = count.get(pair,0) + 1
    return count

pairs = get_stats(tokens)
print(sorted(((v,k) for k,v in pairs.items()), reverse = True))

[(142, (101, 32)), (93, (115, 32)), (91, (32, 97)), (84, (32, 116)), (81, (105, 110)), (68, (116, 32)), (63, (116, 104)), (62, (32, 105)), (60, (226, 128)), (60, (101, 114)), (54, (100, 101)), (52, (99, 111)), (51, (114, 101)), (50, (104, 101)), (49, (105, 116)), (48, (32, 115)), (47, (110, 32)), (47, (97, 114)), (47, (97, 110)), (45, (100, 32)), (45, (44, 32)), (43, (111, 100)), (43, (101, 115)), (40, (110, 103)), (40, (32, 111)), (39, (116, 101)), (39, (108, 101)), (39, (32, 99)), (38, (116, 105)), (38, (111, 114)), (38, (97, 108)), (35, (114, 32)), (35, (110, 116)), (34, (111, 110)), (34, (111, 32)), (33, (105, 99)), (33, (97, 116)), (32, (110, 100)), (32, (108, 32)), (32, (101, 110)), (32, (46, 32)), (29, (116, 111)), (29, (32, 112)), (28, (121, 32)), (28, (104, 97)), (28, (32, 119)), (27, (32, 109)), (26, (115, 116)), (26, (110, 105)), (26, (108, 108)), (26, (105, 115)), (26, (32, 85)), (25, (116, 115)), (25, (111, 117)), (25, (111, 102)), (25, (32, 108)), (24, (128, 153)), (23, (

In [10]:
top_pair = max(pairs, key = pairs.get)
top_pair

(101, 32)

In [17]:
def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

lst = merge(tokens, top_pair, 99)
print(lst)

[239, 188, 181, 239, 189, 142, 239, 189, 137, 239, 189, 131, 239, 189, 143, 239, 189, 132, 239, 189, 133, 33, 32, 240, 159, 133, 164, 240, 159, 133, 157, 240, 159, 133, 152, 240, 159, 133, 146, 240, 159, 133, 158, 240, 159, 133, 147, 240, 159, 133, 148, 226, 128, 189, 32, 240, 159, 135, 186, 226, 128, 140, 240, 159, 135, 179, 226, 128, 140, 240, 159, 135, 174, 226, 128, 140, 240, 159, 135, 168, 226, 128, 140, 240, 159, 135, 180, 226, 128, 140, 240, 159, 135, 169, 226, 128, 140, 240, 159, 135, 170, 33, 32, 240, 159, 152, 132, 32, 84, 104, 99, 118, 101, 114, 121, 32, 110, 97, 109, 99, 115, 116, 114, 105, 107, 101, 115, 32, 102, 101, 97, 114, 32, 97, 110, 100, 32, 97, 119, 99, 105, 110, 116, 111, 32, 116, 104, 99, 104, 101, 97, 114, 116, 115, 32, 111, 102, 32, 112, 114, 111, 103, 114, 97, 109, 109, 101, 114, 115, 32, 119, 111, 114, 108, 100, 119, 105, 100, 101, 46, 32, 87, 99, 97, 108, 108, 32, 107, 110, 111, 119, 32, 119, 99, 111, 117, 103, 104, 116, 32, 116, 111, 32, 226, 128, 156, 115,

In [None]:
#BPE

def get_stats(ids):
    count = {}
    for pair in zip(ids, ids[1:]):
        count[pair] = count.get(pair, 0) + 1
    return count

def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.apppend(ids[i])
            i += 1
    return newids
