## UNDERSTANDING UNICODE

1. `chr(0)` returns `'\x00'` which is the null character

2. `__repr__` renders chr(0) as a null character (empty string) in the stdio, but unicode renders into a string object with place holder characters

3. When rendered in a string, it is represented as `x00`, but with print it renders in stdio, hence empty

In [3]:
chr(0)

'\x00'

In [4]:
print(chr(0))

 


In [5]:
"this text is " + chr(0)

'this text is \x00'

In [6]:
print("this text is " + chr(0))

this text is  


In [7]:
s = "this text is " + chr(0)

len(s), len("this text is ") #so it does add to the length

(14, 13)

## PROBLEM 2

* There are no out of distribution as each sequence of characters can be represented as list of integers

1. UTF-8 is preferred because it is more byte-efficient and covers all the ASCII characters, emojis. Most of the dataset are stored in UTF-8, hence it is compatible to train too

2. Some UTF-8 characters use multiple bytes, but as this is decoding one byte at a time, it is erroneous. It should decode all at once

3. é decodes into two bytes

In [8]:
test_str = "hello there"
s = test_str.encode("utf-8")
print(s)

b'hello there'


In [9]:
print(type(s))

<class 'bytes'>


In [10]:
test_str = "hello\x00 there      "
s = test_str.encode("utf-8")
"".join([bytes([k]).decode("utf-8") for k in list(s)])

'hello\x00 there      '

In [11]:
print(bytes([104, 101, 108, 108]).decode())

hell


In [12]:
bytes([104,101])
bytes([104])

b'h'

In [13]:
def decode_utf8_bytests_to_str_wrong(bytestring: bytes) -> str:
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])

decode_utf8_bytests_to_str_wrong("\x00helloé".encode("utf-8"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

In [14]:
list("€".encode("utf-8"))

[226, 130, 172]

In [15]:
bytes([226])

b'\xe2'

In [16]:
list('é'.encode('utf-8'))

[195, 169]

## PROBLEM 3

### NOTES

* Our initial vocabulary size is 256

* We pre-tokenize the dataset to break them into characters, else it will treat `dog` and `dog!` differently

* If we don't pre-tokenize, then everytime we merge, we have to count the occurences again. If you pre-tokenize, and let's say you have three bytes [1,2,3]. Suppose [1,2] occur many times, and you merge them to 4. So now it is [4,3]. But you still know the count of 4, which is the total count of [1,2], it is additive



In [17]:
import regex as re
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

s = list(re.finditer(PAT, "hello how are you doing, ज़रूर! यहाँ एक वाक्य है हिंदी में:"))

In [18]:
s[0].span()

(0, 5)

In [19]:
list('hello'.encode('utf-8'))

[104, 101, 108, 108, 111]

In [20]:
import collections
import regex as re

PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
txt_file = "/Users/athekunal/Desktop/Stanford-cs336/Stanford-cs336-learning/stanford-cs336/data/TinyStoriesV2-GPT4-valid.txt"

def convert_to_bytes(text: str) -> list[int]:
    bytes_list: list[int] = []
    for match in list(re.finditer(PAT, text)):
        bytes_list.extend(list(text[match.span()[0]:match.span()[1]].encode('utf-8')))
    return bytes_list


In [21]:
import collections
from typing import TypeAlias

KeyBytes: TypeAlias = tuple[int, ...]
BytesCount: TypeAlias = int

a = "low low low low low"
b = "lower lower widest widest widest"
c = "newest newest newest newest newest newest"
def convert_to_bytes(text: str) -> collections.defaultdict[KeyBytes, BytesCount]:
    bytes_dict: collections.defaultdict[KeyBytes, BytesCount] = collections.defaultdict(
        int
    )
    for match in list(re.finditer(PAT, text)):
        bytes_key = tuple(list(match.group().encode("utf-8")))
        # group by two successive bytes and make the count, to reduce one less iteration
        for i in range(len(bytes_key) - 1):
            bytes_dict[(bytes_key[i], bytes_key[i + 1])] += 1
    return bytes_dict

In [22]:
convert_to_bytes(b)

defaultdict(int,
            {(108, 111): 2,
             (111, 119): 2,
             (119, 101): 2,
             (101, 114): 2,
             (32, 108): 1,
             (32, 119): 3,
             (119, 105): 3,
             (105, 100): 3,
             (100, 101): 3,
             (101, 115): 3,
             (115, 116): 3})

In [23]:
bytes_dict: collections.defaultdict[KeyBytes, BytesCount] = collections.defaultdict(
        int
    )
for match in list(re.finditer(PAT, b)):
    # bytes_key = tuple(list(match.group().encode("utf-8")))
    bytes_key = tuple(match.group().encode("utf-8"))
    bytes_dict[bytes_key]+=1

In [24]:
bytes_dict

defaultdict(int,
            {(108, 111, 119, 101, 114): 1,
             (32, 108, 111, 119, 101, 114): 1,
             (32, 119, 105, 100, 101, 115, 116): 3})

In [25]:
VocabDict: TypeAlias = collections.defaultdict[KeyBytes, BytesCount]

vocab_dict: VocabDict = collections.defaultdict(int)

for bd_key, bd_val in bytes_dict.items():
    for i in range(len(bd_key)-1):
        vocab_dict[(bd_key[i], bd_key[i+1])] += 1

In [26]:
from collections import OrderedDict
sorted_dict = OrderedDict(sorted(vocab_dict.items(), key=lambda item: item[1], reverse=True))
sorted_dict

OrderedDict([((108, 111), 2),
             ((111, 119), 2),
             ((119, 101), 2),
             ((101, 114), 2),
             ((32, 108), 1),
             ((32, 119), 1),
             ((119, 105), 1),
             ((105, 100), 1),
             ((100, 101), 1),
             ((101, 115), 1),
             ((115, 116), 1)])

In [27]:
top1 = max(vocab_dict.items(), key=lambda item: item[1])
top1[0], top1[1]

((108, 111), 2)

In [28]:
a = (108, 111, 119, 101, 114, 108,114, 108,111)
b = (108,111)



In [29]:
def replace_all_subsequences(a, b, curr_max_vocab):
    if b == ():
        return a
    n, m = len(a), len(b)
    result = []
    i = 0
    while i <= n - m:
        if a[i:i+m] == b:
            result.append(curr_max_vocab)
            i += m
        else:
            result.append(a[i])
            i += 1
    # Append any remaining elements at the end (if b doesn't reach the end)
    result.extend(a[i:])
    return tuple(result)


replace_all_subsequences(a,(),66)

(108, 111, 119, 101, 114, 108, 114, 108, 111)