## UNDERSTANDING UNICODE

1. `chr(0)` returns `'\x00'` which is the null character

2. `__repr__` renders chr(0) as a null character (empty string) in the stdio, but unicode renders into a string object with place holder characters

3. When rendered in a string, it is represented as `x00`, but with print it renders in stdio, hence empty

In [3]:
chr(0)

'\x00'

In [4]:
print(chr(0))

 


In [5]:
"this text is " + chr(0)

'this text is \x00'

In [6]:
print("this text is " + chr(0))

this text is  


In [7]:
s = "this text is " + chr(0)

len(s), len("this text is ") #so it does add to the length

(14, 13)

## PROBLEM 2

* There are no out of distribution as each sequence of characters can be represented as list of integers

1. UTF-8 is preferred because it is more byte-efficient and covers all the ASCII characters, emojis. Most of the dataset are stored in UTF-8, hence it is compatible to train too

2. Some UTF-8 characters use multiple bytes, but as this is decoding one byte at a time, it is erroneous. It should decode all at once

3. é decodes into two bytes

In [8]:
test_str = "hello there"
s = test_str.encode("utf-8")
print(s)

b'hello there'


In [9]:
print(type(s))

<class 'bytes'>


In [10]:
test_str = "hello\x00 there      "
s = test_str.encode("utf-8")
"".join([bytes([k]).decode("utf-8") for k in list(s)])

'hello\x00 there      '

In [11]:
print(bytes([104, 101, 108, 108]).decode())

hell


In [12]:
bytes([104,101])
bytes([104])

b'h'

In [13]:
def decode_utf8_bytests_to_str_wrong(bytestring: bytes) -> str:
    return "".join([bytes([b]).decode("utf-8") for b in bytestring])

decode_utf8_bytests_to_str_wrong("\x00helloé".encode("utf-8"))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

In [14]:
list("€".encode("utf-8"))

[226, 130, 172]

In [15]:
bytes([226])

b'\xe2'

In [16]:
list('é'.encode('utf-8'))

[195, 169]

## PROBLEM 3

### NOTES

* Our initial vocabulary size is 256

* We pre-tokenize the dataset to break them into characters, else it will treat `dog` and `dog!` differently

* If we don't pre-tokenize, then everytime we merge, we have to count the occurences again. If you pre-tokenize, and let's say you have three bytes [1,2,3]. Suppose [1,2] occur many times, and you merge them to 4. So now it is [4,3]. But you still know the count of 4, which is the total count of [1,2], it is additive



In [2]:
import regex

exclude = ["<|endoftext|>", "<|startoftext|>", "<|pad|>"]
specials = "|".join(regex.escape(s) for s in exclude)

PAT = rf"""
   (?P<skip>{specials})(*SKIP)(*FAIL)    # 1) skip any of the specials
 | '(?:[sdmt]|ll|ve|re)                  # 2) contractions
 | [ ]?\p{{L}}+                          # 3) letters, with optional leading space
 | [ ]?\p{{N}}+                          # 4) numbers, with optional leading space
 | [ ]?[^\s\p{{L}}\p{{N}}]+              # 5) other punctuation, opt. lead-space
 | \s+(?!\S)                             # 6) whitespace (but not trailing on a line)
 | \s+                                   # 7) any whitespace
"""

token_re = regex.compile(PAT, regex.VERBOSE)


In [4]:
text = "hello there <|endoftext|>"
for i in token_re.finditer(text):
    print(i)

<regex.Match object; span=(0, 5), match='hello'>
<regex.Match object; span=(5, 11), match=' there'>
<regex.Match object; span=(11, 14), match=' <|'>
<regex.Match object; span=(14, 23), match='endoftext'>
<regex.Match object; span=(23, 25), match='|>'>


In [6]:
import regex

# 1) List exactly the sequences you want to see as whole tokens:
SPECIALS = ["<|endoftext|>", "<|startoftext|>", "<|pad|>"]
special_pattern = "|".join(regex.escape(s) for s in SPECIALS)

# 2) Build a master regex with the specials up front:
PAT = rf"""
   # — match any of your specials, first
   {special_pattern}

   | '(?:[sdmt]|ll|ve|re)   # contractions
   | \p{{L}}+               # letter sequences
   | \p{{N}}+               # number sequences
   | [^\s\p{{L}}\p{{N}}]+   # other punctuation
   | \s+                    # whitespace
"""

token_re = regex.compile(PAT, regex.VERBOSE)

def tokenize(text):
    return [m.group(0) for m in token_re.finditer(text)]

# demo
text = "hello there <|endoftext|> friend!"
print(tokenize(text))
# -> ['hello', ' ', 'there', ' ', '<|endoftext|>', ' ', 'friend', '!']


['hello', ' ', 'there', ' ', '<|endoftext|>', ' ', 'friend', '!']


In [11]:
import regex   # 3rd-party regex module (needed for \p{} Unicode props)

SPECIALS = ["<|endoftext|>", "<|startoftext|>", "<|pad|>"]
special_alt = "|".join(regex.escape(s) for s in SPECIALS)

PAT = rf"""
    [ ]?(?:{special_alt})              # 1) special tags, optional leading space
  | '(?:[sdmt]|ll|ve|re)               # 2) contractions
  | [ ]?\p{{L}}+                       # 3) letters
  | [ ]?\p{{N}}+                       # 4) numbers
  | [ ]?[^\s\p{{L}}\p{{N}}]+           # 5) punctuation
  | \s+(?!\S)                          # 6) trailing spaces at EOL
  | \s+                                # 7) other whitespace
"""

token_re = regex.compile(PAT, regex.VERBOSE)

# quick demo
text = "hello there <|endoftext|> friend!"
print([m.group(0) for m in token_re.finditer(text)])
# ['hello', ' ', 'there', ' ', '<|endoftext|>', ' ', 'friend', '!']


['hello', ' there', ' <|endoftext|>', ' friend', '!']


In [1]:
import tokenizer

tok = tokenizer.Tokenizer.from_files(
    "/Users/athekunal/Desktop/Stanford-cs336/Stanford-cs336-learning/stanford-cs336/tinystories-val.json",
    special_tokens=None
)

In [7]:
text = "hey there"
print(tok.decode(tok.encode(text)))

hey there


In [9]:
tok.decode([72, 417, 108, 111, 261,272])

[72, 101, 108, 108, 111, 32, 116, 104, 101, 114, 101]


'Hello there'

In [6]:
bytes([72, 101, 108, 108, 111, 32, 116, 104, 101, 65])

b'Hello theA'

In [24]:
import torch

d = 512

theta_vals = 10_000**((-2 * torch.arange(0,d//2,1))/d)
position = torch.arange(d)

# cos_terms = torch.cos(theta_vals)

m_theta = position*torch.repeat_interleave(theta_vals,repeats=2)

m_theta_cos = torch.cos(m_theta)
m_theta_sin = torch.sin(m_theta)



In [26]:
theta_vals.shape

torch.Size([256])