# Exercise 1: Implementing a Character Level Tokenization

### Advantages:
1. No unknown tokens - every character in the input is tokenized.
2. Smaller vocabulary size - only need to represent unique characters.

### Disadvantages:
1. Longer sequences - a sentence becomes many more tokens.
2. May lose word-level semantic meaning.

## Implementation:

### Step 1: Load the text data

In [5]:
# Load the text data

with open(
    "/Users/sadiahzahoor/Desktop/AI Research/LLMs /LLM's from Scratch/the-verdict.txt",
    "r",
) as file:
    text = file.read()

print("Total number of characters in the text: ", len(text))
print(text[:100])

Total number of characters in the text:  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


### Step 2: Create a character level Vocabulary

In [6]:
unique_chars = sorted(list(set(text))) # Creates a list of unique characters in the text

print("Total number of unique characters: ", len(unique_chars))
print(unique_chars)

Total number of unique characters:  62
['\n', ' ', '!', '"', "'", '(', ')', ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


### Step 3: Create a character to token mapping

In [16]:
char_to_token = {char: i for i, char in enumerate(unique_chars)}
token_to_char = {i: char for i, char in enumerate(unique_chars)}

# Print first 10 items of the dictionary
print(list(char_to_token.items())[:10]) # Creates a list of the first 10 items of the dictionary
print(list(token_to_char.items())[:10]) # Creates a list of the first 10 items of the dictionary

[('\n', 0), (' ', 1), ('!', 2), ('"', 3), ("'", 4), ('(', 5), (')', 6), (',', 7), ('-', 8), ('.', 9)]
[(0, '\n'), (1, ' '), (2, '!'), (3, '"'), (4, "'"), (5, '('), (6, ')'), (7, ','), (8, '-'), (9, '.')]


### Step 4: Create a Character-Level-Tokenizer:

In [19]:
class CharacterLevelTokenizerv1:
    def __init__(self, text):
        self.text = text
        self.unique_chars = sorted(list(set(text)))
        self.char_to_token = {char: i for i, char in enumerate(self.unique_chars)}
        self.token_to_char = {i: char for i, char in enumerate(self.unique_chars)}

    def encode(self, text):
        token_ids = [self.char_to_token[char] for char in text]
        return token_ids

    def decode(self, token_ids):
        words = ''.join([self.token_to_char[token_id] for token_id in token_ids])
        return words


# Test the CharacterLevelTokenizer
tokenizer = CharacterLevelTokenizerv1(text)

# Test encoding
encoded = tokenizer.encode("Hello, world!")
print(encoded)

decoded = tokenizer.decode(encoded)
print(decoded)

[20, 40, 47, 47, 50, 7, 1, 58, 50, 53, 47, 39, 2]
Hello, world!


### Remarks:

The Character Level Tokenizer does not have any unknown tokens. This is because every character in the input is tokenized. As a result, the vocabulary size is equal to the number of unique characters in the text. But if we have a character that is not in the vocabulary, it will be tokenized as an unknown token. And throw an error. Say, in this case, we don't have numbers, so if we try to tokenize a number, it will throw an error.

In [22]:
# Test the Unknown Token

encoded = tokenizer.encode("Hello, 10")
print(encoded)

KeyError: '1'

### Step 5: Implement a Character Level Tokenizer with Unknown Tokens.

In [29]:
with open(
    "/Users/sadiahzahoor/Desktop/AI Research/LLMs /LLM's from Scratch/the-verdict.txt",
    "r",
) as file:
    text = file.read()

unique_chars = sorted(list(set(text)))
extended_unique_chars = unique_chars + ['<|unk|>']

print(f"Total number of unique characters: {len(extended_unique_chars )}")
print(f"Last 5 characters of the unique characters: {extended_unique_chars[-5:]}")

Total number of unique characters: 63
Last 5 characters of the unique characters: ['w', 'x', 'y', 'z', '<|unk|>']


In [31]:
# Create a new vocabulary
char_to_token = {char: i for i, char in enumerate(extended_unique_chars)}
token_to_char = {i: char for i, char in enumerate(extended_unique_chars)}

In [34]:
# Create a new tokenizer

class CharacterLevelTokenizerv2:
    def __init__(self, text):
        self.text = text
        self.unique_chars = sorted(list(set(text))) + ['<|unk|>']
        self.char_to_token = {char: i for i, char in enumerate(self.unique_chars)}
        self.token_to_char = {i: char for i, char in enumerate(self.unique_chars)}

    def encode(self, text):
        # If the character is not in the vocabulary, it will be tokenized as <|unk|>
        token_ids = [self.char_to_token[char] if char in self.char_to_token else self.char_to_token['<|unk|>'] for char in text]
        return token_ids

    def decode(self, token_ids):
        words = ''.join([self.token_to_char[token_id] for token_id in token_ids])
        return words
    
# Test the CharacterLevelTokenizer with Unknown Tokens
tokenizer = CharacterLevelTokenizerv2(text)

# Test encoding
encoded = tokenizer.encode("Hello, 10")
print(encoded)

decoded = tokenizer.decode(encoded)
print(decoded)


[20, 40, 47, 47, 50, 7, 1, 62, 62]
Hello, <|unk|><|unk|>
