In [48]:
# importation of librairies 
import re

Loading the data

In [49]:
with open("Data/the-verdict.txt", 'r',encoding='utf-8') as f:
    raw_text = f.read()

print("Total number of caracter: ", len(raw_text))

Total number of caracter:  20479


In [50]:
# printing the first 100 caracters 
print(raw_text[:99])

I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### Tokenization

Using some simple example text, we can use the re.split command with the
following syntax to split a text on whitespace characters:

In [51]:
text = "hello, world. This, is a test"
result = re.split(r'(\s)',text)
print(result)

['hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test']


Note that the simple tokenization scheme above mostly works for separating
the example text into individual words, however, some words are still
connected to punctuation characters that we want to have as separate list
entries. We also refrain from making all text lowercase because capitalization
helps LLMs distinguish between proper nouns and common nouns,understand sentence structure, and learn to generate text with proper
capitalization

Let's modify the regular expression splits on whitespaces (\s) and commas,
and periods ([,.]):

In [52]:
result = re.split(r'([,.]|\s)', text)
print(result)

['hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test']


A small remaining issue is that the list still includes whitespace characters.
Optionally, we can remove these redundant characters safely as follows:

In [53]:
result = [item for item in result if item.strip()]
print(result)

['hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test']


Handle other types of pontuation (,.:?_!"()'--)

In [59]:
text_2 = "Hello, word. Is this-- a test?"
result = re.split(r'([,.:?_!"()\']|--|\s)',text_2)
# removing white space 
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'word', '.', 'Is', 'this', '--', 'a', 'test', '?']


Now that we got a basic tokenizer working, let's apply it to Edith Wharton's
entire short story:

In [61]:
preprocessed = re.split(r'([,.?_!"()\']|--|\s)',raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(len(preprocessed))

4649


The above print statement outputs 4649, which is the number of tokens in this
text (without whitespaces)

In [63]:
print(preprocessed[:10])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius']


### Converting tokens into token IDs

In the previous section, we tokenized Edith Wharton's short story and
assigned it to a Python variable called preprocessed. Let's now create a list
of all unique tokens and sort them alphabetically to determine the vocabulary size

In [67]:
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
print(vocab_size)

1159


all_words = sorted(list(set(preprocessed)))

Cette ligne de code Python effectue les opérations suivantes :

set(preprocessed):

preprocessed est probablement une liste ou un autre itérable contenant des mots ou des jetons (tokens).
set(...) crée un ensemble (set) à partir de preprocessed. Un ensemble est une collection non ordonnée d'éléments uniques. En convertissant preprocessed en un ensemble, on supprime tous les mots en double.
list(...):

Cette opération reconvertit l'ensemble en une liste. Les ensembles n'ont pas d'ordre défini, tandis que les listes maintiennent l'ordre des éléments.
sorted(...):

Cette fonction trie la liste de mots par ordre alphabétique.

After determining that the vocabulary size is 1,159 via the above code, we
create the vocabulary and print its first 50 entries for illustration purposes:

In [78]:
# Creating a vocabulary
vocab = {token:integer for integer, token in enumerate(all_words)}
for i,item in enumerate(vocab.items()):
    print(item)
    if i > 50:
        break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindle:', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)
('Her', 51)


As we can see, based on the output above, the dictionary contains individual
tokens associated with unique integer labels. Our next goal is to apply this
vocabulary to convert new text into token IDs

Let's implement a complete tokenizer class in Python with an encode method
that splits text into tokens and carries out the string-to-integer mapping to
produce token IDs via the vocabulary. In addition, we implement a decode
method that carries out the reverse integer-to-string mapping to convert the
token IDs back into text.

In [94]:
# Implementing a simple text tokenizer 
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encoder(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)',text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decoder(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])',r'\1',text)
        return text

Let's instantiate a new tokenizer object from the SimpleTokenizerV1 class
and tokenize a passage from Edith Wharton's short story to try it out in
practice:

In [95]:
# intiation of our tokenizer with the vocabulary in parameter 
tokenizer = SimpleTokenizerV1(vocab=vocab)

In [96]:
# creating the test text 
text_3 = """It's the last he painted, you know," Mrs. Gisburn said with pardonable pride. "The last but one," she corrected herself--"but the other doesn't count, because he destroyed it."""

In [97]:
# computing the token IDs 
ids = tokenizer.encoder(text_3)
print(ids)

[58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7, 1, 96, 615, 246, 745, 5, 1, 901, 298, 551, 6, 1, 246, 1013, 751, 363, 2, 995, 301, 5, 211, 541, 337, 596, 7]


Next, let's see if we can turn these token IDs back into text using the decode
method:

In [98]:
print(tokenizer.decoder(ids))

It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride." The last but one," she corrected herself --" but the other doesn' t count, because he destroyed it.


Based on the output above, we can see that the decode method successfully
converted the token IDs back into the original text