<a href="https://colab.research.google.com/github/AliAch04/build-llm-from-scratch/blob/main/LLMScratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Short Story as text sample Into python

**Step 1 : Create Tokens**

In [7]:
with open("/content/drive/MyDrive/llm/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print('Number of characters : ', len(raw_text))
# print first 100 characters
print(raw_text[:99])

Number of characters :  20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


- Tokenize the whole characters (20479) into individual words and special characters Then turn into embeddings

- Split text into list of tokens based on white text or special characters ...

In [8]:
import re # Regular expression

text_test = "Hello, world. this, is a test!"
result = re.split(r'([,.!]|\s)', text_test)

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'this', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '!', '']


- Remove redundants characters safely

In [9]:
#result = [item for item in result if item not in ['', ' ']]
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'this', ',', 'is', 'a', 'test', '!']


- In our context we are removing the white-spaces because our text structure doesnt need it (working on simple sample of text)

- Full process

In [10]:
text_test = 'I HAD always! thought Jack Gisburn rather? a cheap genius--though a good fellow enough--so it was no! '
# Tokenization sheme
result = re.split(r'([.,:;!_?"]|--|\s)', text_test)
result = [i for i in result if i.strip()]
print(result)

['I', 'HAD', 'always', '!', 'thought', 'Jack', 'Gisburn', 'rather', '?', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', '!']


- Apply the tokenizer on the Story

In [11]:
preprocessed = re.split(r'([,.:;_!?"()\']|--|\s)', raw_text)
preprocessed = [item for item in preprocessed if item.strip()]
print(preprocessed[:20])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was']


In [12]:
print('Total number of tokens : ', len(preprocessed))

Total number of tokens :  4690


**Step 2 : Create Tokens IDs**

- Create list of tokens and sort them alphabetically to determine the Vocabilary size

In [13]:
all_words = sorted(set(preprocessed))
print(all_words[:15])
print(len(all_words))

['!', '"', "'", '(', ')', ',', '--', '.', ':', ';', '?', 'A', 'Ah', 'Among', 'And']
1130


- Create the Vocabulary itself

In [14]:
vocab = {token : i for i, token in enumerate(all_words)}


In [15]:
for i, v in enumerate(vocab.items()):
  if i > 25:
    break
  print(v)

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)


- Emplement Tokenizer class

In [30]:
class SimpleTokenizer:
  def __init__(self):
    self.str_to_int = vocab
    self.int_to_str = {id:t for t, id in vocab.items()}

  def encoder(self, txt):
    processed = re.split(r'([.,;?()!_:\'"]|--|\s)', txt)

    processed = [item.strip() for item in processed if item.strip()]

    ids = [self.str_to_int[t] for t in processed]
    return ids

  def decoder(self, ids):
    txt = " ".join([self.int_to_str[id] for id in ids])
    # Prevent the whitespace before the punctuation marks
    txt = re.sub(r'\s+([,."\'?!()])', r'\1', txt)
    return txt

token = SimpleTokenizer()
print(token.encoder('Ah At Among !'))
print(token.decoder(token.encoder('Ah At Among !')))


[12, 18, 13, 0]
Ah At Among!
