<a href="https://colab.research.google.com/github/Swyam17/LLM/blob/main/LlmTokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**S1**: CREATING TOKENS

In [None]:
with open("/content/the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 20781
THE VERDICT
June 1908

I had always thought Jack Gisburn rather a cheap genius--though a

good fell


In [None]:
import re
text = "Hello, world. This , is a test."
result = re.split(r'(\s)',text) # this is used to split the text and even the blank spaces
print(result)

['Hello,', ' ', 'world.', ' ', 'This', ' ', ',', ' ', 'is', ' ', 'a', ' ', 'test.']


In [None]:
import re
text = "Hello, world. This , is a test."
result = re.split(r'([,.]|\s)',text) # couma is seperate token
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ' ', '', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [None]:
result = [item for item in result if item.strip()] #removes blank space
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


REMOVING WHITESPACES OR NOT


When developing a simple tokenizer, whether we should encode whitespaces as
separate characters or just remove them depends on our application and its
requirements. Removing whitespaces reduces the memory and computing
requirements. However, keeping whitespaces can be useful if we train models that
are sensitive to the exact structure of the text (for example, Python code, which is
sensitive to indentation and spacing). Here, we remove whitespaces for simplicity
and brevity of the tokenized outputs. Later, we will switch to a tokenization scheme
that includes whitespaces.

TOKENIZATION

The tokenization scheme we devised above works well on the simple sample text. Let's
modify it a bit further so that it can also handle other types of punctuation, such as
question marks, quotation marks, and the double-dashes we have seen earlier in the first
100 characters of Edith Wharton's short story, along with additional special characters:

In [None]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)',text)
result = [item for item in result if item.strip()] #removes blank space
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['THE', 'VERDICT', 'June', '1908', 'I', 'had', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to']


In [None]:
print(len(preprocessed))

4667


**S2**: CREATING TOKEN IDS

In [None]:
all_words = sorted(set(preprocessed)) #sort the words
vocab_size = len(all_words) #length of words
print(vocab_size)

1148


In [None]:
vocab ={token: integer for integer, token in enumerate(all_words)}

In [None]:
for i,item in enumerate(vocab.items()):
  print(item)
  if i>=50:
    break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
('1908', 8)
(':', 9)
(';', 10)
('?', 11)
('A', 12)
('AM', 13)
('Ah', 14)
('Among', 15)
('And', 16)
('Are', 17)
('Arrt', 18)
('As', 19)
('At', 20)
('Be', 21)
('Begin', 22)
('Burlington', 23)
('But', 24)
('By', 25)
('Carlo', 26)
('Chicago', 27)
('Claude', 28)
('Come', 29)
('Croft', 30)
('Destroyed', 31)
('Devonshire', 32)
('Don', 33)
('Dubarry', 34)
('Emperors', 35)
('End', 36)
('FELT', 37)
('Florence', 38)
('For', 39)
('Gallery', 40)
('Gideon', 41)
('Gisburn', 42)
('Gisburns', 43)
('Grafton', 44)
('Greek', 45)
('Grindle', 46)
('Grindles', 47)
('HAD', 48)
('HAS', 49)
('HAVE', 50)


DECODING  

Step 1: Store the vocabulary as a class attribute for access in the encode and decode methods
    
Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens

Step 3: Process input text into token IDs

Step 4: Convert token IDs back into text

Step 5: Replace spaces before the specified punctuation

In [None]:
class SimpleTokenzerV1:
  #init takes vocab
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self, text):
    preprocessed = re.split(r'([,.:?_!"()\']|--|\s)',text)
    #s=token i-token id
    #white spaces remove
    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    text = "".join ([self.int_to_str[i] for i in ids])
    #getting rid of spaces before the punctuations
    text = re.sub(r'\s +([,.?!"()\'])',r'\1',text)
    return text

convert token to token id

In [None]:
tokenizer = SimpleTokenzerV1(vocab)
text =""""It's the last he painted, you know,"
          Mrs.Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 62, 2, 868, 1006, 619, 550, 764, 5, 1144, 614, 5, 1, 76, 7, 42, 869, 1126, 772, 812, 7]


back to token

In [None]:
tokenizer.decode(ids)

'"It\'sthelasthepainted,youknow,"Mrs.Gisburnsaidwithpardonablepride.'

ADDING SPECIAL TOKENS

WE NEED IT BECAUSE THERE MAY BE WORDS WHICH ARE NOT THERE IN THE VACOB SO TO HANDLE THOSE WORDS WE NEED IT

we use SimpleTokenV2, to support two new tokens, <|unk|> and <|endoftext|>

In [None]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])
vocab = {token: integer for integer, token in enumerate(all_tokens)}

In [None]:
len(vocab.items())

1150

vocabulary size is increased by 2, earlier it was 1148 now 1150

In [None]:
#print out the last 5 entries of updated vocabulary
for i, item in enumerate (list(vocab.items())[-5:]):
    print(item)

('younger', 1145)
('your', 1146)
('yourself', 1147)
('<|unk|>', 1148)
('<|endoftext|>', 1149)


In [None]:
class SimpleTokenzerV2:
  def __init__(self, vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self, text):
    preprocessed = re.split(r'([,.:?_!"()\']|--|\s)',text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    preprocessed = [
        item if item in self.str_to_int
        else "<|unk|>" for item in preprocessed # if text is not there in vocab then it is replaced by unkown token
    ]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self, ids):
    text = " ".join ([self.int_to_str[i] for i in ids]) # Join with space
    # Getting rid of spaces before the punctuations and double-dash
    text = re.sub(r'\s+([,.?!"()\u0027]|--)', r'\1', text)
    return text

In [None]:
tokenizer = SimpleTokenzerV2(vocab)
text1 ="Swyam, do you like tea?"
text2 = "In the sunlit terraces of palace."

text = "<|endoftext|> " .join([text1, text2])
print(text)

Swyam, do you like tea?<|endoftext|> In the sunlit terraces of palace.


In [None]:
tokenizer.encode(text)

[1148, 5, 372, 1144, 645, 993, 11, 1149, 61, 1006, 974, 1002, 740, 1148, 7]

In [None]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of <|unk|>.'

other tokens:

[BOS] (beginning of sequence): This token marks the start of a text. It
signifies to the LLM where a piece of content begins.

[EOS] (end of sequence): This token is positioned at the end of a text,
and is especially useful when concatenating multiple unrelated texts,
similar to <|endoftext|>. For instance, when combining two different
Wikipedia articles or books, the [EOS] token indicates where one article
ends and the next one begins.

[PAD] (padding): When training LLMs with batch sizes larger than one,
the batch might contain texts of varying lengths. To ensure all texts have
the same length, the shorter texts are extended or "padded" using the
[PAD] token, up to the length of the longest text in the batch.