### Section 2.1

To someone who does not know (or needs a quick recap) section 2.1 gives a summary of what word embeddings are.

### Section 2.2 (Tokenizing text)


We start coding from section 2.2 that explains how we tokenize text.
We download all the text from 'The verdict', available on WikiSource.


In [1]:
import os
import urllib.request

if not os.path.exists("the-verdict.txt"):
    url = ("https://raw.githubusercontent.com/rasbt/"
           "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
           "the-verdict.txt")
    file_path = "the-verdict.txt"
    urllib.request.urlretrieve(url, file_path)




This document is a concise text file containing just over 20,000 characters, which should be sufficient for building a basic model. The primary goal is to tokenize all the words in this document as a foundational step in the model-building process.

In [2]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))

Total number of character: 20479


We remove white spaces and generate 'tokens' (containing words and special characters).

In [3]:
import re

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

### Section 2.3 (Converting tokens into token IDs)

Create 'vocabulary' from 'tokens'

In [4]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print("Total number of unique tokens:", vocab_size)

Total number of unique tokens: 1130


Assign a unique integer for every token

In [5]:
vocab = {token:integer for integer,token in enumerate(all_words)}

- We create a `Tokenizer` class with the following two primary methods:
  - **`encode`**: Converts a token into its corresponding token ID.
  - **`decode`**: Converts a token ID back into its original token.
- This class serves as a foundational utility for tokenizing text and reconstructing it from token IDs.


In [6]:
class Tokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [7]:
tokenizer = Tokenizer(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""

In [8]:
ids = tokenizer.encode(text)
print(ids)
tokenizer.decode(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

### Section 2.4 (Adding special context tokens)

We add 2 additional token: <|unk|> (representing words not part of the vocab) and <|endoftext|> (Seperates two unrelated text source)

In [9]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

Updated tokenizer

In [10]:
class Tokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int 
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

### Section 2.5 (BytePair encoding)

Byte Pair Encoding (BPE) is a subword tokenization algorithm commonly used to split text into smaller units (subwords) that strike a balance between words and individual characters. This approach is especially effective for handling rare or unseen words, making it widely used in modern language models like GPT and BERT.

We use `tiktoken`, OpenAI's open-source library. (It is 5 times faster! 😮)

In [11]:
!python --version

Python 3.7.6


In [None]:
!conda install python=3.9

Collecting package metadata (current_repodata.json): done
Solving environment: | 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/osx-64::bokeh==1.4.0=py37_0
  - defaults/noarch::flask==1.1.1=py_0
  - defaults/osx-64::spyder==4.0.1=py37_0
  - defaults/noarch::sphinx==2.4.0=py_0
  - defaults/noarch::pytest-astropy==0.8.0=py_0
  - defaults/osx-64::astropy==4.0=py37h1de35cc_0
  - pytorch/osx-64::torchaudio==0.9.0=py37
  - defaults/osx-64::pytest-arraydiff==0.3=py37h39e3cac_0
  - defaults/osx-64::pytest==5.3.5=py37_0
  - pytorch/osx-64::torchvision==0.10.0=py37_cpu
  - defaults/noarch::pytest-astropy-header==0.1.2=py_0
  - defaults/noarch::pytest-doctestplus==0.5.0=py_0
  - defaults/osx-64::distributed==2.11.0=py37_0
  - defaults/noarch::conda-verify==3.4.2=py_1
  - defaults/noarch::pytest-openfiles==0.4.0=py_0
  - defaults/osx-64::pytest-remotedata==0.3.2=py37_0
  - pytorch/osx-64::pytorch==1.9.0=

In [12]:
!pip install tiktoken

[31mERROR: Ignored the following versions that require a different python version: 0.1.1 Requires-Python >=3.9; 0.1.2 Requires-Python >=3.8; 0.2.0 Requires-Python >=3.8; 0.3.0 Requires-Python >=3.8; 0.3.1 Requires-Python >=3.8; 0.3.2 Requires-Python >=3.8; 0.3.3 Requires-Python >=3.8; 0.4.0 Requires-Python >=3.8; 0.5.0 Requires-Python >=3.8; 0.5.1 Requires-Python >=3.8; 0.5.2 Requires-Python >=3.8; 0.6.0 Requires-Python >=3.8; 0.7.0 Requires-Python >=3.8; 0.8.0 Requires-Python >=3.9[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement tiktoken (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for tiktoken[0m[31m
[0m

In [13]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

ModuleNotFoundError: No module named 'tiktoken'