<a href="https://colab.research.google.com/github/RaviChandraVeeramachaneni/fastai-huggingface_experiments/blob/main/fastai%2BHF_week2_Tokenizer_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install the Transformers and Datasets libraries to run this notebook.

In [1]:
!pip install -qq transformers[sentencepiece]
!pip install -qq datasets

from transformers import pipeline

[K     |████████████████████████████████| 2.6 MB 8.4 MB/s 
[K     |████████████████████████████████| 895 kB 59.1 MB/s 
[K     |████████████████████████████████| 636 kB 59.1 MB/s 
[K     |████████████████████████████████| 3.3 MB 62.0 MB/s 
[K     |████████████████████████████████| 1.1 MB 71.0 MB/s 
[K     |████████████████████████████████| 542 kB 8.5 MB/s 
[K     |████████████████████████████████| 76 kB 6.0 MB/s 
[K     |████████████████████████████████| 243 kB 75.2 MB/s 
[K     |████████████████████████████████| 118 kB 70.6 MB/s 
[?25h

Build a tokenizer from scratch

Step1: Download the wikitext-103(516M of text) dataset and extract it

In [2]:
!wget "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip"
!unzip wikitext-103-raw-v1.zip

--2021-07-23 23:39:59--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.129.240
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.129.240|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 191984949 (183M) [application/zip]
Saving to: ‘wikitext-103-raw-v1.zip’


2021-07-23 23:40:04 (40.0 MB/s) - ‘wikitext-103-raw-v1.zip’ saved [191984949/191984949]

Archive:  wikitext-103-raw-v1.zip
   creating: wikitext-103-raw/
  inflating: wikitext-103-raw/wiki.test.raw  
  inflating: wikitext-103-raw/wiki.valid.raw  
  inflating: wikitext-103-raw/wiki.train.raw  


###Task: Let's build and train a Byte-Pair Encoding (BPE) tokenizer.
    - Start with all the characters present in the training corpus as tokens.
    - Identify the most common pair of tokens and merge it into one token.
    - Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.

Step2: Import the Tokenizer & BPE

In [3]:
from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

Step3: To train our tokenizer on the wikitext files, we will need to instantiate a trainer, in this case a BpeTrainer

In [4]:
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

Step4: Utilizing pre-tokenization to make sure we have clear seperation of tokens, they do not overlap

In [5]:
from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Step5: Training

In [6]:
files = [f"wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)

Step6: Save Tokenizer to a file that contains full configuration and vocab

In [7]:
tokenizer.save("tokenizer-wiki.json")

step7: Reloading the tokenizer from the above file

In [8]:
tokenizer = Tokenizer.from_file("tokenizer-wiki.json")

step8: Using the tokenizer we just created & the output would be a encoded object

In [9]:
output = tokenizer.encode("This is my week-2 learning from fastAI and hf study group")

Checking the Tokens

In [10]:
print(output.tokens)

['This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group']


Checking the id's atribute will contain the index of each of those tokens in the tokenizer’s vocabulary

In [11]:
print(output.ids)

[5521, 5031, 5454, 5830, 17, 22, 12018, 5108, 7930, 11571, 5025, 76, 74, 7506, 5733]


Checking the offsets

In [12]:
print(output.offsets[6])

(18, 26)


Matching the offsets back to text and see if the encodings are right



In [13]:
sentence = "This is my week-2 learning from fastAI and hf study group"
sentence[18:26]

'learning'

### Step9: Post-Processing Steps
    - To add special tokens like "[CLS]" or "[SEP]"
    - TemplateProcessing is the most commonly used Post-Processor

In [14]:
from tokenizers.processors import TemplateProcessing

tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

Step10: Let’s try to encode the same sentence as before and see if thats works

In [15]:
output = tokenizer.encode("This is my week-2 learning from fastAI and hf study group")
print(output.tokens)

['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']


In [18]:
output = tokenizer.encode("This is my week-2 learning", "from fastAI and hf study group")
print(output.tokens)

['[CLS]', 'This', 'is', 'my', 'week', '-', '2', 'learning', '[SEP]', 'from', 'fast', 'AI', 'and', 'h', 'f', 'study', 'group', '[SEP]']


Check the ID's

In [19]:
print(output.type_ids)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
