<a href="https://colab.research.google.com/github/Satwikram/NLP-Implementations/blob/main/Preprocessing/Building%20WordPiece%20BERT%20Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author: Satwik Ram K

### Setup

In [None]:
!pip install datasets
!pip install tokenizers
!pip install transformers

### Imports

In [36]:
import datasets
from datasets import load_dataset

import os
from tqdm.auto import tqdm
import re
from pathlib import Path
from tokenizers import BertWordPieceTokenizer

from transformers import BertTokenizer

In [4]:
len(datasets.list_datasets())

2664

### Downloading Dataset

In [5]:
dataset = load_dataset(
    "imdb",
    "plain_text",
    split="train[:5000]"
)

Downloading:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text (download: 80.23 MiB, generated: 127.02 MiB, post-processed: Unknown size, total: 207.25 MiB) to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


In [6]:
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 5000
})

In [7]:
os.makedirs("./imdb", exist_ok=True)

text_data = []
file_count = 0

for sample in tqdm(dataset):
    sample = re.sub("\s+", " ", sample["text"])
    text_data.append(sample)

    # once we hit the 5K mark, save to file
    with open(f'./imdb/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
        fp.write('\n'.join(text_data))
    text_data = []
    file_count += 1

  0%|          | 0/5000 [00:00<?, ?it/s]

In [24]:
paths = [str(x) for x in Path('/content/imdb/').rglob('*.txt')]
paths[:5]

['/content/imdb/text_3437.txt',
 '/content/imdb/text_23.txt',
 '/content/imdb/text_1735.txt',
 '/content/imdb/text_2828.txt',
 '/content/imdb/text_234.txt']

### Building WordPiece BERT Tokenizer

There are a few important arguments to take note of here, during initialization we have:

**clean_text** — cleans text by removing control characters and replacing all whitespace with spaces.

**handle_chinese_chars** — whether the tokenizer includes spaces around Chinese characters (if found in the dataset).

**stripe_accents** — whether we remove accents, when True this will make é → e, ô → o, etc.

**lowercase** — if True the tokenizer will view capital and lowercase characters as equal; A == a, B == b, etc.

And during training, we use:

**vocab_size** — the number of tokens in our tokenizer. During later tokenization of text, unknown words will be assigned an [UNK] token which is not ideal. We should try to minimize this when possible.

**min_frequency** — minimum frequency for a pair of tokens to be merged.

**special_tokens** — a list of the special tokens that BERT uses.

**limit_alphabet** — maximum number of different characters.

**workpieces_prefix** — the prefix added to pieces of words (like ##board in our earlier examples).

In [33]:
# initialize
tokenizer = BertWordPieceTokenizer(
    vocab=None,
    clean_text=False,
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=False
)
# and train
tokenizer.train(files=paths, vocab_size=100000, min_frequency=2,
                limit_alphabet=1000, wordpieces_prefix='##',
                special_tokens=[
                    '[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'])

In [34]:
tokenizer

Tokenizer(vocabulary_size=37716, model=BertWordPiece, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], pad_token=[PAD], mask_token=[MASK], clean_text=False, handle_chinese_chars=False, strip_accents=False, lowercase=False, wordpieces_prefix=##)

In [29]:
os.makedirs('./vocab', exist_ok=True)
tokenizer.save_model("vocab", "vocab")

['vocab/vocab-vocab.txt']

### Using this vocab in BERT Tokenizer

In [38]:
tokenizer = BertTokenizer.from_pretrained('/content/vocab')

file /content/vocab/config.json not found


In [39]:
tokenizer

PreTrainedTokenizer(name_or_path='/content/vocab', vocab_size=37697, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [41]:
tokenizer('Ok. hi. bye. h')

{'input_ids': [2, 12394, 18, 13104, 18, 25639, 18, 76, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [42]:
with open('./vocab/vocab.txt', 'r') as fp:
    vocab = fp.read().split('\n')