<a href="https://colab.research.google.com/github/Satwikram/NLP-Implementations/blob/main/BERT/Adding%20Domain%20Specific%20tokens%20for%20BERT%20Tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author: Satwik Ram K

### Setup

In [None]:
!pip install transformers
!pip install tokenizers
!pip install datasets

### Imports

In [1]:
import datasets
from datasets import load_dataset

import os
from tqdm.auto import tqdm
import re
from pathlib import Path
from tokenizers import BertWordPieceTokenizer

from transformers import BertTokenizer, AutoModelForMaskedLM, AutoTokenizer, TFAutoModel

In [2]:
len(datasets.list_datasets())

2678

### Downloading the datasets

In [3]:
dataset = load_dataset(
    "imdb",
    "plain_text",
    split="train[:5000]"
)

Resolving data files:   0%|          | 0/5000 [00:00<?, ?it/s]

Using custom data configuration imdb-05b5f779e0edea84


Downloading and preparing dataset text/imdb to /root/.cache/huggingface/datasets/text/imdb-05b5f779e0edea84/0.0.0/d86c40dad297bdddf277b406c6a59f0250b5318c400bf23d420a31aff88c84c4...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset text downloaded and prepared to /root/.cache/huggingface/datasets/text/imdb-05b5f779e0edea84/0.0.0/d86c40dad297bdddf277b406c6a59f0250b5318c400bf23d420a31aff88c84c4. Subsequent calls will reuse this data.


In [4]:
os.makedirs("./imdb", exist_ok=True)

text_data = []
file_count = 0

for sample in tqdm(dataset):
    sample = re.sub("\s+", " ", sample["text"])
    text_data.append(sample)

    # once we hit the 5K mark, save to file
    with open(f'./imdb/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
        fp.write('\n'.join(text_data))
    text_data = []
    file_count += 1

  0%|          | 0/5000 [00:00<?, ?it/s]

In [5]:
paths = [str(x) for x in Path('/content/imdb/').rglob('*.txt')]
paths[:5]

['/content/imdb/text_3.txt',
 '/content/imdb/text_327.txt',
 '/content/imdb/text_1929.txt',
 '/content/imdb/text_1787.txt',
 '/content/imdb/text_4710.txt']

### Building WordPiece BERT Tokenizer

In [6]:
# initialize
tokenizer = BertWordPieceTokenizer(
    vocab=None,
    clean_text=False,
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=False
)
# and train
tokenizer.train(files=paths, vocab_size=100000, min_frequency=2,
                limit_alphabet=1000, wordpieces_prefix='##',
                special_tokens=[
                    '[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'])

In [7]:
tokenizer

Tokenizer(vocabulary_size=37716, model=BertWordPiece, unk_token=[UNK], sep_token=[SEP], cls_token=[CLS], pad_token=[PAD], mask_token=[MASK], clean_text=False, handle_chinese_chars=False, strip_accents=False, lowercase=False, wordpieces_prefix=##)

### Saving the Vocab

In [8]:
os.makedirs('./vocab', exist_ok=True)
tokenizer.save_model("vocab")

['vocab/vocab.txt']

In [9]:
tokens = [k for k, v in tokenizer.get_vocab().items()]

In [10]:
tokens[:10]

['recite',
 'HAPP',
 'zones',
 'restore',
 'YW',
 'Ust',
 '##entia',
 'Taco',
 'Toronto',
 'gamut']

### Adding this tokens to BERT Tokenizer

In [11]:
checkpoint = "bert-base-cased"
bert_tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [12]:
model = TFAutoModel.from_pretrained(checkpoint)

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


### Let's increase the vocabulary of Bert model and tokenizer

In [13]:
num_added_toks = bert_tokenizer.add_tokens(tokens)

In [15]:
print('We have added', num_added_toks, 'tokens')

We have added 21229 tokens


Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.

In [16]:
model.resize_token_embeddings(len(bert_tokenizer))

<transformers.models.bert.modeling_tf_bert.TFBertEmbeddings at 0x7f065ac63b50>

In [20]:
bert_tokenizer.vocab_size

28996