# Data prep: Tokenizer Training (Mockup)
Copyright (C) 2021 ServiceNow, Inc.

This notebook was used to experiment and develop around training a custom (geology-specific) tokenizer.

This is likely not runnable without revisions, as the training data likely does not all still exist.

---

# Plan
* Test out tokenizer training process. 
    * Follow [🤗 Tokenizers Quicktour] (https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) to train a BPE tokenizer from scratch. This is just to check the process.
    * Use data from Tianyi’s blind text concatenation here: `/nrcan_p2/data/03_primary/v1/all_text.txt` (used a smaller subset)
* Create two tokenizers
    1. Train a tokenizer from scratch for BERT, including pre- and post-processing, and WordPiece trainer, following [🤗 Tokenization Pipeline Tutorial](https://huggingface.co/docs/tokenizers/python/latest/pipeline.html).
        * Actually trained 2, with different amounts of data (~5 vs. 500 MB)
    2. Create a second tokenizer that is a BERT standard tokenizer (wikipedia data subset, also uses the above pipeline)
* Examine and compare tokenizers
    * Pick a subset of definitely geological sentences
    * Compare what the tokenizers do with technical terms and with common English.

---

# Setup

Manually check that data is where I expect it to be:

In [42]:
# !ls ../../data/03_primary/v1

Huggingface `tokenizers` and `transformers` should be installed in the container already.

## Set up data

### Create a smaller text file

The whole dataset (`data/03_primary/v1/all_text.txt`) is 2.17 GB, and the tutorial uses a dataset with 516 MB.  Using 2.17 GB makes training the tokenizer verrrry slow.  This is why I started with a smaller subset. :) 

The below is just for documentation, and run from a notebook in `workspace/[SUBFOLDER]/`. 

In [34]:
# Ran this in the command line in data/03_primary/. 
# Could uncomment and add directory structure to run here.
# ~5MB
#!head -c 5000000 all_text.txt > short_text_5M.txt

In [45]:
# ~500MB
# !head -c 500000000 ../../data/03_primary/v1/all_text.txt > ../../data/03_primary/v1/med_text_500M.txt

### Download data from wikipedia for comparison

In [69]:
# %%bash
# (cd /nrcan_p2/data/01_raw/toy/wiki;
# wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip;
# unzip wikitext-103-raw-v1.zip)

Data is at:

`/nrcan_p2/data-01/raw/toy/wiki/wikitext-103-raw/`

and includes: 
* `wiki.test.raw`
* `wiki.train.raw`
* `wiki.valid.raw`

---

# Build BPE tokenizer from scratch (learn workflow)

## Train tokenizer

In [1]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

In [21]:
# Instantiate Tokenizer with a BPE model
tokenizer = Tokenizer(BPE())

In [4]:
# Instantiate a trainer (BpeTrainer) for training on geological text
# Default values: vocab_size=30000, min_frequency=0
trainer = BpeTrainer(special_tokens=['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]'])

NOTE: May want to add to the pre-tokenizer to split on numbers as well.

In [5]:
# Include pre-tokenizer to split inputs into words (split on whitespace)
tokenizer.pre_tokenizer = Whitespace()

In [6]:
# Set location for text files (whole corpus)
files = ['../../data/03_primary/v1/short_text_5M.txt']

In [7]:
tokenizer.train(trainer, files)

In [9]:
# Save model and reinstate with unknown token (or it won't be used)
token_files = tokenizer.model.save("../../data/06_models/tokenizers/testing", "bpe_geo_short")
tokenizer.model = BPE.from_file(*token_files, unk_token="[UNK]")

In [10]:
# Save as one file that contains all configuration and vocabulary
tokenizer.save("../../data/06_models/tokenizers/testing/bpe_geo_short.json")
# Can reload with:
# tokenizer = Tokenizer.from_file("../../data/06_models/tokenizers/testing/bpe_geo_short.json")

### Take a quick look at the new tokenizer

In [17]:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
print(output.ids)

['H', 'ello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '[UNK]', '?']
[45, 14735, 17, 93, 12, 221, 6, 4111, 154, 1697, 0, 36]


In [16]:
output = tokenizer.encode("This geo sentence includes proteozoic, quartzite, and magma.")
print(output.tokens)
print(output.ids)

['This', 'geo', 'sent', 'ence', 'includes', 'pro', 'te', 'ozoic', ',', 'quartzite', ',', 'and', 'magma', '.']
[503, 977, 9388, 374, 3726, 186, 428, 975, 17, 2822, 17, 129, 9624, 19]


---

# Build WordPiece tokenizer for BERT

Two tokenizers are trained in this section: same workflow, same parameters, different amounts of geology text (~50 MB, ~500MB). 

I have not done any sort of thorough assessment to determine whether the additional text makes a functional difference.

## Setup

In [2]:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece # BPE
from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing
from tokenizers.trainers import WordPieceTrainer

## Set up training pipeline

### Instantiate tokenizer

In [3]:
# Instantiate Tokenizer with a WordPiece model
bert_tokenizer = Tokenizer(WordPiece())

### Normalization

Tasks:
* Unicode normalization
* Lowercase all text
* Remove accents

In [4]:
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])

### Pre-tokenizer

Split on: 
* Whitespace 
* Punctuation

In [5]:
bert_tokenizer.pre_tokenizer = Whitespace()

### Post-processing template

Set up for both single sequences and sequence/sentence pairs

In [6]:
bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

## Train tokenizer on geology text

Steps:
* Train tokenizer on geology file(s)
* Save model files
* Reload and reinstate with unknown token as `[UNK]`
* Save full model

5 MB text
* This is pretty quick.

In [21]:
# Train
trainer = WordPieceTrainer(
    vocab_size=30522, 
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
text_files = ['/nrcan_p2/data/03_primary/metadata/metadata_nosentences_5M.txt']
bert_tokenizer.train(trainer, text_files)

# Save
model_files = bert_tokenizer.model.save(
    "/nrcan_p2/data/06_models/tokenizers/testing", 
    "bert_geo_meta_short"
)

# Reload with unknown token
bert_tokenizer.model = WordPiece.from_file(*model_files, unk_token="[UNK]")

# Save full model
bert_tokenizer.save("/nrcan_p2/data/06_models/tokenizers/testing/bert_geo_meta_short.json")

500 MB text
* This take a while (8-9 minutes in a notebook in a toolkit container).

In [39]:
bert_tokenizer2 = Tokenizer(WordPiece())
bert_tokenizer2.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
bert_tokenizer2.pre_tokenizer = Whitespace()
bert_tokenizer2.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

# Train
trainer = WordPieceTrainer(
    vocab_size=30522, 
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
text_files = ['../../data/03_primary/v1/med_text_500M.txt']
bert_tokenizer2.train(trainer, text_files)

# Save
model_files = bert_tokenizer2.model.save(
    "../../data/06_models/tokenizers/testing", 
    "bert-geo-med"
)

# Reload with unknown token
bert_tokenizer2.model = WordPiece.from_file(*model_files, unk_token="[UNK]")

# Save full model
bert_tokenizer2.save("../../data/06_models/tokenizers/testing/bert_geo_med.json")

### Take a look at new BERT tokenizer(s)

First tokenizer (small):

In [31]:
output = bert_tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
print(output.ids)

['[CLS]', 'hell', '##o', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]']
[1, 18884, 111, 17, 67, 12, 464, 6, 721, 214, 1105, 0, 36, 2]


In [40]:
output = bert_tokenizer.encode("This geo sentence includes proteozoic, quartzite, and magma.")
print(output.tokens)
print(output.ids)

['[CLS]', 'this', 'geo', 'sent', '##ence', 'includes', 'prote', '##ozoic', ',', 'quartzite', ',', 'and', 'magma', '.', '[SEP]']
[1, 308, 323, 9829, 420, 3773, 19496, 1093, 17, 2558, 17, 184, 9132, 19, 2]


Second tokenizer (medium):

In [41]:
output = bert_tokenizer2.encode("This geo sentence includes proteozoic, quartzite, and magma.")
print(output.tokens)
print(output.ids)

['[CLS]', 'this', 'geo', 'sent', '##ence', 'includes', 'prote', '##ozoic', ',', 'quartzite', ',', 'and', 'magma', '.', '[SEP]']
[1, 1042, 2005, 5599, 1149, 3384, 20157, 1849, 39, 2296, 39, 892, 4649, 41, 2]


# Quick wiki tokenizer for comparison

This takes a little while, but I'm not sure how long, maybe around 5 minutes.

In [63]:
wiki_tokenizer = Tokenizer(WordPiece())
wiki_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
wiki_tokenizer.pre_tokenizer = Whitespace()
wiki_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

# Train - same trainer as above, with different text files
trainer = WordPieceTrainer(
    vocab_size=30522, 
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
text_files_wiki = [f"/nrcan_p2/data/01_raw/toy/wiki/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
wiki_tokenizer.train(trainer, text_files_wiki)

# Save
model_files_wiki = wiki_tokenizer.model.save(
    "/nrcan_p2/data/06_models/tokenizers/testing", 
    "bert-wiki"
)

# Reload with unknown token
wiki_tokenizer.model = WordPiece.from_file(*model_files_wiki, unk_token="[UNK]")

# Save full model
wiki_tokenizer.save("/nrcan_p2/data/06_models/tokenizers/testing/bert_wiki.json")

# Compare tokenizers manually

Preliminarily, this suggests that using geological data to train a tokenizer might be a useful part of our experimental pipeline.  I.e., it does affect the way that geological terms are tokenized; much more experimentation would be necessary to determine whether or not this has an effect on downstream task accuracy, etc.

In [68]:
geo_sentence = "This geo sentence includes proteozoic, quartzite, and magma."
print("Trained on wikipedia:")
print(wiki_tokenizer.encode(geo_sentence).tokens)
print("")
print("Trained on 5MB geo data:")
print(bert_tokenizer.encode(geo_sentence).tokens)
print("")
print("Trained on 500MB geo data:")
print(bert_tokenizer2.encode(geo_sentence).tokens)

Trained on wikipedia:
['[CLS]', 'this', 'ge', '##o', 'sentence', 'includes', 'prote', '##ozo', '##ic', ',', 'quart', '##zi', '##te', ',', 'and', 'magma', '.', '[SEP]']

Trained on 5MB geo data:
['[CLS]', 'this', 'geo', 'sent', '##ence', 'includes', 'prote', '##ozoic', ',', 'quartzite', ',', 'and', 'magma', '.', '[SEP]']

Trained on 500MB geo data:
['[CLS]', 'this', 'geo', 'sent', '##ence', 'includes', 'prote', '##ozoic', ',', 'quartzite', ',', 'and', 'magma', '.', '[SEP]']
