<div style="font-variant: small-caps; 
      font-weight: normal; 
      font-size: 35px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Tokenization
  </div> 
  
<div style="
      font-weight: normal; 
      font-size: 25px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
      Tokenizer fitting (clinical-trials)
  </div> 


  <div style="
      font-size: 15px; 
      line-height: 12px; 
      text-align: center; 
      padding: 15px; 
      margin: 10px;">
  Jean-baptiste AUJOGUE - Hybrid Intelligence
  </div> 

  
  <div style=" float:right; 
      font-size: 12px; 
      line-height: 12px; 
  padding: 10px 15px 8px;">
  December 2022
  </div>

<a id="TOC"></a>

#### Table Of Content

1. [Dataset](#data) <br>
2. [ALBERT finetuning](#albert) <br>
3. [Inference](#inference) <br>



#### Reference

- Hugginface full list of [tutorial notebooks](https://github.com/huggingface/transformers/tree/main/notebooks) (see also [here](https://huggingface.co/docs/transformers/main/notebooks#pytorch-examples))
- Huggingface full list of [training scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch)
- Huggingface [tutorial notebook](https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb) on language models
- Huggingface [course](https://huggingface.co/course/chapter7/3?fw=tf) on language models
- Huggingface [training script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) on language models
- Albert [original training protocol](https://github.com/google-research/albert)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import re
import random
import copy
import string
from itertools import chain

# data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datasets import (
    Dataset, 
    DatasetDict,
    ClassLabel, 
    Features, 
    Sequence, 
    Value,
    load_from_disk,
)
from transformers import AlbertConfig, AutoConfig, DataCollatorForLanguageModeling

# DL
import torch
from gensim.models import Word2Vec
import transformers
from transformers import (
    AutoTokenizer, 
    AutoModelForMaskedLM, 
    TrainingArguments, 
    Trainer,
    pipeline,
    set_seed,
)
import evaluate

# viz
from IPython.display import HTML

  from .autonotebook import tqdm as notebook_tqdm


#### Transformers settings

In [3]:
transformers.__version__

'4.22.2'

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [5]:
# make training deterministic
set_seed(42)

#### Custom paths & imports

In [6]:
path_to_repo = os.path.dirname(os.path.dirname(os.getcwd()))
path_to_data = os.path.join(path_to_repo, 'datasets', 'clinical trials CTTI')
path_to_save = os.path.join(path_to_repo, 'saves', 'MLM')
path_to_src  = os.path.join(path_to_repo, 'src')

In [7]:
sys.path.insert(0, path_to_src)

#### Constants

In [8]:
dataset_name = 'clinical-trials-ctti'
base_model_name = "albert-base-v2"
final_model_name = "albert-small-clinical-trials"

<a id="data"></a>

# 1. Dataset

[Table of content](#TOC)

We generate a collection of instances of the `datasets.Dataset` class. 

Note that these are different from the fairly generic `torch.utils.data.Dataset` class. 

## 1.1 Load Clinical Trials corpus

[Table of content](#TOC)

In [9]:
with open(os.path.join(path_to_data, '{}.txt'.format(dataset_name)), 'r', encoding = 'utf-8') as f:
    texts = f.readlines()

In [10]:
dataset = Dataset.from_dict({'text': texts}, features = Features({'text': Value(dtype = 'string')}))

In [11]:
dataset[:3]

{'text': ['This study will test the ability of extended release nifedipine (Procardia XL), a blood pressure medication, to permit a decrease in the dose of glucocorticoid medication children take to treat congenital adrenal hyperplasia (CAH). This protocol is designed to assess both acute and chronic effects of the calcium channel antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with congenital adrenal hyperplasia. The multicenter trial is composed of two phases and will involve a double-blind, placebo-controlled parallel design. The goal of Phase I is to examine the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (ACTH) levels, as well as to begin to assess the dose-dependency of nifedipine effects. The goal of Phase II is to evaluate the long-term effects of nifedipine; that is, can attenuation of ACTH release by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the HPA axis? Such a decrease would, 

## 1.2 Build Clinical-Albert-small tokenizer

[Table of content](#TOC)


In [12]:
def batch_iterator(dataset, batch_size = 512):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i: i + batch_size]['text']

In [13]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer = tokenizer.train_new_from_iterator(batch_iterator(dataset, batch_size = 512), vocab_size = 5000)
tokenizer.save_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))

('C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\saves\\MLM\\albert-small-clinical-trials\\tokenizer\\tokenizer_config.json',
 'C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\saves\\MLM\\albert-small-clinical-trials\\tokenizer\\special_tokens_map.json',
 'C:\\Users\\jb\\Desktop\\NLP\\Internal - Transformers for NLP\\saves\\MLM\\albert-small-clinical-trials\\tokenizer\\tokenizer.json')

In [51]:
tokenizer = AutoTokenizer.from_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))

In [15]:
texts[0], tokenizer.decode(tokenizer(dataset[0]["text"])['input_ids'])

('This study will test the ability of extended release nifedipine (Procardia XL), a blood pressure medication, to permit a decrease in the dose of glucocorticoid medication children take to treat congenital adrenal hyperplasia (CAH). This protocol is designed to assess both acute and chronic effects of the calcium channel antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with congenital adrenal hyperplasia. The multicenter trial is composed of two phases and will involve a double-blind, placebo-controlled parallel design. The goal of Phase I is to examine the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (ACTH) levels, as well as to begin to assess the dose-dependency of nifedipine effects. The goal of Phase II is to evaluate the long-term effects of nifedipine; that is, can attenuation of ACTH release by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the HPA axis? Such a decrease would, in turn, 

In [58]:
import json

In [64]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer_json = json.loads(tokenizer._tokenizer.to_str())
tokenizer_json

{'version': '1.0',
 'truncation': None,
 'padding': None,
 'added_tokens': [{'id': 0,
   'content': '<pad>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 1,
   'content': '<unk>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 2,
   'content': '[CLS]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 3,
   'content': '[SEP]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 4,
   'content': '[MASK]',
   'single_word': False,
   'lstrip': True,
   'rstrip': False,
   'normalized': False,
   'special': True}],
 'normalizer': {'type': 'Sequence',
  'normalizers': [{'type': 'Replace',
    'pattern': {'String': '``'},
    'content': '"'},
   {'type': 'Replace', 'pattern': {'String': "''"}, 'content': '"'},
   

In [65]:
tokenizer_new = AutoTokenizer.from_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))
tokenizer_new_json = json.loads(tokenizer_new._tokenizer.to_str())
tokenizer_new_json

{'version': '1.0',
 'truncation': None,
 'padding': None,
 'added_tokens': [{'id': 0,
   'content': '<pad>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 1,
   'content': '<unk>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 2,
   'content': '[CLS]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 3,
   'content': '[SEP]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 4,
   'content': '[MASK]',
   'single_word': False,
   'lstrip': True,
   'rstrip': False,
   'normalized': False,
   'special': True}],
 'normalizer': {'type': 'Sequence',
  'normalizers': [{'type': 'Replace',
    'pattern': {'String': '``'},
    'content': '"'},
   {'type': 'Replace', 'pattern': {'String': "''"}, 'content': '"'},
   

In [102]:
tokenizer_new2 = AlbertTokenizerFast.from_pretrained(os.path.join(path_to_save, final_model_name, 'tokenizer'))
tokenizer_new2_json = json.loads(tokenizer_new2._tokenizer.to_str())
tokenizer_new2_json

{'version': '1.0',
 'truncation': None,
 'padding': None,
 'added_tokens': [{'id': 0,
   'content': '<pad>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 1,
   'content': '<unk>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 2,
   'content': '[CLS]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 3,
   'content': '[SEP]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 4,
   'content': '[MASK]',
   'single_word': False,
   'lstrip': True,
   'rstrip': False,
   'normalized': False,
   'special': True}],
 'normalizer': {'type': 'Sequence',
  'normalizers': [{'type': 'Replace',
    'pattern': {'String': '``'},
    'content': '"'},
   {'type': 'Replace', 'pattern': {'String': "''"}, 'content': '"'},
   

In [74]:
for k in tokenizer_json:
    if tokenizer_json[k] != tokenizer_new_json[k]:
        print('KO :', k)
    else:
        print('OK :', k)
        
print('-----')
for k in tokenizer_json['model']:
    if tokenizer_json['model'][k] != tokenizer_new_json['model'][k]:
        print('KO :', k)
    else:
        print('OK :', k)

OK : version
OK : truncation
OK : padding
OK : added_tokens
OK : normalizer
OK : pre_tokenizer
OK : post_processor
OK : decoder
KO : model
-----
OK : type
OK : unk_id
KO : vocab


In [77]:
words = [wc[0] for wc in tokenizer_json['model']['vocab']]
len(words)

30000

In [95]:
len([w for w in words if set(w) & set([str(i) for i in range(10)])])

1582

In [97]:
[w for w in words if set(w) & set(['(', ')', '[', ']'])]

['[CLS]', '[SEP]', '[MASK]', '(', ')', ']', '▁[', '[', '];', '▁[]']

In [79]:
words_new = [wc[0] for wc in tokenizer_new_json['model']['vocab']]
len(words_new)

5000

In [98]:
[w for w in words_new if set(w) & set(['(', ')', '[', ']'])]

['[CLS]',
 '[SEP]',
 '[MASK]',
 ')',
 '▁(',
 ').',
 '),',
 '▁(p',
 '▁(c',
 's)',
 '▁(a',
 '▁(s',
 '▁(m',
 'd)',
 '▁(t',
 '▁(b',
 '▁(h',
 '▁(e',
 '▁(i',
 '▁(d',
 '(',
 'c)',
 '▁(n',
 '▁(f',
 '▁[',
 '▁(r',
 ']',
 'p)',
 '▁(g',
 'r)',
 '▁(l',
 '▁(1',
 '▁1)',
 ');',
 '▁2)',
 'm)',
 '▁(in',
 '▁(e.g.',
 '▁(v',
 's),',
 'g)',
 '▁(w',
 '2)',
 '1)',
 's).',
 '):',
 'v)',
 'ct)',
 '▁(n=',
 '▁(as',
 '▁(i.e.',
 '▁(or',
 '▁(1)',
 '▁(2',
 '▁(2)',
 '▁(4',
 '▁(e.g.,',
 '▁(re',
 '(s)',
 '▁(3',
 '▁(>',
 '▁(5',
 '▁(i.e.,',
 '▁(with',
 '▁(the',
 '▁(including',
 '▁(cr',
 '3)',
 '▁(6',
 '▁(bmi)',
 '▁(<',
 'rt)',
 '▁(for',
 '▁(3)',
 '▁(ecog)',
 '▁(20',
 '▁(iv)',
 '▁ii)',
 '▁(see',
 '▁(uln)',
 '▁(pk)',
 '[',
 '(such',
 '▁(mtd)',
 '▁(mri)',
 '▁(hiv)',
 '▁(week',
 '(about',
 '▁(defined',
 '▁(visit',
 '▁(primary',
 '▁(baseline',
 '▁(inclusive)',
 '▁(approximately',
 '(which',
 '▁(nsclc)',
 '▁(covid-19)']

In [103]:
words_new2 = [wc[0] for wc in tokenizer_new2_json['model']['vocab']]
len(words_new2)

5000

In [104]:
[w for w in words_new2 if set(w) & set(['(', ')', '[', ']'])]

['[CLS]',
 '[SEP]',
 '[MASK]',
 ')',
 '▁(',
 ').',
 '),',
 '▁(p',
 '▁(c',
 's)',
 '▁(a',
 '▁(s',
 '▁(m',
 'd)',
 '▁(t',
 '▁(b',
 '▁(h',
 '▁(e',
 '▁(i',
 '▁(d',
 '(',
 'c)',
 '▁(n',
 '▁(f',
 '▁[',
 '▁(r',
 ']',
 'p)',
 '▁(g',
 'r)',
 '▁(l',
 '▁(1',
 '▁1)',
 ');',
 '▁2)',
 'm)',
 '▁(in',
 '▁(e.g.',
 '▁(v',
 's),',
 'g)',
 '▁(w',
 '2)',
 '1)',
 's).',
 '):',
 'v)',
 'ct)',
 '▁(n=',
 '▁(as',
 '▁(i.e.',
 '▁(or',
 '▁(1)',
 '▁(2',
 '▁(2)',
 '▁(4',
 '▁(e.g.,',
 '▁(re',
 '(s)',
 '▁(3',
 '▁(>',
 '▁(5',
 '▁(i.e.,',
 '▁(with',
 '▁(the',
 '▁(including',
 '▁(cr',
 '3)',
 '▁(6',
 '▁(bmi)',
 '▁(<',
 'rt)',
 '▁(for',
 '▁(3)',
 '▁(ecog)',
 '▁(20',
 '▁(iv)',
 '▁ii)',
 '▁(see',
 '▁(uln)',
 '▁(pk)',
 '[',
 '(such',
 '▁(mtd)',
 '▁(mri)',
 '▁(hiv)',
 '▁(week',
 '(about',
 '▁(defined',
 '▁(visit',
 '▁(primary',
 '▁(baseline',
 '▁(inclusive)',
 '▁(approximately',
 '(which',
 '▁(nsclc)',
 '▁(covid-19)']

In [10]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

In [57]:
tokenizer.tokenize("This study hyperplasia (CAH). This protocol is designed to assess both acute and chronic effects of the calcium channel antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with congenital adrenal hyperplasia. The multicenter trial is composed of two phases and will involve a double-blind, placebo-controlled parallel design. The goal of Phase I is to examine the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (ACTH) levels, as well as to begin to assess the dose-dependency of nifedipine effects. The goal of Phase II is to evaluate the long-term effects of nifedipine; that is, can attenuation of ACTH release by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the HPA axis? Such a decrease would, in turn, reduce the deleterious effects of glucocorticoid treatment in CAH. diagnosed with Congenital Adrenal Hyperplasia (CAH), normal ECG during baseline evaluation, history of liver disease, or elevated liver function tests, history of cardiovascular disease\n")

['▁this',
 '▁study',
 '▁hyper',
 'plas',
 'i',
 'a',
 '▁(c',
 'a',
 'h',
 ').',
 '▁this',
 '▁protocol',
 '▁is',
 '▁designed',
 '▁to',
 '▁assess',
 '▁both',
 '▁acute',
 '▁and',
 '▁chronic',
 '▁effects',
 '▁of',
 '▁the',
 '▁calcium',
 '▁channel',
 '▁antagonist',
 ',',
 '▁',
 'n',
 'i',
 'f',
 'ed',
 'i',
 'pine',
 ',',
 '▁on',
 '▁the',
 '▁hypo',
 'th',
 'al',
 'a',
 'mic',
 '-',
 'p',
 'itu',
 'i',
 't',
 'ary',
 '-',
 'a',
 'd',
 're',
 'nal',
 '▁axis',
 '▁in',
 '▁patients',
 '▁with',
 '▁congenital',
 '▁adrenal',
 '▁hyper',
 'plas',
 'i',
 'a',
 '.',
 '▁the',
 '▁multicenter',
 '▁trial',
 '▁is',
 '▁com',
 'posed',
 '▁of',
 '▁two',
 '▁phase',
 's',
 '▁and',
 '▁will',
 '▁involve',
 '▁',
 'a',
 '▁double-blind,',
 '▁placebo',
 '-controlled',
 '▁parallel',
 '▁design',
 '.',
 '▁the',
 '▁goal',
 '▁of',
 '▁phase',
 '▁',
 'i',
 '▁is',
 '▁to',
 '▁examine',
 '▁the',
 '▁ability',
 '▁of',
 '▁',
 'n',
 'i',
 'f',
 'ed',
 'i',
 'pine',
 '▁vs.',
 '▁placebo',
 '▁to',
 '▁decrease',
 '▁',
 'a',
 'd',
 're'

In [53]:
s = "This study will test the ability of extended release nifedipine (Procardia XL), a blood pressure medication, to permit a decrease in the dose of glucocorticoid medication children take to treat congenital adrenal hyperplasia (CAH). This protocol is designed to assess both acute and chronic effects of the calcium channel antagonist, nifedipine, on the hypothalamic-pituitary-adrenal axis in patients with congenital adrenal hyperplasia. The multicenter trial is composed of two phases and will involve a double-blind, placebo-controlled parallel design. The goal of Phase I is to examine the ability of nifedipine vs. placebo to decrease adrenocorticotropic hormone (ACTH) levels, as well as to begin to assess the dose-dependency of nifedipine effects. The goal of Phase II is to evaluate the long-term effects of nifedipine; that is, can attenuation of ACTH release by nifedipine permit a decrease in the dosage of glucocorticoid needed to suppress the HPA axis? Such a decrease would, in turn, reduce the deleterious effects of glucocorticoid treatment in CAH. diagnosed with Congenital Adrenal Hyperplasia (CAH), normal ECG during baseline evaluation, history of liver disease, or elevated liver function tests, history of cardiovascular disease\n"
tokens = tokenizer.tokenize(s)
ids = tokenizer(s)['input_ids'][1:-1]
list(zip(tokens, ids))

[('▁this', 34),
 ('▁study', 29),
 ('▁will', 16),
 ('▁test', 138),
 ('▁the', 6),
 ('▁ability', 619),
 ('▁of', 8),
 ('▁extended', 2795),
 ('▁release', 1782),
 ('▁', 5),
 ('n', 31),
 ('i', 19),
 ('f', 36),
 ('ed', 39),
 ('i', 19),
 ('pine', 1920),
 ('▁(', 47),
 ('pro', 574),
 ('cardia', 3154),
 ('▁', 5),
 ('x', 80),
 ('l', 42),
 ('),', 67),
 ('▁', 5),
 ('a', 10),
 ('▁blood', 86),
 ('▁pressure', 277),
 ('▁medication', 354),
 (',', 7),
 ('▁to', 13),
 ('▁per', 206),
 ('mit', 1180),
 ('▁', 5),
 ('a', 10),
 ('▁decrease', 729),
 ('▁in', 14),
 ('▁the', 6),
 ('▁dose', 115),
 ('▁of', 8),
 ('▁glucocorticoid', 4020),
 ('▁medication', 354),
 ('▁children', 200),
 ('▁take', 523),
 ('▁to', 13),
 ('▁treat', 1096),
 ('▁congenital', 2336),
 ('▁adrenal', 3968),
 ('▁hyper', 590),
 ('plas', 2166),
 ('i', 19),
 ('a', 10),
 ('▁(c', 220),
 ('a', 10),
 ('h', 48),
 (').', 66),
 ('▁this', 34),
 ('▁protocol', 281),
 ('▁is', 24),
 ('▁designed', 589),
 ('▁to', 13),
 ('▁assess', 173),
 ('▁both', 195),
 ('▁acute', 255),

In [31]:
tokenizer.tokenize('( presentation)')

['▁', '(', '▁presentation', ')']

In [26]:
tokenizer.__class__

transformers.models.albert.tokenization_albert_fast.AlbertTokenizerFast

In [99]:
from transformers import AlbertTokenizerFast

In [111]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Downloading: 100%|██████████████████████████████████████████████████████████████████████████| 28.0/28.0 [00:00<?, ?B/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading: 100%|█████████████████████████████████████████████████████████████████████| 570/570 [00:00<00:00, 284kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████| 232k/232k [00:05<00:00, 45.4kB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████| 466k/466k [00:17<00:00, 27.3kB/s]


In [116]:
# tokenizer = tokenizer.train_new_from_iterator(
#     text_iterator = [['lol', 'a(lol)', 'a(lol) ', 'a(lol )', 'a( lol)', 'a( lol )']], 
#     vocab_size = 50,
#     special_tokens_map = {'[UNK]': '<unk>', '[PAD]': '<pad>'}
# )

In [12]:
from tokenizers import decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer

import json

In [10]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

In [13]:
json.loads(tokenizer._tokenizer.to_str())

{'version': '1.0',
 'truncation': None,
 'padding': None,
 'added_tokens': [{'id': 0,
   'content': '<pad>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 1,
   'content': '<unk>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 2,
   'content': '[CLS]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 3,
   'content': '[SEP]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 4,
   'content': '[MASK]',
   'single_word': False,
   'lstrip': True,
   'rstrip': False,
   'normalized': False,
   'special': True}],
 'normalizer': {'type': 'Sequence',
  'normalizers': [{'type': 'Replace',
    'pattern': {'String': '``'},
    'content': '"'},
   {'type': 'Replace', 'pattern': {'String': "''"}, 'content': '"'},
   

In [14]:
parenthesis_1 = "(\(\))"
parenthesis_2 = "\)"
behavior = "isolated"

custom_pre_tokenizer = pre_tokenizers.Sequence([
    #pre_tokenizers.WhitespaceSplit(),
    pre_tokenizers.Whitespace(),
    #pre_tokenizers.Split(Regex(parenthesis_1), behavior=behavior),
    #pre_tokenizers.Split(parenthesis_2, behavior=behavior),
    #pre_tokenizers.Punctuation(behavior = 'isolated'),
    pre_tokenizers.Metaspace(replacement="▁", add_prefix_space=True),
])

In [15]:
# tokenizer = tokenizer.train_new_from_iterator(
#     text_iterator = [t.replace('(', '(\t') for t in ['lol', 'a(lol)', 'a (lol) ', 'a(lol )', 'a( lol)', 'a( lol )']], 
#     vocab_size = 50,
# )

In [16]:
tokenizer = tokenizer.train_new_from_iterator(
    text_iterator = ['lol', 'a(lol)', 'a (lol) ', 'a(lol )', 'a( lol)', 'a( lol )', 'a( lol)', 'a( lol)a', 'a ( lol)'], 
    vocab_size = 10,
)

In [17]:
tokenizer.backend_tokenizer.normalizer.normalize_str("a (lol)")

'a (lol)'

In [18]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("a (lol)")

[('▁a', (0, 1)), ('▁(lol)', (2, 7))]

In [19]:
tokenizer.tokenize('a (lol)')

['▁a', '▁(', 'lol)']

In [20]:
lol = {v: k for k, v in tokenizer.get_vocab().items()}
lol = {i: lol[i] for i in range(len(lol))}
lol

{0: '<pad>',
 1: '<unk>',
 2: '[CLS]',
 3: '[SEP]',
 4: '[MASK]',
 5: '▁a(',
 6: '▁lol)',
 7: ')',
 8: '▁',
 9: '▁lol',
 10: '▁(',
 11: 'lol)',
 12: '▁a',
 13: '▁a(lol',
 14: '(',
 15: 'l',
 16: 'o',
 17: 'a'}

In [86]:
tokenizer.tokenize('a (lol)')

['▁', 'a', '▁', '(', '▁', 'l', 'o', 'l', '▁', ')']

In [21]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

In [174]:
# see the different pre-tokenizers at https://huggingface.co/docs/tokenizers/components#pretokenizers

custom_pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.WhitespaceSplit(),
    pre_tokenizers.Metaspace(replacement = "▁", add_prefix_space = True),
    pre_tokenizers.Digits(),
    pre_tokenizers.Punctuation(behavior = "isolated"),
])

In [175]:
new_tokenizer = Tokenizer(models.Unigram())
new_tokenizer.normalizer = tokenizer.backend_tokenizer.normalizer
new_tokenizer.pre_tokenizer =  custom_pre_tokenizer #tokenizer.backend_tokenizer.pre_tokenizer # 
new_tokenizer.post_processor = tokenizer.backend_tokenizer.post_processor
new_tokenizer.decoder = tokenizer.backend_tokenizer.decoder

In [176]:
# vocab_size has realy unexpected behavior,
# see https://github.com/huggingface/tokenizers/issues/903

trainer = trainers.UnigramTrainer(vocab_size = 1000, special_tokens=["[CLS]", "[SEP]", "<unk>", "<pad>", "[MASK]"], unk_token="<unk>")
new_tokenizer.train_from_iterator(['lol lol,', 'mama', 'ha', 'loli', 'a(lol).', '3-metlol a (lol lol lolilol lolol) .', 'a(celol , mam) .', 'a. (ha, mama lol)', 'a( lol )', 'a( mam lol)', 'a( lol)a', 'a ( lol)'], trainer=trainer)

In [177]:
from transformers import AlbertTokenizerFast

new_tokenizer = AlbertTokenizerFast(tokenizer_object=new_tokenizer)

In [178]:
new_tokenizer.tokenize('3-metlola (lol)')

['▁', '3', '-', 'm', 'e', 't', 'lol', 'a', '▁', '(', 'lol', ')']

In [179]:
lol = {v: k for k, v in new_tokenizer.get_vocab().items()}
lol = {i: lol[i] for i in range(len(lol))}
lol

{0: '[CLS]',
 1: '[SEP]',
 2: '<unk>',
 3: '<pad>',
 4: '[MASK]',
 5: '▁',
 6: 'a',
 7: ')',
 8: '(',
 9: '▁lol',
 10: 'lol',
 11: '.',
 12: '▁mam',
 13: ',',
 14: 'e',
 15: 'h',
 16: '▁loli',
 17: 'ol',
 18: 'm',
 19: 'o',
 20: 'i',
 21: 't',
 22: 'c',
 23: 'l',
 24: '-',
 25: '3'}

In [17]:
import json

In [18]:
tokenizer_json = json.loads(tokenizer._tokenizer.to_str())
tokenizer_json

{'version': '1.0',
 'truncation': None,
 'padding': None,
 'added_tokens': [{'id': 0,
   'content': '<pad>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 1,
   'content': '<unk>',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 2,
   'content': '[CLS]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 3,
   'content': '[SEP]',
   'single_word': False,
   'lstrip': False,
   'rstrip': False,
   'normalized': False,
   'special': True},
  {'id': 4,
   'content': '[MASK]',
   'single_word': False,
   'lstrip': True,
   'rstrip': False,
   'normalized': False,
   'special': True}],
 'normalizer': {'type': 'Sequence',
  'normalizers': [{'type': 'Replace',
    'pattern': {'String': '``'},
    'content': '"'},
   {'type': 'Replace', 'pattern': {'String': "''"}, 'content': '"'},
   

In [26]:
from tokenizers import NormalizedString, Regex
from tokenizers.normalizers import Normalizer, StripAccents

In [37]:
class CustomNormalizer:
    def normalize(self, normalized: NormalizedString):
        # Most of these can be replaced by a `Sequence` combining some provided Normalizer,
        # (ie Sequence([ NFKC(), Replace(Regex("\s+"), " "), Lowercase() ])
        # and it should be the prefered way. That being said, here is an example of the kind
        # of things that can be done here:
        normalized.nfkd()
        normalized.replace(Regex('``'), '"')
        normalized.replace(Regex("''"), '"')
        normalized.replace(Regex('\('), '(\t')
        normalized.lowercase()

In [38]:
tokenizer.backend_tokenizer.normalizer = Normalizer.custom(CustomNormalizer())

In [39]:
tokenizer.backend_tokenizer.normalizer.normalize_str("à (lol)")

'à (\tlol)'

In [40]:
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("a (lol)")

[('▁a', (0, 1)), ('▁(lol)', (2, 7))]

In [41]:
tokenizer.tokenize('a(lol)')

['▁a(', '▁lol', ')']

In [135]:
tokenizer.decode(tokenizer('a (lol)')['input_ids'])

'[CLS] a ( lol)[SEP]'

In [82]:
json.loads(tokenizer._tokenizer.to_str())['normalizer']['normalizers']

[{'type': 'Replace', 'pattern': {'String': '``'}, 'content': '"'},
 {'type': 'Replace', 'pattern': {'String': "''"}, 'content': '"'},
 {'type': 'NFKD'},
 {'type': 'StripAccents'},
 {'type': 'Lowercase'},
 {'type': 'Precompiled',
  'precompiled_charsmap': 'ALQCAACEAAAAAACAAQAAgMz8AgC4BQAAhyIAgMzkAgC4PQAAeyIAgMzsAgC4BQAAiyIAgMw8AADNvAAAmwkAgJ4JAIChCQCAgx0AAIAZAACBGQAAPR0AgDUdAIBNHQCARR0AgIAxAACBMQAApAkAgIkxAAA9WAMAPEgDAEAKAIA+aAMAAYUAAIQBAQADjQAAAokAAAWVAAAEkQAAB50AAAaZAAAJqQAACKEAAAutAAAKpQAADbkAAAy9AAAPvQAADrkAABHFAAAQwQAAE80AABLJAAAV1QAAFNEAABfdAAAW2QAAGeUAABjhAAAb7QAAGukAAB31AAAc8QAAH/0AAB75AABhOAkAZR0AgGNADgBi8AgAZSgPAGSADgBn2A8AZvAPAGlwDABoMAwAa/AMAGrYDABtSA0AbBwNAG8QEgBubA0ARgoAgHAMEwBzqBMAcuwTAHUoEAB0TBAAd9ARAHYUEAB50BYAePQQAF0dAIB69BYAdR0AgG0dAIB/fQEAhgwAgEGAAgDeCwCAQxgAAELAAABFSAAARGAAAEeQBgBGhAEASSgGAEhsAQBLOAcASvAHAE1wBwBMRAcAT/AEAE7MBACnCQCAUCwFAFOgCgBSEAUAVQAKAFRQCgBX0AgAVhALAFlICABYuAgAhBEAAFo8CACA9QAAgZ0AANgLAIAtHQCAg2kCAIJFAgCBNQIAgDUCAIdtAwCGVQMAgTkAAIRlAgAXDACAigEEAInV

[Table of content](#TOC)