## exBERT Approach
For exBERT, we augment the original BERT’s embedding layer with an extension embedding layer
and corresponding domain-specific extension vocabulary, and add an extension module to each
transformer layer.

In [None]:
#TO DO https://mccormickml.com/2020/06/22/domain-specific-bert-tutorial/

Run in terminal:
    
conda create -y --name exbert2 python==3.8
conda activate exbert2
conda install -y ipykernel
conda install -y ipython_genutils
ipython kernel install --user --name=exbert2

!pip install datasets git+https://github.com/huggingface/transformers/
pip install torch
pip install boto3
pip install tensorflow

pip install transformers
pip install tensorflow
pip install pytorch-pretrained-bert
pip install pandas
pip install fastai
conda install pytorch torchvision -c pytorch

#### Load Dataset

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
!ls drive/MyDrive/Colab\ Notebooks/GitHub

bc5cdr-ner  exBERT
bluebert    nd00333_AZMLND_Optimizing_a_Pipeline_in_Azure-Starter_Files


In [None]:
import pandas as pd
df = pd.read_csv("data/paragrafs.csv")
df.head(1)

FileNotFoundError: ignored

In [None]:
text = df["txts"]
text.to_csv("data/paragraphs.txt", sep='\n', index=False, header=False)

In [None]:
text_file = "data/paragraphs.txt"

## Extension Of Vocabulary And Embedding Layer

1. Derive an extension vocabulary from the target domain (biomedical for this paper) corpus via WordPiece (Wu et al., 2016), while keeping the original general vocabulary used by BERT unchanged. 

In [None]:
# !pip install tokenizers
from tokenizers import BertWordPieceTokenizer

# initialize
tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=False
)


In [None]:
# and train
tokenizer.train(files=text_file, vocab_size=30_000, min_frequency=2,
                limit_alphabet=1000, wordpieces_prefix='##',
                special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'])






In [None]:
tokenizer.save_model('./output', 'WordPiece')

['./output/WordPiece-vocab.txt']

2. Delete any token already present in the original general vocabulary from the extension vocabulary to ensure the extension vocabulary is anabsolute complement to the original vocabulary. 

In [None]:
new_vocab = open("./output/WordPiece-vocab.txt","r")
bert_vocab = open("./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tf/vocab.txt","r")
new_voc_file = new_vocab.readlines()
bert_vocab_file = bert_vocab.readlines()

ext_vocab = [ x for x in new_voc_file if x not in bert_vocab_file ]
print(ext_vocab)

['A\n', 'B\n', 'C\n', 'D\n', 'E\n', 'F\n', 'G\n', 'H\n', 'I\n', 'J\n', 'K\n', 'L\n', 'M\n', 'N\n', 'O\n', 'P\n', 'Q\n', 'R\n', 'S\n', 'T\n', 'U\n', 'V\n', 'W\n', 'X\n', 'Y\n', 'Z\n', '##F\n', '##A\n', '##R\n', '##C\n', '##T\n', '##I\n', '##O\n', '##D\n', '##H\n', '##B\n', '##M\n', '##G\n', '##V\n', '##Q\n', '##L\n', '##S\n', '##W\n', '##U\n', '##X\n', '##N\n', '##K\n', '##Z\n', '##P\n', '##Y\n', '##E\n', '##J\n', 'pati\n', 'Th\n', '##LL\n', 'CLL\n', 'resp\n', '##atment\n', 'The\n', '##erap\n', 'wh\n', 'dis\n', '##erapy\n', '##ymp\n', 'respons\n', 'comp\n', '##ymph\n', 'In\n', 'lymph\n', '##tinib\n', 'rel\n', '##rutinib\n', 'exp\n', '##xim\n', 'inf\n', '##ression\n', '##brutinib\n', 'CD\n', '##edi\n', '##clud\n', '##orm\n', 'prog\n', 'ibrutinib\n', '##ecti\n', '##fter\n', '##inical\n', 'eff\n', '##umab\n', 'rece\n', '##itu\n', 'includ\n', '##ocy\n', '##ximab\n', '##ituximab\n', 'medi\n', 'rituximab\n', 'analys\n', '##ased\n', '##anc\n', '##actor\n', '##alu\n', 'cons\n', 'fol\n', '##xic\

In [None]:
with open("./output/extention-vocab.txt", "w") as f:
    for item in ext_vocab:
        f.write("%s" % item)

In [None]:
print("Extension Vocab Length Is: ", len(ext_vocab))

Extension Vocab Length Is:  13824


In [None]:
print("Bert Vocab Length Is: ", len(bert_vocab_file))

Bert Vocab Length Is:  30522


In [None]:
filenames = ["./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tf/vocab.txt", "./output/extention-vocab.txt"]
with open("./output/overall_voc.txt", "w") as outfile:
    for fname in filenames:
        with open(fname) as infile:
            for line in infile:
                outfile.write(line)

In [None]:
overall_v = open("./output/overall_voc.txt","r")
overall_voc_file = overall_v.readlines()
print("Overall Vocab Length Is: ", len(overall_voc_file))

Overall Vocab Length Is:  44346


We then add a corresponding embedding layer for the extension vocabulary, which is randomly initialized at the beginning and can be optimized during pre-training. 

The overall vocabulary, containing 30,522 (original) and 13,824 (extension) tokens, is used for tokenizing input text. This approach contrasts from SciBERT (Beltagy et al., 2019), which replaces the entire vocabulary and then pre-trains the model from scratch. We tried different extension vocabulary sizes and found that increasing the vocabulary size has a small impact on performance (e.g., increasing the extension vocabulary size by 1435 an additional 12K words only improve performance by 0.0041 F1 score). This is due to the fact that there is no clear drop off in vocabulary frequency of occurrence. 

In this experiment we will be using pretrained BlueBert model downloaded from here:
https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16.zip

In [None]:
# Import generic wrappers
from transformers import AutoModel, AutoTokenizer 
# Define the model repo
model_name = "bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024_A-16" 

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Download pytorch model
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024_A-16 were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Save a trained model
torch_model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self

In [None]:
output_dir_torch = "./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_torch"
torch_model_to_save.save_pretrained(output_dir_torch)

In [None]:
#In terminal, download pretrained model (tf)

#cd pretrained_files

#wget https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16.zip

#unzip NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16.zip

#mv NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16 NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tf

In [None]:
!ls ./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
bert_config.json                    bert_model.ckpt.meta
bert_model.ckpt.data-00000-of-00001 vocab.txt
bert_model.ckpt.index


In [None]:
!mkdir ./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tok
output_dir_tok = "./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tok"
tokenizer.save_pretrained(output_dir_tok)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


('./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tok/tokenizer_config.json',
 './pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tok/special_tokens_map.json',
 './pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tok/vocab.txt',
 './pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tok/added_tokens.json',
 './pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16_tok/tokenizer.json')

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
import pytorch_pretrained_bert
from pytorch_pretrained_bert.tokenization import BertTokenizer
from pytorch_pretrained_bert.modeling import BertForSequenceClassification
from pytorch_pretrained_bert.optimization import BertAdam
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE

In [None]:
tf_path = os.path.abspath("./pretrained_files/bert_model/bert_model.ckpt")
tf_path

'/Users/lsolis/Documents/GitHub/exBERT/pretrained_files/bert_model/bert_model.ckpt'

In [None]:
import tensorflow as tf
# Load weights from TF model
init_vars = tf.train.list_variables(tf_path)


In [None]:
import os
output_model_file = os.path.join(output_dir, "pytorch_model.bin")

In [None]:
torch.save(model_to_save.state_dict(), output_model_file)


In [None]:
stat_dict = torch.load(output_model_file)


In [None]:
model.load_state_dict(stat_dict)

<All keys matched successfully>

In [None]:
model.load_state_dict(stat_dict, strict=False)

<All keys matched successfully>

In [None]:
stat_dict = torch.load(output_model_file, map_location='cpu')

In [None]:
model_state_dict = torch.load(output_model_file)
model_state_dict

OrderedDict([('embeddings.position_ids',
              tensor([[  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
                        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
                        28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,
                        42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
                        56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,
                        70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
                        84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
                        98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
                       112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
                       126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
                       140, 1

In [None]:
# Load a trained model that you have fine-tuned
model_state_dict = torch.load(output_model_file)
model = BertForSequenceClassification.from_pretrained(args.bert_model, state_dict=model_state_dict)
#model.to(device)

In [None]:
data_preprocess.py -voc ./output/WordPiece-vocab.txt -ls 512 -dp ./data/paragraphs.txt -n_c 5 -rd 1 -sp ./output/test_run_data.pkl

In [None]:
!echo $PATH

/Users/lsolis/opt/anaconda3/bin:/Users/lsolis/opt/anaconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin


In [None]:
!PATH=/Users/lsolis/opt/anaconda3/envs/exbert/lib/python3.7/site-packages:/Users/lsolis/opt/anaconda3/envs/exbert/bin:/Users/lsolis/opt/anaconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

In [None]:
!echo $PATH

/Users/lsolis/opt/anaconda3/bin:/Users/lsolis/opt/anaconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin


In [None]:
!python Pretraining.py -e 1 -b 256 -sp ./output/exBERT/ -dv -1 -lr 1e-04 -str exBERT -config ./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16/bert_config.json ./config_and_vocab/exBERT/bert_config_ex_s3.json -vocab ./output/WordPiece-vocab.txt -pm_p_tf ./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16/bert_model.ckpt.index -dp ./output/test_run_data.pkl -ls 512 -p 1 -t_ex_only ""



Traceback (most recent call last):
  File "/Users/lsolis/Documents/GitHub/exBERT/Pretraining.py", line 10, in <module>
    import torch as t
ModuleNotFoundError: No module named 'torch'


In [None]:
python Pretraining.py -e 1 
  -b 256 
  -sp path_to_storage
  -dv 0 1 2 3 -lr 1e-04 
  -str exBERT    
  -config path_to_config_file_of_the_OFF_THE_SHELF_MODEL ./config_and_vocab/exBERT/bert_config_ex_s3.json  
  -vocab ./config_and_vocab/exBERT/exBERT_vocab.txt 
  -pm_p path_to_state_dict_of_the_OFF_THE_SHELF_MODEL
  -dp path_to_your_training_data
  -ls 128 
  -p 1

In [None]:
python Pretraining.py -e 1 \
  -b 256 \
  -sp ./output/exBERT/ \
  -dv -1 \
  -lr 1e-04 \
  -str exBERT \
  -config ./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16/ ./config_and_vocab/exBERT/bert_config_ex_s3.json \
  -vocab ./output/WordPiece-vocab.txt \
  -pm_p ./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16/vocab.txt \
  -dp ./output/test_run_data.pkl \
  -ls 128 \
  -p 1 \
  -t_ex_only ""

In [None]:
python Pretraining.py -e 1 -b 256 -sp ./output/exBERT/ -dv -1 -lr 1e-04 -str exBERT -config ./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16/ ./config_and_vocab/exBERT/bert_config_ex_s3.json -vocab ./output/WordPiece-vocab.txt -pm_p ./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16/vocab.txt 
  -dp ./output/test_run_data.pkl \
  -ls 128 \
  -p 1 \
  -t_ex_only ""

In [None]:
bert_vocab_file = './pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16/vocab.txt'
my_vocab_file = './output/WordPiece-vocab.txt'


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


grep: brackets ([ ]) not balanced


CalledProcessError: Command 'b'bert_vocab_file=./pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16/vocab.txt\nmy_vocab_file=./output/WordPiece-vocab.txt\ngrep -v -f $bert_vocab_file $my_vocab_file\n'' returned non-zero exit status 2.

In [None]:
python Pretraining.py -e 1 
  -b 256 
  -sp path_to_storage
  -dv 0 1 2 3 -lr 1e-04 
  -str exBERT    
  -config path_to_config_file_of_the_OFF_THE_SHELF_MODEL ./config_and_vocab/exBERT/bert_config_ex_s3.json  
  -vocab ./config_and_vocab/exBERT/exBERT_vocab.txt 
  -pm_p path_to_state_dict_of_the_OFF_THE_SHELF_MODEL
  -dp path_to_your_training_data
  -ls 128 
  -p 1

In [None]:
my_vocab_file

In [None]:
from transformers import BertTokenizer
from transformers import BertModel
import torch
#!mkdir /Users/lsolis/Documents/GitHub/exBERT/pretrained_files
blue_dir = '/Users/lsolis/Documents/GitHub/exBERT/pretrained_files/NCBI_BERT_pubmed_mimic_uncased_L-24_H-1024_A-16/'
tokenizer = BertTokenizer.from_pretrained(blue_dir)
model = BertModel.from_pretrained(blue_dir)
input_ids = torch.tensor(tokenizer.encode("This is a sample text.")).unsqueeze(0)  
outputs = model(input_ids)

In [None]:
pretrained_weights = 'bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024_A-16'

tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tokenizer_en.pad_token = tokenizer_en.eos_token


Further, increasing vocabulary size increases time-to-convergence, so in order to bound the convergence time we choose a relatively small extension vocabulary size.
As illustrated in Figure 1(a), the output embedding of a given sentence consists of embedding vectors from both the original and extension embedding layer. Taking the sentence ‘Thalamus is a part of brain’ as an example, our overall vocabulary will tokenize it into eight tokens (‘tha’, ‘##lam’, ‘##us’, ‘is’, ‘a’, ‘part’, ‘of’, ‘brain’), with the embedding vector of ‘thalamus’ coming from the extension embedding layer and all other tokens’ embedding vectors from the original pre-trained embedding layer. Without the extension vocabulary, the original BERT might have tokenized ‘thalamus’ into three tokens, (‘tha’, ‘##lam’, ‘##us’), compared to ‘thalamus’ tokenized as a single word under our
method. 
Therefore by adding the extension vocabulary and corresponding embedding layer, exBERT enables more meaningful tokenization of input text.
However, there are still two issues: (1) Embedding vectors of the extension vocabulary are unknown to the pre-trained BERT model, (2) Distribution of token representation in the original vocabulary may experience a shift from the general domain to the target domain due to the use of different sentence styles, formality, intent, and so on. For
example, the same word in the context of different domains may have different representations. We address these issues by applying a weighted combination mechanism that allows the original BERT model and extension module to cooperate.

#### 1. Byte Level BPE (BBPE) tokenizer
https://medium.com/@pierre_guillou/byte-level-bpe-an-universal-tokenizer-but-aff932332ffe

In [None]:
# Byte Level BPE (BBPE) tokenizers from Transformers and Tokenizers (Hugging Face libraries)

# 1. Get the pre-trained GPT2 Tokenizer (pre-training with an English corpus)
from transformers import GPT2TokenizerFast

pretrained_weights = 'gpt2'
tokenizer_en = GPT2TokenizerFast.from_pretrained(pretrained_weights)
tokenizer_en.pad_token = tokenizer_en.eos_token

In [None]:

pretr_dir = '/Users/lsolis/Documents/GitHub/exBERT/pretrained_files/'
tokenizer_en.save_pretrained(pretr_dir)
#To load from local:
#tokenizer_en = GPT2TokenizerFast.from_pretrained(pretr_dir)

Downloading: 100%|█████████████████████████| 0.99M/0.99M [00:00<00:00, 1.93MB/s]
Downloading: 100%|████████████████████████████| 446k/446k [00:00<00:00, 828kB/s]
Downloading: 100%|█████████████████████████| 1.29M/1.29M [00:00<00:00, 2.16MB/s]
Downloading: 100%|██████████████████████████████| 665/665 [00:00<00:00, 144kB/s]


('/Users/lsolis/Documents/GitHub/exBERT/pretrained_files/tokenizer_config.json',
 '/Users/lsolis/Documents/GitHub/exBERT/pretrained_files/special_tokens_map.json',
 '/Users/lsolis/Documents/GitHub/exBERT/pretrained_files/vocab.json',
 '/Users/lsolis/Documents/GitHub/exBERT/pretrained_files/merges.txt',
 '/Users/lsolis/Documents/GitHub/exBERT/pretrained_files/added_tokens.json',
 '/Users/lsolis/Documents/GitHub/exBERT/pretrained_files/tokenizer.json')

Check GPT2 tokenizer vocab

In [None]:
import itertools

print('--------- vocab ---------')
print()

print('vocab file names: ', tokenizer_en.vocab_files_names)
print()

for k, v in tokenizer_en.pretrained_vocab_files_map.items():
    print(k)
    for kk, vv in v.items():
        print('- ', kk, ':', vv)
    print()
    
print('vocab_size: ', tokenizer_en.vocab_size)
print()
num = 50
print(f'First {num} items of the vocab: {dict(itertools.islice(tokenizer_en.get_vocab().items(), 20))}')


--------- vocab ---------

vocab file names:  {'vocab_file': 'vocab.json', 'merges_file': 'merges.txt', 'tokenizer_file': 'tokenizer.json'}

vocab_file
-  gpt2 : https://huggingface.co/gpt2/resolve/main/vocab.json
-  gpt2-medium : https://huggingface.co/gpt2-medium/resolve/main/vocab.json
-  gpt2-large : https://huggingface.co/gpt2-large/resolve/main/vocab.json
-  gpt2-xl : https://huggingface.co/gpt2-xl/resolve/main/vocab.json
-  distilgpt2 : https://huggingface.co/distilgpt2/resolve/main/vocab.json

merges_file
-  gpt2 : https://huggingface.co/gpt2/resolve/main/merges.txt
-  gpt2-medium : https://huggingface.co/gpt2-medium/resolve/main/merges.txt
-  gpt2-large : https://huggingface.co/gpt2-large/resolve/main/merges.txt
-  gpt2-xl : https://huggingface.co/gpt2-xl/resolve/main/merges.txt
-  distilgpt2 : https://huggingface.co/distilgpt2/resolve/main/merges.txt

tokenizer_file
-  gpt2 : https://huggingface.co/gpt2/resolve/main/tokenizer.json
-  gpt2-medium : https://huggingface.co/gpt2-

In [None]:
# 2. Train a Byte Level BPE (BBPE) tokenizer on our text

# Get GPT2 tokenizer_en vocab size
ByteLevelBPE_tokenizer_en_vocab_size = tokenizer_en.vocab_size
ByteLevelBPE_tokenizer_en_vocab_size


50257

In [None]:
# ByteLevelBPETokenizer Represents a Byte-level BPE as introduced by OpenAI with their GPT-2 model
from tokenizers import ByteLevelBPETokenizer

ByteLevelBPE_tok_en = ByteLevelBPETokenizer()

# Get list of paths to corpus files
text_file = "data/paragraphs.txt"

# Customize training with <|endoftext|> special GPT2 token
ByteLevelBPE_tok_en.train(files=text_file, 
                                vocab_size=ByteLevelBPE_tokenizer_en_vocab_size, 
                                min_frequency=2, 
                                special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])






In [None]:
# Get sequence length max of 1024
ByteLevelBPE_tok_en.enable_truncation(max_length=1024)

# save tokenizer
ByteLevelBPE_tok_en.save("output/ByteLevelBPE_tok_en")

ByteLevelBPE_tok_en.save_model("output/")

['output/vocab.json', 'output/merges.txt']

In [None]:
# opening the file in read mode
my_file = open("file1.txt", "r")
  
# reading the file
data = my_file.read()
  
# replacing end splitting the text 
# when newline ('\n') is seen.
data_into_list = data.split("\n")
print(data_into_list)
my_file.close()

In [None]:
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

In [None]:
# Customize training
tokenizer.train(files=path,
                vocab_size=50265,
                min_frequency=2,
                special_tokens=["<s>", "<pad>", "</s>", "<unk>", "<mask>"])

# Save files to disk
!mkdir -p "models/roberta"
tokenizer.save("models/roberta")