# Pretraining

In [1]:
import sys
sys.path.append("..")
from mangoes.modeling import BERTWordPieceTokenizer, BERTForMaskedLanguageModeling

Users can pretrain BERT on a corpus using the masked language model and next sentence prediction pretraining procedure from the original paper, or just using the masked language model objective.
In this example, we will just use the MLM objective, but the code would look about the same if using NSP as well.

Before we train a BERT model from scratch, we can train a subword tokenizer on our corpus (setting tokenizer parameters in the initialization function call), then save it to a directory:

In [2]:
corpus_path = "./data/wiki_article_en"
tokenizer_dir = "./tok_dir/"
model_output_dir = "./model_ckpts/"

tokenizer = BERTWordPieceTokenizer(lowercase=False)
tokenizer.train(corpus_path, vocab_size=1000)
tokenizer.save(tokenizer_dir)

'./tok_dir/tokenizer.json'

Next, we'll initialize a BERT MLM class, passing in saved tokenizer path and setting model hyperparameters in the initialization function:

In [3]:
model = BERTForMaskedLanguageModeling(tokenizer_dir, hidden_size=252, intermediate_size=256, num_hidden_layers=2)

# optionally, users can use a pretrained tokenizer provided by Huggingface, for example:
# model = BERTForMaskedLanguageModeling("bert-base-cased", hidden_size=252, intermediate_size=256, num_hidden_layers=2)

  return torch._C._cuda_getDeviceCount() > 0


## Training 

We can then train it on the same corpus. There are a few ways to call the train function. The simplest is to pass the raw data as an argument and pass training arguments as keyword arguments.

In [4]:
model.train(train_text=corpus_path, output_dir=model_output_dir, num_train_epochs=5, learning_rate=0.00005, 
            max_len=256, logging_steps=40)

Step,Training Loss
40,6.788
80,6.547
120,6.4446
160,6.3613
200,6.3326
240,6.2982


Alternatively, users can pass instantiated torch.Dataset class(es) instead of the raw data:

In [5]:
from mangoes.modeling import MangoesLineByLineDataset

eval_corpus_path = "./data/wiki_article_fr"
train_dataset = MangoesLineByLineDataset(corpus_path, model.tokenizer, max_len=256)
eval_dataset = MangoesLineByLineDataset(eval_corpus_path, model.tokenizer, max_len=256)

model.train(train_dataset=train_dataset, eval_dataset=eval_dataset, output_dir=model_output_dir, 
            num_train_epochs=4, learning_rate=0.00005, logging_steps=40, evaluation_strategy="epoch")

Epoch,Training Loss,Validation Loss,Runtime,Samples Per Second
1,6.2766,6.284849,0.0973,102.811
2,6.1477,6.228829,0.0926,107.956
3,6.1018,6.24219,0.0873,114.544
4,6.1007,6.324665,0.1005,99.513


Another option is to instantiate a transformers.Trainer and pass this to the train() function. This is shown in the fine tuning demos.

## Inference

After the BERT mode is pretrained, we can use it to get embeddings, or to predict masked tokens. As shown in the feature extraction demo, users can use the predict or generate_outputs functions. In this case, predict gives a direct prediction for the masked token prediction task, while generate outputs gives the masked token scores as well as embeddings or attention matrices, if asked for:

In [6]:
input_text = f"An important current within anarchism is free {model.tokenizer.mask_token} ."

predictions = model.predict(input_text, top_k=1)
print(predictions)

[{'sequence': 'An important current within anarchism is free the.', 'score': 0.03034188225865364, 'token': 160, 'token_str': 'the'}]


In [7]:
outputs = model.generate_outputs(input_text, output_hidden_states=True, output_attentions=True)
print(outputs.keys())
print(outputs["logits"].shape)
print(outputs["logits"][0][-2][model.tokenizer.convert_tokens_to_ids("the")])


dict_keys(['logits', 'hidden_states', 'attentions', 'offset_mappings'])
torch.Size([1, 10, 1000])
tensor(3.0384)


In [8]:
import torch.nn.functional as F

logits = F.softmax(outputs["logits"], dim=-1)
print(logits[0][-2][model.tokenizer.convert_tokens_to_ids("the")])


tensor(0.0303)


## Saving the Model

After the model is trained, it can be saved using the save() function. This is useful to further fine tune the model for a specific task.

In [9]:
model.save("./model_output/", save_tokenizer=True)