# Finetuning a masked language model

source: https://huggingface.co/course/chapter7/3?fw=pt

## Housekeeping

In [24]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the followin line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?


In [25]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [26]:
# !git config --global user.email "gretel_depaepe@me.com"
# !git config --global user.name "Gretel"

In [27]:
# from huggingface_hub import notebook_login

# notebook_login()

## Picking a pretrained model for masked language modeling

<font color='purple' size=4>Using a pre-trained model as is, is not difficult.  Well, except perhaps, deciding which one to choose as there are many to choose from.  We picked DistilBERT, which is a masked language model. In masked language models, random words in the input data are masked and the model needs to learn how to predict the most likely word in that spot.
DistilBERT is trained on a dataset consisting of 11k unpublished books and English Wikipedia, just like BERT. 

In [63]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /home/jupyter/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.21.1",
  "vocab_size": 30522
}

loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /home/jupyter/.cache/huggingface/transforme

<font color='purple' size=4>But, with around 67 million parameters, DistilBERT is approximately two times smaller than the BERT base model, which roughly translates into a two-fold speedup in training. That being said, be prepared to be very patient and don’t even think about retraining or training without access to at least one GPU.

In [64]:
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


<font color='purple' size=4>So how good is the pre-trained model when it comes to DS9 specific data?

In [65]:
text = "Commander [MASK] is talking over the intercom."

For pretrained models, the predictions depend on the corpus the model was trained on, since it learns to pick up the statistical patterns present in the data. Like BERT, DistilBERT was pretrained on the English Wikipedia and BookCorpus datasets, so we expect the predictions for [MASK] to reflect these domains. To predict the mask we need DistilBERT’s tokenizer to produce the inputs for the model, so let’s download that from the Hub as well:

In [66]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /home/jupyter/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.21.1",
  "vocab_size": 30522
}

loading file https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt from cache at /home/jupyter/.cache/huggingface/transformers/0e1bbfda7f63a

With a tokenizer and a model, we can now pass our text example to the model, extract the logits, and print out the top 5 candidates:

In [67]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> Commander mccoy is talking over the intercom.'
'>>> Commander jones is talking over the intercom.'
'>>> Commander jameson is talking over the intercom.'
'>>> Commander armstrong is talking over the intercom.'
'>>> Commander vance is talking over the intercom.'


<font color='purple' size=4>We can see from the outputs that the model’s predictions are based in what it learned from English Wikipedia. Let’s see how we can change this domain to something a bit more niche — Deep Space Nine Scripts!



## The dataset

In [33]:
from datasets import load_dataset

dataset = load_dataset("Gretel/deep_space_9_dataset")
dataset

Using custom data configuration Gretel--deep_space_9_dataset-e59f46fc56ac0435
Reusing dataset parquet (/home/jupyter/.cache/huggingface/datasets/Gretel___parquet/Gretel--deep_space_9_dataset-e59f46fc56ac0435/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'text'],
        num_rows: 38933
    })
})

Let’s take a look at a few samples to get an idea of what kind of text we’re dealing with. 

In [34]:
sample = dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Text: {row['text']}'")
    print(f"'>>> Title: {row['title']}'")

Loading cached shuffled indices for dataset at /home/jupyter/.cache/huggingface/datasets/Gretel___parquet/Gretel--deep_space_9_dataset-e59f46fc56ac0435/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-7402f667f0a44306.arrow



'>>> Text:  and she looks up at'
'>>> Title: The Circle'

'>>> Text:  somehow     it has literally brought him back to     life'
'>>> Title: Battle Lines'

'>>> Text:  as Quark watches'
'>>> Title: Emissary'


## Preprocessing the data

For both auto-regressive and masked language modeling, a common preprocessing step is to concatenate all the examples and then split the whole corpus into chunks of equal size. This is quite different from our usual approach, where we simply tokenize individual examples. Why concatenate everything together? The reason is that individual examples might get truncated if they’re too long, and that would result in losing information that might be useful for the language modeling task!

So to get started, we’ll first tokenize our corpus as usual, but without setting the truncation=True option in our tokenizer. We’ll also grab the word IDs if they are available, as we will need them later on to do whole word masking. We’ll wrap this in a simple function, and while we’re at it we’ll remove the text and title columns since we don’t need them any longer:

In [35]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "title"]
)
tokenized_datasets

Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/Gretel___parquet/Gretel--deep_space_9_dataset-e59f46fc56ac0435/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-bcb27e138117d060.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 38933
    })
})

Now that we’ve tokenized our movie reviews, the next step is to group them all together and split the result into chunks. But how big should these chunks be? This will ultimately be determined by the amount of GPU memory that you have available, but a good starting point is to see what the model’s maximum context size is. This can be inferred by inspecting the model_max_length attribute of the tokenizer:

In [36]:
tokenizer.model_max_length

512

We’ll pick something a bit smaller that can fit in memory:

In [37]:
chunk_size = 128

Now comes the fun part. To show how the concatenation works, let’s take a few reviews from our tokenized training set and print out the number of tokens per review:

In [38]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Text {idx} length: {len(sample)}'")

'>>> Text 0 length: 981'
'>>> Text 1 length: 30'
'>>> Text 2 length: 26'


We can then concatenate all these examples with a simple dictionary comprehension, as follows:

In [39]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated text length: {total_length}'")

'>>> Concatenated text length: 1037'


Great, the total length checks out — so now let’s split the concatenated reviews into chunks of the size given by block_size. To do so, we iterate over the features in concatenated_examples and use a list comprehension to create slices of each feature. The result is a dictionary of chunks for each feature:

In [40]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 13'


As you can see in this example, the last chunk will generally be smaller than the maximum chunk size. There are two main strategies for dealing with this:

Drop the last chunk if it’s smaller than chunk_size.
Pad the last chunk until its length equals chunk_size.
We’ll take the first approach here, so let’s wrap all of the above logic in a single function that we can apply to our tokenized datasets:

In [41]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

Note that in the last step of group_texts() we create a new labels column which is a copy of the input_ids one. As we’ll see shortly, that’s because in masked language modeling the objective is to predict randomly masked tokens in the input batch, and by creating a labels column we provide the ground truth for our language model to learn from.

Let’s now apply group_texts() to our tokenized datasets using our trusty Dataset.map() function:

In [42]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/Gretel___parquet/Gretel--deep_space_9_dataset-e59f46fc56ac0435/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-e2cd530ba659950d.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 20704
    })
})

Let’s also check out what the labels look like for masked language modeling:

In [43]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

'/ 25 / 92 - c star trek : deep space nine " emissary " cast benjamin sisko picard jake sisko con officer miles o\'brien tactical officer kira nerys ops officer odo vulcan captain nog monk # 1 quark monk # 2 kai opaka jennifer julian bashir ferengi pit boss jadzia dax a lieutenant keiko female trans. chief gul duxat chancellor ( on monitor ) cardassian off. # 1 computer voice cardassian off. # 2 female computer voice batter alien gul jasad cardassian officer bajoran bureaucrat ( on monitor ) doran'

In [44]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

'/ 25 / 92 - c star trek : deep space nine " emissary " cast benjamin sisko picard jake sisko con officer miles o\'brien tactical officer kira nerys ops officer odo vulcan captain nog monk # 1 quark monk # 2 kai opaka jennifer julian bashir ferengi pit boss jadzia dax a lieutenant keiko female trans. chief gul duxat chancellor ( on monitor ) cardassian off. # 1 computer voice cardassian off. # 2 female computer voice batter alien gul jasad cardassian officer bajoran bureaucrat ( on monitor ) doran'

As expected from our group_texts() function above, this looks identical to the decoded input_ids — but then how can our model possibly learn anything? We’re missing a key step: inserting [MASK] tokens at random positions in the inputs! Let’s see how we can do this on the fly during fine-tuning using a special data collator.

Fine-tuning a masked language model is almost identical to fine-tuning a sequence classification model. The only difference is that we need a special data collator that can randomly mask some of the tokens in each batch of texts. Fortunately, 🤗 Transformers comes prepared with a dedicated DataCollatorForLanguageModeling for just this task. We just have to pass it the tokenizer and an mlm_probability argument that specifies what fraction of the tokens to mask. We’ll pick 15%, which is the amount used for BERT and a common choice in the literature:

In [45]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

To see how the random masking works, let’s feed a few examples to the data collator. Since it expects a list of dicts, where each dict represents a single chunk of contiguous text, we first iterate over the dataset before feeding the batch to the collator. We remove the "word_ids" key for this data collator as it does not expect it:

In [46]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] star trek : deep space [MASK] " emissar [MASK] " # 40511 - 721 telep [MASK] by michael piller story by rick berman michael [MASK]er the writing credits may not be final and should not be [MASK] for publicity or advertising purposes without first checking with the [MASK] legal department. copyright 1992 paramount [MASK] corporation. all rights reserved. this script is not for [MASK] or reproduction. no one is authorized to dispose of same. [MASK] lost or [MASK], please notify the script department. return to script department rev. final [MASK] paramount pictures corporation. august 10, 1992 star trek : dsaurus " emissary [MASK] rev [MASK] final 08'

'>>> / [MASK] [unused404] 92 emi c star trek [MASK] deep space [MASK] " emissary " cast benjamin sisko picard jake sisko con officer miles o'brien tactical [MASK] kira nerys ops officer odo [MASK] captain nog monk # 1 [MASK]ark monk # gabrielle kai op [MASK] jennifer [MASK] bashir ferengi [MASK] [MASK] jadzia dax a lieutenant kei

Nice, it worked! We can see that the [MASK] token has been randomly inserted at various locations in our text. These will be the tokens which our model will have to predict during training — and the beauty of the data collator is that it will randomize the [MASK] insertion with every batch!

When training models for masked language modeling, one technique that can be used is to mask whole words together, not just individual tokens. This approach is called whole word masking. If we want to use whole word masking, we will need to build a data collator ourselves. A data collator is just a function that takes a list of samples and converts them into a batch, so let’s do this now! We’ll use the word IDs computed earlier to make a map between word indices and the corresponding tokens, then randomly decide which words to mask and apply that mask on the inputs. Note that the labels are all -100 except for the ones corresponding to mask words.

In [47]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id

    return default_data_collator(features)

Next, we can try it on the same samples as before:

In [48]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] star [MASK] : deep space [MASK] " emissary " # 40511 - 721 teleplay by [MASK] piller story by rick berman michael piller [MASK] writing credits may not be final [MASK] should not be used for publicity or [MASK] purposes without first [MASK] with the television legal department. copyright 1992 [MASK] [MASK] corporation. all rights reserved. this script [MASK] not for publication or reproduction [MASK] no one is authorized to dispose of [MASK]. if lost [MASK] destroyed, please notify [MASK] script department. return to script department rev [MASK] final draft paramount pictures [MASK]. august 10, 1992 star trek : ds9 " emissary " [MASK]. final 08'

'>>> / 25 / 92 - c [MASK] trek [MASK] deep [MASK] [MASK] " [MASK] [MASK] [MASK] " cast benjamin sisko [MASK] [MASK] jake [MASK] [MASK] [MASK] officer miles o [MASK] brien tactical officer kira nerys ops officer odo vulcan captain nog [MASK] # 1 quark monk # 2 kai opaka jennifer julian bashir ferengi pit [MASK] jadzia dax a lieutena

Now that we have two data collators, the rest of the fine-tuning steps are standard. Training can take a while 😭, so we’ll first downsample the size of the training set to a few thousand examples. Hopefully, we’ll still get a pretty decent language model! A quick way to downsample a dataset in 🤗 Datasets is via the Dataset.train_test_split() function

In [49]:
train_size = 20704
test_size = int(0.1 * train_size)
train_size = train_size - test_size

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

Loading cached split indices for dataset at /home/jupyter/.cache/huggingface/datasets/Gretel___parquet/Gretel--deep_space_9_dataset-e59f46fc56ac0435/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-c67b9104d13f3fa7.arrow and /home/jupyter/.cache/huggingface/datasets/Gretel___parquet/Gretel--deep_space_9_dataset-e59f46fc56ac0435/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-67dcdceacfe4e7a9.arrow


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 18634
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 2070
    })
})

## Fine-tuning DistilBERT with the Trainer API

<font color='purple' size=4>What happens if we try to finetune the model?  We decided to train for 20 epochs which took about an hour on a machine with 4 vCPUs, 26 GB of RAM and 1 NVIDIA Tesla T4. 

In [50]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-ds9",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    num_train_epochs=20,
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    # push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
    remove_unused_columns=False
)

Here we tweaked a few of the default options, including logging_steps to ensure we track the training loss with each epoch. We’ve also used fp16=True to enable mixed-precision training, which gives us another boost in speed. By default, the Trainer will remove any columns that are not part of the model’s forward() method. This means that if you’re using the whole word masking collator, you’ll also need to set remove_unused_columns=False to ensure we don’t lose the word_ids column during training.

We now have all the ingredients to instantiate the Trainer. Here we just use the standard data_collator, but you can try the whole word masking collator and compare the results as an exercise:

In [51]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=whole_word_masking_data_collator,
)

Using cuda_amp half precision backend


Unlike other tasks like text classification or question answering where we’re given a labeled corpus to train on, with language modeling we don’t have any explicit labels. So how do we determine what makes a good language model? Like with the autocorrect feature in your phone, a good language model is one that assigns high probabilities to sentences that are grammatically correct, and low probabilities to nonsense sentences. To give you a better idea of what this looks like, you can find whole sets of “autocorrect fails” online, where the model in a person’s phone has produced some rather funny (and often inappropriate) completions!

Assuming our test set consists mostly of sentences that are grammatically correct, then one way to measure the quality of our language model is to calculate the probabilities it assigns to the next word in all the sentences of the test set. High probabilities indicates that the model is not “surprised” or “perplexed” by the unseen examples, and suggests it has learned the basic patterns of grammar in the language. There are various mathematical definitions of perplexity, but the one we’ll use defines it as the exponential of the cross-entropy loss. Thus, we can calculate the perplexity of our pretrained model by using the Trainer.evaluate() function to compute the cross-entropy loss on the test set and then taking the exponential of the result:

In [52]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 2070
  Batch size = 64


>>> Perplexity: 5.04


<font color='purple' size=4>A lower perplexity score means a better language model. Let’s see if we can lower it by fine-tuning!

In [53]:
trainer.train()

***** Running training *****
  Num examples = 18634
  Num Epochs = 20
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 5840


Epoch,Training Loss,Validation Loss
1,0.6562,0.526399
2,0.5242,0.491724
3,0.4929,0.465428
4,0.4783,0.458215
5,0.4647,0.451102
6,0.4539,0.445852
7,0.4488,0.428906
8,0.4427,0.425711
9,0.4336,0.422856
10,0.4305,0.41848


***** Running Evaluation *****
  Num examples = 2070
  Batch size = 64
Saving model checkpoint to distilbert-base-uncased-finetuned-ds9/checkpoint-500
Configuration saved in distilbert-base-uncased-finetuned-ds9/checkpoint-500/config.json
Model weights saved in distilbert-base-uncased-finetuned-ds9/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2070
  Batch size = 64
***** Running Evaluation *****
  Num examples = 2070
  Batch size = 64
Saving model checkpoint to distilbert-base-uncased-finetuned-ds9/checkpoint-1000
Configuration saved in distilbert-base-uncased-finetuned-ds9/checkpoint-1000/config.json
Model weights saved in distilbert-base-uncased-finetuned-ds9/checkpoint-1000/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2070
  Batch size = 64
***** Running Evaluation *****
  Num examples = 2070
  Batch size = 64
Saving model checkpoint to distilbert-base-uncased-finetuned-ds9/checkpoint-1500
Configuration saved in distilbert-bas

TrainOutput(global_step=5840, training_loss=0.44875546742792, metrics={'train_runtime': 3539.0753, 'train_samples_per_second': 105.304, 'train_steps_per_second': 1.65, 'total_flos': 1.235072291346432e+16, 'train_loss': 0.44875546742792, 'epoch': 20.0})

In [54]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 2070
  Batch size = 64


>>> Perplexity: 1.50


<font color='purple' size=4>We were indeed able to reduce the Perplexity from 5.04 to 1.5.

In [55]:
trainer.save_model("ds9_finetuned")
tokenizer.save_pretrained("ds9_finetuned")

Saving model checkpoint to ds9_finetuned
Configuration saved in ds9_finetuned/config.json
Model weights saved in ds9_finetuned/pytorch_model.bin
tokenizer config file saved in ds9_finetuned/tokenizer_config.json
Special tokens file saved in ds9_finetuned/special_tokens_map.json


('ds9_finetuned/tokenizer_config.json',
 'ds9_finetuned/special_tokens_map.json',
 'ds9_finetuned/vocab.txt',
 'ds9_finetuned/added_tokens.json',
 'ds9_finetuned/tokenizer.json')

## Test model

In [56]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="ds9_finetuned"
)

loading configuration file ds9_finetuned/config.json
Model config DistilBertConfig {
  "_name_or_path": "ds9_finetuned",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.21.1",
  "vocab_size": 30522
}

loading configuration file ds9_finetuned/config.json
Model config DistilBertConfig {
  "_name_or_path": "ds9_finetuned",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",


In [57]:
text = "Commander [MASK] is talking over the intercom."
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> commander kira is talking over the intercom.
>>> commander dax is talking over the intercom.
>>> commander ross is talking over the intercom.
>>> commander carlson is talking over the intercom.
>>> commander wainwright is talking over the intercom.


In [58]:
text = "He is drinking a cup of [MASK]."
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> he is drinking a cup of coffee.
>>> he is drinking a cup of tea.
>>> he is drinking a cup of water.
>>> he is drinking a cup of ale.
>>> he is drinking a cup of wine.


In [59]:
text = "Some beeps and the image of Gul [MASK] appears on the monitors"
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> some beeps and the image of gul amin appears on the monitors
>>> some beeps and the image of gul du appears on the monitors
>>> some beeps and the image of gul omar appears on the monitors
>>> some beeps and the image of gul goran appears on the monitors
>>> some beeps and the image of gul ali appears on the monitors


In [60]:
text = "Two to [MASK] up."
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> two to give up.
>>> two to catch up.
>>> two to stand up.
>>> two to set up.
>>> two to one up.


In [61]:
text = "The Ferengis follow the Rules of [MASK]."
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> the ferengis follow the rules of acquisition.
>>> the ferengis follow the rules of obedience.
>>> the ferengis follow the rules of battle.
>>> the ferengis follow the rules of nature.
>>> the ferengis follow the rules of existence.


In [68]:
text = "In the Bajoran religion, the [MASK] is worshipped as the Celestial Temple of the Prophets"
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> in the bajoran religion, the shrine is worshipped as the celestial temple of the prophets
>>> in the bajoran religion, the temple is worshipped as the celestial temple of the prophets
>>> in the bajoran religion, the moon is worshipped as the celestial temple of the prophets
>>> in the bajoran religion, the place is worshipped as the celestial temple of the prophets
>>> in the bajoran religion, the mountain is worshipped as the celestial temple of the prophets


<font color='purple' size=4> As you can see, the results are a bit disappointing.  Now obviously one could spend days or even weeks trying out different hyperparameters etc.  And it has to be said that many claim to get pretty decent result fine tuning a pre-trained model, especially when trained afterwards on anothe task such as a classification for example.  It all depends on the specific use case.
