In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import collections
import math

from datasets import Dataset
from transformers import AutoTokenizer, TFAutoModelForMaskedLM, DataCollatorForLanguageModeling, create_optimizer, pipeline
from transformers.data.data_collator import tf_default_data_collator
from transformers.keras_callbacks import PushToHubCallback
from huggingface_hub import notebook_login

## Dataset


Importing dataframe from a previous notebook

In [3]:
%store -r DT_rally_speaches_dataset
df = DT_rally_speaches_dataset
df

Unnamed: 0,Location,Month,Year,filename,content
0,Battle Creek,Dec,2019,BattleCreekDec19_2019.txt,Thank you. Thank you. Thank you to Vice Presid...
1,Bemidji,Sep,2020,BemidjiSep18_2020.txt,There's a lot of people. That's great. Thank y...
2,Charleston,Feb,2020,CharlestonFeb28_2020.txt,Thank you. Thank you. Thank you. All I can say...
3,Charlotte,Mar,2020,CharlotteMar2_2020.txt,"I want to thank you very much. North Carolina,..."
4,Cincinnati,Aug,2019,CincinnatiAug1_2019.txt,Thank you all. Thank you very much. Thank you ...
5,Colorador Springs,Feb,2020,ColoradorSpringsFeb20_2020.txt,"Hello Colorado. We love Colorado, most beautif..."
6,Dallas,Oct,2019,DallasOct17_2019.txt,Thank you. Thank you very much. Hello Dallas. ...
7,Des Moines,Jan,2020,DesMoinesJan30_2020.txt,I worked so hard for this state. I worked so h...
8,Fayetteville,Sep,2020,FayettevilleSep19_2020.txt,"What a crowd, what a crowd. Get those people o..."
9,Fayetteville,Sep,2019,FayettevilleSep9_2019.txt,Thank you everybody. Thank you and Vice Presi...


## Model

In [4]:
model_checkpoint = "distilbert-base-uncased"
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)
model.summary()

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForMaskedLM: ['activation_13']
- This IS expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


Model: "tf_distil_bert_for_masked_lm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 vocab_transform (Dense)     multiple                  590592    
                                                                 
 vocab_layer_norm (LayerNorm  multiple                 1536      
 alization)                                                      
                                                                 
 vocab_projector (TFDistilBe  multiple                 23866170  
 rtLMHead)                                                       
                                                                 
Total params: 66,985,530
Trainable params: 66,985,530
Non-trainable params: 0
__________________________

## Tokenizer

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Let's pick a text to test the base model on

In [6]:
# text = "This is a great [MASK]."
text = "Make [MASK] great"
# text = "[MASK] virus"
# text = "kung [MASK]"

In [7]:
base_model = model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForMaskedLM: ['activation_13']
- This IS expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [8]:
inputs = tokenizer(text, return_tensors="np")
token_logits = base_model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = np.argwhere(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
# We negate the array before argsort to get the largest, not the smallest, logits
top_5_tokens = np.argsort(-mask_token_logits)[:5].tolist()

for token in top_5_tokens:
    print(f">>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}")

>>> Make yourself great
>>> Make it great
>>> Make thee great
>>> Make me great
>>> Make yourselves great


The tokenizer works best using a dataset so let's convert the pandas dataframe

In [9]:
dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['Location', 'Month', 'Year', 'filename', 'content'],
    num_rows: 35
})

Let's set up a tokenize function that can then be mapped onto the dataset. If using a fast tokenizer we can also use the word ids for whole word masking later on. We can also drop the column that will not be required for this task. 

Since we are working with very long texts we cannot truncate the excess since that will lose us most of the dataset. Instead we can split the texts into batches small enough to fit the model.

In [10]:
def tokenize_function(examples):
    result = tokenizer(examples["content"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_dataset = dataset.map(
    tokenize_function, batched=True, remove_columns=['Location', 'Month', 'Year', 'filename', 'content']
)
tokenized_dataset

Map:   0%|          | 0/35 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (24291 > 512). Running this sequence through the model will result in indexing errors


Dataset({
    features: ['input_ids', 'attention_mask', 'word_ids'],
    num_rows: 35
})

Let's check the model's max context length in order to determine the size of the chunks

In [11]:
tokenizer.model_max_length

512

The capabilities of your machine is also a factor when picking the chunk size. If the machine is lacking in memmory it might be better to pick a smaller number than what the model is capable of handling.

In [12]:
chunk_size = 256

Let's check the number of tokens per speech

In [13]:
for idx, sample in enumerate(tokenized_dataset["input_ids"]):
    print(f"'>>> Rally {idx} length: {len(sample)}'")

'>>> Rally 0 length: 24291'
'>>> Rally 1 length: 22976'
'>>> Rally 2 length: 12491'
'>>> Rally 3 length: 8802'
'>>> Rally 4 length: 10662'
'>>> Rally 5 length: 15759'
'>>> Rally 6 length: 13867'
'>>> Rally 7 length: 15730'
'>>> Rally 8 length: 22452'
'>>> Rally 9 length: 12007'
'>>> Rally 10 length: 13599'
'>>> Rally 11 length: 14241'
'>>> Rally 12 length: 12027'
'>>> Rally 13 length: 13050'
'>>> Rally 14 length: 18351'
'>>> Rally 15 length: 16629'
'>>> Rally 16 length: 11906'
'>>> Rally 17 length: 12482'
'>>> Rally 18 length: 19059'
'>>> Rally 19 length: 15646'
'>>> Rally 20 length: 19325'
'>>> Rally 21 length: 13165'
'>>> Rally 22 length: 11902'
'>>> Rally 23 length: 8570'
'>>> Rally 24 length: 15375'
'>>> Rally 25 length: 14479'
'>>> Rally 26 length: 12752'
'>>> Rally 27 length: 16218'
'>>> Rally 28 length: 3016'
'>>> Rally 29 length: 14459'
'>>> Rally 30 length: 15064'
'>>> Rally 31 length: 12457'
'>>> Rally 32 length: 8942'
'>>> Rally 33 length: 14664'
'>>> Rally 34 length: 8255'


Let's take a look at the full length of the entire dataset at once:

In [14]:
tokenized_dataset_dict = tokenized_dataset.to_dict()
# tokenized_dataset_dict = tokenized_dataset[:2]
concatenated_dataset = {
    k: sum(tokenized_dataset_dict[k], []) for k in tokenized_dataset_dict.keys()
}
total_length = len(concatenated_dataset["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 494670'


In [15]:
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_dataset.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk length: 256'
'>>> Chunk lengt

And now to put it all in a function to map to our dataset:

In [16]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_dataset = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_dataset[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_dataset.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

At the end of the group_texts function we create a labels column which is a copy of the input_ids. That is needed in masked language modeling in order to provide the ground truth for our language model to learn from.

Now let's map the function to the tokenized dataset:

In [17]:
lm_datasets = tokenized_dataset.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/35 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
    num_rows: 1932
})

By Grouping and then splitting the text into chunks we now have ended up with quite a few more additional examples but those examples contain all of the data present in our texts, most of which would have been lost without taking this approach.

Let's have a look at he first rally speech decoded from the tokens using the .decode() method:

In [18]:
tokenizer.decode(lm_datasets[0]["input_ids"])

'[CLS] thank you. thank you. thank you to vice president pence. he\'s a good guy. we\'ve done a great job together. and merry christmas, michigan. thank you, michigan. what a victory we had in michigan. what a victory was that. one of the greats. was that the greatest evening? but i\'m thrilled to be here with thousands of hardworking patriots as we celebrate the miracle of christmas, the greatness of america and the glory of god. thank you very much. and did you notice that everybody is saying merry christmas again? did you notice? saying merry christmas. i remember when i first started this beautiful trip, this beautiful journey, i just said to the first lady, " you are so lucky. i took you on this fantastic journey. it\'s so much fun. they want to impeach you. they want to do worse than that. " by the way, by the way, by the way, it doesn\'t really feel like we\'re being impeached. the country is doing better than ever before. we did nothing wrong. we did nothing wrong. and we have 

And now the labels of that same speech:

In [19]:
tokenizer.decode(lm_datasets[0]["labels"])

'[CLS] thank you. thank you. thank you to vice president pence. he\'s a good guy. we\'ve done a great job together. and merry christmas, michigan. thank you, michigan. what a victory we had in michigan. what a victory was that. one of the greats. was that the greatest evening? but i\'m thrilled to be here with thousands of hardworking patriots as we celebrate the miracle of christmas, the greatness of america and the glory of god. thank you very much. and did you notice that everybody is saying merry christmas again? did you notice? saying merry christmas. i remember when i first started this beautiful trip, this beautiful journey, i just said to the first lady, " you are so lucky. i took you on this fantastic journey. it\'s so much fun. they want to impeach you. they want to do worse than that. " by the way, by the way, by the way, it doesn\'t really feel like we\'re being impeached. the country is doing better than ever before. we did nothing wrong. we did nothing wrong. and we have 

We have the exact same thing in both columns as is to be expected. 

## Fine-tuning DistilBERT with the Trainer API
Next step is to insert the mask tokens into the ids which we do via the use of a data collator. All we need to pass it is the tokenizer that we are using and the <b>mlm_probability</b> argument that specifies what fraction of the tokens to mask.=:

In [20]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Let's take a look at what the masked texts which the collator produces. It expects a list of dicts, where each dict represents a single chunk of contiguous text so we need to first iterate over the dataset before feeding the batch to the collator. We remove the "word_ids" key for this data collator as it does not expect it:

In [21]:
#samples = lm_datasets.to_list()
samples = [lm_datasets[i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



'>>> [CLS] thank you. thank you [MASK] thank [MASK] to vice [MASK] pen [MASK]. he'[MASK] a good guy. inactive've done a [MASK] job together [MASK] and [MASK] christmas, michigan. thank you, michigan. what a victory we [MASK] in michigan. what 294 victory was that. one of the greats. was that the [MASK] evening? but i'[MASK] thrilled to be [MASK] with [MASK] of hardworking patriots [MASK] we [MASK] the miracle of christmas, the great [MASK] of america and the glory of god. thank you very much. and did you notice that everybody is saying merry christmas again? did you notice? saying merry [MASK] [MASK] i remember when i first started this beautiful trip, this beautiful journey, i just said to the first lady, " [MASK] are so lucky. i took you on this fantastic journey. [MASK]'s so much fun [MASK] they want to impeach you. they want [MASK] do worse than that [MASK] "bolic the way, by [MASK] [MASK], [MASK] the way, [MASK] doesn [MASK] [MASK] really feel like we're being [MASK]eached. the c

We can see that the [MASK] token has been randomly inserted at various locations in our text. These will be the tokens which our model will have to predict during training and those masks will be randomised with each batch during training.

When training models for masked language modeling, one technique that can be used is to mask whole words together, not just individual tokens. This approach is called whole word masking. If we want to use whole word masking, we will need to build a data collator ourselves. A data collator is just a function that takes a list of samples and converts them into a batch, so let’s do this now! We’ll use the word IDs computed earlier to make a map between word indices and the corresponding tokens, then randomly decide which words to mask and apply that mask on the inputs. Note that the labels are all -100 except for the ones corresponding to mask words:

In [22]:
wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return tf_default_data_collator(features)

Let's test it on the same sample as before:

In [23]:
# samples = lm_datasets.to_list()
samples = [lm_datasets[i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] thank [MASK]. thank [MASK] [MASK] [MASK] [MASK] to vice president pence [MASK] he's a good guy. we [MASK] ve done [MASK] great [MASK] together. and merry christmas, michigan. [MASK] you [MASK] michigan. what a victory we [MASK] in michigan. [MASK] a victory was that. one of the greats. was [MASK] the greatest [MASK] [MASK] but i'm thrilled to [MASK] here with [MASK] [MASK] hardworking patriots [MASK] we celebrate the miracle of christmas, the greatness of america and the glory [MASK] god. thank you very much. and did you notice [MASK] everybody is saying merry christmas again? [MASK] you notice? saying merry christmas. i [MASK] when i first started [MASK] beautiful trip [MASK] this beautiful journey, i [MASK] said to [MASK] first lady, [MASK] you are so lucky. [MASK] [MASK] you [MASK] this [MASK] [MASK]. it [MASK] s [MASK] [MASK] [MASK]. they want to impeach you [MASK] [MASK] want to do worse than that. " by the way, by the way [MASK] by the way, it doesn't really feel like

### Train/Test Split
We now need to split the data into train and test datasets. We can make use of the Dataset.train_test_split() method to do the split based on supplied ratios for train and test size:

In [24]:
train_size = round(0.9 * len(lm_datasets))
test_size = (len(lm_datasets) - train_size)

dataset_split = lm_datasets.train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
dataset_split

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1739
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 193
    })
})

The model will require <b>tf.data</b> datasets as its inputs and that can be achived using the <b>prepare_tf_dataset()</b> method, which uses the given model to automatically infer which columns should go into the dataset.

In [25]:
tf_train_dataset = model.prepare_tf_dataset(
    dataset_split["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=32,
)

tf_eval_dataset = model.prepare_tf_dataset(
    dataset_split["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=32,
)

### Hugging Face Hub Login
I want to be able to save my latest model to the Hugging Face hub whcih is quite streightforward especially when using the PushToHub callback which will save my latest version of the model when it is done training. I can then load the model much like I would load any other of the models present in the hub to make sure I am always using the latest version:

In [26]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [27]:
# Load model from Huggingface hub
model = TFAutoModelForMaskedLM.from_pretrained("distilbert-base-uncased-finetuned-dt-rally-speeches")

All model checkpoint layers were used when initializing TFDistilBertForMaskedLM.

All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-uncased-finetuned-dt-rally-speeches.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [28]:
# Load model from local dir
# save_dir = "C:/Users/ksbon/Desktop/Jupyter/repos/Hugging Face models/distilbert-base-uncased-finetuned-dt-rally-speeches"
# model = TFAutoModelForMaskedLM.from_pretrained(save_dir)

## Optimizer and hyper parameters

I can use the create_optimizer() function from the Hugging Face Transformers library, which gives me an <b>AdamW</b> optimizer with linear learning rate decay. I will also use the model’s built-in loss, which is the default when no loss is specified as an argument to compile().

I have experimented a bit with learning rate and have found that lr of 2e-5 with a decay rate of 0.01 perform very well. Initially I used a higher learning rate (2e-3 and then 2e-4) which was benefitial only at the start when the model still had lots to learn from this particular dataset.

I also tried reducing the warmup steps but it seemed to make the model's loss more volatile and it tends to shift rapidly at the start.

In [29]:
# from transformers import AdamWeightDecay
# optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

In [30]:
num_train_steps = len(tf_train_dataset)
# num_train_steps = 250
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=1_000,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01
)
model.compile(optimizer=optimizer, 
#               metrics=['accuracy']
             )

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [31]:
# Train in mixed-precision float16
# tf.keras.mixed_precision.set_global_policy("mixed_float16")

Let's also set up the PushToHubCallback from tokenizers:

In [32]:
# Push to hub
model_name = model_checkpoint.split("/")[-1]
save_dir = "C:/Users/ksbon/Desktop/Jupyter/repos/Hugging Face models/distilbert-base-uncased-finetuned-dt-rally-speeches"

push_to_hub_callback = PushToHubCallback(
#     output_dir=f"{model_name}-finetuned-dt-rally-speeches", 
    output_dir=save_dir, 
    tokenizer=tokenizer,
#     hub_model_id="distilbert-base-uncased-finetuned-dt-rally-speeches",
#     hub_token="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",
    save_strategy="no" # getting an error with anything other than no
)

C:\Users\ksbon\Desktop\Jupyter\repos\Hugging Face models\distilbert-base-uncased-finetuned-dt-rally-speeches is already a clone of https://huggingface.co/Shmendel/distilbert-base-uncased-finetuned-dt-rally-speeches. Make sure you pull the latest changes with `repo.git_pull()`.


## Training
I am commeting out the trining part. At this stage I've ran it for a total of about 30 epochs.

In [33]:
# model.fit(x=tf_train_dataset, 
#           validation_data=tf_eval_dataset,
#           epochs=10,
#           callbacks=[push_to_hub_callback]
#          )

In [34]:
# Manual model push to hub
# model.push_to_hub("distilbert-base-uncased-finetuned-dt-rally-speeches")

## Results
Let's have a look at the results. One way to determine the model's performance is of course the loss but it has it's limitations. Unlike other tasks like text classification or question answering where we’re given a labeled corpus to train on, with language modeling we don’t have any explicit labels. 

One way to measure the quality of this language model is to calculate the probabilities it assigns to the next word in all the sentences of the test set. High probabilities indicates that the model indicates that the model is not “surprised” or “perplexed” by the unseen examples, and suggests it has learned the basic patterns of grammar in the language. To calculate the perplexity of the model we need to use the exponential of the cross-entropy loss. Thus, we can calculate the perplexity of our pretrained model by using the model.evaluate() method to compute the cross-entropy loss on the test set and then taking the exponential of the result:

In [35]:
eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 5.94


The perplexity score is decent but let's take a look of the results in practice. To do so I will pass the model a few of Donald Trump's favourite phrases to see if I get the suggestions which I expect. I can do the same thing with the base model as well ad compare the results.

Let's start by setting up a function which would take the model and the text I want to test it on a print out the results:

In [36]:
def fill_mask(model, text):
    # Fine tuned model in fill-mask configuration
    mask_filler = pipeline(
    task="fill-mask", 
    model=model, 
    tokenizer=tokenizer
    )
    
    # Printing predictions
    preds = mask_filler(text)
    for pred in preds:
        print(f">>> {pred['sequence']} -> {pred['score']:.2f}")

Let's also load the base version which I started with:

In [37]:
base_model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForMaskedLM: ['activation_13']
- This IS expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


## Comparison

#### 1. Fake News

First phrase I am targeting is <b>"fake news"</b> since it is one of his favourites:

In [52]:
text = "It's [MASK] news"

fill_mask(base_model, text)

>>> it's good news -> 0.12
>>> it's breaking news -> 0.04
>>> it's bad news -> 0.04
>>> it's fox news -> 0.03
>>> it's cbs news -> 0.03


In [53]:
fill_mask(model, text)

>>> it's fake news -> 0.20
>>> it's good news -> 0.14
>>> it's the news -> 0.09
>>> it's bad news -> 0.04
>>> it's fox news -> 0.04


<b>Fake news</b> is not part of the original model's suggestions but the fine tuned model does indeed have it as the top option which is the behavior I was going for.

#### 2. Make America great

"Make America great again" is the phrase I came across the most. In my everyday life during his presidency. The phrase is of course very common in his rally speaches as well so let's see what we get:

In [54]:
# text = 'We will make America [MASK] again!'
text = 'Let\'s make America [MASK] again!'  # great is second
# text = 'The [MASK] American nation'  # great is second

fill_mask(base_model, text)

>>> let's make america proud again! -> 0.14
>>> let's make america laugh again! -> 0.07
>>> let's make america dance again! -> 0.05
>>> let's make america happy again! -> 0.04
>>> let's make america sing again! -> 0.03


In [55]:
fill_mask(model, text)

>>> let's make america proud again! -> 0.42
>>> let's make america great again! -> 0.07
>>> let's make america strong again! -> 0.07
>>> let's make america happy again! -> 0.03
>>> let's make america stronger again! -> 0.02


Not exactly the result I expected. <b>great</b> is in second place which is not terrible but the level of confidence is rather low. Some more training would likely improve it but I'll leave it as is for now. <b>proud</b> on the other hand is at the very top with a very high confidence level. I dove into the data a bit a it does seem that he is using proud more often than great. "Proud" is also present in more diverse phrases like "the proud American nation" and such which does explain it showing up at the top.

#### 3. One great nation

While going over the data I spotted this phrase which was being used quite often so let's see what we get:

In [58]:
text = 'One [MASK] nation'

fill_mask(base_model, text)

>>> one hundred nation -> 0.04
>>> one - nation -> 0.04
>>> one sovereign nation -> 0.02
>>> one african nation -> 0.02
>>> one nation nation -> 0.02


In [59]:
fill_mask(model, text)

>>> one great nation -> 0.09
>>> one proud nation -> 0.04
>>> one strong nation -> 0.04
>>> one - nation -> 0.04
>>> one more nation -> 0.03


Great is at the top followed by proud with which I am happy with.

## Conclusion