# Fine-tuning Masked Language Model


There are a few cases where you‚Äôll want to first fine-tune the language models on your data, before training a task-specific head. For example, if your dataset contains legal contracts or scientific articles, a vanilla Transformer model like BERT will typically treat the domain-specific words in your corpus as rare tokens, and the resulting performance may be less than satisfactory. By fine-tuning the language model on in-domain data you can boost the performance of many downstream tasks, which means you usually only have to do this step once!

This process of fine-tuning a pretrained language model on in-domain data is usually called **domain adaptation**.

We'll use DistilBERT which was trained using , [knowledge distillation](https://en.wikipedia.org/wiki/Knowledge_distillation)

In [1]:
!pip install -q git+https://github.com/huggingface/transformers
!pip install -q git+https://github.com/huggingface/accelerate
!pip install -q datasets evaluate wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m268.8/268.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.8/7.8 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.3/1.3 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... 

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [3]:
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name



'tchoud8/distilbert-base-uncased-finetuned-imdb-accelerate'

In [6]:
import wandb

wandb.init(project="distilbert-base-uncased-finetuned-imdb-accelerate", entity="tchoud8")

In [7]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


In [8]:
text = "This is a great [MASK]."

In [9]:
from transformers import AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (‚Ä¶)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (‚Ä¶)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


## The Dataset

To showcase **domain adaptation**, we‚Äôll use the famous **Large Movie Review Dataset (or IMDb for short)**, which is a corpus of movie reviews that is often used to benchmark sentiment analysis models. By fine-tuning DistilBERT on this corpus, we expect the language model will adapt its vocabulary from the factual data of Wikipedia that it was pretrained on to the more subjective elements of movie reviews.

In [10]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

We can see that the train and test splits each consist of 25,000 reviews, while there is an unlabeled split called unsupervised that contains 50,000 reviews.

In [11]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

For both auto-regressive and masked language modeling, a common preprocessing step is to concatenate all the examples and then split the whole corpus into chunks of equal size. This is quite different from our usual approach, where we simply tokenize individual examples. Why concatenate everything together? The reason is that individual examples might get truncated if they‚Äôre too long, and that would result in losing information that might be useful for the language modeling task!

So to get started, we‚Äôll first tokenize our corpus as usual, but without setting the truncation=True option in our tokenizer. We‚Äôll also grab the word IDs if they are available ((which they will be if we‚Äôre using a fast tokenizer, as described in Chapter 6), as we will need them later on to do whole word masking

In [12]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)

print(tokenizer.model_max_length)

tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

512


DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

Since DistilBERT is a BERT-like model, we can see that the encoded texts consist of the input_ids and attention_mask that we‚Äôve seen in other chapters, as well as the word_ids we added.

This value is derived from the tokenizer_config.json file associated with a checkpoint; in this case we can see that the context size is 512 tokens, just like with BERT.

So, in order to run our experiments on GPUs like those found on Google Colab, we‚Äôll pick something a bit smaller that can fit in memory:

In [13]:
chunk_size = 128

# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


We can then concatenate all these examples with a simple dictionary comprehension

In [14]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 800'


As you can see in this example, the last chunk will generally be smaller than the maximum chunk size. There are two main strategies for dealing with this:

Drop the last chunk if it‚Äôs smaller than chunk_size.
Pad the last chunk until its length equals chunk_size.
We‚Äôll take the first approach here, so let‚Äôs wrap all of the above logic in a single function that we can apply to our tokenized datasets:

In [15]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result


lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

You can see that grouping and then chunking the texts has produced many more examples than our original 25,000 for the train and test splits. That‚Äôs because we now have examples involving contiguous tokens that span across multiple examples from the original corpus. You can see this explicitly by looking for the special [SEP] and [CLS] tokens in one of the chunks:

In [16]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In this example you can see two overlapping movie reviews, one about a high school movie and the other about homelessness. Let‚Äôs also check out what the labels look like for masked language modeling:

In [17]:
tokenizer.decode(lm_datasets["train"][1]["labels"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

As expected from our group_texts() function above, this looks identical to the decoded input_ids ‚Äî but then how can our model possibly learn anything? We‚Äôre missing a key step: inserting [MASK] tokens at random positions in the inputs! Let‚Äôs see how we can do this on the fly during fine-tuning using a special data collator.


### DataCollator

Transformers comes prepared with a dedicated DataCollatorForLanguageModeling for just this task. We just have to pass it the tokenizer and an mlm_probability argument that specifies what fraction of the tokens to mask. We‚Äôll pick 15%, which is the amount used for BERT and a common choice in the literature:

In [18]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

To see how the random masking works, let‚Äôs feed a few examples to the data collator. Since it expects a list of dicts, where each dict represents a single chunk of contiguous text, we first iterate over the dataset before feeding the batch to the collator. We remove the "word_ids" key for this data collator as it does not expect it

In [19]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")



You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



'>>> [CLS] i rented [MASK] [MASK] curious - yellow from my video store because of all the controversy that surrounded lids when it was [MASK] released in [MASK]. i also heard that at [MASK] it was seized by [MASK]. s. customs if it ever [MASK] to enter this [MASK], [MASK] being a fan of films considered " controversial " i really had to see this for myself [MASK] < br / > < [MASK] / > the [MASK] is centered around a [MASK] lent drama student named lena who wants to [MASK] [MASK] she [MASK] about life. in particular she wants to focus her [MASK]s to making some sort of documentary on what the average swede thought about [MASK] political issues such'

'>>> as the vietnam war [MASK] race issues in the united states. in between [MASK] politicians and ordinary den [MASK]ns of stockholm about their opinions on politics, she has [MASK] [MASK] her drama teacher, classmates, and married men. < br / [MASK] < br / > [MASK] [MASK] me [MASK] i am curious - yellow is [MASK] 40 years ago, this was c

Also replace the tokenizer.decode() method with tokenizer.convert_ids_to_tokens() to see that sometimes a single token from a given word is masked, and not the others.

In [20]:
for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.convert_ids_to_tokens(chunk)}'")


'>>> ['[CLS]', 'i', '[MASK]', 'i', 'am', 'curious', '-', 'yellow', 'from', 'my', 'video', 'store', 'because', 'of', 'all', 'the', 'controversy', 'that', 'surrounded', 'it', 'when', 'it', 'was', 'first', 'released', '[MASK]', '1967', '.', 'i', 'also', 'heard', 'that', 'at', 'first', 'it', 'was', 'seized', 'by', 'u', '.', 's', '.', 'customs', 'if', 'it', 'ever', 'tried', 'to', 'enter', 'this', 'country', ',', 'therefore', 'being', 'a', 'fan', 'of', 'films', '[MASK]', '"', 'controversial', '"', '[MASK]', 'really', 'had', 'to', '[MASK]', 'this', '[MASK]', 'myself', '.', '[MASK]', 'br', '/', '>', '[MASK]', 'br', '/', '>', '[MASK]', 'plot', 'is', 'centered', 'around', 'a', 'young', 'swedish', 'drama', 'cox', 'named', 'lena', 'who', 'wants', 'to', 'learn', 'everything', 'she', 'can', 'about', 'life', '.', 'in', 'particular', 'she', 'wants', 'to', 'focus', 'her', 'attention', '##s', 'to', 'making', '[MASK]', 'sort', 'of', 'documentary', '[MASK]', 'what', 'the', 'average', 'sw', '##ede', 'thou

## Training

Now that we have two data collators, the rest of the fine-tuning steps are standard. Training can take a while on Google Colab, so we‚Äôll first downsample the size of the training set to a few thousand examples.

In [21]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

# Fine-tuning DistilBERT with Trainer API

Here we tweaked a few of the default options, including logging_steps to ensure we track the training loss with each epoch. We‚Äôve also used fp16=True to enable mixed-precision training, which gives us another boost in speed. By default, the Trainer will remove any columns that are not part of the model‚Äôs forward() method. This means that if you‚Äôre using the whole word masking collator, you‚Äôll also need to set remove_unused_columns=False to ensure we don‚Äôt lose the word_ids column during training.

In [22]:
from transformers import TrainingArguments

batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

Here we tweaked a few of the default options, including logging_steps to ensure we track the training loss with each epoch. We‚Äôve also used fp16=True to enable mixed-precision training, which gives us another boost in speed. By default, the Trainer will remove any columns that are not part of the model‚Äôs forward() method. This means that if you‚Äôre using the whole word masking collator, you‚Äôll also need to set remove_unused_columns=False to ensure we don‚Äôt lose the word_ids column during training.

Note that you can specify the name of the repository you want to push to with the hub_model_id argument (in particular, you will have to use this argument to push to an organization). For instance, when we pushed the model to the huggingface-course organization, we added hub_model_id="huggingface-course/distilbert-finetuned-imdb" to TrainingArguments. By default, the repository used will be in your namespace and named after the output directory you set, so in our case it will be "lewtun/distilbert-finetuned-imdb".

We now have all the ingredients to instantiate the Trainer. Here we just use the standard data_collator, but you can try the whole word masking collator and compare the results as an exercise:

In [23]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)



Unlike other tasks like text classification or question answering where we‚Äôre given a labeled corpus to train on, with language modeling we don‚Äôt have any explicit labels. So how do we determine what makes a good language model? Like with the autocorrect feature in your phone, a good language model is one that assigns high probabilities to sentences that are grammatically correct, and low probabilities to nonsense sentences. To give you a better idea of what this looks like, you can find whole sets of ‚Äúautocorrect fails‚Äù online, where the model in a person‚Äôs phone has produced some rather funny (and often inappropriate) completions!

Assuming our test set consists mostly of sentences that are grammatically correct, then one way to measure the quality of our language model is to calculate the probabilities it assigns to the next word in all the sentences of the test set. High probabilities indicates that the model is not ‚Äúsurprised‚Äù or ‚Äúperplexed‚Äù by the unseen examples, and suggests it has learned the basic patterns of grammar in the language. There are various mathematical definitions of perplexity, but the one we‚Äôll use defines it as the exponential of the cross-entropy loss. Thus, we can calculate the perplexity of our pretrained model

In [24]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 21.94


A lower perplexity score means a better language model, and we can see here that our starting model has a somewhat large value. Let‚Äôs see if we can lower it by fine-tuning! To do that, we first run the training loop:

In [25]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,2.7024,2.496793
2,2.5794,2.428111
3,2.5354,2.450894


TrainOutput(global_step=471, training_loss=2.6049823275037634, metrics={'train_runtime': 159.5259, 'train_samples_per_second': 188.057, 'train_steps_per_second': 2.952, 'total_flos': 994208670720000.0, 'train_loss': 2.6049823275037634, 'epoch': 3.0})

In [26]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 11.16


In [27]:
trainer.push_to_hub()

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.09k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

'https://huggingface.co/tchoud8/distilbert-base-uncased-finetuned-imdb/tree/main/'

# Fine-tuning DistilBERT with Accelerate

However, we saw that DataCollatorForLanguageModeling also applies random masking with each evaluation, so we‚Äôll see some fluctuations in our perplexity scores with each training run. One way to eliminate this source of randomness is to apply the masking once on the whole test set, and then use the default data collator in ü§ó Transformers to collect the batches during evaluation. To see how this works, let‚Äôs implement a simple function that applies the masking on a batch, similar to our first encounter with DataCollatorForLanguageModeling:

In [28]:
def insert_random_mask(batch):
    features = [dict(zip(batch, t)) for t in zip(*batch.values())]
    masked_inputs = data_collator(features)
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

Next, we‚Äôll apply this function to our test set and drop the unmasked columns so we can replace them with the masked ones. You can use whole word masking by replacing the data_collator above with the appropriate one, in which case you should remove the first line here:

In [29]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])
eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)
eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

We can then set up the dataloaders as usual, but we‚Äôll use the default_data_collator from ü§ó Transformers for the evaluation set:

In [30]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64
train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)
eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [31]:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [32]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [33]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [34]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

There is just one last thing to do before training: create a model repository on the Hugging Face Hub! We can use the ü§ó Hub library to first generate the full name of our repo:

In [40]:
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-mlm-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'tchoud8/distilbert-base-uncased-finetuned-imdb-mlm-accelerate'

then create and clone the repository using the Repository class from ü§ó Hub: First create the repository in HuggingFace then run this cell (name of repository = "distilbert-base-uncased-finetuned-imdb-mlm-accelerate")

In [36]:
# !sudo apt -qq install git-lfs
# !git config --global credential.helper store

In [41]:
from huggingface_hub import Repository

output_dir = model_name
repo = Repository(output_dir, clone_from=repo_name)

Cloning https://huggingface.co/tchoud8/distilbert-base-uncased-finetuned-imdb-mlm-accelerate into local empty directory.


### Training & Evaluation

In [42]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    total_train_loss = 0.0  # To track total training loss for the epoch
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        total_train_loss += loss.item()  # Add current batch loss to total training loss
        progress_bar.update(1)

    # Calculate average training loss for the epoch
    avg_train_loss = total_train_loss / len(train_dataloader)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    # Log perplexity to wandb
    wandb.log({"Perplexity": perplexity})

    # Update model's config with perplexity
    model.config.perplexity = perplexity

    print(f">>> Epoch {epoch}:")
    print(f"Training Loss: {avg_train_loss}")
    print(f"Validation Loss: {torch.mean(losses)}")
    print(f"Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)

    # Update and save model's config
    unwrapped_model.config.save_pretrained(output_dir)

    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

# Close wandb run
wandb.finish()


  0%|          | 0/471 [00:00<?, ?it/s]

>>> Epoch 0:
Training Loss: 2.657470791203201
Validation Loss: 2.462477684020996
Perplexity: 11.733848321632662
>>> Epoch 1:
Training Loss: 2.5095403741119773
Validation Loss: 2.4211950302124023
Perplexity: 11.259306489817392
>>> Epoch 2:
Training Loss: 2.4733242700054388
Validation Loss: 2.4022672176361084
Perplexity: 11.048196673044224


0,1
Perplexity,‚ñà‚ñÉ‚ñÅ
eval/loss,‚ñà‚ñÇ‚ñÅ‚ñÅ‚ñÅ
eval/runtime,‚ñà‚ñÅ‚ñÅ‚ñÅ‚ñÅ
eval/samples_per_second,‚ñÅ‚ñá‚ñà‚ñá‚ñà
eval/steps_per_second,‚ñÅ‚ñá‚ñà‚ñá‚ñà
train/epoch,‚ñÅ‚ñÅ‚ñÑ‚ñÖ‚ñà‚ñà‚ñà‚ñà
train/global_step,‚ñÅ‚ñÉ‚ñÉ‚ñÜ‚ñÜ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà
train/learning_rate,‚ñà‚ñÑ‚ñÅ
train/loss,‚ñà‚ñÉ‚ñÅ
train/total_flos,‚ñÅ

0,1
Perplexity,11.0482
eval/loss,2.41189
eval/runtime,1.923
eval/samples_per_second,520.03
eval/steps_per_second,8.32
train/epoch,3.0
train/global_step,471.0
train/learning_rate,0.0
train/loss,2.5354
train/total_flos,994208670720000.0


## Inference

In [43]:
text = ["How is the [MASK] today?" ,"I want to know [MASK] opinion"]

In [44]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="tchoud8/distilbert-base-uncased-finetuned-imdb-mlm-accelerate"
)

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/593 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (‚Ä¶)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (‚Ä¶)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (‚Ä¶)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [45]:
preds = mask_filler(text)

for pred in preds:
    print("-"*80)
    for item in pred:
        print(f">>> {item['sequence']}")

--------------------------------------------------------------------------------
>>> how is the weather today?
>>> how is the day today?
>>> how is the mood today?
>>> how is the news today?
>>> how is the school today?
--------------------------------------------------------------------------------
>>> i want to know my opinion
>>> i want to know your opinion
>>> i want to know his opinion
>>> i want to know the opinion
>>> i want to know their opinion


In [46]:
#