<a href="https://colab.research.google.com/github/MUmairAB/Masked-Language-Model-Fine-Tuning-with-HuggingFace-Transformers/blob/main/Masked_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Masked Language Modeling (MLM)

Masked Language Modeling (MLM) is a popular task in natural language processing that aims to predict missing or masked words within a given text. It serves as a valuable approach to enhance language understanding and modeling capabilities. The procedure for Masked Language Modeling involves several key steps.

1. Initially, a large text corpus is gathered and pre-processed. This typically includes concatenating multiple texts into a single corpus and then splitting it into smaller chunks of manageable size. These chunks serve as the input data for the MLM task.

2. Next, the text chunks undergo tokenization, where each word or subword is assigned a unique token. This process enables the model to understand and process the text at a granular level.

3. To train the model for MLM, certain tokens within the text are randomly masked or replaced with a special **[MASK]** token. The objective is for the model to predict the original masked tokens based on the context provided by the surrounding words. This helps the model develop a deeper understanding of the language and improve its ability to generate coherent and meaningful text.

4. The masked chunks, along with their corresponding labels (the original tokens that were masked), are then used to train a language model. During training, the model learns to predict the correct tokens at the masked positions, taking into account the contextual information provided by the surrounding words.


Once the training is complete, the model can be utilized for various downstream tasks such as:

- text completion,
- sentiment analysis,
- question-answering, or
- where the understanding of masked tokens plays a crucial role.


In this notebook, we'll use **"DistilBERT-base-uncased"** pretrained language model from 🤗 HuggingFace and fine-tune the model on [IMDB Movies Review](https://huggingface.co/datasets/imdb) dataset.


**Let's begin**

In [1]:
#Install transformers library
!pip install transformers

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m104.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m63.2 MB/s[0m eta [36m0:00

In [2]:
#Set the seed value
SEED = 4243

In [3]:
#Download the model
from transformers import TFAutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"

model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [4]:
#Download the DistilBERT's tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

## Model working before fine tuning

In [5]:
#Sample text
text = "This is a great [MASK]."

In [6]:
#Use the model to predict the MASK and print the top 5 candidates
import numpy as np
import tensorflow as tf

inputs = tokenizer(text, return_tensors="np")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = np.argwhere(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
# We negate the array before argsort to get the largest, not the smallest, logits
top_5_tokens = np.argsort(-mask_token_logits)[:5].tolist()

for token in top_5_tokens:
    print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

This is a great deal.
This is a great success.
This is a great adventure.
This is a great idea.
This is a great feat.


## Dataset

We'll use the IMDB dataset. It can be accessed [here](https://huggingface.co/datasets/imdb).

In [7]:
#Install the datasets library
!pip install datasets

Collecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.13.

In [8]:
#Download the dataset from HuggingFace
from datasets import load_dataset

dataset = load_dataset(path="imdb")

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [9]:
#View the data fields
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [10]:
dataset = dataset["train"]
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [11]:
#Interview the data

#Let's print 5 random samples
samples = dataset.shuffle(seed=SEED).select(range(5))

for sample in samples:
    print("Review:",sample["text"])
    print("Label:",sample["label"])
    print("\n\t\t\t\t%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n")

Review: A must see by all - if you have to borrow your neighbors kid to see this one. Easily one of the best animation/cartoons released in a long-time. It took the the movies Antz to a whole new level. Do not mistake the two as being the same movie - although in principle the movies plot is similiar. Just go and enjoy.
Label: 1

				%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Review: Nothing new in this hackneyed romance with characters put into unbelievable situations, speaking dialogue that borders on the ridiculous. This is an example of another movie put into production before serious script problems were solved. Don't waste your time.
Label: 0

				%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Review: This was one of the worst movies I have ever seen. Branaugh seemed to have so much trouble remembering his accent that he couldn't deliver his lines. The plot was definitely not worthy of John Grisham's name. No wonder it was never published as a book or released in theaters. I didn't e

## Pre-processing

To carry out the Masked Language Modeling task effectively, it is crucial to perform pre-processing. The initial step involves combining the text corpus into a single unit, followed by dividing it into smaller chunks. These chunks are then subjected to processing by the model.

One might inquire about the knowledge acquired by the model through processing these segments. To facilitate learning, it is necessary to insert **[MASK]** tokens randomly within the text. Consequently, the model will acquire the capability to predict the correct token at the masked position.

In [12]:
#Function to tokenize the text and extract word IDs
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    #Add the word IDs column
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


#Use the map() method to tokenize the text
# Use batched=True to speed-up the fast tokenizer
tokenized_datasets = dataset.map(
                                function=tokenize_function,
                                batched=True,
                                remove_columns=["text", "label"]
                                )
#Have a loot at the DastsetDict
tokenized_datasets

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (720 > 512). Running this sequence through the model will result in indexing errors


Dataset({
    features: ['input_ids', 'attention_mask', 'word_ids'],
    num_rows: 25000
})

In [13]:
#Define a function to concatenate all the text together
# and then split it into smaller chunks
chunk_size = 128

def group_texts(examples):

    #Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

    #Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    #We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size

    #Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    #Create a new labels column to be used by the
    # model to learn the ground truth
    result["labels"] = result["input_ids"].copy()
    return result

In [14]:
#Apply the group_texts function using the map() method
processed_dataset = tokenized_datasets.map(function=group_texts,
                                           batched=True)

#Have a loot at the DastsetDict
processed_dataset

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
    num_rows: 61291
})

The pre-processing task has been completed as per the requirements. The text was concatenated and then divided into smaller chunks with a specified chunk size. The resulting data consists of an **"input_ids"** column, where the content is the same as the **"labels"** column. However, the insertion of **[MASK]** tokens at random positions remains pending, which we will address next.

To ensure that the **"input_ids"** column matches the **"labels"** column, we can perform a verification check.

In [15]:
#The input_ids field of 10th sample
k = 10
print(tokenizer.decode(processed_dataset[k]["input_ids"]))
print("\n\t\t%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n")
print(tokenizer.decode(processed_dataset[k]["labels"]))

scrape to find value in its boring pseudo revolutionary political spewings.. but if it weren't for the censorship scandal, it would have been ignored, then forgotten. < br / > < br / > instead, the " i am blank, blank " rhythymed title was repeated endlessly for years as a titilation for porno films ( i am curious, lavender - for gay films, i am curious, black - for blaxploitation films, etc.. ) and every ten years or so the thing rises from the dead, to be viewed by a new generation of suckers who want

		%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

scrape to find value in its boring pseudo revolutionary political spewings.. but if it weren't for the censorship scandal, it would have been ignored, then forgotten. < br / > < br / > instead, the " i am blank, blank " rhythymed title was repeated endlessly for years as a titilation for porno films ( i am curious, lavender - for gay films, i am curious, black - for blaxploitation films, etc.. ) and every ten years or so the thing 

In [16]:
#Instantiate the data collator and set it to
# mask the 15% tokens
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm_probability=0.15)

**Beofre applying it on the whole dataset, let's check its performance.**

In [17]:
#Select 5 random samples
samples = processed_dataset.select(range(5))

#Convert the datasets.arrow_dataset.Dataset object to a list
samples = list(samples)

#Remove the word_ids column, as the data collator does not expect it
for sample in samples:
    _ = sample.pop("word_ids")

#Apply the data collator
collated_data = data_collator(samples)

#Print the input IDs
for i, sample in enumerate(collated_data["input_ids"]):
    print(f"Input ID {i}:", tokenizer.decode(sample))
    print("\n\t\t\t%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Input ID 0: [CLS] i rented i am curious - yellow from my video store because of all the [MASK] that surrounded it when [MASK] was first released in mcdowell. i also heard [MASK] at first it [MASK] seized by u [MASK] s. [MASK] if it ever tried [MASK] enter this country, therefore being a fan of films considered " controversial " i [MASK] had to see this [MASK] myself. < br / > < br [MASK] > the plot is centered around a young strapped drama student named [MASK] who wants to learn [MASK] she can about life. in particular she wants to focus her attentions to [MASK] some sort of [MASK] on what the average swede thought about [MASK] political issues [MASK]

			%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Input ID 1: [MASK] [MASK] vietnam war and race issues in [MASK] united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics [MASK] she [MASK] sex with [MASK] [MASK] teacher, classmates, [MASK] married men. < br / [MASK] [MASK] [MASK] / >

**We can see that data collator has added [MASK] tokens in the text. But it masks tokens, not the whole word. So, we need to define a new data collator ourself.**

In [18]:
#Define a data collator that will mask a whole word
import collections
import numpy as np

from transformers.data.data_collator import tf_default_data_collator

#Set the Whole Word Masking probability to 20%
wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return tf_default_data_collator(features)

In [19]:
#Apply the custom data collator function on 5 random samples
#Select 5 random samples
samples = processed_dataset.select(range(5))

#Convert the datasets.arrow_dataset.Dataset object to a list
samples = list(samples)

#Apply our custom data collator on the same samples
custom_collated_data = whole_word_masking_data_collator(samples)

#Print the input IDs
for i, sample in enumerate(custom_collated_data["input_ids"]):
    print(f"Input ID {i}:", tokenizer.decode(sample))
    print("\n\t\t\t%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\n")

Input ID 0: [CLS] i rented [MASK] am [MASK] - yellow [MASK] my [MASK] store because of all the [MASK] [MASK] surrounded it [MASK] [MASK] was first [MASK] in 1967. [MASK] also [MASK] that at first it [MASK] seized by u. s. customs if it ever tried to enter this country [MASK] therefore [MASK] a fan of [MASK] considered " controversial " i really had [MASK] see this for myself [MASK] < br / > < br / > the plot [MASK] centered around a young swedish drama student named lena who wants to [MASK] everything [MASK] can about life [MASK] in particular [MASK] [MASK] to focus her [MASK] [MASK] to making some [MASK] of documentary [MASK] [MASK] the average [MASK] [MASK] [MASK] about [MASK] [MASK] issues [MASK]

			%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Input ID 1: as the vietnam war and race issues [MASK] the united states [MASK] [MASK] between asking politicians and ordinary denizens of stockholm about their opinions on politics [MASK] she has sex with her drama teacher, classmates, and m

## Train the model

To optimize the training process and reduce the time required, downsampling the dataset is a viable approach, considering the significant increase in data size after text concatenation and chunk splitting. From an initial count of **25,000**, the dataset has expanded to **61,291** samples. Downsampling will help mitigate the training time.

In [20]:
#Downsample the dataset
train_size = 15000
test_size = int(0.2 * train_size)

downsampled_dataset = processed_dataset.train_test_split(
        train_size=train_size,
        test_size=test_size,
        seed=SEED
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 15000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 3000
    })
})

In [21]:
import pandas as pd

## Prepare the tf.data object
The easy method to prepare the dataset is to use **prepare_tf_dataset()** method. But somehow, it is not working.

```
#Training data
train_data = model.prepare_tf_dataset(
    downsampled_dataset["train"],
    collate_fn = whole_word_masking_data_collator,
    shuffle = True,
    batch_size = 32
)

#Test data
test_data = model.prepare_tf_dataset(
    downsampled_dataset["test"],
    collate_fn = whole_word_masking_data_collator,
    shuffle = True,
    batch_size = 32
)
```

It throws a weird error:

```
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-27-0a4ca6add0ca> in <cell line: 4>()
      2
      3 #Training data
----> 4 train_data = model.prepare_tf_dataset(
      5     downsampled_dataset["train"],
      6     collate_fn = whole_word_masking_data_collator,

1 frames
/usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py in requires_backends(obj, backends)
   1012     failed = [msg.format(name) for available, msg in checks if not available()]
   1013     if failed:
-> 1014         raise ImportError("".join(failed))
   1015
   1016

ImportError:
TFBertForMaskedLM requires the 🤗 Datasets library but it was not found in your environment. You can install it with:
```
pip install datasets
```
In a notebook or a colab, you can install it by executing a cell with
```
!pip install datasets
```
then restarting your kernel.

Note that if you have a local folder named `datasets` or a local python file named `datasets.py` in your current
working directory, python may try to import this instead of the 🤗 Datasets library. You should rename this folder or
that python file if that's the case. Please note that you may need to restart your runtime after installation.


---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
```

So, the alternate is **Dataset.to_tf_dataset()**. But this method also has issues as highlighted in this [HuggingFace Discussion Thread](https://discuss.huggingface.co/t/dataset-object-has-no-attribute-to-tf-dataset/12099) and the [StackOverflow Thread](https://stackoverflow.com/questions/73937384/attribute-error-datasetdict-object-has-no-attribute-to-tf-dataset). TensorFlow has difficulty working or integrating with Pandas dataframes to convert to tf.data.Dataset format using to_tf_dataset(). But this issue has a work around. Let's implement it.

In [22]:
#Convert the DatasetDict object to a Pandas DataFrame
pd_train = pd.DataFrame(downsampled_dataset["test"])
pd_test = pd.DataFrame(downsampled_dataset["test"])

In [23]:
#Convert the pandas dataset back to DatasetDict
from datasets import Dataset
ds_train = Dataset.from_pandas(pd_train)
ds_test = Dataset.from_pandas(pd_test)

In [25]:
pd_train

Unnamed: 0,input_ids,attention_mask,word_ids,labels
0,"[16021, 26310, 1010, 1000, 2030, 3352, 2019, 2...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[342, 342, 343, 344, 345, 346, 347, 348, 349, ...","[16021, 26310, 1010, 1000, 2030, 3352, 2019, 2..."
1,"[101, 17685, 1998, 2014, 2402, 2684, 17217, 20...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1...","[101, 17685, 1998, 2014, 2402, 2684, 17217, 20..."
2,"[2926, 2043, 2049, 25405, 1999, 26352, 9709, 1...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[138, 139, 140, 141, 142, 143, 143, 144, 144, ...","[2926, 2043, 2049, 25405, 1999, 26352, 9709, 1..."
3,"[1005, 1056, 3477, 2005, 4312, 2021, 2001, 198...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1017, 1018, 1019, 1020, 1021, 1022, 1023, 102...","[1005, 1056, 3477, 2005, 4312, 2021, 2001, 198..."
4,"[2190, 14939, 1007, 1998, 1010, 2007, 2023, 45...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[565, 566, 567, 568, 569, 570, 571, 572, 573, ...","[2190, 14939, 1007, 1998, 1010, 2007, 2023, 45..."
...,...,...,...,...
2995,"[1996, 6355, 7091, 3065, 2008, 1996, 2088, 200...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[388, 389, 390, 391, 392, 393, 394, 395, 396, ...","[1996, 6355, 7091, 3065, 2008, 1996, 2088, 200..."
2996,"[1010, 26211, 2100, 11999, 3248, 2019, 9461, 1...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[203, 204, 204, 205, 206, 207, 208, 209, 210, ...","[1010, 26211, 2100, 11999, 3248, 2019, 9461, 1..."
2997,"[2049, 8141, 1010, 2008, 2003, 1996, 2069, 579...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[355, 356, 357, 358, 359, 360, 361, 362, 363, ...","[2049, 8141, 1010, 2008, 2003, 1996, 2069, 579..."
2998,"[2019, 3535, 2000, 13883, 1996, 9530, 22461, 4...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[41, 42, 43, 44, 45, 46, 46, 47, 48, 49, 50, 5...","[2019, 3535, 2000, 13883, 1996, 9530, 22461, 4..."


In [34]:
#Prepare the tf.data object for train data
tf_train_dataset = ds_train.to_tf_dataset(
    columns=['input_ids', 'attention_mask'],
    label_cols=["labels"],
    batch_size=32,
    collate_fn=data_collator,
    shuffle=True
)

In [33]:
#Prepare the tf.data object for test data
tf_test_dataset = ds_train.to_tf_dataset(
    columns=['input_ids', 'attention_mask'],
    label_cols=["labels"],
    batch_size=32,
    collate_fn=data_collator,
    shuffle=True
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [35]:
from transformers import create_optimizer

num_epochs = 10

num_train_steps = len(tf_train_dataset) * num_epochs
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=1000,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

In [30]:
#Log in to the HF account
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [31]:
from transformers.keras_callbacks import PushToHubCallback
callback = PushToHubCallback(
    output_dir="bert-based-MaskedLM", tokenizer=tokenizer
)

Cloning https://huggingface.co/MUmairAB/bert-based-MaskedLM into local empty directory.


In [36]:
#First 10 Epochs
model.fit(tf_train_dataset,
          validation_data=tf_test_dataset,
          callbacks=[callback],
          epochs=num_epochs)

Epoch 1/10
 6/94 [>.............................] - ETA: 47s - loss: 3.3092



Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fca4dd79630>

In [37]:
#Second 10 Epochs
model.fit(tf_train_dataset,
          validation_data=tf_test_dataset,
          callbacks=[callback],
          epochs=num_epochs)

Epoch 1/10
 6/94 [>.............................] - ETA: 50s - loss: 2.4428



Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fca7ed9c310>

**Since the validation loss has started in increase, we'll stop further training.**

In [38]:
model.summary()

Model: "tf_distil_bert_for_masked_lm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 vocab_transform (Dense)     multiple                  590592    
                                                                 
 vocab_layer_norm (LayerNorm  multiple                 1536      
 alization)                                                      
                                                                 
 vocab_projector (TFDistilBe  multiple                 23866170  
 rtLMHead)                                                       
                                                                 
Total params: 66,985,530
Trainable params: 66,985,530
Non-trainable params: 0
__________________________

## Model testing

Let's evaluate the performance of the model on sample data.

In [39]:
#Get the test dataset
test_set = processed_dataset.select(range(1000))

In [40]:
#Convert the DatasetDict object to a Pandas DataFrame
pd_test_set = pd.DataFrame(test_set)

In [41]:
#Convert the pandas dataset back to DatasetDict
ds_test_set = Dataset.from_pandas(pd_test_set)

In [42]:
#Prepare the tf.data object for train data
tf_test_set = ds_test_set.to_tf_dataset(
    columns=['input_ids', 'attention_mask'],
    label_cols=["labels"],
    batch_size=32,
    collate_fn=data_collator,
    shuffle=True
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [43]:
#Calculate the Perplexity
import math

eval_loss = model.evaluate(tf_test_set)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 10.72
