# <center> Masked Language Modelling

## Load the cleaned data

In [None]:
from datasets import load_dataset

twitter_data = load_dataset("cayjobla/twitter-sentiment-classification")

In [2]:
twitter_data["train"][0]

{'tweet_id': 1753253621,
 'sentiment': 8,
 'content': '@aminorjourney - We owe you a LOT.'}

## Tokenize the dataset

In [4]:
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("cayjobla/distilbert-base-uncased-finetuned-twitter")

In [5]:
test_ids = tokenizer(twitter_data["train"][0]["content"])['input_ids']
tokenizer.decode(test_ids)

2023-05-25 12:04:23.814551: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-25 12:04:23.999854: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-05-25 12:04:24.856527: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-05-25 12:04:24.856617: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

'[CLS] @ aminorjourney - we owe you a lot. [SEP]'

In [6]:
def tokenize_function(examples):
    result = tokenizer(examples["content"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

tokenized_datasets = twitter_data.map(
    # Remove sentiment classification since our task is different
    tokenize_function, batched=True, remove_columns=["tweet_id", "content", "sentiment"]
)
tokenized_datasets

Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 32000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 8000
    })
})

## Concatenate and chunk

In [7]:
# Get sample lengths
tokenized_samples = tokenized_datasets["train"][:10]
for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 11'
'>>> Review 1 length: 8'
'>>> Review 2 length: 16'
'>>> Review 3 length: 16'
'>>> Review 4 length: 36'
'>>> Review 5 length: 6'
'>>> Review 6 length: 7'
'>>> Review 7 length: 22'
'>>> Review 8 length: 24'
'>>> Review 9 length: 24'


In [8]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 170'


In [9]:
chunk_size = 128
chunks = {
    k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, t in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 42'


In [10]:
def group_texts(examples, chunk_size=128):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result

In [11]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/32000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 5219
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1342
    })
})

In [12]:
tokenizer.decode(lm_datasets["train"][0]["input_ids"])

"[CLS] @ aminorjourney - we owe you a lot. [SEP] [CLS] chilling feeling really nice.. [SEP] [CLS] i'm soooo sleepy but i'm not a home just yet [SEP] [CLS] @ sarahsss i wish i had friends i could spend the night with [SEP] [CLS] @ johncmayer ur really the sweetest person ever! thanks for making everyone's dreams come true.. ( p. s ) my dream is for u 2twitter me back x [SEP] [CLS] juss boredd,! [SEP] [CLS] @ novawildstar damn right! [SEP] [CLS] @ marjicurran1 looking forward to your gig in ireland!!! see ya there! [SEP] [CLS] home from ghosts of girlfriends"

## Collate Data

In [13]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors='tf')

In [14]:
# Token based collator
samples = [lm_datasets["train"][i] for i in range(3)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



'>>> [CLS] @ aminorjourney - we owe you a lot. [SEP] [CLS] chilling feeling really nice.. [SEP] [CLS] i'm soooo sleepy but i'm not a [MASK] just yet [SEP] [CLS] @ sarahsss i wish i had friends i [MASK] spend the night with [SEP] [CLS] @ johncmayer ur [MASK] the sweetest person ever! thanks for making everyone's dreams come true [MASK]. ( p. [MASK] ) my dream is for [MASK] 2twitter me back x [SEP] [CLS] juss boredd,! [SEP] [CLS] @ novawildstar damn right! [SEP] [CLS] @ marjic [MASK]an1 looking forward to your gig in ireland!!! see ya there! [SEP] [CLS] home from ghosts of [MASK]'

'>>> past with [MASK] lovely lud [MASK] not exactly high [MASK], but a good date movie! [SEP] [CLS] @ passed u r " happiness pollinator " 4 shizzle. good friend you are.. [MASK] am smiling! [SEP] [CLS] @ bam _ hall is that why you aint answer my call? [MASK] we were homies!? [SEP] [CLS] @ macquid no! strange, each time we [MASK] i [MASK] like i have come home and yet i have no spanish [MASK] that i know of. [

2023-05-25 12:04:42.955555: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2023-05-25 12:04:42.955590: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-05-25 12:04:42.956341: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [15]:
import collections
import numpy as np

from transformers.data.data_collator import tf_default_data_collator

wwm_probability = 0.2

# Word-based collator
def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        word2idxs = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for token_idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                word2idxs[current_word_index].append(token_idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, len(word2idxs))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in word2idxs[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return tf_default_data_collator(features)

In [16]:
# Word-based collator
samples = [lm_datasets["train"][i] for i in range(3)]

for chunk in whole_word_masking_data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] @ aminorjourney - we owe you a lot [MASK] [SEP] [CLS] chilling feeling really nice. [MASK] [SEP] [CLS] [MASK]'m [MASK] sleepy [MASK] i'm not a home just yet [SEP] [CLS] @ sarahsss [MASK] [MASK] i [MASK] friends i could spend the night with [SEP] [CLS] [MASK] [MASK] [MASK] really the sweetest person ever [MASK] thanks for making [MASK] [MASK] s dreams come true.. ( p. [MASK] ) my dream is for u 2twitter me back x [SEP] [CLS] juss boredd [MASK]! [SEP] [CLS] @ novawildstar damn [MASK]! [SEP] [CLS] @ [MASK] [MASK] [MASK] [MASK] [MASK] looking forward to your gig in ireland!!! see ya [MASK]! [SEP] [CLS] home [MASK] [MASK] of [MASK]'

'>>> past with my lovely luddite not [MASK] high theatre, but [MASK] good date movie [MASK] [SEP] [CLS] [MASK] rosehwang u r " happiness pollinator " [MASK] shizzle [MASK] good friend you are.. [MASK] am [MASK]! [SEP] [CLS] @ bam _ hall [MASK] [MASK] why you aint answer my call [MASK] thought we [MASK] homies! [MASK] [SEP] [CLS] [MASK] macquid no! s

## Load the model to fine-tune

In [17]:
from transformers import TFAutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForMaskedLM: ['activation_13']
- This IS expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at distilbert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [18]:
model.summary()

Model: "tf_distil_bert_for_masked_lm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 vocab_transform (Dense)     multiple                  590592    
                                                                 
 vocab_layer_norm (LayerNorm  multiple                 1536      
 alization)                                                      
                                                                 
 vocab_projector (TFDistilBe  multiple                 23866170  
 rtLMHead)                                                       
                                                                 
Total params: 66,985,530
Trainable params: 66,985,530
Non-trainable params: 0
__________________________

In [19]:
lm_datasets = lm_datasets.remove_columns(["word_ids"])

In [20]:
tf_train_dataset = model.prepare_tf_dataset(
    lm_datasets["train"],
    collate_fn=data_collator,
    shuffle=True,
    batch_size=128,
)

tf_eval_dataset = model.prepare_tf_dataset(
    lm_datasets["test"],
    collate_fn=data_collator,
    shuffle=False,
    batch_size=128,
)

Old behaviour: columns=['a'], labels=['labels'] -> (tf.Tensor, tf.Tensor)  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor)  
New behaviour: columns=['a'],labels=['labels'] -> ({'a': tf.Tensor}, {'labels': tf.Tensor})  
             : columns='a', labels='labels' -> (tf.Tensor, tf.Tensor) 


In [22]:
from transformers import create_optimizer
from transformers.keras_callbacks import PushToHubCallback
import tensorflow as tf

num_train_steps = len(tf_train_dataset)
optimizer, schedule = create_optimizer(
    init_lr=2e-5,
    num_warmup_steps=1_00,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
)
model.compile(optimizer=optimizer)

# Train in mixed-precision float16
tf.keras.mixed_precision.set_global_policy("mixed_float16")

# model_name = model_checkpoint.split("/")[-1]
# callback = PushToHubCallback(
#     output_dir=f"{model_name}-finetuned-twitter-mask", tokenizer=tokenizer
# )

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [23]:
import math

eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 1758471.09


In [25]:
model.fit(tf_train_dataset, validation_data=tf_eval_dataset, )#callbacks=[callback])



<keras.callbacks.History at 0x7f313c341490>

In [26]:
eval_loss = model.evaluate(tf_eval_dataset)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 5900.57


In [27]:
model_name = "distilbert-base-uncased-finetuned-twitter-mask"
tokenizer.save_pretrained(model_name)
model.save_pretrained(model_name)

## Pipeline for our model

In [24]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="cayjobla/distilbert-base-uncased-finetuned-twitter-mask"
)

Downloading tf_model.h5:   0%|          | 0.00/363M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForMaskedLM.

All the layers of TFDistilBertForMaskedLM were initialized from the model checkpoint at cayjobla/distilbert-base-uncased-finetuned-twitter.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


Downloading (…)okenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/205k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/677k [00:00<?, ?B/s]

In [25]:
text = "I went to the [MASK]"
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> i went to the pain
>>> i went to thepcom
>>> i went to the prod
>>> i went to theffee
>>> i went to the fav
