In this tutorial, we will use a pretrained transformer model to perform a fill-in-the-blank task,
commonly known as **Fill Mask**. This type of model predicts missing words in a sentence.
We'll use the `bert-base-uncased` model, which is a common choice for this type of task.

# Import the necessary libraries:

In [1]:
from transformers import pipeline


# Create a Fill-Mask pipeline


In [2]:
# We'll initialize a fill-mask pipeline using the `bert-base-uncased` model.""
# Create the fill-mask pipeline
fill_masker = pipeline("fill-mask", model="bert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



# Test the pipeline with an example sentence


In [3]:
sentence = "The capital of France is [MASK]."
print(fill_masker(sentence))

[{'score': 0.4167894423007965, 'token': 3000, 'token_str': 'paris', 'sequence': 'the capital of france is paris.'}, {'score': 0.07141634821891785, 'token': 22479, 'token_str': 'lille', 'sequence': 'the capital of france is lille.'}, {'score': 0.06339266151189804, 'token': 10241, 'token_str': 'lyon', 'sequence': 'the capital of france is lyon.'}, {'score': 0.04444744810461998, 'token': 16766, 'token_str': 'marseille', 'sequence': 'the capital of france is marseille.'}, {'score': 0.030297260731458664, 'token': 7562, 'token_str': 'tours', 'sequence': 'the capital of france is tours.'}]


# Fine-Tuning a Fill Mask Model


In [6]:
# To fine-tune a fill-mask model on a custom dataset, follow these steps:
from datasets import load_dataset

# Using a sample dataset here for demonstration
dataset = load_dataset('bookcorpus', split='train[:1%]')

#Load the pretrained model and tokenizer
from transformers import BertTokenizer, BertForMaskedLM

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

#Preprocess the dataset
def preprocess_data(examples):
    inputs = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512, return_tensors='pt')
    inputs['labels'] = inputs['input_ids'].clone()  # Labels are the same as inputs for masked LM tasks
    return inputs

train_data = dataset.map(preprocess_data, batched=True, remove_columns=["text"])

#Define training arguments and initialize Trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="no",  # Disable evaluation for this example
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    num_train_epochs=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data
)

# Fine-tune the model
trainer.train()


ModuleNotFoundError: No module named 'datasets'