We're going to use the `wikitext` (https://huggingface.co/datasets/wikitext) dataset with the `distilbert-base-cased` (https://huggingface.co/distilbert-base-cased) model checkpoint. 

Start by loading the `wikitext-2-raw-v1` version of that dataset, and take the 11th example (index 10) of the train split.
We'll tokenize this using the appropriate tokenizer, and we'll mask the sixth token (index 5) the sequence. 

When using the `distilbert-base-cased` checkpoint to unmask that (sixth token, index 5) token, what is the most probable predicted token (please provide the decoded token, and not the ID)? 

Tips: 
- You might find the transformers docs (https://huggingface.co/docs/transformers/index) useful. 
- You might find the datasets docs (https://huggingface.co/docs/datasets/index) useful. 
- You might also be interested in the Hugging Face course (https://huggingface.co/course/chapter1/1).

In [1]:
# ! pip install datasets evaluate transformers[sentencepiece]
from transformers import DistilBertTokenizer, DistilBertForMaskedLM
import torch
import datasets

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-cased")
model = DistilBertForMaskedLM.from_pretrained("distilbert-base-cased")

dataset = datasets.load_dataset("wikitext", "wikitext-2-raw-v1")
train_dataset = dataset["train"]

# Take the 11th example
example = train_dataset[10]

# Tokenize the example
tokens = tokenizer.tokenize(example["text"])

# Mask the 6th token
tokens[5] = "[MASK]"

# Convert the tokens to their IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)

# Convert input_ids to a tensor and add a dimension
input_ids = torch.tensor(input_ids).unsqueeze(0)

# Unmask the 6th token
outputs = model(input_ids)[0]
predicted_token = tokenizer.decode(torch.argmax(outputs[0, 5]).item())

print("The most probable predicted token is:", predicted_token)


  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset wikitext (/home/codespace/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
100%|██████████| 3/3 [00:00<00:00, 11.73it/s]


The most probable predicted token is: m e c h a n i c


: 