We're going to use the wikitext (link) dataset with the distilbert-base-cased (link) model checkpoint.

Start by loading the wikitext-2-raw-v1 version of that dataset, and take the 11th example (index 10) of the train split.
We'll tokenize this using the appropriate tokenizer, and we'll mask the sixth token (index 5) the sequence.

When using the distilbert-base-cased checkpoint to unmask that (sixth token, index 5) token, what is the most probable predicted token (please provide the decoded token, and not the ID)?

Tips:
- You might find the transformers docs (link) useful.
- You might find the datasets docs (link) useful.
- You might also be interested in the Hugging Face course (link).


https://huggingface.co/spaces/internships/internships-2023

Tokenization is the process of breaking a string of text into individual words, phrases or other meaningful elements, known as tokens. In this case, the appropriate tokenizer for the distilbert-base-cased model checkpoint is the DistilBertTokenizer, which is trained to tokenize text in a way that is consistent with the way the model was trained on.

Masking is a technique used in transformer-based models like BERT to replace a token in the input text with a special token, [MASK], during the training process. In this case, the sixth token (index 5) in the sequence will be replaced with the [MASK] token. This is done so that the model can learn to predict the original token, which is now hidden. The model is trained to predict the original token based on the context provided by the other tokens in the sequence.

In summary, tokenization is the process of breaking the text into individual tokens and masking is the process of hiding a token in the input sequence to be predicted by the model during the training process.

In [1]:
baseUrl = "https://datasets-server.huggingface.co/first-rows?dataset=wikitext&config=wikitext-2-raw-v1&split=train"


In [2]:
baseUrl

'https://datasets-server.huggingface.co/first-rows?dataset=wikitext&config=wikitext-2-raw-v1&split=train'

In [3]:
import pandas as pd

with open('dataset_infos.json', encoding='utf-8-sig') as f_input:
    df = pd.read_json(f_input)

df.to_csv('dataset_infos.csv', encoding='utf-8', index=False)

In [4]:
df = pd.read_csv('dataset_infos.csv')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17 entries, 0 to 16
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   wikitext-103-v1      13 non-null     object
 1   wikitext-2-v1        13 non-null     object
 2   wikitext-103-raw-v1  13 non-null     object
 3   wikitext-2-raw-v1    13 non-null     object
dtypes: object(4)
memory usage: 672.0+ bytes


In [6]:
df['wikitext-2-raw-v1']

0      The WikiText language modeling dataset is a c...
1     @misc{merity2016pointer,\n      title={Pointer...
2     https://blog.einstein.ai/the-wikitext-long-ter...
3     Creative Commons Attribution-ShareAlike 4.0 In...
4     {'text': {'dtype': 'string', 'id': None, '_typ...
5                                                   NaN
6                                                   NaN
7                                                   NaN
8                                              wikitext
9                                     wikitext-2-raw-v1
10    {'version_str': '1.0.0', 'description': None, ...
11    {'test': {'name': 'test', 'num_bytes': 1305092...
12    {'https://s3.amazonaws.com/research.metamind.i...
13                                              4721645
14                                                  NaN
15                                             13526117
16                                             18247762
Name: wikitext-2-raw-v1, dtype: object

In [7]:


# from transformers import pipeline, set_seed
# generator = pipeline('text-generation', model='gpt2')
# generator("Hello, I like to play cricket,", max_length=60, num_return_sequences=7)



In [9]:
!pip install nlp



In [12]:
import torch
import transformers
import nlp

# Load the Wikitext-2 dataset
dataset = nlp.load_dataset('wikitext', 'wikitext-2-raw-v1')

# Get the 11th example (index 10) of the train split
example = dataset['train'][10]

# Load the DistilBERT model and tokenizer
model = transformers.DistilBertModel.from_pretrained('distilbert-base-cased')
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-cased')

# Tokenize the example
input_ids = tokenizer.encode(example['text'], return_tensors='pt')

# Mask the sixth token (index 5) in the sequence
masked_input_ids = input_ids.clone()
masked_input_ids[:, 5] = tokenizer.mask_token_id

# Use the model to predict the most probable token for the masked token
output = model(masked_input_ids)[0]
prediction_scores, prediction_indexes = output[:, 5, :].max(dim=-1)
# prediction_scores, prediction_indexes = output[:, 5, :].max(dim=-1)

# Decode the predicted token ID to obtain the actual token
predicted_token = tokenizer.decode(prediction_indexes, skip_special_tokens=True)

# Replace the masked token with the predicted token in the input sequence
decoded_input_ids = input_ids.squeeze().tolist()
decoded_input_ids[5] = prediction_indexes.item()
decoded_input = tokenizer.decode(decoded_input_ids, skip_special_tokens=True)

print(f'Input: {example["text"]}')
print(f'Predicted token: {predicted_token}')
print(f'Decoded input: {decoded_input}')

2023-01-27 21:58:14.018584: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Downloading:   0%|          | 0.00/8.14k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.81k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown sizetotal: 17.41 MiB) to /Users/ryantalbot/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/8e456126357b4411737ead54576f99321fc077a0d4b64e4a724ab3454ba5b730...


Downloading:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset wikitext downloaded and prepared to /Users/ryantalbot/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/8e456126357b4411737ead54576f99321fc077a0d4b64e4a724ab3454ba5b730. Subsequent calls will reuse this data.


Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Input:  The game 's battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters ' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede a ch

In [11]:
# import sys
# sys.path

['/Users/ryantalbot/Desktop/bookcamp/huggingface',
 '/Users/ryantalbot/opt/anaconda3/envs/tf2/lib/python39.zip',
 '/Users/ryantalbot/opt/anaconda3/envs/tf2/lib/python3.9',
 '/Users/ryantalbot/opt/anaconda3/envs/tf2/lib/python3.9/lib-dynload',
 '',
 '/Users/ryantalbot/.local/lib/python3.9/site-packages',
 '/Users/ryantalbot/opt/anaconda3/envs/tf2/lib/python3.9/site-packages',
 '/Users/ryantalbot/.local/lib/python3.9/site-packages/IPython/extensions',
 '/Users/ryantalbot/.ipython']

In [None]:
# /Users/yufeng/anaconda3/envs/py33/bin/python -m pip install plotly

/Users/ryantalbot/opt/anaconda3/envs/tf2/bin/python3 -m pip install nlp