This notebook explores adding pseudoword embeddings as new embeddings to a BERT model:

In [28]:
from transformers import BertForMaskedLM, BertTokenizer
import torch
import numpy as np

Load the pseudoword embeddings:

In [29]:
pseudowords = np.load("../../out/pseudowords-avg.npy")
pseudowords

array([[-0.6139378 , -0.09849278,  0.38503495, ...,  0.5829185 ,
        -0.03312979, -0.32081214],
       [ 0.7645389 , -0.9078201 , -0.83330715, ...,  0.29633465,
         0.2754484 ,  0.01427615],
       [-0.04470551, -0.13107753, -0.6122573 , ...,  0.9067127 ,
         0.5997687 , -0.05785817],
       ...,
       [-0.41056955, -0.25118613, -0.1638469 , ..., -0.26522723,
         1.182461  ,  0.3514442 ],
       [ 0.188755  , -0.4069652 ,  0.12774336, ..., -0.12126191,
        -0.5085608 ,  0.632808  ],
       [-0.34621328,  0.5956651 , -0.95337975, ...,  0.05335583,
        -1.1837399 ,  0.24092862]], dtype=float32)

Save the new token names:

In [30]:
import csv
import json

with open("../../data/pseudowords/MaPP_all.txt") as json_file:
    data = json.load(json_file)

new_tokens = [d["query"].split()[d["query_idx"]] for d in data]
token_counts = {}
bert_tokens = []

with open("../../data/pseudowords/MaPP_Dataset.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    labels = {row[0]: row[2] + row[5] for row in csv_reader}

bert_tokens = []

for d in data:
    try:
        label = labels[d["target1"]]
    except KeyError:
        label = labels[d["target1"].strip()]
    if label not in bert_tokens:
        bert_tokens.append(label)
bert_tokens

['in1',
 'in2',
 'for1',
 'for2',
 'for3',
 'started1',
 'started2',
 'had1',
 'had4',
 'had5',
 'about1',
 'about2',
 'with1',
 'with2',
 'with3',
 'on1',
 'on2',
 'run1',
 'run2']

Load the vanilla BERT model:

In [31]:
model = BertForMaskedLM.from_pretrained('bert-base-cased', return_dict=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model.bert.embeddings.word_embeddings

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Embedding(28996, 768, padding_idx=0)

Add to existing embeddings:

In [32]:
combined_embeddings = torch.cat((model.bert.embeddings.word_embeddings.weight, torch.tensor(pseudowords)), dim=0)
model.bert.embeddings.word_embeddings = torch.nn.Embedding.from_pretrained(combined_embeddings)
model.bert.embeddings.word_embeddings

Embedding(29015, 768)

Add to existing tokens:

In [33]:
tokenizer.add_tokens(bert_tokens)
model.resize_token_embeddings(len(tokenizer))

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 29015. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Embedding(29015, 768)

Try it with an example:

In [47]:
tokenized_text = tokenizer.tokenize("The [MASK]")
masked_index = tokenized_text.index("[MASK]")
tokenized_text

['The', '[MASK]', '[PAD]']

Convert the tokens to indices:

In [48]:
input_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
input_ids = torch.tensor([input_ids])
input_ids

tensor([[1109,  103,    0]])

Predict the token:

In [49]:
with torch.no_grad():
    outputs = model(input_ids)
    predictions = outputs.logits

predicted_token_id = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_token_id])[0]

predicted_token

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


'for2'

Predict the top 100 tokens:

In [50]:
top_k = 100
predicted_token_ids = torch.topk(predictions[0, masked_index], top_k).indices
predicted_token_probs = torch.topk(predictions[0, masked_index], top_k).values

# Convert the predicted token IDs back to tokens
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_token_ids)

# Print the top 5 predictions and their probabilities
for token, prob in zip(predicted_tokens, predicted_token_probs):
    print(token, prob)

for2 tensor(60.4169)
in1 tensor(42.1652)
for3 tensor(27.5228)
with2 tensor(26.8077)
about2 tensor(26.3791)
on1 tensor(24.1114)
. tensor(20.7530)
, tensor(13.6920)
had1 tensor(11.8560)
in2 tensor(10.7599)
; tensor(10.3517)
had4 tensor(10.2989)
? tensor(9.3468)
... tensor(8.9554)
) tensor(8.0753)
of tensor(7.8683)
one tensor(7.8091)
: tensor(7.7398)
with1 tensor(7.7063)
" tensor(7.5942)
with3 tensor(7.1962)
and tensor(7.1646)
have tensor(7.1489)
for tensor(6.8958)
- tensor(6.8416)
( tensor(6.6311)
to tensor(6.6089)
however tensor(6.4687)
or tensor(6.4409)
has tensor(6.2858)
he tensor(6.1440)
10 tensor(6.0997)
was tensor(6.0678)
had tensor(6.0156)
direction tensor(6.0082)
she tensor(5.8671)
with tensor(5.8534)
once tensor(5.7480)
! tensor(5.6943)
center tensor(5.6774)
which tensor(5.6481)
but tensor(5.6247)
hundred tensor(5.6120)
being tensor(5.5830)
instead tensor(5.5702)
calling tensor(5.5174)
50 tensor(5.4752)
called tensor(5.4552)
small tensor(5.4548)
is tensor(5.4308)
having tensor(5

Predict the most probable word that is not part of the new embeddings:

In [51]:
predicted_token_ids = torch.argmax(predictions[0, masked_index])
vocab_size = len(tokenizer)
# Find the highest predicted token with an ID lower than 28997
for i in range(vocab_size):
    if predicted_token_ids < 28996:  # if no [PAD]: <= 28996
        break
    predicted_token_ids = torch.argsort(predictions[0, masked_index], descending=True)[i]

# Convert the predicted token ID back to a token
predicted_token = tokenizer.convert_ids_to_tokens([predicted_token_ids])[0]

print(predicted_token)

.


Predict the top 5 words that are not part of the new embeddings:

In [46]:
# Get the predicted token IDs and their probabilities
predicted_token_probs = predictions[0, masked_index]
vocab_size = len(tokenizer)
# Create a list to store the top 5 predictions and their probabilities
top_5_predictions = []

# Find the top 5 predicted tokens with IDs lower than 28997
for i in range(vocab_size):
    if len(top_5_predictions) >= 5 or i >= vocab_size:
        break
    token_id = torch.argsort(predicted_token_probs, descending=True)[i].item()
    if token_id < 28996:  # if no [PAD]: <= 28996
        predicted_token = tokenizer.convert_ids_to_tokens([token_id])[0]
        top_5_predictions.append((predicted_token, predicted_token_probs[token_id].item()))

# Print the top 5 predictions and their probabilities
for token, prob in top_5_predictions:
    print(token, prob)

. 21.43332862854004
, 13.855920791625977
; 10.54966926574707
? 9.460063934326172
... 9.084609031677246
