This notebook explores adding pseudoword embeddings as new embeddings to a BERT model:

In [157]:
from transformers import BertForMaskedLM, BertTokenizer
import torch
import numpy as np

Load the pseudoword embeddings:

In [158]:
pseudowords = np.load("../out/pseudowords-avg.npy")
pseudowords

array([[-0.6139378 , -0.09849278,  0.38503495, ...,  0.5829185 ,
        -0.03312979, -0.32081214],
       [ 0.7645389 , -0.9078201 , -0.83330715, ...,  0.29633465,
         0.2754484 ,  0.01427615],
       [-0.04470551, -0.13107753, -0.6122573 , ...,  0.9067127 ,
         0.5997687 , -0.05785817],
       ...,
       [-0.41056955, -0.25118613, -0.1638469 , ..., -0.26522723,
         1.182461  ,  0.3514442 ],
       [ 0.188755  , -0.4069652 ,  0.12774336, ..., -0.12126191,
        -0.5085608 ,  0.632808  ],
       [-0.34621328,  0.5956651 , -0.95337975, ...,  0.05335583,
        -1.1837399 ,  0.24092862]], dtype=float32)

Save the new token names:

In [159]:
import csv
import json

with open("./pseudowords/MaPP_all.txt") as json_file:
    data = json.load(json_file)

new_tokens = [d["query"].split()[d["query_idx"]] for d in data]
token_counts = {}
bert_tokens = []

with open("./pseudowords/MaPP_Dataset.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    labels = {row[0]: row[2] + row[5] for row in csv_reader}

bert_tokens = []

for d in data:
    try:
        label = labels[d["target1"]]
    except KeyError:
        label = labels[d["target1"].strip()]
    if label not in bert_tokens:
        bert_tokens.append(label)
bert_tokens

['in1',
 'in2',
 'for1',
 'for2',
 'for3',
 'started1',
 'started2',
 'had1',
 'had4',
 'had5',
 'about1',
 'about2',
 'with1',
 'with2',
 'with3',
 'on1',
 'on2',
 'run1',
 'run2']

Load the vanilla BERT model:

In [160]:
model = BertForMaskedLM.from_pretrained('bert-base-cased', return_dict=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model.bert.embeddings.word_embeddings

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Embedding(28996, 768, padding_idx=0)

Add to existing embeddings:

In [161]:
combined_embeddings = torch.cat((model.bert.embeddings.word_embeddings.weight, torch.tensor(pseudowords)), dim=0)
model.bert.embeddings.word_embeddings = torch.nn.Embedding.from_pretrained(combined_embeddings)
model.bert.embeddings.word_embeddings

Embedding(29015, 768)

Add to existing tokens:

In [162]:
tokenizer.add_tokens(bert_tokens)
model.resize_token_embeddings(len(tokenizer))

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embeding dimension will be 29015. This might induce some performance reduction as *Tensor Cores* will not be available. For more details  about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Embedding(29015, 768)

Try it with an example:

In [181]:
tokenized_text = tokenizer.tokenize("[CLS] The lecture will be just about1 a [MASK]. [SEP]")
masked_index = tokenized_text.index("[MASK]")
tokenized_text

['[CLS]',
 'The',
 'lecture',
 'will',
 'be',
 'just',
 'about1',
 'a',
 '[MASK]',
 '.',
 '[SEP]']

Convert the tokens to indices:

In [182]:
input_ids = tokenizer.convert_tokens_to_ids(tokenized_text)
input_ids = torch.tensor([input_ids])
input_ids

tensor([[  101,  1109, 10309,  1209,  1129,  1198, 29006,   170,   103,   119,
           102]])

Predict the token:

In [183]:
with torch.no_grad():
    outputs = model(input_ids)
    predictions = outputs.logits

predicted_token_id = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_token_id])[0]

predicted_token

'run2'

Predict the top 100 tokens:

In [184]:
top_k = 100
predicted_token_ids = torch.topk(predictions[0, masked_index], top_k).indices
predicted_token_probs = torch.topk(predictions[0, masked_index], top_k).values

# Convert the predicted token IDs back to tokens
predicted_tokens = tokenizer.convert_ids_to_tokens(predicted_token_ids)

# Print the top 5 predictions and their probabilities
for token, prob in zip(predicted_tokens, predicted_token_probs):
    print(token, prob)

run2 tensor(50.2523)
started2 tensor(46.4366)
had4 tensor(40.3415)
run1 tensor(34.4864)
had1 tensor(33.7096)
about2 tensor(33.7002)
had5 tensor(25.1552)
for2 tensor(12.1103)
minute tensor(10.4596)
lecture tensor(9.8213)
moment tensor(9.7789)
day tensor(9.5306)
bit tensor(9.5272)
little tensor(9.4727)
show tensor(9.3757)
preview tensor(8.8964)
game tensor(8.8520)
start tensor(8.8175)
second tensor(8.8044)
demonstration tensor(8.7517)
touch tensor(8.6196)
surprise tensor(8.4973)
performance tensor(8.4794)
week tensor(8.4368)
conference tensor(8.3706)
presentation tensor(8.2188)
meeting tensor(8.2018)
joke tensor(8.1999)
thing tensor(8.1657)
concert tensor(8.1434)
test tensor(8.1418)
distraction tensor(8.0433)
rehearsal tensor(7.9748)
dream tensor(7.9736)
visit tensor(7.9571)
date tensor(7.7271)
while tensor(7.6887)
scratch tensor(7.6475)
few tensor(7.6338)
walk tensor(7.6190)
break tensor(7.5936)
thought tensor(7.5277)
movie tensor(7.5129)
coincidence tensor(7.4856)
story tensor(7.4786)


Predict the most probable word that is not part of the new embeddings:

In [185]:
predicted_token_ids = torch.argmax(predictions[0, masked_index])
vocab_size = len(tokenizer)
# Find the highest predicted token with an ID lower than 28997
for i in range(vocab_size):
    if predicted_token_ids < 28996:  # if no [PAD]: <= 28996
        break
    predicted_token_ids = torch.argsort(predictions[0, masked_index], descending=True)[i]

# Convert the predicted token ID back to a token
predicted_token = tokenizer.convert_ids_to_tokens([predicted_token_ids])[0]

print(predicted_token)

minute


Predict the top 5 words that are not part of the new embeddings:

In [186]:
# Get the predicted token IDs and their probabilities
predicted_token_probs = predictions[0, masked_index]
vocab_size = len(tokenizer)
# Create a list to store the top 5 predictions and their probabilities
top_5_predictions = []

# Find the top 5 predicted tokens with IDs lower than 28997
for i in range(vocab_size):
    if len(top_5_predictions) >= 5 or i >= vocab_size:
        break
    token_id = torch.argsort(predicted_token_probs, descending=True)[i].item()
    if token_id < 28996:  # if no [PAD]: <= 28996
        predicted_token = tokenizer.convert_ids_to_tokens([token_id])[0]
        top_5_predictions.append((predicted_token, predicted_token_probs[token_id].item()))

# Print the top 5 predictions and their probabilities
for token, prob in top_5_predictions:
    print(token, prob)

minute 10.4595947265625
lecture 9.821345329284668
moment 9.778871536254883
day 9.530567169189453
bit 9.527246475219727
