## Set Up Everything

In [1]:
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

In [9]:
import transformers
import torch

## Load Tokenizer and Language Model

In [44]:
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
model = transformers.AutoModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Do Some Tokenization Tests

In [121]:
tokenizer.tokenize("This is a test!", add_special_tokens=True)

['[CLS]', 'this', 'is', 'a', 'test', '!', '[SEP]']

In [120]:
tokenizer.tokenize("We are practicing tokenization some more", add_special_tokens=True)

['[CLS]',
 'we',
 'are',
 'practicing',
 'token',
 '##ization',
 'some',
 'more',
 '[SEP]']

In [122]:
tokenizer.encode("This is a test!")

[101, 2023, 2003, 1037, 3231, 999, 102]

## Token Embeddings have Some Interesting Properties

In [66]:
def get_embedding(word):
    input_ids = tokenizer.encode(word, add_special_tokens=True, return_tensors="pt")
    assert input_ids.shape[1] == 3
    with torch.no_grad():
        last_hidden_states = model(input_ids, output_hidden_states=True).hidden_states
    return last_hidden_states[0][0,1,:]

Cosine similarity can be used to measure the similarity between two vecors (cos of the angle between the vectors)

In [91]:
def compare(word1, word2):
    w1_emb = get_embedding(word1)
    w2_emb = get_embedding(word2)
    return torch.cosine_similarity(w1_emb, w2_emb, dim=0)

In [93]:
compare("king", "queen")

tensor(0.6223)

In [124]:
compare("king", "man")

tensor(0.2913)

In [125]:
compare("king", "woman")

tensor(0.2266)

In [127]:
compare("queen", "man")

tensor(0.1750)

In [128]:
compare("queen", "woman")

tensor(0.3524)