## Vectorization and Embedding 
- The word “tokenizer” was split into the known words “token” and “##izer,” where “##” indicates that the token should be attached to the previous one.

- By using subword tokenization, we can take advantage of the benefits of word tokenization while keeping the vocabulary size reasonable. For example, the `bert-case-uncased` tokenizer used in the example above has a vocabulary size of only 30,522 tokens.

- Bert: https://arxiv.org/abs/1810.04805 

## Tokenization

### Using pre-trained bert-base-uncased tokenizer to tokenize a sample sentence.

In [3]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("This is an example of the bert tokenizer")
print(tokens)

['this', 'is', 'an', 'example', 'of', 'the', 'bert', 'token', '##izer']


## token_id

In [4]:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)

[2023, 2003, 2019, 2742, 1997, 1996, 14324, 19204, 17629]


## Token Embeddings

- The returned word vector has a size of 768 dimensions, the same as the BERT model.

In [6]:
import torch
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")

# get the embedding vector for the word "example"
example_token_id = tokenizer.convert_tokens_to_ids(["example"])[0]
example_embedding = model.embeddings.word_embeddings(torch.tensor([example_token_id]))

print(example_embedding.shape)

model.safetensors: 100%|████████████████████████████████████████████████| 440M/440M [01:11<00:00, 6.15MB/s]


torch.Size([1, 768])


## Vector comparision

- We can use these vectors to compare their similarities by using PyTorch’s cosine similarity function.

- Cosine similarity is a way to measure how similar two things are. It’s often used in natural language processing to compare the content of two texts.

- To calculate the cosine similarity, we look at the angle between two vectors. If the vectors point in the same direction, they are more similar, and if they point in opposite directions, they are less similar.

- The result is a number between -1 and 1, where 1 means the vectors are identical and -1 means they are completely different.

In [7]:
king_token_id = tokenizer.convert_tokens_to_ids(["king"])[0]
king_embedding = model.embeddings.word_embeddings(torch.tensor([king_token_id]))

queen_token_id = tokenizer.convert_tokens_to_ids(["queen"])[0]
queen_embedding = model.embeddings.word_embeddings(torch.tensor([queen_token_id]))

cos = torch.nn.CosineSimilarity(dim=1)
similarity = cos(king_embedding, queen_embedding)
print(similarity[0])

tensor(0.6469, grad_fn=<SelectBackward0>)


In [8]:
similarity = cos(example_embedding, queen_embedding)
print(similarity[0])

tensor(0.2392, grad_fn=<SelectBackward0>)


- The queen and example vectors have a similarity of 0.2392. This means that the king and queen vectors are more similar in our vector space than the example and queen vectors. This show that our model successfully learned the “meaning” of these words and can differentiate between them.