# BERT Embeddings
## Featuring: DistilBERT on PyTorch

This is adapted from HuggingFace example codes: https://huggingface.co/docs/transformers/model_doc/distilbert

### Preliminaries: Load the text tokenizer and deep learning model

In [None]:
import torch
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

### Create some input text and convert to tokens

In [None]:
inputs = tokenizer(["Hello, my dog is cute",
                    "Hello, my dog is ugly",
                    "Hello, my dog is not cute",
                    "The cat ran fast",
                    "The game ended in a cat"],
                   padding=True,
                   return_tensors="pt")
inputs

### Process the inputs and obtain the CLS token embeddings (whole sentence meaning)

Note that the text information for each sentence is now represented as a 768-dimensional vector. Similar vectors (close according to distance and/or dot-product computations) should be found for text with similar meanings. However, dissimilar vectors should correspond to dissimilar texts.

In [None]:
with torch.no_grad():
    output = model(**inputs).last_hidden_state[:,0,:]
output.shape

### Quick, unscaled dot-product comparison between all embeddings

Note that higher values indicate more similarity when using dot-product. We could also calculate distances, which might be more appropriate in some cases.

In [None]:
output @ output.T

### Euclidean distances between all embeddings

In [None]:
from scipy.spatial.distance import pdist, squareform
squareform(pdist(output.numpy()))

### Obtain some summary/visualization info on the model...

In [None]:
from torchinfo import summary
from torchview import draw_graph

In [None]:
summary(model)

In [None]:
model_graph = draw_graph(model,input_data=inputs)
model_graph.visual_graph