<a href="https://colab.research.google.com/github/Taaniya/exploring-gpt2-language-model/blob/main/Visualizing_gpt2_token_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook visualizes word token embeddings of GPT2 on Tensorboard projector

In [None]:
! pip install transformers

In [None]:
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
import tensorflow as tf
from tensorboard.plugins import projector

import os
from tqdm import tqdm

In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2', from_tf=True)

word_embeddings = model.transformer.wte.weight      # Word Token Embeddings 
position_embeddings = model.transformer.wpe.weight  # Word Position Embeddings 

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading tf_model.h5:   0%|          | 0.00/498M [00:00<?, ?B/s]

All TF 2.0 model weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.


Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
print(word_embeddings.shape)

print(position_embeddings.shape)

torch.Size([50257, 768])
torch.Size([1024, 768])


In [None]:
# create logging directory
log_dir='./logs/vocab/'

if not os.path.exists(log_dir):
    os.makedirs(log_dir)

In [None]:
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
tokenizer.pretrained_vocab_files_map

{'vocab_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/vocab.json',
  'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/vocab.json',
  'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/vocab.json',
  'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/vocab.json',
  'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/vocab.json'},
 'merges_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/merges.txt',
  'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/merges.txt',
  'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/merges.txt',
  'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/merges.txt',
  'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/merges.txt'},
 'tokenizer_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/tokenizer.json',
  'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/tokenizer.json',
  'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/ma

In [None]:
tokenizer.vocab

**Creating list of tokens in vocab sorted by their index in vocab**

In [None]:
vocab_list = sorted(tokenizer.vocab.items(), key=lambda x:x[1])

**Verify if the resulting list is sorted by the token indices.**

In [None]:
for k,v in tokenizer.vocab.items():
    if v < 10:
        print(k, v)

* 9
$ 3
% 4
) 8
& 5
! 0
" 1
( 7
# 2
' 6


In [None]:
vocab_list[:10]

[('!', 0),
 ('"', 1),
 ('#', 2),
 ('$', 3),
 ('%', 4),
 ('&', 5),
 ("'", 6),
 ('(', 7),
 (')', 8),
 ('*', 9)]

**Save the sorted token labels from vocab as metadata file**

In [None]:
with open(os.path.join(log_dir, 'metadata.tsv'), "w") as f:
    for word, idx in tqdm(vocab_list):
        f.write("{}\n".format(str(word.encode(encoding='iso-8859-1', errors='replace'))))   

100%|██████████| 50257/50257 [00:00<00:00, 358917.79it/s]


**Save the word embeddings**

In [None]:
embeddings = tf.Variable(model.transformer.wte.weight.detach().numpy())
checkpoint = tf.train.Checkpoint(embedding=embeddings)
checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))

'./logs/vocab/embedding.ckpt-1'

Finally set up tensorboard projector's configuration. This creates a configuration file with .pbtxt extension.

In [None]:
# Set up config.
config = projector.ProjectorConfig()
embedding = config.embeddings.add()

# The name of the tensor will be suffixed by `/.ATTRIBUTES/VARIABLE_VALUE`.
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'metadata.tsv'
projector.visualize_embeddings(log_dir, config)

Run tensorboard to visualize the embeddings. Use UMAP for faster and cleaner visualizations. Search a few keywords and find their nearest neighbours in the 3D space and in the drop down.

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./logs/vocab/

#### References

1. [Tensorboard embedding projector](https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin)

2. https://towardsdatascience.com/how-to-visualize-text-embeddings-with-tensorboard-47e07e3a12fb

3. https://github.com/huggingface/transformers/issues/1458