<a href="https://colab.research.google.com/github/Taaniya/exploring-gpt2-language-model/blob/main/Visualizing_gpt2_token_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook visualizes word token embeddings of GPT2 on Tensorboard projector

In [None]:
! pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m91.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.4


In [None]:
from transformers import GPT2TokenizerFast, GPT2LMHeadModel
import tensorflow as tf
from tensorboard.plugins import projector

import re
import os
from tqdm import tqdm

In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

word_embeddings = model.transformer.wte.weight      # Word Token Embeddings 
position_embeddings = model.transformer.wpe.weight  # Word Position Embeddings 

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
print(word_embeddings.shape)

print(position_embeddings.shape)

torch.Size([50257, 768])
torch.Size([1024, 768])


In [None]:
# create logging directory
log_dir='./logs/vocab/'

if not os.path.exists(log_dir):
    os.makedirs(log_dir)

In [None]:
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
tokenizer.pretrained_vocab_files_map

{'vocab_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/vocab.json',
  'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/vocab.json',
  'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/vocab.json',
  'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/vocab.json',
  'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/vocab.json'},
 'merges_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/merges.txt',
  'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/merges.txt',
  'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/main/merges.txt',
  'gpt2-xl': 'https://huggingface.co/gpt2-xl/resolve/main/merges.txt',
  'distilgpt2': 'https://huggingface.co/distilgpt2/resolve/main/merges.txt'},
 'tokenizer_file': {'gpt2': 'https://huggingface.co/gpt2/resolve/main/tokenizer.json',
  'gpt2-medium': 'https://huggingface.co/gpt2-medium/resolve/main/tokenizer.json',
  'gpt2-large': 'https://huggingface.co/gpt2-large/resolve/ma

In [None]:
tokenizer.vocab

{'German': 16010,
 'ICS': 19505,
 'ĠPotter': 14179,
 'Ġprosecutor': 13683,
 'Ġmanslaughter': 35083,
 'Ġget': 651,
 '701': 41583,
 'Ġinserts': 42220,
 'Phys': 43215,
 'udd': 4185,
 'Ġchildhood': 9963,
 'ĠBaker': 14372,
 'Ġacademy': 21531,
 'Ġrelax': 8960,
 'Ġbooking': 25452,
 'Ġoppressed': 25765,
 '644': 29173,
 'ilar': 1794,
 'Ġwake': 7765,
 'ized': 1143,
 'ĠBarn': 11842,
 'Ġallowing': 5086,
 'ĠâĶĤ': 19421,
 'ĠSch': 3059,
 'orthodox': 42539,
 'ĠSergeant': 26541,
 'Supplement': 42615,
 '993': 44821,
 'Ġhugging': 46292,
 'åį': 39355,
 'Ġlegendary': 13273,
 'Ġprize': 11596,
 'FB': 26001,
 'DP': 6322,
 'Ġdrummer': 34269,
 'Ġhistorically': 15074,
 'ĠPhysical': 16331,
 'Ġdeterior': 17975,
 'ĠRoot': 20410,
 'Ġparasite': 38473,
 'uci': 42008,
 'ĠTransgender': 44002,
 '244': 25707,
 'Ġbaskets': 46530,
 'Ġcost': 1575,
 'Ġback': 736,
 'Ġ164': 25307,
 'Ġnow': 783,
 'Ġpie': 2508,
 'Ġstarted': 2067,
 'ĠBard': 25654,
 'aez': 47246,
 '774': 47582,
 'Ġconfid': 47830,
 'ĠSeeking': 48160,
 'Ġemploys': 24

**Creating list of tokens in vocab sorted by their index in vocab**

In [None]:
vocab_list = sorted(tokenizer.vocab.items(), key=lambda x:x[1])

**Verify if the resulting list is sorted by the token indices.**

In [None]:
for k,v in tokenizer.vocab.items():
    if v < 10:
        print(k, v)

* 9
$ 3
% 4
) 8
& 5
! 0
" 1
( 7
# 2
' 6


In [None]:
vocab_list[:10]

[('!', 0),
 ('"', 1),
 ('#', 2),
 ('$', 3),
 ('%', 4),
 ('&', 5),
 ("'", 6),
 ('(', 7),
 (')', 8),
 ('*', 9)]

**Save the sorted token labels from vocab as metadata file**

In [None]:
# Just create the metadata file

with open(os.path.join(log_dir, 'metadata.tsv'), "w") as f:
    for word, idx in tqdm(vocab_list):
        line = str(word.encode(encoding='iso-8859-1', errors='replace'))
        line = re.sub("^b'", "", line)
        line = re.sub('^b"', "", line)
        line = re.sub("'$", "", line)
        line = re.sub('"$', '', line)
        f.write("{}\n".format(line))

100%|██████████| 50257/50257 [00:00<00:00, 358917.79it/s]


**Save the word embeddings**

In [None]:
embeddings = tf.Variable(model.transformer.wte.weight.detach().numpy())
checkpoint = tf.train.Checkpoint(embedding=embeddings)
checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))

'./logs/vocab/embedding.ckpt-1'

Finally set up tensorboard projector's configuration. This creates a configuration file with .pbtxt extension.

In [None]:
# Set up config.
config = projector.ProjectorConfig()
embedding = config.embeddings.add()

# The name of the tensor will be suffixed by `/.ATTRIBUTES/VARIABLE_VALUE`.
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'metadata.tsv'
projector.visualize_embeddings(log_dir, config)

Run tensorboard to visualize the embeddings. Use UMAP for faster and cleaner visualizations. Search a few keywords and find their nearest neighbours in the 3D space and in the drop down.

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./logs/vocab/

#### References

1. [Tensorboard embedding projector](https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin)

2. https://towardsdatascience.com/how-to-visualize-text-embeddings-with-tensorboard-47e07e3a12fb

3. https://github.com/huggingface/transformers/issues/1458