# Assignment - "CBOW" Word2Vec Embedding

In this assignment you will be implementing a Word2Vec Model from the following paper http://arxiv.org/abs/1301.3781. To be more pricise you will be implementing the "CBOW" Architecture, not the "Skip-gram" architecture. The CBOW architecture is described in section 3.1 of the paper.

As you have already learnt Word2Vec is a model that learns word embeddings. Word embeddings are a way to represent words as vectors. The idea is that similar words will have similar vectors. For example the words "cat" and "dog" will have similar vectors, because they are both animals. The words "cat" and "car" will have less similar vectors, because they are not similar. The words "cat" and "the" will have very different vectors, because they are not similar at all.

The assignment is divided into 4 parts:

1. Data Preparation
2. CBOW Model Architecture
3. Training
4. Evaluation

## Dependencies

In [87]:
%matplotlib inline

In [88]:
# install required dependencies
!pip install tqdm torch torchtext datasets torchinfo plotly pandas



In [89]:
import os
import re
import json
from functools import partial
from typing import List, Callable, Tuple

from tqdm.auto import tqdm

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

from torchinfo import summary  # nice library for model summary

import numpy as np

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from datasets import load_dataset  # huggingface datasets library

import plotly.express as px
import plotly.graph_objects as go

import pandas as pd

In [9]:
torch_seed = 0

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if device.type == 'cuda':
    torch.cuda.manual_seed(torch_seed)

In [10]:
print(f"Currently using {device}")

Currently using cpu


In [11]:
# You can change parameters if you want
train_batch_size = 96
val_batch_size = 96
shuffle = True

# Parameters about CBOW model architecture and Vocab.
CBOW_N_WORDS = 4 # context_size, window_size is 2*context_size + 1

MIN_WORD_FREQUENCY = 50
MAX_SEQUENCE_LENGTH = 256

EMBED_DIMENSION = 300
EMBED_MAX_NORM = 1

## 1. Data Preparation

For this project you will be using the WikiText-2 dataset. The dataset is a collection of Wikipedia articles. You can read more about the dataset here:

https://huggingface.co/datasets/wikitext
https://huggingface.co/datasets/wikitext/viewer/wikitext-2-raw-v1/test

In [12]:
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
train_dataset = datasets["train"]
val_dataset = datasets['validation']
test_dataset = datasets['test']

Found cached dataset wikitext (/Users/george.rowlands@fhnw.ch/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


  0%|          | 0/3 [00:00<?, ?it/s]

In [13]:
# Take a look at an example setence
train_dataset["text"][11]

" Troops are divided into five classes : Scouts , Shocktroopers , Engineers , Lancers and Armored Soldier . Troopers can switch classes by changing their assigned weapon . Changing class does not greatly affect the stats gained while in a previous class . With victory in battle , experience points are awarded to the squad , which are distributed into five different attributes shared by the entire squad , a feature differing from early games ' method of distributing to different unit types . \n"

After loading the data you will need to preprocess it so that it can be properly tokenized and used for training.

In [14]:
def clean_sentence(sentence: str) -> str:
    pattern = r"[^A-Za-z ]+"
    clean = re.sub(pattern, "", sentence.replace("\n", " "))
    return clean


clean_sentence(train_dataset["text"][11])  # the multiple spaces are not and issue, the tokenizer can handle them.

' Troops are divided into five classes  Scouts  Shocktroopers  Engineers  Lancers and Armored Soldier  Troopers can switch classes by changing their assigned weapon  Changing class does not greatly affect the stats gained while in a previous class  With victory in battle  experience points are awarded to the squad  which are distributed into five different attributes shared by the entire squad  a feature differing from early games  method of distributing to different unit types   '

We will be using the torchtext tokenizer to tokenize our sentences. The tokenizer will also lowercase all the words. You can read more about the tokenizer here: https://pytorch.org/text/stable/data_utils.html#get-tokenizer

In [15]:
tokenizer = get_tokenizer("basic_english", language="en")

def yield_tokens(datasets):
    for sentence in datasets:
        yield tokenizer(clean_sentence(sentence))

We will be using the torchtext build_vocab_from_iterator function to build our vocabulary. You can read more about the function here: https://pytorch.org/text/stable/vocab.html#torchtext.vocab.build_vocab_from_iterator

In [16]:
vocab = build_vocab_from_iterator(yield_tokens(train_dataset["text"]), min_freq=MIN_WORD_FREQUENCY)
vocab.set_default_index(len(vocab) - 1)  # for unknown tokens
vocab_size = len(vocab)
vocab_size

3848

In [17]:
vocab.get_itos()[:10]  # first 10 words

['the', 'of', 'and', 'in', 'to', 'a', 'was', 's', 'on', 'as']

We need a collate function in order to make dataset into CBOW train format. The collate function should iterate over (sliding) batch data and make train/test dataset. And each component of data should be composed of CBOW_N_WORD words in left and right side as input and target output as word in center.  
Make the collate function return CBOW dataset in tensor type.

In [18]:

# In the end we want all of our sentences to be tokenized and the tokens to be replaced with their vocab indexes.
def to_token_idxs(sentence: str) -> List[int]:
    return vocab(tokenizer(sentence))

to_token_idxs("troops are divided into five classes")

[605, 25, 1579, 51, 149, 2386]

![cbow](https://user-images.githubusercontent.com/74028313/204695601-51d44a38-4bd3-4a69-8891-2854aa57c034.png)

In [19]:
# batch = a single window sentence i.e "troops are divided into five" and then "are divided into five classes" etc.
# the output should then be ([668, 33, 59, 165], [1688]) and then ([33, 1688, 165, 2567], [59]) where each number is the vocab index of the word
def collate(batch: List[str], clean_fn: Callable, to_token_idxs_fn: Callable, context_size: int):
    batch_input, batch_output = [], []
    clean_batch = clean_fn("".join(batch))
    token_idxs = to_token_idxs_fn(clean_batch)
    window_size = 2 * context_size + 1
    for idx in range(len(token_idxs) - window_size):
        window = token_idxs[idx:idx + window_size]
        batch_output.append(window.pop(context_size))  # remove middle index and put it in output
        batch_input.append(window)

    return batch_input, batch_output

In [20]:
# check our work
collate(["Troops are divided into five classes: Scouts, Shocktroopers, Engineers, Lancers and Armored Soldier.",
         "Troopers can switch classes by changing their assigned weapon."]
        ,clean_sentence ,to_token_idxs, 2)

([[605, 25, 51, 149],
  [25, 1579, 149, 2386],
  [1579, 51, 2386, 3847],
  [51, 149, 3847, 3847],
  [149, 2386, 3847, 2315],
  [2386, 3847, 2315, 3847],
  [3847, 3847, 3847, 2],
  [3847, 2315, 2, 3332],
  [2315, 3847, 3332, 3847],
  [3847, 2, 3847, 99],
  [2, 3332, 99, 3847],
  [3332, 3847, 3847, 2386],
  [3847, 99, 2386, 13],
  [99, 3847, 13, 2506],
  [3847, 2386, 2506, 27],
  [2386, 13, 27, 1006]],
 [1579,
  51,
  149,
  2386,
  3847,
  3847,
  2315,
  3847,
  2,
  3332,
  3847,
  99,
  3847,
  2386,
  13,
  2506])

In [21]:
# partial is used here to pass arguments to the collate function, could also use lambda https://discuss.pytorch.org/t/supplying-arguments-to-collate-fn/25754
train_dataloader = DataLoader(
    train_dataset['text'],
    batch_size=train_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, clean_fn=clean_sentence, to_token_idxs_fn=to_token_idxs, context_size=CBOW_N_WORDS),
)

val_dataloader = DataLoader(
    val_dataset['text'],
    batch_size=val_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, clean_fn=clean_sentence, to_token_idxs_fn=to_token_idxs, context_size=CBOW_N_WORDS),
)

In [22]:
# Check our dimensions
print(f"Training X: {len(train_dataset['text'])}")
print(f"Validation X: {len(val_dataset['text'])}")
print(f"Trainining dataloader: {len(train_dataloader)} batches of {train_batch_size}")
print(f"Validation dataloader: {len(val_dataloader)} batches of {val_batch_size}")
for idx, batch in enumerate(train_dataloader):
    X, y = batch
    y = torch.tensor(y, device=device, dtype=torch.int32)
    X = torch.tensor(X, device=device, dtype=torch.int32)
    print(f"y {y.shape} {y}")
    print(f"X {X.shape} {X}")
    break

Training X: 36718
Validation X: 3760
Trainining dataloader: 383 batches of 96
Validation dataloader: 40 batches of 96
y torch.Size([3679]) tensor([ 162, 3847, 2043,  ...,   93,   49,  474], dtype=torch.int32)
X torch.Size([3679, 8]) tensor([[ 243,    2, 2213,  ..., 2043,    2,    0],
        [   2, 2213,    8,  ...,    2,    0, 1756],
        [2213,    8,  162,  ...,    0, 1756,  967],
        ...,
        [  49,    2,   49,  ...,  474,   14,  421],
        [   2,   49, 2666,  ...,   14,  421,    3],
        [  49, 2666,   93,  ...,  421,    3,   93]], dtype=torch.int32)


## 2. CBOW Model Architecture
![image](https://user-images.githubusercontent.com/74028313/204701161-cd9df4bf-78b8-4b4d-b8b7-ed4a3b5c3922.png)

CBOW Models' main concept is to predict center-target word using context words. As you see in above simple architecture, input 2XCBOW_N_WORDS length words are projected to Projection layer. In order to convert each word to embedding, it needs look-up table and we will use torch's Embedding function to convert it. After combining embeddings of context, it use shallow linear neural network to predict target word and compare result with center word's index using cross-entropy loss. Finally, the embedding layer (lookup table) of the trained model itself serves as an embedding representing words.

In [90]:
class CBOW(nn.Module):
    def __init__(self, vocab_size: int, embedding_dim: int, max_norm: float):
        super(CBOW, self).__init__()
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim, max_norm=max_norm,
                                      norm_type=2.0) # store word embeddings and retrieve them using indices -> lookup table
        self.projection = nn.Linear(in_features=embedding_dim, out_features=vocab_size)

    def forward(self, x):
        # input x = [batch, indexes in vocab of surrounding tokens] example with window_size = 5, so 2 left and right:
        # input = [batch_size, 4]
        x = self.embedding(
            x)  # [batch_size, 4, embed_dim]takes index of each token and gets embedding using lookup table (think of one-hot vector times weight matrix |v| x embed_dim)
        # average all the embeddings of the tokens in window. In paper they use mean which is better then sum: https://groups.google.com/g/word2vec-toolkit/c/HlJyFACiVPE?pli=1
        x = torch.mean(x, 1)  # [batch_size, embed_dim]
        y_pred = self.projection(x)  # batch_size x |v| should be as close to on hot vector of center word as possible
        return y_pred

In [24]:
summary(CBOW(vocab_size, EMBED_DIMENSION, EMBED_MAX_NORM), input_size=[train_batch_size, CBOW_N_WORDS*2], dtypes=[torch.int32])

Layer (type:depth-idx)                   Output Shape              Param #
CBOW                                     [96, 3848]                --
├─Embedding: 1-1                         [96, 8, 300]              1,154,400
├─Linear: 1-2                            [96, 3848]                1,158,248
Total params: 2,312,648
Trainable params: 2,312,648
Non-trainable params: 0
Total mult-adds (M): 222.01
Input size (MB): 0.00
Forward/backward pass size (MB): 4.80
Params size (MB): 9.25
Estimated Total Size (MB): 14.05

## 3. Training

Let's train our CBOW model, implement the _train_epoch and _validate_epoch function.
- First train the model with a constant learning rate first, then if you are interested, you can try to use a learning rate scheduler.

In [25]:
def get_one_hot_tensor(vocab_size: int, token_idxs: List[int], device: str):
    one_hot_tensor = torch.zeros(len(token_idxs), vocab_size)
    for idx, token_idx in enumerate(token_idxs):
        one_hot_tensor[idx][token_idx] = 1

    return one_hot_tensor.to(device=device)


get_one_hot_tensor(5, [1, 3, 4], "cpu")

tensor([[0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.]])

In [26]:
class Train_CBOW:

    def __init__(
            self,
            model,
            epochs,
            train_dataloader,
            val_dataloader,
            loss_function,
            optimizer,
            device,
            model_dir,
            lr_scheduler=None
    ):
        self.model = model
        self.epochs = epochs
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.loss_function = loss_function
        self.optimizer = optimizer
        self.device = device
        self.model_dir = model_dir

        self.loss = {"train": [], "val": []}
        self.model.to(self.device)

    def train(self):
        for epoch in range(self.epochs):
            self._train_epoch()
            self._validate_epoch()
            print(
                "Epoch: {}/{}, Train Loss={:.5f}, Val Loss={:.5f}".format(
                    epoch + 1,
                    self.epochs,
                    self.loss["train"][-1],
                    self.loss["val"][-1],
                )
            )

    def _train_epoch(self):
        self.model.train()
        loss_list = []

        for idx, batch in tqdm(enumerate(self.train_dataloader)):
            X, y = batch
            y = get_one_hot_tensor(self.model.vocab_size, y, self.device)
            X = torch.tensor(X, device=self.device)

            self.optimizer.zero_grad()  # clear gradient

            y_pred = self.model(X)  # forward

            loss = self.loss_function(y_pred, y)  # loss
            loss_list.append(loss)
            loss.backward()  # backward
            self.optimizer.step()  # optimize

        epoch_loss = torch.mean(torch.stack(loss_list))
        self.loss["train"].append(epoch_loss)

    def _validate_epoch(self):
        self.model.eval()
        loss_list = []

        with torch.no_grad():
            for idx, batch in enumerate(self.val_dataloader):
                X, y = batch
                y = get_one_hot_tensor(self.model.vocab_size, y, self.device)
                X = torch.tensor(X, device=self.device)

                y_pred = self.model(X)  # forward
                loss = self.loss_function(y_pred, y)  # loss
                loss_list.append(loss)

        epoch_loss = torch.mean(torch.stack(loss_list))
        self.loss["val"].append(epoch_loss)

    def save_model(self):
        model_path = os.path.join(self.model_dir, "model.pt")
        torch.save(self.model, model_path)

    def save_loss(self):
        loss_path = os.path.join(self.model_dir, "loss.json")
        with open(loss_path, "w") as fp:
            self.loss["train"] = [tensor.item() for tensor in self.loss["train"]]
            self.loss["val"] = [tensor.item() for tensor in self.loss["val"]]
            json.dump(self.loss, fp)

In [27]:
learning_rate = 0.025
epochs = 10

model = CBOW(vocab_size=vocab_size, embedding_dim=EMBED_DIMENSION, max_norm=EMBED_MAX_NORM)
loss_function = nn.CrossEntropyLoss() # takes raw values hence no softmax or activation in last layer
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

In [28]:
model_dir = './model/'
# Make model directory if not exist to save model weights
if not os.path.exists(model_dir):
    os.mkdir(model_dir)

In [29]:
trainer = Train_CBOW(
    model=model,
    epochs=epochs,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    loss_function=loss_function,
    optimizer=optimizer,
    device=device,
    model_dir=model_dir,
)

trainer.train()
print("Training finished.")


0it [00:00, ?it/s]

Epoch: 1/10, Train Loss=5.45493, Val Loss=5.19436


0it [00:00, ?it/s]

Epoch: 2/10, Train Loss=5.16582, Val Loss=5.08723


0it [00:00, ?it/s]

Epoch: 3/10, Train Loss=5.07077, Val Loss=5.03903


0it [00:00, ?it/s]

Epoch: 4/10, Train Loss=5.01649, Val Loss=5.00611


0it [00:00, ?it/s]

Epoch: 5/10, Train Loss=4.98120, Val Loss=4.99464


0it [00:00, ?it/s]

Epoch: 6/10, Train Loss=4.95540, Val Loss=4.99271


0it [00:00, ?it/s]

Epoch: 7/10, Train Loss=4.93559, Val Loss=5.00314


0it [00:00, ?it/s]

Epoch: 8/10, Train Loss=4.92298, Val Loss=4.99118


0it [00:00, ?it/s]

Epoch: 9/10, Train Loss=4.90922, Val Loss=4.98492


0it [00:00, ?it/s]

Epoch: 10/10, Train Loss=4.89740, Val Loss=4.98430
Training finished.


In [30]:
## save model
trainer.save_model()
trainer.save_loss()

vocab_path = os.path.join(model_dir, "vocab.pt")
torch.save(vocab, vocab_path)

In [40]:
fig = go.Figure()
fig.add_trace(go.Scatter(y=trainer.loss["train"],
                    mode='lines',
                    name='Training'))
fig.add_trace(go.Scatter(y=trainer.loss["val"],
                    mode='lines',
                    name='Validation'))

# Edit the layout
fig.update_layout(title='CBOW from Scratch Training',
                   xaxis_title='Epochs',
                   yaxis_title='Loss')

fig.show()

## 4. Evaluation

Now that we have trained our model let's load our embeddings and see if we achieved our goals.

In [41]:
# reload saved model and vocab
model = torch.load(os.path.join(model_dir, "model.pt"), map_location=device)
vocab = torch.load(os.path.join(model_dir, "vocab.pt"))

# embedding is model's first layer
embeddings = list(model.parameters())[0]
embeddings = embeddings.cpu().detach().numpy()

In [42]:
embeddings_df = pd.DataFrame(embeddings)
for token, idx in vocab.get_stoi().items():
    embeddings_df.loc[idx, "token"] = token

# move to front
word_column = embeddings_df.pop('token')
embeddings_df.insert(0, 'token', word_column)

In [43]:
embeddings_df.head(10)

Unnamed: 0,token,0,1,2,3,4,5,6,7,8,...,290,291,292,293,294,295,296,297,298,299
0,the,0.085537,-0.016568,-0.111821,-0.149764,-0.028471,0.000603,-0.04949,-0.002673,0.041732,...,-0.021008,-0.021215,-0.002445,-0.08568,0.111842,-0.02627,-0.132318,0.028821,-0.042134,0.049591
1,of,0.047074,0.047841,0.063443,0.101381,-0.016897,0.028919,0.07904,-0.084996,-0.022181,...,0.036332,-0.088694,-0.02032,0.036297,-0.097317,-0.069241,0.109762,0.062826,-0.105972,-0.131777
2,and,-0.026058,-0.085627,0.035302,0.002652,-0.039583,0.034764,0.044013,0.000181,-0.006052,...,0.016324,0.03879,0.013179,-0.010727,-0.078934,-0.083221,0.136999,0.038349,-0.064423,-0.153362
3,in,-0.001893,-0.143155,0.041051,0.041067,0.01465,0.055739,0.005157,-0.062392,-0.002958,...,0.024252,-0.009995,-0.015803,0.012579,0.062386,-0.11871,0.137043,0.074272,-0.012258,-0.0844
4,to,0.013015,-0.078569,-0.085096,0.078846,-0.031175,0.059051,0.06193,0.083512,0.035657,...,-0.038487,0.093689,-0.082679,-0.074912,-0.043002,-0.086919,0.053446,-0.03534,-0.168364,-0.090428
5,a,0.038808,-0.059927,-0.054437,-0.016322,0.005057,0.110044,-0.032128,0.009845,-0.025099,...,0.008005,0.014596,-0.054981,-0.028679,0.080726,-0.070232,-0.112523,-0.038893,0.058399,0.088377
6,was,-0.058699,0.08393,0.046581,0.041677,0.049776,0.043054,-0.102079,-0.026219,-0.049468,...,-0.054353,-0.042243,-0.008797,-0.03377,-0.094414,-0.090585,-0.079607,0.024692,-0.019707,0.072998
7,s,-0.088716,-0.041882,-0.113224,0.029534,0.01696,-0.023771,-0.049168,0.071771,0.105385,...,0.03711,-0.049074,0.093437,-0.059651,0.121915,-0.010175,0.10491,-0.051979,0.001587,-0.045091
8,on,-0.046836,-0.081999,-0.030816,0.06982,0.031225,0.020886,-0.114216,-0.059555,0.086598,...,-0.104576,-0.001023,-0.083507,-0.082064,-0.034946,0.028136,0.088369,0.042889,-0.055468,-0.099856
9,as,-0.063262,-0.050761,0.10281,0.010499,0.060799,0.015572,0.061786,0.021662,-0.007403,...,0.110467,-0.019976,0.020759,-0.051339,-0.04722,-0.011521,0.081332,0.094458,-0.122637,-0.087101


In [44]:
def get_cosine_sim(word_vec, other_vec):
    return (word_vec @ other_vec) / (np.linalg.norm(word_vec) * np.linalg.norm(other_vec))

In [55]:
def get_k_similar(token: str, vocab, k: int = 10) -> List[Tuple[str, float]]:
    top_k = {}
    all_sims = {}
    embedded_token = embeddings[vocab.get_stoi()[token]]
    for other, idx in vocab.get_stoi().items():
        other_embedded = embeddings[idx]
        sim = get_cosine_sim(embedded_token, other_embedded)
        all_sims[other] = sim

    sorted_sims = sorted(all_sims.items(), key=lambda item: item[1], reverse=True)
    del sorted_sims[0]  # remove the token itself
    top_k = sorted_sims[:k]
    return top_k

In [56]:
get_k_similar("king", vocab)

[('queen', 0.42077157),
 ('reign', 0.4196033),
 ('walter', 0.39870682),
 ('bce', 0.39367795),
 ('djedkare', 0.39288175),
 ('title', 0.38532743),
 ('philip', 0.37316322),
 ('emperor', 0.36861202),
 ('iv', 0.3546911),
 ('earl', 0.35148412)]

In [61]:
get_k_similar("football", vocab)

[('champions', 0.37642378),
 ('promotion', 0.3525813),
 ('premier', 0.345919),
 ('hockey', 0.3364137),
 ('uefa', 0.33595797),
 ('basketball', 0.33354732),
 ('speaking', 0.3208413),
 ('creation', 0.31150877),
 ('baseball', 0.31065193),
 ('seasons', 0.30927154)]

PCA

In [66]:
def add_colors(df):
    df['color'] = 'blue'  # Set the default color to blue

    df.loc[df['token'].isin([token[0] for token in get_k_similar("king", vocab)]), 'color'] = 'green'
    df.loc[df['token'].isin([token[0] for token in get_k_similar("football", vocab)]), 'color'] = 'red'
    return df

In [67]:
pca_2d = PCA(n_components=2)
pca_2d_embed = pca_2d.fit_transform(embeddings)

In [68]:
pca_2d_df = pd.DataFrame(pca_2d_embed, columns=['1', '2'])
pca_2d_df['token'] = embeddings_df['token']

In [69]:
add_colors(pca_2d_df)

Unnamed: 0,PC1,PC2,token,color
0,0.490491,-0.066907,the,blue
1,-0.043893,-0.359104,of,blue
2,0.281104,-0.157985,and,blue
3,0.151017,-0.176460,in,blue
4,0.263181,-0.326058,to,blue
...,...,...,...,...
3843,0.055110,0.197413,venues,blue
3844,0.023335,0.050650,wagner,blue
3845,-0.074348,-0.013033,walking,blue
3846,-0.331274,0.044510,weapon,blue


In [70]:
fig = px.scatter(pca_2d_df, x='1', y='2', hover_data=['token'], color='color')
fig.show()

In [71]:
pca_3d = PCA(n_components=3)
pca_3d_embed = pca_3d.fit_transform(embeddings)
pc_3d_df = pd.DataFrame(pca_3d_embed, columns=['1', '2', '3'])
pc_3d_df['token'] = embeddings_df['token']
add_colors(pc_3d_df)

Unnamed: 0,PC1,PC2,PC3,token,color
0,0.491297,-0.074712,-0.270989,the,blue
1,-0.046086,-0.349329,0.361900,of,blue
2,0.282108,-0.163096,0.165907,and,blue
3,0.150328,-0.173487,0.238598,in,blue
4,0.263372,-0.325367,0.344548,to,blue
...,...,...,...,...,...
3843,0.054356,0.204657,-0.115351,venues,blue
3844,0.022319,0.048062,-0.050985,wagner,blue
3845,-0.073929,-0.018098,-0.118971,walking,blue
3846,-0.330519,0.043476,0.140330,weapon,blue


In [73]:
fig = px.scatter_3d(pc_3d_df, x='1', y='2', z='3', hover_data=['token'], color='color')
fig.show()

TSNE

In [78]:
tsne_2d = TSNE(n_components=2, random_state=42)
tsne_2d_embed = tsne_2d.fit_transform(embeddings)
tsne_2d_df = pd.DataFrame(tsne_2d_embed, columns=['1', '2'])
tsne_2d_df['token'] = embeddings_df['token']
add_colors(tsne_2d_df)

Unnamed: 0,1,2,token,color
0,24.759232,-23.001451,the,blue
1,20.486557,-16.522123,of,blue
2,30.210815,-18.116652,and,blue
3,13.711433,-27.067392,in,blue
4,36.774929,-9.025466,to,blue
...,...,...,...,...
3843,13.186694,0.833658,venues,blue
3844,13.521516,-0.258973,wagner,blue
3845,-15.701785,-11.146845,walking,blue
3846,-24.271585,7.102196,weapon,blue


In [80]:
fig = px.scatter(tsne_2d_df, x='1', y='2', hover_data=['token'], color='color')
fig.show()

In [84]:
tsne_3d = TSNE(n_components=3, random_state= )
tsne_3d_embed = tsne_3d.fit_transform(embeddings)
tsne_3d_df = pd.DataFrame(tsne_3d_embed, columns=['1', '2', '3'])
tsne_3d_df['token'] = embeddings_df['token']
add_colors(tsne_3d_df)

Unnamed: 0,1,2,3,token,color
0,20.282921,-18.083191,-40.111698,the,blue
1,7.457191,-47.639168,-7.999765,of,blue
2,13.757533,-48.776878,-16.180811,and,blue
3,39.074379,-42.774544,0.797148,in,blue
4,17.170401,-59.881382,4.181213,to,blue
...,...,...,...,...,...
3843,-10.962580,-16.082119,20.360903,venues,blue
3844,42.336063,10.872888,5.906294,wagner,blue
3845,22.565176,25.802588,-23.105036,walking,blue
3846,12.889104,32.915993,30.908831,weapon,blue


In [85]:
fig = px.scatter_3d(tsne_3d_df, x='1', y='2', z='3', hover_data=['token'], color='color')
fig.show()