# Exploring Language Model's Embeddings

<a href="https://colab.research.google.com/drive/1he7aM1o8b8TtCWRStXFHpXVO7LYbQ5Mm" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).

In language modeling, embeddings refer to dense vector representations of words or tokens in a continuous vector space. These embeddings are learned during training a language model, such as [Word2Vec](https://paperswithcode.com/method/skip-gram-word2vec), [GloVe](https://paperswithcode.com/method/glove), or in the case of neural language models like GPT and BERT, the embedding layer is treated as any other layer of a neural network, i.e., collections of learned parameters discovered via gradient descent and backpropagation. Embeddings are crucial in language modeling because they provide a way to represent words meaningfully and computationally efficiently, enabling the model to learn complex patterns and relationships within language data.

Since embeddings are the foundation for various natural language processing tasks, it would be nice to explore their inner workings more.

![word-embeddings](https://lena-voita.github.io/resources/lectures/word_emb/lookup_table.gif)

[Source](https://lena-voita.github.io/nlp_course/word_embeddings.html).

Again, you can think of an embedding layer as a lookup table. Imagine a large table with rows and columns, much like a spreadsheet. Each row represents a unique item, and each column represents some attribute or feature of that item. For example, if you're dealing with words, each row might represent a word, and each column could represent a linguistic feature. In this lookup table, each row is assigned a unique index. This index serves as the "address" to access the corresponding row. So, if you want to retrieve information about a specific word, you look up that word's index in the table, and it will lead you to the row containing the relevant information about that word.

In a nutshell, this is an embedding layer. An embedding layer maps each word (or token) from a high-dimensional one-hot encoded space to a lower-dimensional dense vector space. Each word is represented by a unique vector in this space. And just like in the lookup table, each word is associated with a unique index.

Below, we have a minimalistic example demonstrating this relationship between words (or tokens), their integer representations (their index on the embedding matrix), and their vector representation (the actual vector that represents that specific token).

In [None]:
import numpy as np

# Define a small vocabulary.
vocab = ["apple", "banana", "orange", "grape", "pineapple"]

# Define corresponding vectors for each word (for demonstration purposes).
word_to_vec = {
    "apple": np.array([0.1, 0.2, 0.3]), # Index 0
    "banana": np.array([0.2, 0.3, 0.4]), # Index 1
    "orange": np.array([0.3, 0.4, 0.5]), # Index 2 ...
    "grape": np.array([0.4, 0.5, 0.6]),
    "pineapple": np.array([0.5, 0.6, 0.7])
}

# Function to retrieve vector representation of a word from the vocabulary
def get_vector(word):
    return word_to_vec.get(word, np.zeros(3))  # Return zero vector if word not found

# Test the vocabulary and lookup table
word = "banana"
print(f"Word: '{word}', Integer representation: {list(word_to_vec.keys()).index(word)}, Vector representation: {get_vector(word)}")

word = "pineapple"  # Word not in vocabulary
print(f"Word: '{word}', Integer representation: {list(word_to_vec.keys()).index(word)}, Vector representation: {get_vector(word)}")


Word: 'banana', Integer representation: 1, Vector representation: [0.2 0.3 0.4]
Word: 'pineapple', Integer representation: 4, Vector representation: [0.5 0.6 0.7]


However, instead of handcrafting these vectors, let a neural network find them for us. In this notebook, just as [done before](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/master/ML-Explainability/NLP/model_maker.ipynb), we will create a language model for sentiment analysis. However, we only want to train the embedding layer, so do not pay attention to the rest of the architecture (it is merely a husk). As our learning signal, we will use the same dataset we used to train our first language models ([`sentiment-analysis-dataset`](https://huggingface.co/datasets/AiresPucrs/sentiment-analysis)), available in English and Portuguese on the Hugging Face Hub! 🤗

In [None]:
!pip install datasets -q

from datasets import load_dataset

dataset = load_dataset('AiresPucrs/sentiment-analysis', split = 'train')

display(dataset)

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25h

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/912 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/85089 [00:00<?, ? examples/s]

Dataset({
    features: ['text', 'label'],
    num_rows: 85089
})

Now, before training our embeddings, we need to extract a vocabulary of our corpus. And we will do that by using the [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer from the Keras API.

In [None]:
import numpy as np
import tensorflow as tf

vocab_size = 10000 # Size of the vocabulary
sequence_length = 100 # Maximum sequence length to consider

# Create and instance of the `TextVectorization` layer
vectorization_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length
    )

# Fit the `TextVectorization` layer (which is basically a tokenizer) to the text
vectorization_layer.adapt(dataset['text'])

# Get the vocabulary out of the `TextVectorization` layer
embedding_vocabulary = vectorization_layer.get_vocabulary()

# Save the vocabulary for further use
with open(r'embedding-vocabulary.txt', 'w', encoding='utf-8') as fp:
    for word in embedding_vocabulary:
        fp.write("%s\n" % word)
    fp.close()

Now that we have a dataset and a tokenizer that can translate words to integers, we can begin training our model. To create an ML split, we will use the [`train_test_split`](https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.Dataset.train_test_split) method to break down our corpus into training and test sets.



In [None]:
# Split the dataset
dataset = dataset.train_test_split(test_size=0.1)

# Vectorize the text samples and turn the labels into floats.
x_train = vectorization_layer(dataset['train']['text'])
y_train = np.array(dataset['train']['label']).astype(float)
x_val = vectorization_layer(dataset['test']['text'])
y_val = np.array(dataset['test']['label']).astype(float)

print('Training Inputs: ', x_train.shape)
print('Validation Inputs: ', x_val.shape)

Training Inputs:  (76580, 100)
Validation Inputs:  (8509, 100)


For the purposes of this notebook, we will create a simple model with a 16-embedding dimension. Thismeans that every word will be assosiated with a vector of size 16. Since we have a cocabulary with 10000 words, we have an embedding layer (matrix) of shape (10000, 16).

In [None]:
# Dimensionality of the embedding layer
embed_size = 16

# Layers of this neural network are defined as funtions via the The Functional API
inputs = tf.keras.Input(shape=(None,), dtype="int32", name='input')
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=sequence_length,
                              name='embedding')(inputs)

x = tf.keras.layers.GlobalAveragePooling1D(name='globa_average')(x)
x = tf.keras.layers.Dense(embed_size, activation="relu", name='dense')(x)

# Output layers is a single unbouded neuron
outputs = tf.keras.layers.Dense(1, name='output')(x)
model = tf.keras.Model(inputs, outputs)

model._name="Embedding-16"

model.compile(loss=tf.losses.BinaryCrossentropy(from_logits = True),
              optimizer='adam',
              metrics=['accuracy'])

display(model.summary())

print(f"Shape of the Embedding matrix: {model.layers[1].weights[0].shape}")

Model: "Embedding-16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input (InputLayer)          [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 16)          160000    
                                                                 
 globa_average (GlobalAvera  (None, 16)                0         
 gePooling1D)                                                    
                                                                 
 dense (Dense)               (None, 16)                272       
                                                                 
 output (Dense)              (None, 1)                 17        
                                                                 
Total params: 160289 (626.13 KB)
Trainable params: 160289 (626.13 KB)
Non-trainable params: 0 (0.00 Byte)
______________

None

Shape of the Embedding matrix: (10000, 16)


Since our model only has a small number of units to form the latent dimension, it will quickly overfit. Using [`EarlyStopping`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping), even with a max of 20 epochs, should stop the training under 10 epochs if we set the patience to something like 5 or 6.


In [None]:
callbacks = [
    # Define how to save the model
    tf.keras.callbacks.ModelCheckpoint("embedding-model-16.keras",
                                                save_best_only=True),
    # Define when to stop the training
    tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                                patience=5,
                                                verbose=1,
                                                mode="auto",
                                                baseline=None,
                                                restore_best_weights=True)
    ]

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

model.fit(x_train,
          y_train,
          epochs=20,
          validation_data=(x_val, y_val),
          callbacks=callbacks,
          verbose=1)

Next cell downloads the already trained language model. You can skip it if you trained the model on the cell above.

In [None]:
!pip install huggingface_hub -q

from huggingface_hub import hf_hub_download

# Download the model
hf_hub_download(repo_id="AiresPucrs/embedding-model-16",
                filename="embedding-model-16.keras",
                local_dir="./",
                repo_type="model"
                )

# Download the embedding vocabulary txt file
hf_hub_download(repo_id="AiresPucrs/embedding-model-16",
                filename="embedding-vocabulary.txt",
                local_dir="./",
                repo_type="model"
                )

Now, let us retrieve our embedding layer. We will also use our vocabulary to create a lookup table (a dictionary) where keys will be the words in our vocabulary and the values of their vector representation.

In [None]:
import numpy as np
import tensorflow as tf

# Load the model
model = tf.keras.models.load_model('embedding-model-16.keras')

# Load the vocabulary
with open('embedding-vocabulary.txt', encoding='utf-8') as fp:
    embedding_vocabulary = [line.strip() for line in fp]
    fp.close()

# Get the embedding matrix back
embeddings = model.get_layer('embedding').get_weights()[0]

# Create a lookup table for the words and their respective vector representations
words_embeddings = {}

# iterating through the elements of list
for i, word in enumerate(embedding_vocabulary):
    # here we skip the embedding/token 0 (""), because is just the PAD token.
    if i == 0:
        continue
    words_embeddings[word] = embeddings[i]

print("Embeddings Dimensions: ", np.array(list(words_embeddings.values())).shape)
print("Vocabulary Size: ", len(words_embeddings.keys()))

Embeddings Dimensions:  (9999, 16)
Vocabulary Size:  9999


Now, one way to explore the relationship of these embedding vectors is by how similar they are. **Cosine similarity** is a similarity measure between two non-zero vectors of an inner product space. It measures the cosine of the angle between the two vectors and returns a value between -1 and 1, where 1 means identical, 0 means orthogonal, and -1 means opposite.

Here, we can use cosine similarity to compare word embeddings in natural language processing. By calculating the cosine similarity between two-word embeddings, we can measure how similar the two words are in meaning and context.

The equation for cosine similarity is:

$$\text{Cosine Similarity}(u, v) = \frac{u \cdot v}{\lVert u \rVert \lVert v \rVert} = \cos(\theta)$$

Where:

- $u$ and $v$ are the two vectors being compared.
- $\cdot$ represents the dot product operation.
- $\lVert u \rVert$ and $\lVert v \rVert$ are the magnitudes of the two vectors.
- $\theta$ is the angle between them.

To calculate the cosine similarity between two-word embeddings, we plug the embeddings as the vectors $u$ and $v$ in the above equation.

In [None]:
from numpy.linalg import norm

def cosine_similarity(word1, word2, dictionary):
    """
    Computes the cosine similarity between a two given words.

    Parameters:
    2 strings : str
        Two words to be compared.
    dictionary : dict
        A dictionary with words as keys and their corresponding embeddings as values.

    Returns:
    --------
    The Cosine Similarity Score (float).
    -----------
    """
    return np.dot(dictionary[word1], dictionary[word2])/(norm(dictionary[word1])*norm(dictionary[word2]))

cos = cosine_similarity("wonderful", "horrible",words_embeddings)
print(f"""Cosine Similarity between 'wonderful' with 'horrible': {cos}""")

cos = cosine_similarity("good", "bad",words_embeddings)
print(f"""Cosine Similarity between 'good' with 'bad': {cos}""")

Cosine Similarity between 'wonderful' with 'horrible': -0.9910120368003845
Cosine Similarity between 'good' with 'bad': -0.7805653810501099


As we can see, the word embeddings for **"wonderful"** and **"horrible"**, just as the embeddings for **"good"** and **"bad"**, are almost opposite words (as they should be for this sentiment-focused embedding layer). Meanwhile, similar adjectives should show a high positive value (and they do!).

In [None]:
cos = cosine_similarity("good", "beautiful",words_embeddings)
print(f"""Cosine Similarity between 'good' with 'beautiful': {cos}""")

Cosine Similarity between 'good' with 'beautiful': 0.8989092707633972


Now, let us create a function to get the most similar word embeddings according to the cosine similarity measure.

In [None]:
import numpy as np
import pandas as pd
from numpy.linalg import norm
from IPython.display import Markdown

def compute_cosine_table(string, dictionary,
                         vocabulary):
    """
    Computes the cosine similarity between a given word and all other words in a dictionary.

    Parameters:
    -----------
    string : str
        The word to compare against.
    dictionary : dict
        A dictionary with words as keys and their corresponding embeddings as values.
    vocabulary : list
        A list of words in the dictionary.

    Returns:
    --------
    A pandas DataFrame with the closest matches to the input word and their
    corresponding similarity scores. The index of the DataFrame is set
    to the closest matches.
    """

    l = vocabulary.copy()
    l.remove(string)

    cos = []
    for word in l[1::]:

        cosine = np.dot(dictionary[string],
                dictionary[word])/(norm(dictionary[string])*norm(dictionary[word]))
        cos.append(cosine)

    return pd.DataFrame({"Closest Match": l[1::],f"Similarity Score": cos})\
        .sort_values(f"Similarity Score", ascending=True)\
        .set_index('Closest Match')

df = compute_cosine_table("horrible",
        words_embeddings,
        embedding_vocabulary)

print("Cosine Similarity (most different word embeddings:)")
display(Markdown(df.head(5).to_markdown()))

print("Cosine Similarity (most similar word embeddings:)")
display(Markdown(df.tail(5).to_markdown()))

Cosine Similarity (most different word embeddings:)


| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| devito          |          -0.996515 |
| fears           |          -0.995742 |
| funniest        |          -0.995613 |
| unique          |          -0.995567 |
| appreciate      |          -0.995236 |

Cosine Similarity (most similar word embeddings:)


| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| stinker         |           0.997925 |
| talentless      |           0.998165 |
| unwatchable     |           0.998458 |
| worst           |           0.998905 |
| terrible        |           0.998943 |

**"Terrible"** is the word embedding most similar to **"horrible"**, which makes sense, given that this model learned embeddings were parameterized by an optimizer trying to find parameters that could differentiate negative and positive comments.

However, the reason why **"devito"** is as antagonistic to **"horrible"** as **"funniest"** is open to interpretation (e.g., maybe the IMDB portion of our dataset is biased toward liking Danny DeVito).

Another way to explore these similarities is through visual representations. For this, we will create a 3D projection of our embedding space. This can allow us to visualize the embedding space and the proximity between word embeddings in a geometrical sense. Even though cosine similarity is a measure of (as the name says) similarity and not of distance, we can bring our 16-dimensional space to a 3D embedding space where distances can be calculated visually. We can achieve this by using [`t-SNE`](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html), something we used before in [this notebook](https://github.com/Nkluge-correa/TeenyTinyCastle/blob/master/ML-Intro-Course/8_Fashion_MNIST.ipynb).

Let us first turn our word embeddings and vocabulary into a data frame.


In [None]:
# we are starting from index position 1 because the first element is the padding token.
df = pd.DataFrame(np.array(list(words_embeddings.values())),
                  columns=[f'embedding_{i}' for i in range(embeddings.shape[1])],
                  index=list(words_embeddings.keys()))

display(df.head(10))

Unnamed: 0,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4,embedding_5,embedding_6,embedding_7,embedding_8,embedding_9,embedding_10,embedding_11,embedding_12,embedding_13,embedding_14,embedding_15
[UNK],-0.010852,-0.016661,-0.101499,-0.035356,0.036173,-0.036556,0.034414,0.019626,0.111396,-0.062554,-0.268549,-0.037651,-0.11064,0.057772,0.019258,0.09291
the,-0.009815,-0.038481,-0.038047,0.009836,-0.002357,1.6e-05,-0.007963,-0.010238,0.142858,-0.063763,-0.394119,-0.05874,-0.111451,-0.041375,-0.037418,0.124044
and,-0.057627,-0.042199,-0.088547,-0.002199,-0.106559,-0.047287,-0.068991,0.004988,0.060553,-0.043085,-0.480912,-0.090603,-0.103486,0.019498,0.031873,0.182327
a,0.053509,0.013593,-0.087281,0.007861,0.022452,-0.09036,-0.029521,0.023927,0.089998,-0.066138,-0.382833,-0.065598,-0.052045,0.027191,-0.026402,0.079612
to,0.045605,-0.03627,-0.022603,-0.025769,-0.011813,-0.070337,-0.019752,-0.022252,0.101837,-0.021529,-0.382733,0.003982,-0.027381,0.017267,-0.026032,0.102821
of,-0.018664,-0.019801,-0.134763,0.057086,0.024884,0.000384,-0.026023,0.064676,0.104811,-0.05723,-0.291771,0.006383,-0.028979,0.004829,-0.049015,0.042955
is,-0.002275,-0.007564,-0.092545,0.009008,-0.032594,0.023053,0.00339,0.006978,0.07358,-0.063714,-0.384378,-0.00486,-0.094561,-0.004869,-0.004495,0.087992
in,0.001678,-0.032698,-0.144504,-0.044736,-0.049127,0.012565,-0.032565,0.021187,0.057413,-0.060246,-0.321496,-0.036165,-0.048146,-0.011753,0.016123,0.036803
i,-0.031512,0.040961,-0.046268,-0.026744,-0.024982,-0.025752,0.015338,-0.03261,0.071521,-0.000712,-0.293674,-0.026488,-0.070524,0.002184,-0.018233,0.074001
it,-0.023752,-0.023493,-0.016742,-0.050926,-0.067329,0.049322,-0.063472,0.09592,0.020589,-0.025841,-0.454201,-0.064458,-0.087016,-0.045967,-0.009504,0.13623


Now, let us implement **t-SNE** for dimensionality reduction. But before doing that, we need to keep some things in mind.

T-SNE ([t-distributed Stochastic Neighbor Embedding](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding)) is optimized using gradient descent, being also sensitive to the choice of hyperparameters. For example, the choice of perplexity and number of iterations in `t-SNE` depends on the specific dataset and the goals of the analysis. However, here are some general guidelines that can help you choose appropriate values:

- **Perplexity** controls the balance between local and global aspects of the data. A perplexity value of 5 to 50 is often used for small to medium datasets. A higher perplexity value may be needed for larger datasets to capture global structure. However, a perplexity value that is too high may result in losing local structure. Generally, a perplexity value of 50 to 100 is often used for larger datasets.

- The number of **iterations** determines the amount of computation and time needed to optimize the t-SNE algorithm. Several iterations between 1000 and 5000 are often sufficient for small to medium datasets to obtain a stable embedding. A higher number of iterations may be needed for larger datasets to obtain a stable embedding. However, many iterations may result in overfitting, where the embedding captures noise in the data instead of the underlying structure.

> Note: To learn how to "[Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)", we recommend this publication.

Get ready for this computation, which might take a while...

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=2000)
tsne_results = tsne.fit_transform(df.values)

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 9999 samples in 0.001s...
[t-SNE] Computed neighbors for 9999 samples in 0.802s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9999
[t-SNE] Computed conditional probabilities for sample 2000 / 9999
[t-SNE] Computed conditional probabilities for sample 3000 / 9999
[t-SNE] Computed conditional probabilities for sample 4000 / 9999
[t-SNE] Computed conditional probabilities for sample 5000 / 9999
[t-SNE] Computed conditional probabilities for sample 6000 / 9999
[t-SNE] Computed conditional probabilities for sample 7000 / 9999
[t-SNE] Computed conditional probabilities for sample 8000 / 9999
[t-SNE] Computed conditional probabilities for sample 9000 / 9999
[t-SNE] Computed conditional probabilities for sample 9999 / 9999
[t-SNE] Mean sigma: 0.047531
[t-SNE] KL divergence after 250 iterations with early exaggeration: 78.496521
[t-SNE] KL divergence after 1750 iterations: 2.128672


You can use the cell below to plot these results as a 3D scatter plot.

In [None]:
import plotly.express as px

fig = px.scatter_3d(
    tsne_results, x=0, y=1, z=2, color=df.index,
    labels={'0': 't-SNE 1', '1': 't-SNE 2', '2': 't-SNE 3'}
)

fig.update_layout(
    template='ggplot2',
    title=f'Word Embeddings in 3D'
                  )

fig.show()

To measure the distance between two vectors, you can use the [`numpy.linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) function, which calculates the Euclidean distance between two vectors. This will output the distance between the vectors $a$ and $b$.


In [None]:
from numpy.linalg import norm

tsne_df = pd.DataFrame(tsne_results, index=df.index)

def calculate_embedding_distance(word1, word2, df):
    """
    Computes the Euclidean distance between two given vectors.

    Parameters:
    2 strings : str
        Two words to be compared.
    DataFrame : pandas.DataFrame
        A pandas.DataFrame where the index are words and the
        columns are the unitary vector values.
    Returns:
    --------
    The Euclidean distance (float).
    -----------
    """
    return norm(df.loc[word1].values - df.loc[word2].values)

distance = calculate_embedding_distance("good", "bad", tsne_df)
print(f"""Distance from "good" to "bad": {distance}""")

distance = calculate_embedding_distance("wonderfully", "horrible", tsne_df)
print(f"""Distance from "wonderfully" to "horrible": {distance}""")

distance = calculate_embedding_distance("terrible", "horrible", tsne_df)
print(f"""Distance from "terrible" to "horrible": {distance}""")

distance = calculate_embedding_distance("terrible", "bad", tsne_df)
print(f"""Distance from "terrible" to "bad": {distance}""")

Distance from "good" to "bad": 98.20780181884766
Distance from "wonderfully" to "horrible": 117.50151062011719
Distance from "terrible" to "horrible": 4.233308792114258
Distance from "terrible" to "bad": 29.911943435668945


Putting together everything we learned thus far, we can start creating similar sub-clusters. For example, we can use cosine similarity to compute what words are more similar to "terrible".

Let us create a list with the 30 most similar (and different) word embeddings of the "terrible" word embedding.

In [None]:
df_cosine_similarity = compute_cosine_table("terrible", words_embeddings, embedding_vocabulary)

word_list = []

for word in list(df_cosine_similarity.head(30).index):
    word_list.append(word)

for word in list(df_cosine_similarity.tail(30).index):
    word_list.append(word)

Now, we will select only the 3D projections of these words (a.k.a. the "terrible" similarity cluster).

In [None]:
def select_word_projections(word_list, df):
    """
    Select only the vectors associated with a specific word in a
    pandas.DataFrame.

    Parameters:
    list : list
        A list of words to be selected.
    DataFrame : pandas.DataFrame
        A pandas.DataFrame where the index are words and the
        columns are the unitary vector values.
    Returns:
    --------
    A pandas.DataFrame with all the "X", "Y", "Z"
    components of the selected words.
    -----------
    """
    tsne_df = df.copy().reset_index()
    arr = np.array([0, 0, 0])

    for word in word_list:
        index = tsne_df.loc[tsne_df['index'] == word].index[0]
        arr = np.vstack((arr, tsne_df.drop('index', axis=1).values[index, :]))

    arr = np.delete(arr, (0), axis=0)

    result_df = pd.DataFrame(arr, index=word_list, columns=['X', 'Y', 'Z'])

    return result_df

cluster = select_word_projections(word_list, tsne_df)

display(cluster.head(10))
display(cluster.tail(10))


Unnamed: 0,X,Y,Z
thank,-70.664452,-2.004013,13.265845
devito,-76.898949,-5.869706,4.459921
funniest,-76.293297,-4.913204,13.592978
fears,-76.167618,6.471135,-6.288749
superbly,-75.346542,-4.158202,6.171668
unforgettable,-73.322281,1.381741,-6.733866
fortunately,-73.753799,5.193653,-7.617327
thanks,-72.030792,-1.305513,13.535428
unique,-75.21376,3.367316,-6.043014
appreciate,-75.496796,3.537618,-4.359625


Unnamed: 0,X,Y,Z
uninteresting,32.075939,44.290012,-4.929207
insulting,31.510633,41.974789,10.89453
unwatchable,32.308258,44.451508,-6.943865
trite,31.373411,41.739697,6.89088
uninspired,33.901012,43.864555,-5.942081
stinker,32.371578,42.836239,-5.633059
waste,33.274792,40.719261,-11.087099
worst,34.355034,37.856327,-8.249153
horrible,31.988035,42.31868,-9.424381
awful,33.691402,39.472031,-9.870434


Now, let us visualize these word embedding projections in 3D space.

In [None]:
import plotly.express as px

fig = px.scatter_3d(
    cluster, x="X", y="Y", z="Z", color=cluster.index,
    labels={0: 'X', 1: 'Y', 2: 'Z'}
)

fig.update_layout(
    template='ggplot2',
    title=f'<b>The "<i>terrible</i>" cluster</b>',
    )

fig.show()


We now understand what tokens will push a clarification more to one side than the other. In general, exploring embeddings and embedding layers can provide valuable insights into how language models represent language, which can then be used to improve the interpretability and explainability of such models.

This same type of analysis can be used to explore models whose outputs are embedding themselves, like [`SentenceTransformers`](https://www.sbert.net/), wich are models that are used to generate raw embeddings that serve in different types of downstream applications (e.g., semantic search). Let us perform a test using all the tools we learned and some open-source models and datasets.

> **Note: To learn more about the use of text embeddings and the SentenceTransformer framework, we recommend "[Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)."**

First, let us download samples of text that can be easily separated into classes: the AG dataset, a collection of more than 1 million news articles.

In [64]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("ag_news", split="train")

# Convert the dataset into a pandas.dataframe for easy manipulation
dataset = dataset.to_pandas()

# Let us select only 100 samples from each class and change the integer
# value of the labels by their string counterparts.
world_news = dataset[dataset['label'] == 0].head(100).reset_index(drop=True)
world_news['label'] = "World News"
sports_news = dataset[dataset['label'] == 1].head(100).reset_index(drop=True)
sports_news['label'] = "Sports News"
business_news = dataset[dataset['label'] == 2].head(100).reset_index(drop=True)
business_news['label'] = "Business News"
sci_news = dataset[dataset['label'] == 3].head(100).reset_index(drop=True)
sci_news['label'] = "Science News"

df = pd.concat([world_news, sports_news, business_news, sci_news]).reset_index(drop=True)

display(df)

Unnamed: 0,text,label
0,Venezuelans Vote Early in Referendum on Chavez...,World News
1,S.Koreans Clash with Police on Iraq Troop Disp...,World News
2,Palestinians in Israeli Jails Start Hunger Str...,World News
3,Seven Georgian soldiers wounded as South Osset...,World News
4,Rwandan Troops Arrive in Darfur (AP) AP - Doze...,World News
...,...,...
395,"Saudis: Bin Laden associate surrenders \\""(CNN...",Science News
396,Mozilla Exceptions (mexception) \\For some rea...,Science News
397,"Ron Regan Jr is My Kinda Guy \\""Now that the c...",Science News
398,"Al Qaeda member surrenders \\""RIYADH, Saudi Ar...",Science News


Now, let us embed all of our samples into a 384-dimensional dense vector using the [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) SentenceTransformer, available on the Hub! 🤗

In [65]:
!pip install -U sentence-transformers -q

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Embed every sample and append them to this list
embeddings = []

for sentence in df.text:
  embedding = model.encode(sentence)
  embeddings.append(embedding)

df['embeddings'] = embeddings



Now that we have embeddings to represent our sentences, all previous techniques are applied, like the cosine similarity calculation, to find the most similar instances:

In [60]:
def compare_embeddings(df, class1, class2):
    """
    Calculate the cosine similarity between random embeddings of two classes.

    Parameters:
    df (DataFrame): DataFrame containing 'label' and 'embeddings' columns.
    class1 (str): Name of the first class.
    class2 (str): Name of the second class.

    Returns:
    float: Cosine similarity between random embeddings of the two classes.
    """

    # Get random embeddings for class1 and class2
    A = df.loc[df['label'] == class1, 'embeddings'].sample(n=1).iloc[0]
    B = df.loc[df['label'] == class2, 'embeddings'].sample(n=1).iloc[0]

    # Calculate cosine similarity
    cos_similarity = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

    # Print the result
    print(f"Similarity from a random sample of class '{class1}' and '{class2}': {cos_similarity}")

compare_embeddings(df, "World News", "World News")
compare_embeddings(df, "World News", "Sports News")
compare_embeddings(df, "World News", "Business News")
compare_embeddings(df, "World News", "Science News")

Similarity from a random sample of class 'World News' and 'World News': 0.02805277518928051
Similarity from a random sample of class 'World News' and 'Sports News': -0.011308781802654266
Similarity from a random sample of class 'World News' and 'Business News': 0.11066465079784393
Similarity from a random sample of class 'World News' and 'Science News': -0.0026667313650250435


Or the 3D projection of this embedding space, which helps visualize how these vectors can help us group/separate data.

In [66]:
tsne = TSNE(n_components=3, verbose=1, perplexity=100, n_iter=2000)
tsne_results = tsne.fit_transform(np.vstack(df.embeddings)) # `np.vstack` stacks the arrays vertically to create a 2D array

fig = px.scatter_3d(
    tsne_results, x=0, y=1, z=2, color=df.label,
    labels={0: 'X', 1: 'Y', 2: 'Z'}
)

fig.update_layout(
    template='ggplot2',
    title=f'<b>The "<i>AG News</i>" cluster</b>',
    )

fig.show()

[t-SNE] Computing 301 nearest neighbors...
[t-SNE] Indexed 400 samples in 0.000s...
[t-SNE] Computed neighbors for 400 samples in 0.024s...
[t-SNE] Computed conditional probabilities for sample 400 / 400
[t-SNE] Mean sigma: 0.419129
[t-SNE] KL divergence after 100 iterations with early exaggeration: 45.606083
[t-SNE] KL divergence after 500 iterations: 0.709099


We could now restart our investigation by examining the embeddings of our SentenceTransformer (i.e., not the output embeddings, but the embeddings of the model itself).

```python
from transformers import AutoTokenizer, AutoModel

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

vocabulary = tokenizer.get_vocab() # 30522 tokens
embeddings = model.embeddings.word_embeddings.weight.shape # torch.Size([30522, 384])

# Explore the embeddings of all-MiniLM-L6-v2 ...

```

You can definitely do this as an exercise (explore the embeddings of open-source models). This type of research can help improve our understanding of such models, and it was on a similar basis that "[Glitch Tokens](https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation)" (like "SolidGoldMagikarp") were discovered.

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).
