# Exploring Language Models through `Word Embeddings`

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

`Word embeddings` are a type of representation that captures the meaning and context of words in a numerical form. They are used extensively in `natural language processing` tasks such as language modeling, sentiment analysis, and machine translation. Word embeddings are typically learned from large text corpora using unsupervised learning techniques such as `word2vec`. However, embeddings can also be trained as just another layer in a neural network. The resulting word embeddings are high-dimensional vectors representing each word in the vocabulary.

These vectors are designed so that words with similar meanings and contexts have similar vector representations, while words with different meanings and contexts have dissimilar representations.

This property makes `word embeddings` a powerful tool for many NLP tasks, as they can be used to find semantic relationships between words, cluster similar words together, and even perform arithmetic operations such as word analogies. Something that can help us better understand how language models "_understand_" language. 

![word-embeddings](https://lena-voita.github.io/resources/lectures/word_emb/lookup_table.gif)

In this notebook, we create a language model for sentiment analysis using `Keras API` and `TensorFlow`. We only want to train the embedding layer, so do not pay attention to the rest of the architecture (is merely a husk).

We will be using a dataset that was put together by combining several datasets for sentiment classification available on [Kaggle](https://www.kaggle.com/):

- The `IMDB 50K` [dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv): _0K movie reviews for natural language processing or Text analytics._
- The `Twitter US Airline Sentiment` [dataset](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment): _originated from the  [Crowdflower's Data for Everyone library](http://www.crowdflower.com/data-for-everyone)._
- Our `google_play_apps_review` _dataset: built using the `google_play_scraper` in [this notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/64d0693c28786ce42149411bec8b3b42520fc4df/ML%20Explainability/NLP%20Interpreter%20(en)/scrape(en).ipynb)._
- The `EcoPreprocessed` [dataset](https://www.kaggle.com/datasets/pradeeshprabhakar/preprocessed-dataset-sentiment-analysis): _scrapped amazon product reviews_.

The final result is the `sentiment_analysis_dataset.csv` available for download in [this link](https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv). Also available in [portuguese](https://drive.google.com/uc?export=download&id=1YCIzGqcdlHSy-GvghRp0U5USUhuOVEE3)!

Both datasets already come preprocessed, and the `cleaning` function we used is this:

```python

import re
from unidecode import unidecode

def custom_standardization(input_data):
    clean_text = input_data.lower().replace("<br />", " ")
    clean_text = re.sub(r"[-()\"#/@;:<>{}=~|.?,]", ' ', clean_text)
    clean_text = re.sub(' +', ' ', clean_text)
    return unidecode(clean_text)

```

In [1]:
import pandas as pd
import urllib.request

# PT-BR https://drive.google.com/uc?export=download&id=1YCIzGqcdlHSy-GvghRp0U5USUhuOVEE3
# EN https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv

urllib.request.urlretrieve(
    'https://drive.google.com/uc?export=download&id=1_ijhnVLHddM7Cm3R3vfqBB-svw6iNfpv', 
    'sentiment_analysis_dataset_en.csv'
)

df = pd.read_csv('sentiment_analysis_dataset_en.csv')

display(df.head(10))

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tech...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there's a family where a little boy ...,0
4,petter mattei's love in the time of money is a...,1
5,probably my all time favorite movie a story of...,1
6,i sure would like to see a resurrection of a u...,1
7,this show was an amazing fresh & innovative id...,0
8,encouraged by the positive comments about this...,0
9,if you like original gut wrenching laughter yo...,1


Before training our `embeddings`, we need to extract a vocabulary of our corpus. And we will do that by using the `TextVectorization` layer from the [`Keras`](https://keras.io/) API. `TextVectorization` is a text preprocessing layer in `Keras`, which is used to vectorize the text data by converting text into a numerical format that can be used as input to a machine learning model. 

We will use ' word tokenization ' for the purposes of this notebook (interpret the embedding layer of a language model). However, there are many other types of possible tokenization schemes (e.g., `bi-gram tokenization`). Below are some common types of tokenization used in natural language processing:

- `Word tokenization`: This involves breaking a text into words or tokens, where words are defined as sequences of characters separated by spaces or punctuation marks.
- `Character tokenization`: This involves breaking a text into individual characters. This approach can be useful when dealing with languages that do not use spaces to separate words.
- `Subword tokenization`: This involves breaking words into smaller units, called subwords or subword units, which can be used to build up words in a language. This approach is commonly used in neural machine translation.
- `Byte pair encoding`: This subword tokenization type uses pairs of consecutive bytes to represent subwords.

In the cell below, we create our `TextVectorization` layer and adapt it to our text corpus, setting the vocabulary size to 10,000 words. After, we use `train_test_split` to break down our corpus into training and validation sets (no test set here because we actually only care about training the `embedding layer`).
```

In [2]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split


vocab_size = 10000
sequence_length = 100

vectorization_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length
    )

vectorization_layer.adapt(df.review)
english_embedding_vocabulary = vectorization_layer.get_vocabulary()

with open(r'models/english_embedding_vocabulary.txt', 'w', encoding='utf-8') as fp:
    for word in english_embedding_vocabulary:
        fp.write("%s\n" % word)
    fp.close()

x_train, x_val, y_train, y_val = train_test_split(
    df.review, df.sentiment, test_size=0.1, random_state=42)

x_train = vectorization_layer(x_train)
y_train = np.array(y_train).astype(float)
x_val = vectorization_layer(x_val)
y_val = np.array(y_val).astype(float)

print('Training Inputs: ', x_train.shape)
print('Validation Inputs: ', x_val.shape)

Training Inputs:  (76580, 100)
Validation Inputs:  (8509, 100)


For the purposes of this notebook, we will create a simple model with a 16-embedding dimension.

> Note: in Keras/TensorFlow, you can name the layers of your model (and the model itself) as you prefer. This is helpful if you need to retrieve a certain portion of your model later.

In [3]:
embed_size = 16

inputs = tf.keras.Input(shape=(None,), dtype="int32", name='input')
x = tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=embed_size,
                              input_length=sequence_length,
                              name='embedding')(inputs)

x = tf.keras.layers.GlobalAveragePooling1D(name='globa_average')(x)
x = tf.keras.layers.Dense(embed_size, activation="relu", name='dense')(x)

outputs = tf.keras.layers.Dense(1, name='output')(x)

model_16 = tf.keras.Model(inputs, outputs)

model_16._name="EnglishEmbedding16"

model_16.compile(loss=tf.losses.BinaryCrossentropy(from_logits = True),
              optimizer='adam',
              metrics=['accuracy'])

display(model_16.summary())

Model: "EnglishEmbedding16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input (InputLayer)          [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 16)          160000    
                                                                 
 globa_average (GlobalAverag  (None, 16)               0         
 ePooling1D)                                                     
                                                                 
 dense (Dense)               (None, 16)                272       
                                                                 
 output (Dense)              (None, 1)                 17        
                                                                 
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
__________________________________________

None

We showed how to plot the history of your model's training using the `history` dictionary provided by the `fit` function in this [early notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/55dec4959b0121ab7a441a4448c25dcf66d085da/ML%20Intro%20Course/8_Fashion_MNIST.ipynb). However, you can attain the same thing with only two lines of code and the `TensorBoard` package. You can log all these results by setting a `logs` folder, and instruct your model to save the relevant info to this folder via a Keras `callback`.

In [4]:
%load_ext tensorboard
log_folder = 'logs'

Since our model only has a small number of units to form the latent dimension, it will quickly overfit. Using `EarlyStopping`, even with a max of 20 epochs, should stop the training under 10 epochs if we set the patience to something like 5 or 6.

In [5]:
callbacks = [tf.keras.callbacks.ModelCheckpoint("models/english_embedding_vocabulary_16.keras",
                                                save_best_only=True),
tf.keras.callbacks.TensorBoard(log_dir="logs"),         
tf.keras.callbacks.EarlyStopping(monitor="val_loss",
                                            patience=5,
                                            verbose=1,
                                            mode="auto",
                                            baseline=None,
                                            restore_best_weights=True)]

print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

model_16.fit(x_train,
          y_train,
          epochs=20,
          validation_data=(x_val, y_val),
          callbacks=callbacks,
          verbose=1)

Version:  2.10.1
Eager mode:  True
GPU is available
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 7: early stopping


<keras.callbacks.History at 0x1aee066cb50>

Let us now vizualize our training with the logs from our `tensorboard`.

In [None]:
%tensorboard --logdir logs

If you do not want to train the models, you can load the trained versions in the cell below. But first, you need to download them (instructions in the `models` folder.)

As said before, `word embeddings` are a powerful technique used in natural language processing to represent words or tokens in a high-dimensional space. By mapping each token to a point in this space, token embeddings provide a way to translate the meaning of words into a geometric representation. 

In this representation, tokens with similar meanings are clustered together, while tokens with different meanings are far apart. This allows language models to capture the relationships between words in a geometric sense. The high dimensionality of the space also allows for complex relationships to be represented, such as the relationships between synonyms and antonyms and even social constructs like gender. 

Below, we will load our saved model and its vocabulary and retrieve the trained word embeddings of our `embedding layer`. We will use our saved vocabulary and these vectors to create a dictionary of `{"words":"embeddings"}`.

In [7]:
import numpy as np
import tensorflow as tf

model = tf.keras.models.load_model('models/english_embedding_vocabulary_16.keras')

with open('models/english_embedding_vocabulary.txt', encoding='utf-8') as fp:
    english_embedding_vocabulary = [line.strip() for line in fp]
    fp.close()

embeddings = model.get_layer('embedding').get_weights()[0]

words_embeddings = {}
 
# iterating through the elements of list
for i, word in enumerate(english_embedding_vocabulary):
    # here we skip the embedding/token 0 (""), because is just the PAD token.
    if i == 0:
        continue
    words_embeddings[word] = embeddings[i]

print("Embeddings Dimensions: ", np.array(list(words_embeddings.values())).shape)
print("Vocabulary Size: ", len(words_embeddings.keys()))

Embeddings Dimensions:  (9999, 16)
Vocabulary Size:  9999


One way to explore the relationship of these embedding vectors is by how _similar_ they are.

Cosine similarity is a similarity measure between two non-zero vectors of an inner product space. It measures the cosine of the angle between the two vectors and returns a value between -1 and 1, where 1 means identical, 0 means orthogonal, and -1 means opposite.

We can use cosine similarity to compare word embeddings in natural language processing. By calculating the cosine similarity between two-word embeddings, we can measure how similar the two words are in meaning and context.

The equation for cosine similarity is:

$$\text{cosine\_similarity}(u, v) = \frac{u \cdot v}{\lVert u \rVert \lVert v \rVert} = \cos(\theta)$$


Where:

- $u$ and $v$ are the two vectors being compared.
- $\cdot$ represents the dot product operation. 
- $\lVert u \rVert$ and $\lVert v \rVert$ are the magnitudes of the two vectors.
- $\theta$ is the angle between them.

To calculate the cosine similarity between two-word embeddings, we simply plug the embeddings as the vectors $u$ and $v$ in the above equation.

In [8]:
from numpy.linalg import norm

def cosine_similarity(word1, word2, dictionary):
    """
    Computes the cosine similarity between a two given words.
    
    Parameters:
    2 strings : str
        Two words to be compared.
    dictionary : dict
        A dictionary with words as keys and their corresponding embeddings as values.
    
    Returns:
    --------
    The Cosine Similarity Score (float).
    -----------
    """
    return np.dot(dictionary[word1], dictionary[word2])/(norm(dictionary[word1])*norm(dictionary[word2]))

cos = cosine_similarity("wonderful", "horrible",words_embeddings) 
print(f"""Cosine Similarity between 'wonderful' with 'horrible': {cos}""")

cos = cosine_similarity("good", "bad",words_embeddings) 
print(f"""Cosine Similarity between 'good' with 'bad': {cos}""")

Cosine Similarity between 'wonderful' with 'horrible': -0.9784237146377563
Cosine Similarity between 'good' with 'bad': -0.9407784342765808


The word embeddings for "wonderful" and "horrible", just as the embeddings for "good" and "bad", are almost opposite words (as they should be).

However, similar adjectives (something really important for sentiment analysis models) should show a high positive value (and they do!).

In [9]:
cos = cosine_similarity("good", "beautiful",words_embeddings) 
print(f"""Cosine Similarity between 'good' with 'beautiful': {cos}""")

Cosine Similarity between 'good' with 'beautiful': 0.9373324513435364


Now, let us create a function to get the most similar word embeddings according to the cosine similarity measure. Ordering the vocabulary by similarity also gives us the most "_different_" word embeddings.

In [10]:
import numpy as np
import pandas as pd
from numpy.linalg import norm
from IPython.display import Markdown 

def compute_cosine_table(string, dictionary, 
                         vocabulary):
    """
    Computes the cosine similarity between a given word and all other words in a dictionary.
    
    Parameters:
    -----------
    string : str
        The word to compare against.
    dictionary : dict
        A dictionary with words as keys and their corresponding embeddings as values.
    vocabulary : list
        A list of words in the dictionary.
    
    Returns:
    --------
    A pandas DataFrame with the closest matches to the input word and their 
    corresponding similarity scores. The index of the DataFrame is set 
    to the closest matches.
    """

    l = vocabulary.copy()
    l.remove(string)

    cos = []
    for word in l[1::]:

        cosine = np.dot(dictionary[string],
                dictionary[word])/(norm(dictionary[string])*norm(dictionary[word]))
        cos.append(cosine)

    return pd.DataFrame({"Closest Match": l[1::],f"Similarity Score": cos})\
        .sort_values(f"Similarity Score", ascending=True)\
        .set_index('Closest Match')

df = compute_cosine_table("horrible", 
        words_embeddings, 
        english_embedding_vocabulary)

print("Cosine Similarity (most different word embeddings:)")
display(Markdown(df.head(5).to_markdown()))

print("Cosine Similarity (most similar word embeddings:)")
display(Markdown(df.tail(5).to_markdown()))

Cosine Similarity (most different word embeddings:)


| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| pleasure        |          -0.993557 |
| tear            |          -0.992812 |
| thank           |          -0.992007 |
| pleasant        |          -0.991167 |
| rewarded        |          -0.990919 |

Cosine Similarity (most similar word embeddings:)


| Closest Match   |   Similarity Score |
|:----------------|-------------------:|
| waste           |           0.996396 |
| hrs             |           0.99642  |
| sucks           |           0.996422 |
| hoping          |           0.996789 |
| terrible        |           0.997209 |

"_terrible_" is the word embedding most similar to "_horrible_", wich "meks sence". However, the reason why "_devito_" s as antagonistic to "_horrible_" as "_funniest_" is open to interpretation (e.g., maybe the IMDB portion of our dataset is biased toward liking Danny DeVito.).

Tensorflow makes available a tool for projecting `word embeddings` into a 3D space. You can use their [word projector](https://projector.tensorflow.org/) by saving your vocabulary and word embeddings into special files called `metadata.tsv` and `vectors.tsv`. Give it a try if you want to.

Below, we save these files on the `TensorBoard` "logs" folder, and then use it to start the `tensorboard projector`, which is integrated in `TensorBoard`.

In [11]:
from tensorboard.plugins import projector

model = tf.keras.models.load_model('models/english_embedding_vocabulary_16.keras')

with open('models/english_embedding_vocabulary.txt', encoding='utf-8') as fp:
    english_embedding_vocabulary = [line.strip() for line in fp]
    fp.close()

# Save the weights we want to analyze as a `tf.variable``. Note that the first
# value represents the padding token, so we just skip it. 
embeddings = tf.Variable(model.get_layer('embedding').get_weights()[0][1:])
checkpoint = tf.train.Checkpoint(embedding=embeddings)
checkpoint.save("logs/embedding.ckpt")

# Save labels separately on a line-by-line manner. Note that the first
# value represents the padding token, so we just skip it.
with open('logs/metadata.tsv', "w") as fp:
  for words in english_embedding_vocabulary[1:]:
    fp.write("{}\n".format(words))

# Set up the configuration settings to start the projector.
config = projector.ProjectorConfig()
embedding = config.embeddings.add()

embedding.tensor_name = "logs/embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'logs/metadata.tsv'
projector.visualize_embeddings("/logs", config)

Now, just start the `TensorBoard`, chose the "logs" file as home directory, and select the `Projector` dashboard.

In [None]:
%tensorboard --logdir /logs

But we can also create our own projections! 

Let us create a 3D projection of our embedding space. This can allow us to visualize the embedding space, and the proximity between word embeddings, in a geometrical sense.

Cosine similarity is a measure of (as the name says), similarity, and not of distance. However, we can bring our 16-dimensional space to a 3D embedding space where distances can be calculated visually. We can achieve this by using `t-SNE`, something we briefly explained in [this notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/55dec4959b0121ab7a441a4448c25dcf66d085da/ML%20Intro%20Course/8_Fashion_MNIST.ipynb).

Let us first turn our word embeddings and vocabulary into a DataFrame.

In [20]:
# we are starting from index position 1 because the first element is the padding token.

df = pd.DataFrame(embeddings[1::], 
                  columns=[f'embedding_{i}' for i in range(embeddings[1::].shape[1])],
                  index=list(words_embeddings.keys())[1::])

display(df.head(10))

Unnamed: 0,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4,embedding_5,embedding_6,embedding_7,embedding_8,embedding_9,embedding_10,embedding_11,embedding_12,embedding_13,embedding_14,embedding_15
the,-0.076551,-0.014924,0.067767,0.007883,-0.016938,0.268177,0.03446,-0.048748,0.036562,-0.015586,-0.021275,-0.058763,-0.019428,-0.112663,-0.064345,0.112612
and,-0.029695,0.041913,0.070794,-0.018414,0.0352,0.250137,-0.021369,-0.022792,0.000395,0.018333,0.013924,-0.025006,-0.108997,-0.071116,-0.037261,0.122586
a,-0.093505,0.011508,-0.039603,-0.004146,-0.004097,0.314304,-0.002159,-0.057122,-0.028662,0.023549,0.064215,-0.035054,0.002804,-0.086173,-0.019819,0.081942
to,-0.09033,0.012954,0.038419,0.05603,-0.042045,0.376835,0.024688,-0.058342,-0.048245,0.085957,0.003867,-0.02727,-0.081651,-0.115291,-0.002087,0.116711
of,-0.08283,0.011401,0.049799,-0.015487,-0.04042,0.154868,0.02133,-0.010035,-0.012892,0.026445,0.030382,-0.038912,-0.073732,-0.038778,-0.050055,0.037295
is,-0.032904,0.06092,0.040417,0.04345,0.016395,0.314284,0.028085,-0.065494,0.032696,-0.010763,0.026772,0.000486,-0.010015,-0.095048,-0.04895,0.041197
in,-0.04931,0.071329,0.041701,0.030343,0.019761,0.264715,0.020251,-0.091105,0.005416,0.059387,0.001725,-0.036669,-0.066259,-0.089162,-0.033258,0.056866
i,0.01059,-0.022202,0.000624,-0.035277,-0.008554,0.165885,0.006468,-0.073313,0.029857,-0.016918,0.01882,-0.066553,-0.091927,-0.053967,-0.126327,0.076479
it,0.034613,0.026286,0.045939,-0.014202,0.018118,0.255664,-0.037551,-0.037671,0.047684,-0.013476,0.076358,-0.117027,-0.120247,-0.097833,-0.074516,0.054467
this,-0.063159,-0.029197,-0.038168,0.108564,-0.039919,0.372734,-0.000651,0.033397,-0.051223,0.040571,-0.001527,0.005195,-0.002434,-0.08687,-0.042902,0.064033


Now, let us implement `t-SNE` for dimensionality reduction. But before doing that, we need to keep some things in mind.

`t-SNE` (t-distributed Stochastic Neighbor Embedding) is optimized using gradient descent, being also sensitive to the choice of hyperparameters. For example, the choice of perplexity and number of iterations in `t-SNE` depends on the specific dataset and the goals of the analysis. However, here are some general guidelines that can help you choose appropriate values:

### `Perplexity`

- `Perplexity` controls the balance between local and global aspects of the data. A `perplexity` value of $5$ to $50$ is often used for small to medium datasets.
- A higher `perplexity` value may be needed for larger datasets to capture global structure. However, a `perplexity` value that is too high may result in the loss of local structure. Generally, a `perplexity` value of $50$ to $100$ is often used for larger datasets.

### `Iterations`

- The number of `iterations` determines the amount of computation and time needed to optimize the `t-SNE` algorithm.
- For small to medium datasets, a number of `iterations` between $1000$ to $5000$ is often sufficient to obtain a stable embedding.
- A higher number of `iterations` may be needed for larger datasets to obtain a stable embedding. However, a very high number of `iterations` may result in overfitting, where the embedding captures noise in the data instead of the underlying structure.

You can learn how to "_[Use t-SNE Effectively](https://distill.pub/2016/misread-tsne/)_" with this publication.

In [29]:
from sklearn.manifold import TSNE
import plotly.express as px
import plotly.offline as py


tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=2000)
tsne_results = tsne.fit_transform(df.values)


fig = px.scatter_3d(
    tsne_results, x=0, y=1, z=2, color=df.index,
    labels={'0': 't-SNE 1', '1': 't-SNE 2', '2': 't-SNE 3'}
)
fig.update_layout(template='plotly_white',
                  title=f'Word Embeddings in 3D',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')

fig.show()
py.plot(fig, filename='Word Embeddings in 3D.html', auto_open=False)

[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Indexed 9999 samples in 0.000s...
[t-SNE] Computed neighbors for 9999 samples in 0.290s...
[t-SNE] Computed conditional probabilities for sample 1000 / 9999
[t-SNE] Computed conditional probabilities for sample 2000 / 9999
[t-SNE] Computed conditional probabilities for sample 3000 / 9999
[t-SNE] Computed conditional probabilities for sample 4000 / 9999
[t-SNE] Computed conditional probabilities for sample 5000 / 9999
[t-SNE] Computed conditional probabilities for sample 6000 / 9999
[t-SNE] Computed conditional probabilities for sample 7000 / 9999
[t-SNE] Computed conditional probabilities for sample 8000 / 9999
[t-SNE] Computed conditional probabilities for sample 9000 / 9999
[t-SNE] Computed conditional probabilities for sample 9999 / 9999
[t-SNE] Mean sigma: 0.047531
[t-SNE] KL divergence after 250 iterations with early exaggeration: 78.511284
[t-SNE] KL divergence after 1800 iterations: 2.132993


'Word Embeddings in 3D.html'

To measure the distance between two vectors, you can use the `numpy.linalg.norm` function, which calculates the Euclidean distance between two vectors.

This will output the distance between the two vectors $a$ and $b$.

The equation for measuring the Euclidean distance between two 3D vectors can be expressed as follows:

$$\text{distance} = \| \mathbf{a} - \mathbf{b} \| = \sqrt{(a_1-b_1)^2 + (a_2-b_2)^2 + (a_3-b_3)^2}$$

where:

- $\mathbf{a}$ and $\mathbf{b}$ are the two 3D vectors.
- $a_1, a_2, a_3$ and $b_1, b_2, b_3$ are their respective components.

In [37]:
from numpy.linalg import norm

tsne_df = pd.DataFrame(tsne_results, index=df.index)

def calculate_embedding_distance(word1, word2, df):
    """
    Computes the Euclidean distance between two given vectors.
    
    Parameters:
    2 strings : str
        Two words to be compared.
    DataFrame : pandas.DataFrame
        A pandas.DataFrame where the index are words and the
        columns are the unitary vector values.
    Returns:
    --------
    The Euclidean distance (float).
    -----------
    """
    return norm(df.loc[word1].values - df.loc[word2].values)

distance = calculate_embedding_distance("good", "bad", tsne_df)
print(f"""Distance from "good" to "bad": {distance}""")

distance = calculate_embedding_distance("wonderfully", "horrible", tsne_df)
print(f"""Distance from "wonderfully" to "horrible": {distance}""")

distance = calculate_embedding_distance("terrible", "horrible", tsne_df)
print(f"""Distance from "terrible" to "horrible": {distance}""")

distance = calculate_embedding_distance("terrible", "bad", tsne_df)
print(f"""Distance from "terrible" to "bad": {distance}""")

Distance from "good" to "bad":  114.304016
Distance from "wonderfully" to "horrible":  104.82702
Distance from "terrible" to "horrible":  4.1755075
Distance from "terrible" to "bad":  34.340137


Putting together everything we learned thus far, we can start creating similar sub-clusters. For example, we can use cosine similarity to compute what words are more similar to "devito".

Let us create a list with the 30 most similar (and different) word embeddings of the "devito" word embedding.

In [38]:
df_cosine_similarity = compute_cosine_table("devito", english_embedding_vocabulary)

word_list = []

for word in list(df_cosine_similarity.head(30).index):
    word_list.append(word)

for word in list(df_cosine_similarity.tail(30).index):
    word_list.append(word)

Now, we will select only the 3D projections of these words (a.k.a. the "_devito_" similarity cluster).

In [41]:

def select_word_projections (word_list, df):
    """
    Select only the vectors associated with a specific word in a 
    pandas.DataFrame.
    
    Parameters:
    list : list
        A list of words to be selected.
    DataFrame : pandas.DataFrame
        A pandas.DataFrame where the index are words and the
        columns are the unitary vector values.
    Returns:
    --------
    A pandas.DataFrame with all the "X", "Y", "Z"
    components of the selected words.
    -----------
    """
    df = tsne_df.copy().reset_index()
    arr = np.array([0,0,0])

    for word in word_list:
        index = df.loc[df['index'] == word].index[0]
        arr = np.vstack((arr, df.drop('index', axis=1).values[index,:]))

    arr = np.delete(arr, (0), axis=0)

    return pd.DataFrame(arr, index=word_list)

cluster = select_word_projections (word_list, tsne_df)

display(cluster.head(10))

Unnamed: 0,X,Y,Z
journey,-77.105576,9.452806,-6.690478
worries,-74.205162,-7.504502,12.889542
appreciate,-77.210968,4.283458,0.652169
wax,-76.62101,4.872957,-2.49561
funniest,-74.65374,-6.989667,17.09643
bourne,-77.230721,-5.43428,8.964151
thanks,-70.037895,-3.553062,16.745853
edge,-78.159035,4.993025,-3.280926
fortunately,-76.363831,6.520238,-2.631117
devito,-76.851402,-6.721216,7.851744


Now, let us visualize these word embedding projections in 3D space.

In [43]:

fig = px.scatter_3d(
    cluster, x="0", y="1", z="2", color=cluster.index,
    labels={'0': 'X', '1': 'Y', '2': 'Z'}
)
fig.update_layout(template='plotly_dark',
                  title=f'<b>The "<i>devito</i>" Similarity Cluster</b>',
                  paper_bgcolor='rgba(0, 0, 0, 0)',
                  plot_bgcolor='rgba(0, 0, 0, 0)')

fig.show()

In summary, the exploration of `word embeddings` and `embedding layers` can provide valuable insights into how language models are representing language. These insights can be used to improve the interpretability and explainability of such models, as well as to identify potential biases or limitations in them.

There are other ways we can use `word embeddings` and `embedding layers` besides semantic similarity analysis. For example:

- `Embedding layers` can be used in deep learning models to extract features from text inputs. By analyzing the weights associated with each feature, researchers can identify which words or phrases are most important for predicting a particular outcome. Something we have already done in our `integrated gradients` [notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/55dec4959b0121ab7a441a4448c25dcf66d085da/ML%20Explainability/NLP%20Interpreter/integrated_gradients_in%20_keras_nlp.ipynb).
- Researchers can modify the embeddings of individual words or phrases and observe how the model's predictions change. This is one of the ways to adversarially attack language models. Something we explored in this [notebook](https://github.com/Nkluge-correa/teeny-tiny_castle/blob/55dec4959b0121ab7a441a4448c25dcf66d085da/ML%20Adversarial/adversarial_text_attack.ipynb).

Another good example of how this type of research can help improve our understanding of such models is the recent discovery of  "_[Glitch Tokens](https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation)_" (like "_SolidGoldMagikarp_") that provoke unpredictable behavior in some of the most powerful language models ever created.

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).