# Word2Vec: visualisation

In this notebook, we show how Word2Vec vectors can be visualised.

**Disclaimer:** This notebook will include steps which are a bit advanced for the moment.

We will start by importing `matplotlib`, a library for creating visualisations in Python.

We will also import the `PCA` function from the `sklearn` library. PCA stands for **Principal Component Analysis** and it's commonly used as one of the many methods for 'dimensionality reduction', which is used to project each word vector (which generally have 100, 200 or more 'dimensions') onto only the first few principal 'components', so that we obtain lower-dimensional data but preserving as much of the data's variation as possible.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

We will also import two modules we're already familiar with:

In [None]:
from gensim.models import KeyedVectors
import pandas as pd

Load the embeddings of a model:

In [None]:
our_vectors = KeyedVectors.load_word2vec_format("models/test-model-vectors.txt")

We first apply PCA to the whole vocabulary in our model and create a dataframe from it:

In [None]:
vocab = list(our_vectors.index_to_key)

X = our_vectors[vocab]

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# We're storing the transformed vectors into a dataframe:
df = pd.DataFrame(X_pca, index=vocab)

Let's have a quick look at how the vectors have been transformed through dimensionality reduction:

In [None]:
df.head(10)

As you can see there are only 2 dimensions ('columns'). Just compare the number of dimensions of the vectors before this transformation:

In [None]:
our_vectors['prisoner']

Now we are going to use the two dimensions of our new `df` as coordinates to visualize the vectors in a scatter plot.

In [None]:
tokens_to_plot = ["prisoner", "man", "london", "liverpool", "john", "king", "duchess", "naples", "greece", "henry", "inspector"]

Let's filter the dataframe to only those tokens we're interested in plotting:

**Warning!** The following cell will fail if you've included a token that is not included in the vocabulary of your model!

In [None]:
df2 = df.loc[tokens_to_plot] # subset of df only containing the terms in our tokens_to_plot list as keys/index

Let's plot the terms!

In [None]:
# our two dimensions to be used as coordinates
x = df2[0] 
y = df2[1]

# just some size adjustment...
fig = plt.figure(figsize = (20, 15),dpi=80)

# plot x and y
plt.scatter(x,y)

# add labels to the dots so we know what each of them refers to
for i, txt in enumerate(tokens_to_plot):
    plt.annotate(txt, (x[i], y[i]))

# add a grid to help understand distances a bit better
plt.grid()

#show the plot
plt.show()

Ok, this visualisation may not make a lot of sense.

However, here we're just plotting the embeddings of a very very very small model. So small that there was not enough data to find relationships between the words.

✏️ **Exercise:**

Load a different model (a real model, not a test model) and plot some words:

In [None]:
# Write your code here:

✏️ **Exercise:**

Can you think how this could be used as a way to perform 'bias detection'? Experiment with a model of your choice.

In [None]:
# Write your code here: