# Word2Vec: using pre-trained models

## Background

Introduced in 2013, Word2Vec had a huge revolutionary impact in natural language processing and its applications.

Word2Vec is a technique that provides ways for computing **word vectors** (also called **word embeddings**).

**Word vectors** are numerical representations of words (i.e. words represented as a sequences of numbers). These are created (trained, learned) from a corpus of texts. The set of word embeddings created from a corpus is called a model.

**Intuition:** if the algorithm can guess correctly which words fit a certain context, it has some understanding of their meanings. So the model encodes contexts of use. But can we say word vectors represent a word's meaning?

A word2vec model that can be shared and used out-of-the-box is called a **pre-trained model**.

## Loading and using a pretrained w2v model

We will use a popular library called Gensim to load Word2Vec models and explore vector spaces.

First, we import one of the most relevant modules from the `gensim` library: `KeyedVectors`.

Documentation: https://radimrehurek.com/gensim/models/keyedvectors.html

In [None]:
from gensim.models import KeyedVectors

### Option 1: downloading pre-trained embeddings with Gensim

Gensim allows us to directly download and use some pre-trained models off-the-shelf ([see more info here](https://github.com/RaRe-Technologies/gensim-data#models)).

Some of these models are very large, and can take long to download and load. We will use smaller word2vec models for demonstration purposes.

We can import the `downloader` module from Gensim and use it as follows to download one of the available pre-trained models:

In [None]:
# Import the Gensim `downloader`:
from gensim import downloader

In [None]:
# Download a pretrained word2vec model and load it into a variable called `wiki_vectors`:
wiki_vectors = downloader.load("glove-wiki-gigaword-50")

In [None]:
# It's important to always know the data type:
type(wiki_vectors)

In [None]:
# KeyedVectors allow word vectors to be accessed as if from a dictionary:
print(wiki_vectors["apple"])

In [None]:
# KeyedVectors allow word vectors to be accessed as if from a dictionary:
print(wiki_vectors["pear"])

### Option 2: loading a model from elsewhere

Word2Vec models are trained and shared all the time, and most of them are not available through the Gensim downloader. In fact, not all of them are easy to find, sometimes it requires doing a bit of archeology to find whether something you would find useful already exists and has been shared. For example, it's common for researchers to train their own models and upload them to Zenodo or another research repository. Or different institutions may have their own repositories.

For example, you can find word vectors in many different languages [here](https://fasttext.cc/docs/en/crawl-vectors.html#models):
> E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages. I Proceedings of the International Conference on Language Resources and Evaluation, 2018.

Pre-trained models often come as:
* either a vectors-only file (i.e. just a file containing the word and associated vector),
* or as full models (we will see this in the next notebook).

Most pretrained models are shared as a vectors-only file, full models are generally kept only if one may update the model at some point.

To upload a file with vectors only, we use the `KeyedVectors` module from `gensim.models`:

In [None]:
# We need to import KeyedVectors from gensim:
from gensim.models import KeyedVectors

As an example, we will use a Word2Vec model that was trained on a 4.2-billion-word corpus of 19th-century British newspapers.

These embeddings were trained to investigate semantic change in the lexicon of mechanization in 19th-century British newspapers, by Nilo Pedrazzini and Barbara McGillivray:
> Pedrazzini, Nilo & Barbara McGillivray. 2022. Diachronic word embeddings from 19th-century British newspapers [Data set]. Zenodo. DOI: https://doi.org/10.5281/zenodo.7181682

Download the embeddings file from [here](https://zenodo.org/record/7181682). Download just one of the `.txt` files, put it in the `models/` folder, which should be inside `Sessions/` (create the `models` folder if needed).

You can then use the `.load_word2vec_format()` method to load the embeddings:

In [None]:
# Load the 1860s embeddings into a new variable `victorian_vectors`:
victorian_vectors = KeyedVectors.load_word2vec_format('models/1860s-vectors.txt')

In [None]:
# Check the data type of `victorian_vectors`:
print(type(victorian_vectors))

In [None]:
# Get the embedding for "apple":
print(victorian_vectors["apple"])

Let's take a moment to compare the "apple" embeddings from these two different models: they are completely different!!

Word embeddings, on their own, are pretty useless. Also, they are not comparable between different models (unless you align them, but that's for another time!).

Word embeddings make sense relatively to each other within the same model.

## Exploring the vector space

### Vectors

First, what is a vector and how does it look like?

In [None]:
# Get the vector for the word "apple":
apple_vector = wiki_vectors["apple"]

In [None]:
# Print the content of the variable `apple_vector`:
print(apple_vector)

In [None]:
# Print the length of `apple_vector`:
len(apple_vector)

In [None]:
# What's the data type of this?
type(apple_vector)

**Note:** A numpy array ([numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html)) represents a multidimensional array of fixed-size items. When it only has one dimension, it looks very similar to a list, [but still is different!](https://numpy.org/doc/stable/user/absolute_beginners.html#whats-the-difference-between-a-python-list-and-a-numpy-array)

### Vocabulary

The Vocabulary of the model is the set of unique words in the model. Often, when training a model, a strict limit to the vocabulary is given, only keeping the top most common words or removing words which occur less than a number of times in the whole corpus.

The size of the vocabulary can be found using the `len()` function:

In [None]:
# Find the size of the vocabulary of model trained on 19thC English newspapers:
print(len(victorian_vectors))

# Find the size of the vocabulary of model trained on modern data:
print(len(wiki_vectors))

Now, let's inspect which words are part of the model. You can get them with `.index_to_key`:

In [None]:
# Get the words from the model, store them in variable `w2v_words`:
w2v_words = list(victorian_vectors.index_to_key)

In [None]:
# Print the first five words of the vocabulary:
print(w2v_words[:5])

In [None]:
# Check if the word "DHOxSS" is part of the vocabulary:
print("DHOxSS" in w2v_words)

In [None]:
# Check if the word "Oxford" is part of the vocabulary:
print("Oxford" in w2v_words)

In [None]:
# Ok, then check if the word "oxford" is part of the vocabulary:
print("oxford" in w2v_words)

There are several decisions involved when training a model: how many words do we keep? do we lower-case the corpus? and many many more.

It is hard to define when 'large' is large enough. As a (very) rough rule of thumb, perhaps **at least** around 500K words are necessary to train embeddings on corpora of a very specific style/genre/topic, and 2-4 million words for somewhat more diverse sources. Ultimately, there is no 'right' number, and perhaps more is not necessarily better (or maybe it is... depending on your corpus and its quality). A good read on the topic is [this one](https://aclanthology.org/W17-6908.pdf).

Note that the choice of the model and training decisions will have a huge impact on the word vectors you'll get.

### Similarity, relatedness, analogy

So what is the advantage of representing words as vectors?

We can calculate the similarity between two vectors. This is usually done with `cosine similarity`, which Wikipedia defines as "the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths".

When we calculate the cosine similarity between word vectors, behind the scene we are actually ranking words relatively to one another depending on the size of the angle between any given word vector and the vectors of the other words.

![](images/w2v.png)

Gensim's KeyedVectors module allows us to find similarities, relatedness and analogies between words.

Let's see some useful functions:
* `doesnt_match()`
* `most_similar()`
* `similarity()`
* `n_similarity()`

#### The `.doesnt_match()` method

Given a list of words, it guesses which element is semantically different from the rest.

In [None]:
wiki_vectors.doesnt_match(["oxford", "carrot", "pear", "zucchini"])

#### The `.most_similar()` method

Given a word, `.most_similar()` returns the closest neighbours to the query word. The argument `topn` determines the number of neighbours to return.

In [None]:
# Example 1: The top 10 most similar words to 'oxford':
wiki_vectors.most_similar("oxford", topn=10)

In [None]:
# Example 2: The top 10 most similar words to 'carrot':
wiki_vectors.most_similar("carrot", topn=10)

In [None]:
# Example 3: The top 10 most similar words to 'apple':
wiki_vectors.most_similar("apple", topn=10)

In [None]:
# Example 4: The top 10 most similar words to 'apple', but using one of the Victorian models:
victorian_vectors.most_similar("apple", topn=10)

The `most_similar()` method can also be used to perform some vector arithmetics. Since word vectors are vectors, we can do maths with them and observe interesting patterns.

A classic: what word is to _woman_ the way _king_ is to _man_? We can try to figure this out by simple additions and subtractions:

`woman + king - man = ?`

In [None]:
wiki_vectors.most_similar(positive=['woman', 'king'], negative=['man'], topn=10)

Or also:

`man + queen - woman = ?`

In [None]:
wiki_vectors.most_similar(positive=['man', 'queen'], negative=['woman'], topn=10)

This method is often used to find analogies. What is to Paris the way England is to London? Or:

`paris + england - london = ?`

In [None]:
wiki_vectors.most_similar(positive=['paris', 'england'], negative=['london'])

#### The `.similarity()` method

This method allows you to find the similarity between words...

(... actually, remember that this is the similarity between the vector representation of two words learned from a specific corpus, **HUGE DIFFERENCE!**)

In [None]:
# Get the similarity between two words:
wiki_vectors.similarity("bush", "president")

In [None]:
# Get the similarity between two words:
wiki_vectors.similarity("trump", "president")

In [None]:
# Of course, that will change if we use one of our Victorian models:
victorian_vectors.similarity("bush", "president")

In [None]:
# And, of course:
victorian_vectors.similarity("obama", "president")

#### The `n_similarity()` method

This method computes cosine similarity between two sets of words.

In [None]:
print(wiki_vectors.n_similarity(["oxford", "university", "summer", "school"], ["learning", "oxfordshire"]))

In [None]:
print(wiki_vectors.n_similarity(["oxford", "university", "summer", "school"], ["sun", "beach", "holidays"]))

### Bias in word embeddings

Depending on the the texts the vectors have been trained on, the respective vectors tend to <u>**reflect the bias they have acquired in the natural language**</u> as is represented in the corpus. Bias, of course, can be reflected across several variables (sexual orientation, ethnicity, political leaning, etc.).  

Depending on the task, **this can be an important ethical issue**: it may not be if your model is meant to capture the semantics of words from a specific historical period, but it may be so if the vectors are used to improve certain recommendation systems, for example.

In [None]:
# What is to woman the way doctor is to man?
wiki_vectors.most_similar(positive=['woman', 'doctor'], negative=['man'])

In [None]:
# What is to man the way doctor is to woman?
wiki_vectors.most_similar(positive=['man', 'doctor'], negative=['woman'])

Word embeddings have been used in research to quantify biases and stereotypes, for example in [this famous PNAS paper](https://www.pnas.org/doi/full/10.1073/pnas.1720347115).

✏️ **Exercise:**

Load a different model (in a different language, for example) and explore similarities between words.

In [None]:
# Experiment and type your code here:



✏️ **Exercise:**

With the model you have loaded, can you identify any biases inherited from the training data?

In [None]:
# Experiment and type your code here:

