# Python Text Analysis: Word Embeddings

<div class="alert alert-success">  
    
### Learning Objectives 

* Recognize differences between bag-of-words representations and word embeddings.
* Learn how word embeddings capture the meaning of words.
* Calculate cosine similarity to capture linguistic concepts.
* Understand that word embeddings models can be biased, and develop approaches to uncover these biases.
</div>

### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive excersise. We'll work through these in the workshop!<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br> 

### Sections
1. [Understand Word Embeddings](#section1)
2. [Word Similarity](#section2)
3. [Word Analogy](#section3)
4. [Bias in Word Embeddings](#section4)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In Part 2, we have tried converting the text data to a numerical representation with Bags of Words (BoW) and beyond that, TF-IDF. These methods make heavy use of word frequency but not much of the relative positions between words, but there's still rich semantic and syntactic meanings left to be captured beyond independent frequencies of words. 

We need a more powerful tool that has the potential to represent rich semantics (and more) of our text data. In the final part of this series, we will dive into word embeddings, a method widely combined with more advanced Natural Language Processing (NLP) tasks. We'll make extensive use of the `gensim` package, which hosts a range of word embedding models, including `word2vec` and `glove`, the two models we'll explore today.

<a id='section1'></a>

# Understand Word Emebeddings 

As famously put by British Linguist J.R. Firth:

> **You shall know a word by the company it keeps.**

This quote sums it all for the essence of word embeddings, which take the numerical representation of text further to the next step. 

Recall from Part 2 that a BoW representation is a **sparse** matrix. Its dimension is determined by vocabulary size and the number of documents. Importantly, a sparse matrix like BoW is interpretable: the cell values refer to the count of a word in a document. Oftentimes the cell values are zeros: many words do not simply appear in a particular document. 

We can think of word embedding as a matrix likewise, but this time a **dense** matrix, where the cell values are real numbers. Word embeddings project a word's meaning onto a high-dimensional vector space, that's why it is also called **word vectors**. A word vector is essentially an array of real numbers, the length of which, as we'll see today, could be as low as 50, or as high as 300 (or even higher in Large Language Models). These real numbers do not make explicit sense to us, but this is not to say they are meaningless. The meanings of words, semantic or syntactic, are captured by the vector representation, which we will return to shortly.  

BOW:
- Sparse matrix
- Dimension: $D$ x $V$, where rows are **D**ocuments and columns are words in the **V**ocabulary.
- Interpretable: e.g., in a financial document, "bank" and "banker" could appear a lot of times but not "bane".

<img src='../images/bow-illustration-2.png' alt="BoW" width="500">

Word embeddings:
- Dense matrix
- Dimension: $V$ x $D$, where rows are **V**ocabulary and columns are vectors with dimension **D**.
- Not immediately interpretable

<img src='../images/bow-illustration-3.png' alt="BoW" width="500">

Today, we are going to explore two widely used word embedding models, `word2vec` and `glove`. We will use the package `gensim` to access both models, so let's install gensim first.

## Install `gensim`

In [2]:
# Run if you do not have gensim installed
# !pip install gensim

In [3]:
import gensim
import gensim.downloader as api
from gensim.models import KeyedVectors

## `word2vec`

Before diving into `word2vec`, let's talk a bit of history first. The idea of word vectors, i.e, projecting a word's meaning onto a vector space, has been around for a long time. The `word2vec` model, proposed by [Mikolov et al.](https://arxiv.org/abs/1310.4546) in 2013, introduces an efficient model of word embeddings, since then it has stimulated a new wave of research into this topic. 

The key question asked in this paper is: how do we go about learning a good vector representation from the data?

Mikolov et al. proposed two approaches: the **continuous bag-of-words (CBOW)** and the **skip-gram (SG)**. Both are similar in that we use the vector representation of a token to try and predict what the nearby tokens are with a shallow neural network.   

Take the following sentence from Merriam-Webster for example. If our target token is $w_t$, "banks", the context tokens would be the preceding tokens $w_{t-2}, w_{t-1}$ and the following ones $w_{t+1}, w_{t+2}$. This corresponds to a **window size** of 2: 2 words on either side of the target word. Similarly when we move onto the next tagret token, the context window (tokens underlined) moves as well.

<img src='../images/target_word.png' alt="Trget word" width="500">

In the continuous bag-of-words model, our goal is to predict the target token, given the context tokens. In the skip-gram model, the task is to predict the context tokens from the target token. This is the reverse of the continuous bag-of-words, and is a harder task, since we have to predict more from less information.

<img src='../images/word2vec-model.png' alt="word2vec" width="550">

**CBOW** (Left):
- **Input**: context tokens
- **Inner dimension**: embedding layer
- **Output**: the target token

**Skip-gram** (Right):
- **Input**: the target token
- **Inner dimension**: embedding layer
- **Output**: context tokens.

The above figure illustrates the direction of prediction. It also serves as a schematic representation of a neural network, i.e., the mechanics underlying the training of `word2vec`. The input and output are known to us, represented by **one-hot encodings** in Mikolov et al. The **hidden layer**, the inner dimension in-between the input and the output, is the vector representation that we are trying to find out. 

We won't go into the specifics of training but provide a brief idea of where does embedding come from. The `word2vec` model we will be interacting with today is **pre-trained**, meaning that the embeddings have already been trained on a large corpus (or a number of corpora). The pre-trained `word2vec` and `glove`, as well as other models, are available through `gensim`. 

Let's take a look at a few them!

In [4]:
# Get word embedding models
gensim_models = list(api.info()['models'].keys())

for model in gensim_models:
    print(model)

fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis


The one named `word2vec-google-news-300` is what we are looking for! The model name is usually formatted as `model-corpora-dimension`, so this is a `word2vec` model that is trained on Google News, and the embedding has 300 dimensions. 

We can retrieve this model in two ways:
- Downloading it via `api.load()`
- Downloading the model as a zip file beforehand and then loading it in with `KeyedVectors.load()`

In [None]:
# Run the following line if your local machine has plenty of memory
#wv = api.load("word2vec-google-news-300")

[--------------------------------------------------] 1.4% 23.4/1662.8MB downloaded

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



[==------------------------------------------------] 4.5% 75.6/1662.8MB downloaded

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



[===-----------------------------------------------] 7.7% 127.2/1662.8MB downloaded

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



[=====---------------------------------------------] 10.4% 172.4/1662.8MB downloaded

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)





The parameter `binary` asks whether the model is in the binary format (indicated by the extension `.bin`).

In [6]:
# Alternatively, load the model in
wv = KeyedVectors.load_word2vec_format('../data/GoogleNews-vectors-negative300.bin', binary=True)

FileNotFoundError: [Errno 2] No such file or directory: '../data/GoogleNews-vectors-negative300.bin'

Accessing the actual word vectors can be done by treating the word vector model as a dictionary. 

For example, let's take a look at the word vector for "banana":

In [None]:
wv['banana']

We can take a look at the shape of the "banana" vector. As promised, it is an 1-D array that holds 300 values. 

In [None]:
wv['banana'].size

These values appear to be random floats. However, now that the word has been transformed into a vector, we can more easily perform computations on it. 

Let's take a look at a few examples!

<a id='section2'></a>

# Word Similarity

The first question we can ask is: What words are similar to "bank"? In vector space, we'd expect similar words to have vectors that are closer to each other.

There are many metrics for measuring vector similarity, one of the most useful being [**cosine similarity**](https://en.wikipedia.org/wiki/Cosine_similarity). Cosine similarity ranges from 0 to 1, with orthogonal vectors having a cosine similarity of 0 and parallel vectors having a cosine similarity of 1.

`gensim` provides a function called `most_similar()` that lets us find the words most similar to a queried word. The output is a tuple of the word and its cosine similarity to the queried word.

Let's give it a shot!

In [None]:
wv.most_similar(['bank'])

It looks like most similar vectors to "bank" are other financial terms! 

Recall that `word2vec` is trained to capture a word's meaning based on contextual information. These results pop up because these words commonly appear in similar contexts as the word "bank". 

In addition to querying the most similar words, we can also ask the model to return the cosine similarity between two words by calling the function `similarity()`

Let's go ahead and check out the similarities between the following pairs of words.

In [None]:
# bank with capitalized B
wv.similarity('Bank', 'river')

In [None]:
# the present participle of bank
wv.similarity('banking', 'river')

In [None]:
# the word stem
wv.similarity('bank', 'river')

In [None]:
# bank in plural
wv.similarity('banks', 'river')

🔔 **Question**: Why "banks" and "river" appear to have higher similarity than other pairs?

## 🥊 Challenge 1: Dosen't Match

We have a list of tuples for coffee-noun pairs. Let's find out which coffee drink is most commonly associated with the word "coffee," and which one is not:

- Complete the for loop to calculate the cosine similarity between each pair.

Next, look up the documentation for the [`doesnt_match`](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#word2vec-demo) function. We will use it to identify the verb in the following list (one cell below) that does not seem to belong.

- Use `doesnt_match` to find the verb that is unlikely to fit within the group.
  

In [None]:
coffee_nouns = [
    ('coffee', 'espresso'),
    ('coffee', 'cappuccino'),
    ('coffee', 'latte'),
    ('coffee', 'americano'),
    ('coffee', 'irish'),
]

# for w1, w2 in coffee_nouns:
#     similarity = # YOUR CODE HERE
#     print(f"{w1}, {w2}, {similarity}")

In [None]:
for w1, w2 in coffee_nouns:
    similarity = wv.similarity(w1, w2)
    print(f"{w1}, {w2}, {similarity}")

In [None]:
coffee_verbs = ['brew', 'drip', 'pour', 'make', 'grind', 'roast']

# verb_dosent_match = # YOUR CODE HERE
# verb_dosent_match

In [None]:
verb_dosent_match = wv.doesnt_match(coffee_verbs)
verb_dosent_match

<a id='section3'></a>

# Word Analogy

One of the most famous usages of `word2vec` is via word analogies. For example:

`man : king :: woman : queen`

Oftentimes, word analogy like this is visualized with parallelogram, such as shown in the following figure, which is adapted from [Ethayarajh et al. (2019)](https://aclanthology.org/P19-1315.pdf). 

<img src='../images/word_analogy.png' alt="Word analogy" width="450">

The upper side (difference between `man` and `woman`) should approximate the lower side (differ between `king` and `queen`); the vector difference represents the meanig of `female`. 

- $\mathbf{V}_{\text{man}} - \mathbf{V}_{\text{woman}} \approx \mathbf{V}_{\text{king}} - \mathbf{V}_{\text{queen}}$

Similarly, the left side (difference between `king` and `man`) should approximate the right side (differ between `queen` and `woman`); the vector difference represents the meaning of `royal`.

- $\mathbf{V}_{\text{king}} - \mathbf{V}_{\text{man}} \approx \mathbf{V}_{\text{queen}} - \mathbf{V}_{\text{woman}}$

We can take either equation and rearrange it:

- $\mathbf{V}_{\text{king}} - \mathbf{V}_{\text{man}} + \mathbf{V}_{\text{woman}} \approx \mathbf{V}_{\text{Queen}}$

If the vectors of `king`, `man`, and `woman` are known, by vector arithmatics we should be able to get a vector that approximates the meaning of `queen`. 

Let's implement it!

⚠️ **Warning:** In all these operations, we set `norm=True`, and renormalize. That's because different vectors might be of different lengths, so the normalization puts everything on a common scale.

In [None]:
# Calculate "royal" vector difference
difference = wv.get_vector('king', norm=True) - wv.get_vector('man', norm=True) 

# Add on woman
difference += wv.get_vector('woman', norm=True)

# Renormalize vector
difference = difference / np.linalg.norm(difference)

In [None]:
# What is the most similar vector?
wv.most_similar(difference)

🔔 **Question**: The word "queen" is the second most similar one. Why "king" has the highest similarity score?

Carrying out these operations can be done in one swoop with the `most_similar` function. 

We pass in two arguments `positive` and `negative`, wherein `positive` holds the words that we want the output to be similar with, and `negative` the words we'd like the output to be dissimilar with.

In [None]:
wv.most_similar(positive=['woman', 'king'], negative='man')

## 🥊 Challenge 2: Woman is to Homemaker?

[Bolukbasi et al. (2016)](https://arxiv.org/pdf/1607.06520) is a thought-provoking investigation of gender bias in word embeddings, and they primarily focus on word analogies, especially those that reveal gender stereotyping! Let run a couple examples discussed in the paper, using the `most_similiar` function we've just learned. 

The following code block contains a few examples we can pass to the `positive` argument: we want the output to be similar to, for example, `woman` and `chairman`, and in the meantime, we are also specificying that it should be dissimilar to `man`. We'll print the top result by indexing to the 0th item. 

Let's complete the following for loop.

In [None]:
positive_pair = [['woman', 'chairman'],
                 ['woman', 'doctor'], 
                 ['woman', 'computer_programmer']]
negative_word = 'man'

## YOUR CODE HERE
# for example in positive_pair:
#     result = ...
#     print(f"man is to {example[1]} as woman to {result[0][0]}")

In [None]:
for example in positive_pair:
    result = wv.most_similar(positive=example, negative=negative_word)
    print(f"man is to {example[1]} as woman is to {result[0][0]}")

🔔 **Question**: What have you found? Are these results surprising?

<a id='section4'></a>

# Bias in Word Embeddings

### `glove` 

Any forms of stereotyping is disturbing. Now that we've known gender bias is indeed present in the pre-trained embeddings. Let's take a closer look at it!

We will switch gear to a smaller size embedding, i.e., pre-trained `glove`, starting from this section. Let's load it with the `api.load()` function. 

The model we load in is trained from Wikipedia and Gigaword (news data). Check out the [documentation](https://nlp.stanford.edu/projects/glove/) if you want to know further!

In [None]:
glove = api.load('glove-wiki-gigaword-50')

Let's double check the size of the embedding vector.

In [None]:
glove['banana'].size

### Semantic Axis

To investigate gender bias in word embeddings. We first need a vector representation that capture the concept of gender. The idea is to construct **a semantic axis** (or "SemAxis") of a concept. This concept is often complex, and cannot be simply denoted by a single word. And it is often fluid, meaning that its meaning spans form one end to the other. Once we've got the vector representation of this concept, we can project a list of terms that onto that axis, and see if each of the term is more aligned towars one end or the other of the concept. 

The methods of doing so comes from [An et al. 2018](https://aclanthology.org/P18-1228/). We will first need to come up with two lists of pole words, which are opposing to each other. 

- $\mathbf{V}_{\text{plus}} = \{v_{1}^{+}, v_{2}^{+}, v_{3}^{+}, ..., v_{n}^{+}\}$

- $\mathbf{V}_{\text{minus}} = \{v_{1}^{-}, v_{2}^{-}, v_{3}^{-}, ..., v_{n}^{-}\}$

We take the mean of each vector set to represent the core meaning of that set. 

- $\mathbf{V}_{\text{plus}} = \frac{1}{n}\sum_{1}^{n}v_{i}^{+}$

- $\mathbf{V}_{\text{minus}} = \frac{1}{n}\sum_{1}^{n}v_{j}^{-}$

Next we take the difference between the two means to represent the corresponding semantic axis. 

- $\mathbf{V}_{\text{axis}} = \mathbf{V}_{\text{plus}} - \mathbf{V}_{\text{minus}}$

Projecting a specific term to the semantic axis is, as we've learned above, operationalized as taking the `cosine similarity` between the word's vector and the semantic axis vector. A positive value would indicate that the term is more closer to the $\mathbf{V}_{\text{plus}}$ end, and a negative value meaning proximity to the $\mathbf{V}_{\text{minus}}$ end. 

- $score(w) = cos(v_{w},  \mathbf{V}_{\text{axis}})$

⚠️ **Warning:** A binary distinction of gender is a simplification of the diversity and complexity of gender identities. This method is limited, as it is only capable of constructing two polarities. Along the way, we'll discover how much stereotyping is encoded in it.

## 🥊 Challenge 3: Construct a Semantic Axis

Now it's your turn! We have two sets of pole words for "female" and "male". These are example words tested in Bolukbasi et al., 2016. We will get the embeddings for these words from glove to calculate the gender axis. 

The cell for the function `get_semaxis` provides some starting code. Complete the function. If everything runs, the embedding size of the semantic axis should be the same as the size of the input vector. 

In [None]:
# Define two sets of pole words (examples from Bolukbasi et al., 2016)
female = ['she', 'woman', 'female', 'daughter', 'mother', 'girl']
male = ['he', 'man', 'male', 'son', 'father', 'boy']

In [None]:
# def get_semaxis(list1, list2, model, embedding_size):
#     '''Calculate the embedding of a semantic axis given two lists of pole words.'''

#     # Step 1: Get the embeddings for terms in each list
#     # vplus = ...
#     # vminus = ...

#     # Step 2: Calculate the mean embeddings for each list
#     # vplus_mean = ...
#     # vminus_mean = ...

#     # Step 3: Get the difference between two means
#     # sem_axis = ...

#     # Sanity check
#     assert sem_axis.size == embedding_size
    
#     return sem_axis

In [None]:
def get_semaxis(list1, list2, model, embedding_size):
    '''Calculate the embedding of a semantic axis given two lists of pole words.'''

    # STEP 1: Get the embeddings for terms in each list
    vplus = [model[term] for term in list1]
    vminus = [model[term] for term in list2]

    # Step 2: Calculate the mean embeddings for each list
    vplus_mean = np.mean(vplus, axis=0)
    vminus_mean = np.mean(vminus, axis=0)

    # Step 3: Get the difference between two means
    sem_axis = vplus_mean - vminus_mean

    # Sanity check
    assert sem_axis.size == embedding_size
    
    return sem_axis

In [None]:
# Plug in the gender lists to calculate the semantic axis for gender
gender_axis = get_semaxis(list1=female, 
                          list2=male, 
                          model=glove, 
                          embedding_size=50)
gender_axis

We had the gender axis ready! The next step is to project a list of terms onto the gender axis. We can continue with the occupation terms we've tested previously. 

Before we go ahead to calculate the cosine similarity, let first rate the following occupation terms, use your intuition!

The rating should be between $[-1, 1]$: the negative value means the term is closer to the male end and positive value to the female end. 

In [None]:
# Define a list of occupations terms (examples taken from Bolukbaski et al., 2016)
occupations = ['engineer',
               'nurse',
               'designer',
               'receptionist',
               'banker',
               'librarian',
               'architect',
               'hairdresser',
               'philosopher']

In [None]:
# Rate the following occupation terms
occ_rating = {'engineer': -0.4,
              'nurse': 0.6,
              'designer':-0.1,
              'receptionist':0.5,
              'banker':-0.4,
              'librarian': 0.5,
              'architect': -0.4,
              'hairdresser': 0.5,
              'philosopher': -0.1
             }

In [None]:
# Calculate cosine similarity between a given word and the axis
def get_projection(word, model, axis):
    '''Get the projection of a word onto a semantic axis'''
    
    word_norm = model[word] / np.linalg.norm(model[word])
    axis_norm = axis / np.linalg.norm(axis)
    projection = np.dot(word_norm, axis_norm) 
    
    return projection

In [None]:
occ_projections = {word: get_projection(word, glove, gender_axis) for word in occupations}
occ_projections 

## Visualize the Projection

Now that we have calculated the projection of each occupation term onto the gender axis, let's plot these values to gain a more straightforward understanding of how much gender stereotyping is hidden in these terms.

We will use a bar plot to visualize them, with the color of each bar corresponding to the proximity of a term to an end.

In [None]:
from matplotlib.colors import Normalize

def plot_semantic_axis(projections, title, xlab):
    '''Return a horizontal bar plot of the projections.'''

    # Sort the projections in descending order
    projection_sorted = sorted(projections.items(), key=lambda term: term[1], reverse=True)

    # Extract the terms
    terms = [term_value[0] for term_value in projection_sorted]

    # Extract corresponding values of projections
    values = [term_value[1] for term_value in projection_sorted]

    # Take the absolute values for gradient color fill
    values_abs = np.abs(values)
    norm = Normalize(vmin=min(values_abs), vmax=max(values_abs))
    cmap = plt.get_cmap("YlOrBr")  
    colors = [cmap(norm(value)) for value in values_abs]

    plt.figure(figsize=(8, 6))  
    plt.barh(terms, values, color=colors)
    plt.grid(axis="x", linestyle=":", alpha=0.5)
    plt.xlim(-np.max(values_abs+0.05), np.max(values_abs+0.05))
    plt.xlabel(xlab)
    plt.title(title)
    plt.show()

We will visualize the projections as well as your self-ratings together. 

🔔 **Question**: Do you find the results surprising or expected? Let's pause for a minute to discuss why does steorotyping exist in word embeddings?

In [None]:
title1 = 'Projections onto the gender axis'
title2 = 'Self-rated projections onto the gender axis'
xlab = 'Gender-stereotypical occpuation terms'

plot_semantic_axis(occ_projections, title1, xlab)
plot_semantic_axis(occ_rating, title2, xlab)

## 🎬 **Demo**: The Class Axis

In addition to projecting terms onto a single axis, we can also project terms onto two axes and plot the results on a scatter plot, where the coordinates correspond to projections onto the two axes.

Social class is another dimension that has been frequently discussed in the literature. In this demo, we'll create a semantic axis for social class, using two sets of pole words representing the two ends of class, as described in [Kozlowski et al. 2019](https://journals.sagepub.com/doi/full/10.1177/0003122419877135).

First, we'll project a list of sports terms onto both the gender and social class axes, similar to the method used in [Kozlowski et al. 2019](https://journals.sagepub.com/doi/full/10.1177/0003122419877135). We'll visualize the results on a scatter plot, with the x-axis representing gender and the y-axis representing social class. The coordinates of a term on this plot correspond to its projections onto these axes.

Next, we'll repeat the process to visualize occupation terms, which will give us a rough idea of how much a term is biased towards either end of these two dimensions.

Let's dive in!

In [None]:
# Define two sets of pole words of social class (examples taken from Kozlowski et. al, 2019)
poor = ['poor', 'poorer', 'poorest', 'poverty', 'inexpensive', 'impoverished', 'cheap']
rich = ['rich', 'richer', 'richest', 'affluence', 'expensive', 'wealthy', 'luxury']

class_axis = get_semaxis(list1=poor, 
                         list2=rich, 
                         model=glove,
                         embedding_size=50)

We will project sports terms onto the social class axis to see if some sports are more associated with the "high" society and others "low" soceity. 

In [None]:
# Define a list of sports terms (examples taken from Kozlowski et. al, 2019)
sports = ['camping', 
          'boxing', 
          'bowling', 
          'baseball', 
          'soccer', 
          'tennis', 
          'golf', 
          'basketball', 
          'skiing', 
          'sailing', 
          'volleyball']

Next, let's use the `get_projection` function to calculate the cosine similarity between each sport term and the axis (gender and class). 

In [None]:
proj_spt_class = {word: get_projection(word, glove, class_axis) for word in sports}
proj_spt_gender = {word: get_projection(word, glove, gender_axis) for word in sports}

Finally, let's plot the results in a scatter plot!

In [None]:
plt.figure(figsize=(9, 7))

# Use scatter plot to visualize the results
plt.scatter(list(proj_spt_gender.values()), 
            list(proj_spt_class.values()), 
            color='cornflowerblue',
            s=75)

# Add text label to each dot
for term in sports:
    plt.annotate(term, 
                 (proj_spt_gender[term], proj_spt_class[term]), 
                 fontsize=10)

# Add more annotations to four corners of the plot
plt.annotate('Male/High', (-0.48, -0.28), color='gray', horizontalalignment='left')
plt.annotate('Female/High', (0.48, -0.28), color='gray', horizontalalignment='right')
plt.annotate('Male/Low', (-0.48, 0.27), color='gray', horizontalalignment='left')
plt.annotate('Female/Low', (0.48, 0.27), color='gray', horizontalalignment='right')

# Add reference lines to each semantic axis
plt.hlines(xmin=-1, xmax=1, y=0, color='lightcoral', linewidth=1, linestyle=':')
plt.vlines(ymin=-1, ymax=1, x=0, color='lightcoral', linewidth=1, linestyle=':')

# Other parameter settings
plt.xlim(-0.5, 0.5)
plt.ylim(-0.3, 0.3)
plt.grid(True, linestyle=':')
plt.xlabel('Projection onto Gender')
plt.ylabel('Projection onto Class')
plt.show();

🔔 **Question**: Voilà! Our scatter plot looks great. Let's take a minute to unpack the plot and discuss the following questions:
- Which sport term is most biased towards male and which toward female?
- Which sport seems to be gender-neutral?
- Which sport term is most biased towards high social class, and which towards low social class?
- Which sport seems to be neutral to class?

Ok! Let's go back to occupation terms. We will first need to get the projections onto both axes.

In [None]:
proj_occ_gender = {word: get_projection(word, glove, gender_axis) for word in occupations}
proj_occ_class = {word: get_projection(word, glove, class_axis) for word in occupations}

Next, let's visualize the results in a scatter plot.

In [None]:
plt.figure(figsize=(9, 7))

# Use scatter plot to visualize the results
plt.scatter(list(proj_occ_gender.values()), 
            list(proj_occ_class.values()), 
            color='tan', 
            s=75)

# Add text label to each dot
for term in occupations:
    plt.annotate(term, 
                 (proj_occ_gender[term], proj_occ_class[term]), 
                 fontsize=10)

# Add more annotations to four corners of the plot
plt.annotate('Male/High', (-0.48, -0.48), color='gray', horizontalalignment='left')
plt.annotate('Female/High', (0.48, -0.48), color='gray', horizontalalignment='right')
plt.annotate('Male/Low', (-0.48, 0.45), color='gray', horizontalalignment='left')
plt.annotate('Female/Low', (0.48, 0.45), color='gray', horizontalalignment='right')

# Add reference lines to each semantic axis
plt.hlines(xmin=-1, xmax=1, y=0, color='lightcoral', linewidth=1, linestyle=':')
plt.vlines(ymin=-1, ymax=1, x=0, color='lightcoral', linewidth=1, linestyle=':')

# Other parameter settings
plt.xlim(-0.5, 0.5)
plt.ylim(-0.5, 0.5)
plt.grid(True, linestyle=':')
plt.xlabel('Projection onto Gender')
plt.ylabel('Projection onto Class')
plt.show();

🔔 **Question**: We've known how much each term is biased towards male/female. Let's focus on their projections onto the social class axis.
- Which occuptation is most biased towards high social class, and which towards low social class?
- Which occputation seems to be neutral to class?

We will wrap up this workshop with these two plots, and hopefully, they will leave you with some food for thought to further explore word embeddings. Constructing an axis of gender or social class has been widely researched, but with the tool of semantic axis, we can investigate much more. It is useful for capturing the abstract meaning of various notions, such as an axis of coldness, an axis of kindness, and so on.

<div class="alert alert-success">

## ❗ Key Points

* Pre-trained word embeddings like `word2vec` and `glove` take contextual information into representations of words' meanings. 
* Similarities between words is conveniently reflected in cosine similarity. 
* We can explore biases in word embeddings with the methods of semantic axis.

</div>