# Tutorial // Exploring Gender Bias in Word Embedding

## https://learn.responsibly.ai/word-embedding

Powered by [`responsibly`](https://docs.responsibly.ai/) - Toolkit for auditing and mitigating bias and fairness of machine learning systems 🔎🤖🧰

# Part Three: Motivation - Why to use Word Embeddings?

## 3.1 - [NLP (Natural Language Processing)](https://en.wikipedia.org/wiki/Natural_language_processing)
**Very partial** list of tasks


### 1. Classification
- Fake news classification
- Toxic comment classification
- Review raiting (sentiment analysis)
- Hiring decision making by CV
- Automated essay scoring

### 3. Machine Translation

### 2. Information Retrieval
- Search engine
- Plagiarism detection

### 3. Conversation chatbot

### 4. Coreference Resolution
![](../images/corefexample.png)
<small>Source: [Stanford Natural Language Processing Group](https://nlp.stanford.edu/projects/coref.shtml)</small>

<br><br><br><br>

## 3.2 - Machine Learning (NLP) Pipeline
<br>
<div style="border: 1px solid; padding: 50px; margin: 10px">
 <h2>
 
Data → Representation → (Structured) Inference → Prediction   

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;↑

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Auxiliary Corpus/Model
 </h2>
</div>
<br>

<small>Source: [Kai-Wei Chang (UCLA) - What It Takes to Control Societal Bias in Natural Language Processing](https://www.youtube.com/watch?v=RgcXD_1Cu18)</small>

### 3.3 - Esessional Question - How to represent language to machine?

We need some kind of *dictionary* 📖 to transform/encode

... from a human representation (words) 🗣 🔡

... to a machine representation (numbers) 🤖 🔢

<br><br><br><br>

## First Atempt

### Idea: Bag of Words (for a document)
![](../images/bow.png)
<small>Source: Zheng, A.& Casari, A. (2018). Feature Engineering for Machine Learning. O'Reilly Media.</small>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vocabulary = ['it', 'they', 'puppy', 'and', 'cat', 'aardvark', 'cute', 'extremely', 'not']

vectorizer = CountVectorizer(vocabulary=vocabulary)

In [None]:
sentence = 'it is a puppy and it is extremely cute'

### Bag of words

In [None]:
vectorizer.fit_transform([sentence]).toarray()

In [None]:
vectorizer.fit_transform(['it is not a puppy and it is extremely cute']).toarray()

In [None]:
vectorizer.fit_transform(['it is a puppy and it is extremely not cute']).toarray()

🦄 Read more about scikit-learn's text feature extraction [here](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

### One-hot representation

In [None]:
[vectorizer.fit_transform([word]).toarray()
 for word in sentence.split()
 if word in vocabulary]

### The problem with one-hot representation

![](../images/audio-image-text.png)
<small>Source: [Tensorflow Documentation](https://www.tensorflow.org/tutorials/representation/word2vec)</small>

[Color Picker](https://www.google.com/search?q=color+picker)

<br><br><br><br>

## 3.4 - 💎 Idea: Embedding a word in a n-dimensional space

### Distributional Hypothesis
> "a word is characterized by the company it keeps" - [John Rupert Firth](https://en.wikipedia.org/wiki/John_Rupert_Firth)

**Distance ~ Meaning Similarity**


### 🦄 Examples (algorithms and pre-trained models)
- [Word2Vec](https://code.google.com/archive/p/word2vec/)
- [GloVe](https://nlp.stanford.edu/projects/glove/)
- [fastText](https://fasttext.cc/)
- [ELMo](https://allennlp.org/elmo) (contextualized)

#### 🦄 Training: using *word-context* relationships from a corpus. See: [The Illustrated Word2vec by Jay Alammar](http://jalammar.github.io/illustrated-word2vec/)

#### 🦄 State of the Art - Contextual Word Embedding → Language Models
- [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) by Jay Alammar](http://jalammar.github.io/illustrated-bert/)
- Microsoft - [NLP Best Practices](https://github.com/microsoft/nlp-recipes)
- [Tracking Progress in Natural Language Processing](https://nlpprogress.com/)

# Part Four: Playing with Word2Vec word embedding!

[Word2Vec](https://code.google.com/archive/p/word2vec/) - Google News - 100B tokens, 3M vocab, cased, 300d vectors - only lowercase vocab extracted

Loaded using [`responsibly`](http://docs.responsibly.ai) package, the function [`responsibly.we.load_w2v_small`]() returns a [`gensim`](https://radimrehurek.com/gensim/)'s [`KeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors) object.


## 4.1 - Basic Properties

In [None]:
# 🛠️⚡ ignore warnings
# generally, you shouldn't do that, but for this tutorial we'll do so for the sake of simplicity

import warnings
warnings.filterwarnings('ignore')

In [None]:
from responsibly.we import load_w2v_small

w2v_small = load_w2v_small()

In [None]:
# vocabulary size

len(w2v_small.vocab)

In [None]:
# get the vector of the word "home"

print('home =', w2v_small['home'])

In [None]:
# the word embedding dimension, in this case, is 300

len(w2v_small['home'])

In [None]:
# all the words are normalized (=have norm equal to one as vectors)

from numpy.linalg import norm

norm(w2v_small['home'])

In [None]:
# 🛠️ make sure that all the vectors are normalized!

from numpy.testing import assert_almost_equal

length_vectors = norm(w2v_small.vectors, axis=1)

assert_almost_equal(actual=length_vectors,
                    desired=1,
                    decimal=5)

## 4.2 - 💎 Mesuring Distance between Words

![](../images/sphere.png)

<small>Source: [Wikipedia](https://en.wikipedia.org/wiki/File:Sphere_wireframe_10deg_6r.svg)</small>

### Mesure of Similiarty: [Cosine Similariy](https://en.wikipedia.org/wiki/Cosine_similarity)
- Measures the cosine of the angle between two vecotrs.
- Ranges between 1 (same vector) to -1 (opposite/antipode vector)
- In Python, for normalized vectors (Numpy's array), use the `@`(at) operator!

In [None]:
w2v_small['cat'] @ w2v_small['cat']

In [None]:
w2v_small['cat'] @ w2v_small['cats']

In [None]:
from math import acos, degrees

degrees(acos(w2v_small['cat'] @ w2v_small['cats']))

In [None]:
w2v_small['cat'] @ w2v_small['dog']

In [None]:
degrees(acos(w2v_small['cat'] @ w2v_small['dog']))

In [None]:
w2v_small['cat'] @ w2v_small['cow']

In [None]:
degrees(acos(w2v_small['cat'] @ w2v_small['cow']))

In [None]:
w2v_small['cat'] @ w2v_small['graduated']

In [None]:
degrees(acos(w2v_small['cat'] @ w2v_small['graduated']))

💎 In general, the use of Word Embedding to encode words, as an input for NLP systems (*), improve their performance compared to one-hot representation.

\* Sometimes the embedding is learned as part of the NLP system.

## 4.3 - 🛠️ Visualization Word Embedding in 2D using T-SNE 

<small>Source: [Google's Seedbank](https://research.google.com/seedbank/seed/pretrained_word_embeddings)</small>

In [None]:
from sklearn.manifold import TSNE
from matplotlib import pylab as plt

# take the most common words in the corpus between 200 and 600
words = [word for word in w2v_small.index2word[200:600]]

# convert the words to vectors
embeddings = [w2v_small[word] for word in words]

# perform T-SNE
words_embedded = TSNE(n_components=2).fit_transform(embeddings)

# ... and visualize!
plt.figure(figsize=(20, 20))
for i, label in enumerate(words):
    x, y = words_embedded[i, :]
    plt.scatter(x, y)
    plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                 ha='right', va='bottom', size=11)
plt.show()

### Extra: [Tensorflow Embedding Projector](http://projector.tensorflow.org)
⚡ Be cautious: It is easy to see "patterns".

## 4.4 - Most Similar

What are the most simlar words (=closer) to a given word?

In [None]:
w2v_small.most_similar('cat')

### EXTRA: Doesn't Match

Given a list of words, which one doesn't match?

The word further away from the mean of all words.

In [None]:
w2v_small.doesnt_match('breakfast cereal dinner lunch'.split())

## 4.5 - Vector Arithmetic

![](../images/vector-addition.png)

<small>Source: [Wikipedia](https://commons.wikimedia.org/wiki/File:Vector_add_scale.svg)</small>

In [None]:
# nature + science = ?

w2v_small.most_similar(positive=['nature', 'science'])

## 4.6 - 💎 Vector Analogy

![](../images/linear-relationships.png)
<small>Source: [Tensorflow Documentation](https://www.tensorflow.org/tutorials/representation/word2vec)</small>

In [None]:
# man:king :: woman:?
# king - man + woman = ?

w2v_small.most_similar(positive=['king', 'woman'],
                       negative=['man'])

In [None]:
w2v_small.most_similar(positive=['big', 'smaller'],
                       negative=['small'])

## 4.10 - Think about a DIRECTION in word embedding as a RELATION

# $\overrightarrow{she} - \overrightarrow{he}$
# $\overrightarrow{smaller} - \overrightarrow{small}$
# $\overrightarrow{Spain} - \overrightarrow{Madrid}$


**⚡ Direction is not a word vector by itself!**

### ⚡ But it doesn't work all the time...

In [None]:
w2v_small.most_similar(positive=['forward', 'up'],
                       negative=['down'])

It might be because we have the phrase "looking forward" which is acossiated with "excitement" in the data.

⚡🦄 Keep in mind the word embedding was generated by learning the co-occurrence of words, so the fact that it *empirically* exhibit "concept arithmetic", it doesn't necessarily mean it learned it! In fact, it seems it didn't.
See: [king - man + woman is queen; but why? by Piotr Migdał](https://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html)

🦄 EXTRA: [Demo - Word Analogies Visualizer by Julia Bazińska](https://lamyiowce.github.io/word2viz/)

⚡🦄 In fact, `w2v_small.most_similar` find the most closest word which *is not one* of the given ones. This is a real methodological issue. Nowadays, it is not a common practice to evaluate word embedding with analogies.

You can use [`responsibly.we.most_similar`](https://docs.responsibly.ai/word-embedding-bias.html#responsibly.we.utils.most_similar) for the unrestricted version.