# A Quick Introduction to Word2Vec with Python and Gensim
April 7, 2022

## Setting up and getting some data

In [None]:
# We need a resource for our data
import nltk

In [None]:
# If you are using this in CoLab, also run this cell, otherwise you can skip it
import warnings
warnings.filterwarnings('ignore')

In [None]:
# We are going to use the brown corpus
nltk.download('brown')
from nltk.corpus import brown

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Let's start by printing a few sentences out of the "brown" corpus, to get an idea of what the data looks like.

In [None]:
brown_sent = brown.sents()
print(brown_sent[:3])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possible', '``', 'irregularities', "''", 'in', 'the', 'hard-fought', 'primary', 'which', 'was', 'won', 'by', 'Mayor-nominate', 'Ivan', 'Allen', 'Jr.', '.']]


## Building the Model

We don't want to build the whole model from scratch, we will use the Gensim library instead.

In [None]:
from gensim.models import Word2Vec

We can now build an instance of the model!

In [None]:
# This is the whole model for the brown corpus (it might take a few minutes)!
brown_model = Word2Vec(brown_sent)

Let's look at an example!

In [None]:
test1 = brown_model.wv.most_similar('blue')
print("Most similar to 'blue':\n", test1[:3])

Most similar to 'blue':
 [('red', 0.9627671241760254), ('gray', 0.9619156718254089), ('green', 0.9587184190750122)]


## Refining the model

Word2Vec takes a broad range of parameters. In our example above, we only chose where to get our sentences from, and we used the *default* settings for the rest. But let's now look at a few that are most relevant (you can find a full list here: https://radimrehurek.com/gensim/models/word2vec.html):

- **size**: The dimensionality of our embeddings (i.e. the length of each word vector).
- **window**: Which words are considered contexts of the target. The size of window affects the type of similarity captured in the embeddings.
- **negative**: The number of negative samples (incorrect training-pair instances) that are drawn for each good.
- **sg**: Training algorithm -- 1 for skip-gram; otherwise CBOW.
- **min_count**: Ignores all words with total frequency lower than this.
- **iter**: Number of iterations (epochs) over the corpus.

So let's now train our model by explicitly setting some of these parameters!

In [None]:
# This is the whole model (it's going to take a few minutes!)
brown_model = Word2Vec(brown_sent, size = 300, window = 5, negative = 5, sg = 1, min_count = 5, iter = 10)

In [None]:
# We can do the same test as before
test = brown_model.wv.most_similar('game')
print("Most similar to 'blue':\n", test[:10])

Most similar to 'blue':
 [('golf', 0.7113315463066101), ('sunny', 0.6928216218948364), ('baseball', 0.6793524026870728), ('Beethoven', 0.6783057451248169), ("week's", 0.6782501339912415), ('tournament', 0.6765882968902588), ('games', 0.674275279045105), ('Bears', 0.673066258430481), ('booking', 0.6693858504295349), ('orchestra', 0.6671175956726074)]


## Evaluating the Model

We are going to rely on our own **human intuitions** to decide how well the model is doing!

In [None]:
sim = brown_model.wv.similarity("cup", "water")
print("How similar is 'cup' to 'water':\n", sim)

sim = brown_model.wv.similarity("cup", "book")
print("How similar is 'cup' to 'book':\n", sim)

How similar is 'cup' to 'water':
 0.5959407
How similar is 'cup' to 'book':
 0.18765117


In [None]:
brown_test = brown_model.wv.most_similar('child')
print("Most similar to 'child':\n", brown_test[:3])

Most similar to 'child':
 [('teacher', 0.6705455780029297), ('autistic', 0.6585035920143127), ('parent', 0.6541589498519897)]


We can do more complex comparisons, but some results will be less intuitive than others!

In [None]:
brown_test = brown_model.wv.most_similar(positive = ['child'], negative = ['person'])
print("Most similar to 'child' but dissimilar to 'person':\n", brown_test[:3])

Most similar to 'child' but dissimilar to 'person':
 [('health', 0.3028700649738312), ('children', 0.25111696124076843), ('fever', 0.21571321785449982)]


### Let's try a few more interesting tests.

Which word is a mismatch in the sequence?

In [None]:
mismatch = brown_model.wv.doesnt_match(['teacher','professor','doctor','red','athlete','runner'])
print(mismatch)

red


Maybe not **just** semantic relations?

In [None]:
mismatch = brown_model.wv.doesnt_match(['running','swimming','singing','paper','reading','booking','catch'])
print(mismatch)

paper


In [None]:
compare = brown_model.wv.similarity('walk','walked') 
print("The similarity between 'walk' and 'walked':\n", compare)

compare = brown_model.wv.similarity('look','looked') 
print("The similarity between 'look' and 'looked':\n", compare)

compare = brown_model.wv.similarity('look','walk') 
print("The similarity between 'look' and 'walk':\n", compare)

The similarity between 'walk' and 'walked':
 0.6559779
The similarity between 'look' and 'looked':
 0.6430508
The similarity between 'look' and 'walk':
 0.527116


## The choice of training data

As for the other parameters that we looked at, the **choice of training data** (our corpus) is essential in driving model performance.
For example, consider a very famous test case for Word2Vec: is the model able to derive the fact that "woman" is to "queen" what "man" is to "king"?

We can represent this question algebraically as:

$$vector(woman) +  vector(king) - vector(man) = vector(queen)$$

In [None]:
test = brown_model.wv.most_similar(positive=['woman','king'], negative=['man'], topn=1)
print(test)

[('singing', 0.72732013463974)]


We got a *weird* result!

However, consider the fact that the brown corpus is not too big (1M words) and it is fairly old. What would happen if we used a bigger, more recent corpus?

### Working with a pretrained model

Luckily, NLTK includes a pre-trained model. In particular, it includes part of a model trained on 100 billion words from the Google News Dataset. The full model is from https://code.google.com/p/word2vec/ (about 3 GB).

In [None]:
# we need to get the data
from nltk.data import find
nltk.download('word2vec_sample')

# we are going to use a pruned set
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))

[nltk_data] Downloading package word2vec_sample to /root/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


In [None]:
# This time we are **not** training it from scratch, we are just loading it in (it is still going to take a bit)!
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

Let's do a sanity check!

In [None]:
model.most_similar("joy")[:10]

[('elation', 0.6732936501502991),
 ('joyful', 0.6633968353271484),
 ('delight', 0.655024528503418),
 ('excitement', 0.6531193256378174),
 ('thrill', 0.630203902721405),
 ('happiness', 0.6182849407196045),
 ('joyous', 0.6128898859024048),
 ('jubilation', 0.6043694615364075),
 ('pleasure', 0.6032876968383789),
 ('sadness', 0.5980007648468018)]

Let's try our example once more!

In [None]:
model.most_similar(positive=['woman','king'], negative=['man'], topn = 1)

[('queen', 0.7118192911148071)]

In [None]:
model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 1)

[('France', 0.7884092330932617)]

We can do more! Let's track **semantic shifts** (e.g. historical changes in meaning)

In [None]:
change1 = brown_model.wv.most_similar('gay')
print("Most similar to 'gay' in the brown corpus:\n", change1[:5])

Most similar to 'gay' in the brown corpus:
 [('gaiety', 0.8070326447486877), ('sad', 0.8015825748443604), ('wonderfully', 0.7922199964523315), ('Schwarzkopf', 0.7886055707931519), ('witty', 0.7874990701675415)]


In [None]:
change2 = model.most_similar('gay')
print("Most similar to 'gay' in Google News:\n", change2[:5])

Most similar to 'gay' in Google News:
 [('homosexual', 0.8145634531974792), ('homosexuals', 0.7562745809555054), ('lesbians', 0.7516927719116211), ('queer', 0.6972684264183044), ('Gay', 0.6740463376045227)]


## Biases

Relying on frequency patterns in human-generated data to make inferences has some problems...

In [None]:
compare1 = model.similarity('she','engineer')
print("The similarity between 'she' and 'engineer':\n", compare1)

compare2 = model.similarity('he','engineer')
print("The similarity between 'he' and 'engineer':\n", compare2)

The similarity between 'she' and 'engineer':
 0.0032564793
The similarity between 'he' and 'engineer':
 0.107617


In [None]:
compare1 = model.similarity('woman','nurse')
print("The similarity between 'woman' and 'nurse':\n", compare1)

compare2 = model.similarity('man','nurse')
print("The similarity between 'man' and 'nurse':\n", compare2)

The similarity between 'woman' and 'nurse':
 0.44135568
The similarity between 'man' and 'nurse':
 0.25472283


**Exercise 1. (5 points)**

Pick 2 words of your choice as use the code below to extract their 3 closest words from the brown semantic space and the google semantic space. Which model better captures your intuition? Sum up your considerations in a few sentences.

In [None]:
# Word 1
# Brown modify this code with a word of your choice

brown_test = brown_model.wv.most_similar('lazy')
print("Brown Corpus, most similar to 'w1':\n", brown_test[:4])

#Google modify this code with words of your choice

google_test = model.most_similar("lazy")
print("Google Corpus, most similar to 'w1':\n", google_test[:4])

Brown Corpus, most similar to 'w1':
 [('redhead', 0.878506064414978), ('Sabella', 0.8738782405853271), ('selfish', 0.8700441122055054), ('lukewarm', 0.8653857707977295)]
Google Corpus, most similar to 'w1':
 [('slothful', 0.5775077939033508), ('stupid', 0.5557514429092407), ('dumb', 0.5406800508499146), ('tired', 0.5291409492492676)]


In [None]:
# Word 2
# Brown modify this code with a word of your choice

brown_test = brown_model.wv.most_similar('baby')
print("Brown Corpus, most similar to 'w2':\n", brown_test[:4])

#Google modify this code with words of your choice

google_test = model.most_similar("baby")
print("Google Corpus, most similar to 'w2':\n", google_test[:4])

Brown Corpus, most similar to 'w2':
 [('cook', 0.7868125438690186), ('uncle', 0.7821295857429504), ('Carla', 0.778669536113739), ('Grandma', 0.7778461575508118)]
Google Corpus, most similar to 'w2':
 [('newborn', 0.8206996917724609), ('babies', 0.7815852165222168), ('infant', 0.7726625800132751), ('child', 0.65739905834198)]


*Well, the Google Corpus tends to fit the bill rather well in each instance - definitely following my intuition. Brown gave reliably strange comparisons for both 'lazy' and 'baby', hilariously finding 'redhead' to be the most similar to lazy and 'cook' as most similar to baby. The 60s were a strange time indeed.*

**Exercise 2. (5 points)**

Think of two more cases of implicit biases that you can test the model on (they can be based on gender as above, but it would be even better if you could think of other dimensions for bias). Then, modify the code below by switching w1 and w2 with words of your choice to test your idea. Did the model output what you expected? Summarize your conclusions in a couple of sentences.

In [None]:
#Bias example 1

compare3 = model.similarity('virus', 'computer')
print("The similarity between virus and computer:\n", compare3)

compare3s = model.similarity('virus','cell')
print("The similarity between virus and cell:\n", compare3s)

The similarity between virus and computer:
 0.27622563
The similarity between virus and cell:
 0.20255555


In [None]:
#Bias example 2

compare4 =  model.similarity('American', 'fear')
print("The similarity between American and fear:\n", compare4)

compare4s = model.similarity('Arab','fear')
print("The similarity between Arab and fear:\n", compare4s)

The similarity between American and fear:
 0.07104741
The similarity between Arab and fear:
 0.15490128


#Write your consideration here

There is some interesting bias present in this corpus.

When testing for two different 'virus' hosts, the corpus prefers associating a virus to a computer rather than a cell. In reality, the term 'virus' should be equally similar regardless of host type.

This one was somewhat expected, but there is a bias towards the word 'Arab' and the word fear. When using a different demographic like 'American', the fear association cuts in half. This bias is likely due to potential news articles on terrorism in the Google corpus. 