# Word embeddings

## An alternative approach
Can we define words by the company they keep?  
"If A and B have almost the identical environments we can say that they are synonyms" (Zelig Harris, 1954)

## Vector representations
One-hot encodings are long and sparse  
Alternative: **dense vectors**  
short (length 50-1000) + dense (most elements are non-zero)
### Benefits
1. Easier to use in ML  
2. Offer better generalization capabilities

## 1. Train your own embeddings
Instead of counting terms, we train a classifier on a prediction task: 'does A occur near B'?  
The learned classifier weights become our embeddings   
We can create our own word embeddings using gensim (an open-source library for unsupervised topic modeling and NLP)  
and train it on the Brown corpus  

In [1]:
# set up libraries & data
import gensim
from nltk.corpus import brown
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to C:\Users\Zee
[nltk_data]     Tech\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [2]:
# train the model
model = gensim.models.Word2Vec(brown.sents())



In [3]:
# save a copy, for later re-use
model.save('brown.embedding')

In [4]:
# we can load models on demand
brown_model = gensim.models.Word2Vec.load('brown.embedding')

In [5]:
# how many words?
len(brown_model.wv.index_to_key)

AttributeError: 'Word2VecKeyedVectors' object has no attribute 'index_to_key'

In [None]:
# how many dimensions?
brown_model.wv['university']

In [None]:
# calculate similarity between terms
brown_model.wv.similarity('university','school')

In [8]:
# find similar terms
brown_model.wv.most_similar('university', topn=5)

[('membership', 0.9543715119361877),
 ('profession', 0.9533947110176086),
 ('neighborhood', 0.9528996348381042),
 ('congregation', 0.951430082321167),
 ('selection', 0.9481860399246216)]

In [9]:
brown_model.wv.most_similar('lemon', topn=5)

[('marble', 0.9657975435256958),
 ('pension', 0.9649026989936829),
 ('frankfurters', 0.9639797806739807),
 ('herd', 0.9625921249389648),
 ('towel', 0.9624603986740112)]

In [10]:
brown_model.wv.most_similar('government', topn=5)

[('nation', 0.9277928471565247),
 ('policy', 0.9251207709312439),
 ('power', 0.9233179092407227),
 ('Christian', 0.9233003258705139),
 ('education', 0.9203839302062988)]

## 2. Use pre-trained embeddings
We can load pre-built embeddings, e.g. a sample from a model trained on 100 billion words from the Google News Dataset

In [11]:
import nltk
nltk.download('word2vec_sample')

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


True

In [12]:
# load a pre-build model
from nltk.data import find
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
news_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

In [13]:
# how many terms?
len(news_model.key_to_index)

43981

In [14]:
# how many dimensions?
len(news_model['university'])

300

In [15]:
# are they any better?
news_model.most_similar(positive=['university'], topn = 5)

[('universities', 0.7003918290138245),
 ('faculty', 0.6780906915664673),
 ('undergraduate', 0.6587096452713013),
 ('campus', 0.6434987783432007),
 ('college', 0.638526976108551)]

In [16]:
news_model.most_similar(positive=['lemon'], topn = 5)

[('lemons', 0.646256148815155),
 ('apricot', 0.6199417114257812),
 ('avocado', 0.5922889113426208),
 ('fennel', 0.5873182415962219),
 ('coriander', 0.5828487277030945)]

In [17]:
news_model.most_similar(positive=['government'], topn = 5)

[('Government', 0.7132059335708618),
 ('governments', 0.6521531939506531),
 ('administration', 0.5462369322776794),
 ('legislature', 0.5307289361953735),
 ('parliament', 0.5268454551696777)]

## 3. Perform vector algebra
We can use embeddings to perform verbal reasoning, e.g. A is to B as C is to...  
e.g. 'man is to king as woman is to...'  
vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)

In [18]:
news_model.most_similar(positive=['woman','king'], negative=['man'], topn = 5)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902430415153503),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236843228340149)]

In [19]:
news_model.most_similar(positive=['king', 'woman'], negative=['man'], topn = 5)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902430415153503),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236843228340149)]

In [20]:
# encyclopaedic knowledge
news_model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 5)

[('France', 0.7884091734886169),
 ('Belgium', 0.6197876930236816),
 ('Spain', 0.5664774179458618),
 ('Italy', 0.5654898881912231),
 ('Switzerland', 0.560969352722168)]

In [21]:
# syntactic patterns (verbs)
news_model.most_similar(positive=['has','be'], negative=['have'], topn = 5)

[('is', 0.6774996519088745),
 ('was', 0.5710028409957886),
 ('remains', 0.47552669048309326),
 ('been', 0.4538103938102722),
 ('being', 0.4456518888473511)]

In [22]:
# syntactic patterns (adjectives)
news_model.most_similar(positive=['longest','short'], negative=['long'], topn = 5)

[('shortest', 0.5145131945610046),
 ('steepest', 0.42448338866233826),
 ('first', 0.40251171588897705),
 ('flattest', 0.40171927213668823),
 ('consecutive', 0.39518705010414124)]

In [23]:
# more encyclopaedic knowledge
news_model.most_similar(positive=['blue','tulip'], negative=['sky'], topn = 5)

[('purple', 0.5252774953842163),
 ('tulips', 0.4938238859176636),
 ('brown', 0.4907749593257904),
 ('pink', 0.4860530197620392),
 ('maroon', 0.48056456446647644)]

In [24]:
# syntactic knowledge (pronouns)
news_model.most_similar(positive=['him','she'], negative=['he'], topn = 5)

[('her', 0.804938554763794),
 ('herself', 0.6881042718887329),
 ('me', 0.5886672139167786),
 ('She', 0.5803765058517456),
 ('woman', 0.5470799207687378)]

In [25]:
# lexical know
news_model.most_similar(positive=['light','long'], negative=['dark'], topn = 5)

[('short', 0.4077243208885193),
 ('longer', 0.3670078217983246),
 ('lengthy', 0.36229580640792847),
 ('Long', 0.36003556847572327),
 ('continuous', 0.34982720017433167)]

In [26]:
# Find the odd one out
news_model.doesnt_match('breakfast cereal dinner lunch'.split())

'cereal'

### Quiz questions + homework

In [27]:
news_model.most_similar(positive=['be','has'], negative=['is'], topn = 5)

[('have', 0.7298133969306946),
 ('Had', 0.5585500597953796),
 ('subsequently', 0.54400634765625),
 ('had', 0.5378279685974121),
 ('recently', 0.5339969992637634)]

In [28]:
brown_model.wv.most_similar('university', topn=5)

[('membership', 0.9543715119361877),
 ('profession', 0.9533947110176086),
 ('neighborhood', 0.9528996348381042),
 ('congregation', 0.951430082321167),
 ('selection', 0.9481860399246216)]

In [29]:
news_model.most_similar('university', topn=5)

[('universities', 0.7003918290138245),
 ('faculty', 0.6780906915664673),
 ('undergraduate', 0.6587096452713013),
 ('campus', 0.6434987783432007),
 ('college', 0.638526976108551)]

In [30]:
brown_model.wv.most_similar('college', topn=5)

[('university', 0.9235467314720154),
 ('treatment', 0.9229241609573364),
 ('selection', 0.921475887298584),
 ('persons', 0.9177237153053284),
 ('generations', 0.9159466624259949)]

In [31]:
news_model.most_similar('college', topn=5)

[('colleges', 0.6560819149017334),
 ('university', 0.6385270357131958),
 ('school', 0.6081898808479309),
 ('collegiate', 0.6081600189208984),
 ('undergraduate', 0.5866836905479431)]

In [32]:
news_model.similarity('university','turtle')

0.054503173

In [33]:
news_model.similarity('university', 'school')

0.5080746

In [34]:
news_model.similarity('university', 'factory')

0.13740103

In [35]:
news_model.similarity('university', 'supermarket')

0.13598306

In [36]:
news_model.similarity('university', 'turtle')

0.054503173

In [None]:
news_model.similarity