# Word embeddings

## An alternative approach
Can we define words by the company they keep?  
"If A and B have almost the identical environments we can say that they are synonyms" (Zelig Harris, 1954)

## Vector representations
One-hot encodings are long and sparse  
Alternative: **dense vectors**  
short (length 50-1000) + dense (most elements are non-zero)
### Benefits
1. Easier to use in ML  
2. Offer better generalization capabilities

## 1. Train your own embeddings
Instead of counting terms, we train a classifier on a prediction task: 'does A occur near B'?  
The learned classifier weights become our embeddings   
We can create our own word embeddings using gensim (an open-source library for unsupervised topic modeling and NLP)  
and train it on the Brown corpus  

In [2]:
# set up libraries & data
import gensim
from nltk.corpus import brown
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /home/jovyan/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [3]:
# train the model
model = gensim.models.Word2Vec(brown.sents())

In [4]:
model

<gensim.models.word2vec.Word2Vec at 0x7f89c16ccc10>

In [5]:
# save a copy, for later re-use
model.save('brown.embedding')

In [6]:
# we can load models on demand
brown_model = gensim.models.Word2Vec.load('brown.embedding')

In [7]:
# how many words?
len(brown_model.wv.index_to_key)

15173

In [11]:
#how many dimensions?
len(brown_model.wv['university'])

100

In [8]:
# how many dimensions?
brown_model.wv['university']

array([ 0.1123121 ,  0.25699806,  0.20049234,  0.11121298, -0.06425545,
       -0.33215448,  0.203627  ,  0.347207  , -0.30725956, -0.28279257,
        0.16802801, -0.22751583,  0.19008662,  0.1674023 ,  0.24391298,
       -0.16353872,  0.26953107, -0.14396475, -0.5229727 , -0.5176938 ,
        0.28552625, -0.11481584,  0.48428005,  0.07612962, -0.05926904,
       -0.13284211, -0.23031797,  0.01688239, -0.2100438 ,  0.23463464,
        0.2175537 , -0.06709159,  0.28098744, -0.37103644, -0.1776211 ,
        0.05884128, -0.18912491, -0.05309489, -0.34382743, -0.05950236,
        0.0128465 , -0.26842707,  0.18020464,  0.11521347,  0.20342   ,
       -0.00789078, -0.01889006, -0.02543117,  0.09976728,  0.31008193,
        0.01597053, -0.28622678, -0.25172043, -0.19010486, -0.10741533,
       -0.22977918,  0.17676017,  0.052156  , -0.06199336, -0.07945454,
        0.02969945,  0.20052046, -0.03786999, -0.16787542, -0.18871498,
        0.45194843,  0.02368015,  0.2838704 , -0.26320067,  0.36

In [12]:
# calculate similarity between terms
brown_model.wv.similarity('university','school')

0.8153878

In [13]:
# find similar terms
brown_model.wv.most_similar('university', topn=5)

[('membership', 0.9542639255523682),
 ('profession', 0.9534994959831238),
 ('neighborhood', 0.9527466297149658),
 ('congregation', 0.9513265490531921),
 ('selection', 0.9480083584785461)]

In [14]:
brown_model.wv.most_similar('lemon', topn=5)

[('marble', 0.9657989740371704),
 ('pension', 0.9648841619491577),
 ('frankfurters', 0.9640069007873535),
 ('herd', 0.9626672267913818),
 ('towel', 0.9624289870262146)]

In [15]:
brown_model.wv.most_similar('government', topn=5)

[('nation', 0.9279634952545166),
 ('policy', 0.9250583648681641),
 ('Christian', 0.9233180284500122),
 ('power', 0.9231869578361511),
 ('education', 0.9204767346382141)]

## 2. Use pre-trained embeddings
We can load pre-built embeddings, e.g. a sample from a model trained on 100 billion words from the Google News Dataset

In [16]:
import nltk
nltk.download('word2vec_sample')

[nltk_data] Downloading package word2vec_sample to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


True

In [18]:
# load a pre-build model
from nltk.data import find
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
news_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

In [19]:
# how many terms?
len(news_model.key_to_index)

43981

In [20]:
# how many dimensions?
len(news_model['university'])

300

In [21]:
# are they any better?
news_model.most_similar(positive=['university'], topn = 5)

[('universities', 0.7003918290138245),
 ('faculty', 0.6780906915664673),
 ('undergraduate', 0.6587096452713013),
 ('campus', 0.6434987783432007),
 ('college', 0.638526976108551)]

In [22]:
news_model.most_similar(positive=['lemon'], topn = 5)

[('lemons', 0.646256148815155),
 ('apricot', 0.6199417114257812),
 ('avocado', 0.5922889113426208),
 ('fennel', 0.5873182415962219),
 ('coriander', 0.5828487277030945)]

In [23]:
news_model.most_similar(positive=['government'], topn = 5)

[('Government', 0.7132059335708618),
 ('governments', 0.6521531939506531),
 ('administration', 0.5462369322776794),
 ('legislature', 0.5307289361953735),
 ('parliament', 0.5268454551696777)]

## 3. Perform vector algebra
We can use embeddings to perform verbal reasoning, e.g. A is to B as C is to...  
e.g. 'man is to king as woman is to...'  
vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)

In [24]:
news_model.most_similar(positive=['woman','king'], negative=['man'], topn = 5)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902430415153503),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236843228340149)]

In [25]:
news_model.most_similar(positive=['king', 'woman'], negative=['man'], topn = 5)

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902430415153503),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236843228340149)]

In [26]:
# encyclopaedic knowledge
news_model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 5)

[('France', 0.7884091734886169),
 ('Belgium', 0.6197876930236816),
 ('Spain', 0.5664774179458618),
 ('Italy', 0.5654898881912231),
 ('Switzerland', 0.560969352722168)]

In [27]:
# syntactic patterns (verbs)
news_model.most_similar(positive=['has','be'], negative=['have'], topn = 5)

[('is', 0.6774996519088745),
 ('was', 0.5710028409957886),
 ('remains', 0.47552669048309326),
 ('been', 0.4538103938102722),
 ('being', 0.4456518888473511)]

In [28]:
# syntactic patterns (adjectives)
news_model.most_similar(positive=['longest','short'], negative=['long'], topn = 5)

[('shortest', 0.5145131945610046),
 ('steepest', 0.42448338866233826),
 ('first', 0.40251171588897705),
 ('flattest', 0.40171927213668823),
 ('consecutive', 0.39518705010414124)]

In [29]:
# more encyclopaedic knowledge
news_model.most_similar(positive=['blue','tulip'], negative=['sky'], topn = 5)

[('purple', 0.5252774953842163),
 ('tulips', 0.4938238859176636),
 ('brown', 0.4907749593257904),
 ('pink', 0.4860530197620392),
 ('maroon', 0.48056456446647644)]

In [30]:
# syntactic knowledge (pronouns)
news_model.most_similar(positive=['him','she'], negative=['he'], topn = 5)

[('her', 0.804938554763794),
 ('herself', 0.6881042718887329),
 ('me', 0.5886672139167786),
 ('She', 0.5803765058517456),
 ('woman', 0.5470799207687378)]

In [31]:
# lexical know
news_model.most_similar(positive=['light','long'], negative=['dark'], topn = 5)

[('short', 0.4077243208885193),
 ('longer', 0.3670078217983246),
 ('lengthy', 0.36229580640792847),
 ('Long', 0.36003556847572327),
 ('continuous', 0.34982720017433167)]

In [32]:
# Find the odd one out
news_model.doesnt_match('breakfast cereal dinner lunch'.split())

'cereal'

### Quiz questions + homework

In [33]:
news_model.most_similar(positive=['be','has'], negative=['is'], topn = 5)

[('have', 0.7298133969306946),
 ('Had', 0.5585500597953796),
 ('subsequently', 0.54400634765625),
 ('had', 0.5378279685974121),
 ('recently', 0.5339969992637634)]

In [34]:
brown_model.wv.most_similar('university', topn=5)

[('membership', 0.9542639255523682),
 ('profession', 0.9534994959831238),
 ('neighborhood', 0.9527466297149658),
 ('congregation', 0.9513265490531921),
 ('selection', 0.9480083584785461)]

In [35]:
news_model.most_similar('university', topn=5)

[('universities', 0.7003918290138245),
 ('faculty', 0.6780906915664673),
 ('undergraduate', 0.6587096452713013),
 ('campus', 0.6434987783432007),
 ('college', 0.638526976108551)]

In [36]:
brown_model.wv.most_similar('college', topn=5)

[('university', 0.9233660101890564),
 ('treatment', 0.9230210781097412),
 ('selection', 0.9215536713600159),
 ('persons', 0.9178534150123596),
 ('generations', 0.9160250425338745)]

In [37]:
news_model.most_similar('college', topn=5)

[('colleges', 0.6560819149017334),
 ('university', 0.6385270357131958),
 ('school', 0.6081898808479309),
 ('collegiate', 0.6081600189208984),
 ('undergraduate', 0.5866836905479431)]

In [38]:
news_model.similarity('university','turtle')

0.054503173

In [39]:
news_model.similarity('university', 'school')

0.5080746

In [40]:
news_model.similarity('university', 'factory')

0.13740103

In [41]:
news_model.similarity('university', 'supermarket')

0.13598306

In [42]:
news_model.similarity('university', 'turtle')

0.054503173