# Word embeddings

## An alternative approach
Can we define words by the company they keep?  
"If A and B have almost the identical environments we can say that they are synonyms" (Zelig Harris, 1954)

## Vector representations
One-hot encodings are long and sparse  
Alternative: **dense vectors**  
short (length 50-1000) + dense (most elements are non-zero)
### Benefits
1. Easier to use in ML  
2. Offer better generalization capabilities

## 1. Train your own embeddings
Instead of counting terms, we train a classifier on a prediction task: 'does A occur near B'?  
The learned classifier weights become our embeddings   
We can create our own word embeddings using gensim (an open-source library for unsupervised topic modeling and NLP)  
and train it on the Brown corpus  

In [4]:
# set up libraries & data
import gensim
import nltk
from nltk.corpus import brown
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [5]:
# train the model
model = gensim.models.Word2Vec(brown.sents())

In [6]:
# save a copy, for later re-use
model.save('brown.embedding')

In [7]:
# we can load models on demand
brown_model = gensim.models.Word2Vec.load('brown.embedding')

In [8]:
# how many words?
len(brown_model.wv.vocab)

15173

In [9]:
# how many dimensions?
brown_model['university']

  brown_model['university']


array([-0.00802389, -0.3220791 ,  0.06604723, -0.05892321, -0.07192575,
        0.01295918,  0.2520807 , -0.10333798, -0.06010169,  0.00328946,
        0.1397381 ,  0.09316388, -0.04719595,  0.02978518,  0.23821244,
       -0.07591572,  0.06442582, -0.40731215,  0.1494775 ,  0.02817419,
        0.10142963,  0.08470541, -0.03422831, -0.14311703, -0.01740327,
        0.20569564, -0.33188426,  0.02663272,  0.26308233, -0.18549116,
        0.11897259,  0.12331799, -0.34421152,  0.4221925 ,  0.18478112,
        0.12846982,  0.00246079, -0.13260533, -0.12906106,  0.01467364,
        0.26817647, -0.12917444, -0.48004118,  0.5958531 , -0.05170608,
        0.19868603, -0.463152  ,  0.06841221, -0.0343422 ,  0.13069806,
       -0.18202719, -0.04968045,  0.11151163,  0.12091953,  0.08191173,
       -0.57515806,  0.3172483 , -0.19306669, -0.01037446,  0.13933685,
       -0.13659793, -0.21690078,  0.00756574, -0.08519218, -0.2516058 ,
       -0.1576002 , -0.08060304, -0.14884232,  0.13748868, -0.13

In [10]:
# calculate similarity between terms
brown_model.wv.similarity('university','school')

0.84121317

In [11]:
# find similar terms
brown_model.wv.most_similar('university', topn=5)

[('membership', 0.9640309810638428),
 ('neighborhood', 0.9619176983833313),
 ('selection', 0.959403932094574),
 ('payment', 0.9581350088119507),
 ('battle', 0.9574347734451294)]

In [12]:
brown_model.wv.most_similar('lemon', topn=5)

[('frankfurters', 0.9775074124336243),
 ('marble', 0.9770973920822144),
 ('dealer', 0.9745524525642395),
 ('panel', 0.9719914197921753),
 ('Swiss', 0.9719628095626831)]

In [13]:
brown_model.wv.most_similar('government', topn=5)

[('education', 0.9351464509963989),
 ('Christian', 0.9324058294296265),
 ('power', 0.9306133985519409),
 ('policy', 0.9276884198188782),
 ('national', 0.9238537549972534)]

## 2. Use pre-trained embeddings
We can load pre-built embeddings, e.g. a sample from a model trained on 100 billion words from the Google News Dataset

In [14]:
import nltk
nltk.download('word2vec_sample')

[nltk_data] Downloading package word2vec_sample to /root/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


True

In [15]:
# load a pre-build model
from nltk.data import find
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
news_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

In [16]:
# how many terms?
len(news_model.vocab)

43981

In [17]:
# how many dimensions?
len(news_model['university'])

300

In [18]:
# are they any better?
news_model.most_similar(positive=['university'], topn = 5)

[('universities', 0.7003918886184692),
 ('faculty', 0.6780906915664673),
 ('undergraduate', 0.6587096452713013),
 ('campus', 0.6434987783432007),
 ('college', 0.638526976108551)]

In [19]:
news_model.most_similar(positive=['lemon'], topn = 5)

[('lemons', 0.646256148815155),
 ('apricot', 0.6199417114257812),
 ('avocado', 0.5922889113426208),
 ('fennel', 0.5873183012008667),
 ('coriander', 0.5828486680984497)]

In [20]:
news_model.most_similar(positive=['government'], topn = 5)

[('Government', 0.7132059931755066),
 ('governments', 0.6521531343460083),
 ('administration', 0.5462368726730347),
 ('legislature', 0.5307289361953735),
 ('parliament', 0.5268454551696777)]

## 3. Perform vector algebra
We can use embeddings to perform verbal reasoning, e.g. A is to B as C is to...  
e.g. 'man is to king as woman is to...'  
vec(“king”) - vec(“man”) + vec(“woman”) =~ vec(“queen”)

In [21]:
news_model.most_similar(positive=['woman','king'], negative=['man'], topn = 5)

[('queen', 0.7118192911148071),
 ('monarch', 0.6189673542976379),
 ('princess', 0.5902431011199951),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236842632293701)]

In [22]:
news_model.most_similar(positive=['king', 'woman'], negative=['man'], topn = 5)

[('queen', 0.7118192911148071),
 ('monarch', 0.6189673542976379),
 ('princess', 0.5902431011199951),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236842632293701)]

In [23]:
# encyclopaedic knowledge
news_model.most_similar(positive=['Paris','Germany'], negative=['Berlin'], topn = 5)

[('France', 0.7884092330932617),
 ('Belgium', 0.6197876930236816),
 ('Spain', 0.5664774179458618),
 ('Italy', 0.5654898881912231),
 ('Switzerland', 0.560969352722168)]

In [24]:
# syntactic patterns (verbs)
news_model.most_similar(positive=['has','be'], negative=['have'], topn = 5)

[('is', 0.6774996519088745),
 ('was', 0.5710027813911438),
 ('remains', 0.47552669048309326),
 ('been', 0.4538104236125946),
 ('being', 0.4456518888473511)]

In [25]:
# syntactic patterns (adjectives)
news_model.most_similar(positive=['longest','short'], negative=['long'], topn = 5)

[('shortest', 0.5145131349563599),
 ('steepest', 0.42448338866233826),
 ('first', 0.40251171588897705),
 ('flattest', 0.4017193019390106),
 ('consecutive', 0.3951870799064636)]

In [26]:
# more encyclopaedic knowledge
news_model.most_similar(positive=['blue','tulip'], negative=['sky'], topn = 5)

[('purple', 0.5252774953842163),
 ('tulips', 0.4938238859176636),
 ('brown', 0.490774929523468),
 ('pink', 0.4860530495643616),
 ('maroon', 0.48056459426879883)]

In [27]:
# syntactic knowledge (pronouns)
news_model.most_similar(positive=['him','she'], negative=['he'], topn = 5)

[('her', 0.8049386143684387),
 ('herself', 0.6881042718887329),
 ('me', 0.5886671543121338),
 ('She', 0.5803765058517456),
 ('woman', 0.5470799207687378)]

In [28]:
# lexical knowledge
news_model.most_similar(positive=['light','long'], negative=['dark'], topn = 5)

[('short', 0.4077243208885193),
 ('longer', 0.3670077919960022),
 ('lengthy', 0.36229580640792847),
 ('Long', 0.3600355386734009),
 ('continuous', 0.34982722997665405)]

In [29]:
# Find the odd one out
news_model.doesnt_match('breakfast cereal dinner lunch'.split())

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'cereal'

### Quiz questions + homework

In [None]:
news_model.most_similar(positive=['be','has'], negative=['is'], topn = 5)

[('have', 0.7298133373260498),
 ('Had', 0.5585501194000244),
 ('subsequently', 0.54400634765625),
 ('had', 0.5378279685974121),
 ('recently', 0.5339970588684082)]

In [None]:
brown_model.wv.most_similar('university', topn=5)

[('frankfurter', 0.9596022367477417),
 ('craft', 0.9578697085380554),
 ('cafeteria', 0.9577983617782593),
 ('railroads', 0.95762038230896),
 ('enjoyment', 0.9542069435119629)]

In [None]:
news_model.most_similar('university', topn=5)

[('universities', 0.7003918290138245),
 ('faculty', 0.6780907511711121),
 ('undergraduate', 0.6587096452713013),
 ('campus', 0.6434987187385559),
 ('college', 0.6385269165039062)]

In [None]:
brown_model.wv.most_similar('college', topn=5)

[('minute', 0.9388023018836975),
 ('jazz', 0.9376934766769409),
 ('treatment', 0.9367307424545288),
 ('search', 0.9360219240188599),
 ('prices', 0.9350869059562683)]

In [None]:
news_model.most_similar('college', topn=5)

[('colleges', 0.6560818552970886),
 ('university', 0.6385270357131958),
 ('school', 0.6081898212432861),
 ('collegiate', 0.6081600189208984),
 ('undergraduate', 0.5866836309432983)]

In [None]:
brown_model.similarity('university','turtle')

  brown_model.similarity('university','turtle')


0.8196627

In [30]:
news_model.similarity('university', 'school')

0.50807464

In [None]:
news_model.similarity('university', 'factory')

0.13740103

In [None]:
news_model.similarity('university', 'supermarket')

0.13598306

In [None]:
news_model.similarity('university', 'turtle')

0.05450318