# Word embeddings

## An alternative approach
Can we define words by the company they keep?  
"If A and B have almost the identical environments we can say that they are synonyms" (Zelig Harris, 1954)

## Vector representations
One-hot encodings are long and sparse  
Alternative: **dense vectors**  
short (length 50-1000) + dense (most elements are non-zero)
### Benefits
1. Easier to use in ML  
2. Offer better generalization capabilities

## 1. Train your own embeddings
Instead of counting terms, we train a classifier on a prediction task: 'does A occur near B'?  
The learned classifier weights become our embeddings   
We can create our own word embeddings using gensim (an open-source library for unsupervised topic modeling and NLP)  
and train it on the Brown corpus  

In [5]:
# set up libraries & data
import gensim
import nltk
from nltk.corpus import brown
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [6]:
# train the model
brown_model = gensim.models.Word2Vec(brown.sents())

In [7]:
# with write permisisons you could save a copy for later re-use
# brown_model.save('brown.embedding')
# then load models on demand
# brown_model = gensim.models.Word2Vec.load('brown.embedding')

In [8]:
# how many words?
len(brown_model.wv.vocab)

15173

In [9]:
# how many dimensions?
brown_model['university']

  brown_model['university']


array([-0.1547524 ,  0.24442221, -0.00260162, -0.01158985,  0.02937359,
       -0.21278664,  0.11085042,  0.32451573,  0.15049177,  0.15711203,
       -0.23298538,  0.32490495, -0.202932  ,  0.15748681,  0.12574184,
       -0.14732419,  0.5301341 ,  0.14839508, -0.15313883, -0.25825414,
       -0.05845566, -0.08132599,  0.21642452,  0.2368574 ,  0.24337627,
       -0.4523013 , -0.09021856,  0.00466608,  0.28294778, -0.15896355,
        0.23846267,  0.01632281,  0.05606301,  0.08980975,  0.3702068 ,
        0.00096992, -0.04549228, -0.13190529,  0.3555484 , -0.05920343,
       -0.00702829, -0.0497788 , -0.12982413,  0.15383112,  0.00960286,
        0.2505941 , -0.40065917,  0.22192399, -0.18494073, -0.25310516,
        0.17462648, -0.05410967,  0.02853094, -0.0895878 ,  0.20794699,
       -0.20602362,  0.06715866,  0.18540701, -0.35270718, -0.06614354,
        0.16386633, -0.00570062,  0.04548644, -0.05183437, -0.0554895 ,
        0.21736522, -0.22498292,  0.06570208, -0.27874961, -0.28

In [10]:
# calculate similarity between terms
brown_model.wv.similarity('university','school')

0.8463591

In [11]:
# find similar terms
brown_model.wv.most_similar('university', topn=5)

[('membership', 0.9634189009666443),
 ('neighborhood', 0.9625349044799805),
 ('selection', 0.9584420919418335),
 ('battle', 0.9573919177055359),
 ('congregation', 0.9569629430770874)]

In [12]:
brown_model.wv.most_similar('lemon', topn=5)

[('frankfurters', 0.9746100306510925),
 ('marble', 0.9745819568634033),
 ('dealer', 0.974488377571106),
 ('neat', 0.9741830825805664),
 ('pension', 0.9727776050567627)]

In [13]:
brown_model.wv.most_similar('government', topn=5)

[('education', 0.9376094341278076),
 ('Christian', 0.9356517791748047),
 ('power', 0.9337419867515564),
 ('policy', 0.9314901232719421),
 ('national', 0.9245917201042175)]

## 2. Use pre-trained embeddings
We can load pre-built embeddings, e.g. a sample from a model trained on 100 billion words from the Google News Dataset

In [14]:
import nltk
nltk.download('word2vec_sample')

[nltk_data] Downloading package word2vec_sample to /root/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


True

In [15]:
# load a pre-build model
from nltk.data import find
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
news_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

In [16]:
# how many terms?
len(news_model.vocab)

43981

In [17]:
# how many dimensions?
len(news_model['university'])

300

In [18]:
# are they any better?
news_model.most_similar(positive=['university'], topn = 5)

[('universities', 0.7003918886184692),
 ('faculty', 0.6780906915664673),
 ('undergraduate', 0.6587096452713013),
 ('campus', 0.6434987783432007),
 ('college', 0.638526976108551)]

In [19]:
news_model.most_similar(positive=['lemon'], topn = 5)

[('lemons', 0.646256148815155),
 ('apricot', 0.6199417114257812),
 ('avocado', 0.5922889113426208),
 ('fennel', 0.5873183012008667),
 ('coriander', 0.5828486680984497)]

In [20]:
news_model.most_similar(positive=['government'], topn = 5)

[('Government', 0.7132059931755066),
 ('governments', 0.6521531343460083),
 ('administration', 0.5462368726730347),
 ('legislature', 0.5307289361953735),
 ('parliament', 0.5268454551696777)]

### Your turn:
Find the 5 most similar words for each of the following:
university, college, school, factory, farm, pig, dog, cat, apple, lemon, table, chair

In [21]:
brown_model.wv.most_similar('university', topn=5)

[('membership', 0.9634189009666443),
 ('neighborhood', 0.9625349044799805),
 ('selection', 0.9584420919418335),
 ('battle', 0.9573919177055359),
 ('congregation', 0.9569629430770874)]

In [22]:
news_model.most_similar('university', topn=5)

[('universities', 0.7003918886184692),
 ('faculty', 0.6780906915664673),
 ('undergraduate', 0.6587096452713013),
 ('campus', 0.6434987783432007),
 ('college', 0.638526976108551)]

### Your turn:
Using each language model, rank the degrees of similarity between the word ‘university’ and each of the following: college, school, factory, supermarket, turtle  
Which model puts them in the most sensible order?

In [23]:
brown_model.similarity('university','college')

  brown_model.similarity('university','college')


0.9368422

In [24]:
news_model.similarity('university', 'college')

0.638527