# Word embeddings

## An alternative approach
Can we define words by the company they keep?  
"If A and B have almost the identical environments we can say that they are synonyms" (Zelig Harris, 1954)

## Vector representations
One-hot encodings are long and sparse  
Alternative: **dense vectors**  
short (length 50-1000) + dense (most elements are non-zero)
### Benefits
1. Easier to use in ML  
2. Offer better generalization capabilities

## 1. Train your own embeddings
Instead of counting terms, we train a classifier on a prediction task: 'does A occur near B'?  
The learned classifier weights become our embeddings   
We can create our own word embeddings using gensim (an open-source library for unsupervised topic modeling and NLP)  
and train it on the Brown corpus  

In [None]:
# set up libraries & data
import gensim
from nltk.corpus import brown

In [None]:
# train the model
brown_model = gensim.models.Word2Vec(brown.sents())

In [None]:
# with write permisisons you could save a copy for later re-use
# brown_model.save('brown.embedding')
# then load models on demand
# brown_model = gensim.models.Word2Vec.load('brown.embedding')

In [None]:
# how many words?
len(brown_model.wv.key_to_index)

In [None]:
# how many dimensions?
brown_model.wv['university']

In [None]:
# calculate similarity between terms
brown_model.wv.similarity('university','school')

In [None]:
# find similar terms
brown_model.wv.most_similar('university', topn=5)

In [None]:
brown_model.wv.most_similar('lemon', topn=5)

In [None]:
brown_model.wv.most_similar('government', topn=5)

## 2. Use pre-trained embeddings
We can load pre-built embeddings, e.g. a sample from a model trained on 100 billion words from the Google News Dataset

In [None]:
import nltk
nltk.download('word2vec_sample')

In [None]:
# load a pre-build model
from nltk.data import find
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
news_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

In [None]:
# how many terms?
len(news_model.key_to_index)

In [None]:
# how many dimensions?
len(news_model['university'])

In [None]:
# are they any better?
news_model.most_similar(positive=['university'], topn = 5)

In [None]:
news_model.most_similar(positive=['lemon'], topn = 5)

In [None]:
news_model.most_similar(positive=['government'], topn = 5)

### Your turn:
Find the 5 most similar words for each of the following:
university, college, school, factory, farm, pig, dog, cat, apple, lemon, table, chair

In [None]:
brown_model.wv.most_similar('university', topn=5)

In [None]:
news_model.most_similar('university', topn=5)

### Your turn:
Using each language model, rank the degrees of similarity between the word ‘university’ and each of the following: college, school, factory, supermarket, turtle  
Which model puts them in the most sensible order?

In [None]:
brown_model.similarity('university','college')

In [None]:
news_model.similarity('university', 'college')