# Word Embedding

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import spacy


from scipy import spatial

The cell below loads the model which contains the word embeddings. You will need to install this on your system _seperately_ to Spacy. This is done with:

```bash
python -m spacy download en_core_web_lg
```

You can choose `en_core_web_sm`, `_md` or `_lg` (small, medium, large)- but the larger the more accurate.

In [None]:
# (uncomment the last line and run once)
# let's download a preetrained word2vec model - this may take a while
# !python -m spacy download en_core_web_lg

In [None]:
# load the downloaded model
nlp = spacy.load('en_core_web_lg')

The `en_core_web_lg` will give us an embedding vector of length 300. It is also the number of parameters in the model.

In [None]:
# find the vector of the word "hello" and print its shape
nlp('hello').vector.shape

The following cell measures the similarity of three words, with themselves.

In [None]:
tokens = nlp('cat lion pet')
for t1 in tokens:
    for t2 in tokens:
        print(t1.text, t2.text, t1.similarity(t2))

###  <span style="color:red"> Exercise 1 </span>
Repeat the above with 3 words of your choice

Below is a pretty classic exercise to do. We are going to take 3 words and try to find words that relate to a relationship of these words. For example what words relate to food and fruit but not burgers?

In [None]:
# Words to Vectors
food = nlp('food')
fruit = nlp('fruit')
burger = nlp('burger')

In [None]:
# How similar are these words?
food_fruit = food.similarity(fruit)
fruit_burger = fruit.similarity(burger)
food_burger = food.similarity(burger)

In [None]:
print(f"Food Fruit similarity: {food_fruit}")
print(f"Fruit Burger similarity: {fruit_burger}")
print(f"Food Burger similarity: {food_burger}")

In [None]:
# This vector is based on taking away the man vector from king and adding woman
new_vector = food.vector - burger.vector + fruit.vector
new_vector

A helper function for measuring cosine similarity:

In [None]:
def cosine_similarity(vec1, vec2): 
    return 1 - spatial.distance.cosine(vec1, vec2)

In [None]:
# run the below to have words in spacy cache for vocab
random_words = nlp('house good pizza cars fun chess vegetable fries of apple juice kingdom burger prince house wicked this banana home island car truck dog rice')

In [None]:
# Given a group of words see which word is closest

similarities = []
for word in random_words:
    if word.has_vector and word.is_alpha and word.is_lower:
        similarities.append((cosine_similarity(new_vector,word.vector),word.text, word))

In [None]:
print(sorted(similarities, reverse=True)[:10])

In [None]:
# as we can observe that for a vector like king-man+woman we obviously expect a queen and it 
# proves to be successful in getting that

for similarity,word,_ in  sorted(similarities,reverse=True)[:20]:
    print(word)

###  <span style="color:red"> Exercise 2 </span>
Repeat the above by adding vectors of 4 words of your choice.  Hint the following gives you the full vocab of words list(nlp.vocab.strings)

If we stack word embedding vectors on top of one another we can display them as an image. This way we can see which axes in the vector representations which are actually similar.

This is just an interesting thing to do really... we don't really know what these axes represent.

In [None]:
grid = []
for similarity, word, embedding in  sorted(similarities,reverse=True)[:20]:
    grid.append(embedding.vector)
grid = np.array(grid)
print(grid.shape)

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
ax.imshow(grid, interpolation='nearest', cmap="gray")

## Loading Text

We're going to load _Dracula_ which can be downloaded from Project Gutenberg [here](http://www.gutenberg.org/cache/epub/345/pg345.txt).

In [None]:
doc = nlp(open("pg345.txt", encoding="utf-8").read())

In [None]:
# all of the words in the text file
tokens = list(set([w.text for w in doc if w.is_alpha and w.has_vector]))

A helper function to get the word vector.

In [None]:
def vec(s):
    return nlp(s).vector 

Make sure things work (this should be true below):

In [None]:
# this line checks if the word "dog" is more similar to "puppy" than "trousers" to "octopus"
cosine_similarity(vec('dog'), vec('puppy')) > cosine_similarity(vec('trousers'), vec('octopus'))

In [None]:
def spacy_closest(token_list, vec_to_check, n=10):
    return sorted(
        token_list,
        key=lambda x: cosine_similarity(vec_to_check, vec(x)),
        reverse=True
    )[:n]

This will find the 10 closest words to "blood" in Dracula:

In [None]:
spacy_closest(tokens, vec("blood"), 30)

###  <span style="color:red"> Exercise 3 </span>
Try to do the same thing with a different text. Can you think of word + text pairs that will have a lot of similar words or not many? E.g. a collection of medieval poetry may not have many words similar to "computer"