# Ungraded lab 2: Manipulating word embeddings 

*Copyrighted material*

**Objectives:** Use numpy function to apply the most common linear algebra in Python

By the end of this module you will be able to:
* Use a pretrained word embedding to map words to vectors
* Use a pretrained word embedding to map a sentence to a vector
* Visualize word embeddings and identify its semantics by means of PCA to reduce its dimensionality
* Create your own word embedding by training an Artificial Neural Network

**From wikipedia:** 

"_Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.

_Methods to generate this mapping include neural networks,[1] dimensionality reduction on the word co-occurrence matrix,[2][3][4] probabilistic models,[5] explainable knowledge base method,[6] and explicit representation in terms of the context in which words appear.[7]

_Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing[8] and sentiment analysis.[9]_"

In [None]:
import pandas as pd
import csv
import numpy as np
import pickle

word_embeddings = pickle.load( open( "word_embeddings_subset.p", "rb" ) )
len(word_embeddings) # there should be 243 words that will be used in this assignment

Now that the model is loaded, we can take a look to see how words are represented. First note that the _word_embeddings_ is a dictionary. Each word is the key of the tuple, and the value is its corresponding vector presentation. From previous labs, you should remember that you can access any entry in the dictionary by using square brackets:

In [None]:
countryVector = word_embeddings['country'] # Get the vector representation for country
print(type(countryVector)) # Print the type of the vector. Note it is a numpy array
print(countryVector) # Print the values of the vector.  

It is important to note that each vector is stored as a numpy array. Don't forget it allows to use the linear algebra operations on it. 

The vectors have a size of 300 while the vocabulary size of google news is around 3 millions words! 

In [None]:
#Get the vector for a word:
def vec(w):
    return word_embeddings[w]

## Operating on word embeddings

If word embedding are vector then you can operate over them using the linear algebra operators. But sometimes you need to have a visual representation of your data before you start using it.

In the next cell, you will make a nice plot for the word embedding of some words. Even if plotting the dots gives you and idea of the words, the arrow representations helps to visualize how alligned are the vectors.

In [None]:
import matplotlib.pyplot as plt # Import matplotlib

words = ['oil', 'gas', 'happy', 'sad', 'city', 'town', 'village', 'country', 'continent', 'petroleum', 'joyful']

bag2d = np.array([vec(word) for word in words]) # Convert each word to its vector representation

fig, ax = plt.subplots(figsize = (10, 10)) # Create custom size image

col1 = 3 # Select the column for the x axe
col2 = 2 # Select the column for the y axe

# Print an arrow for each word
for word in bag2d:
    ax.arrow(0, 0, word[col1], word[col2], head_width=0.005, head_length=0.005, fc='r', ec='r', width = 1e-5)

    
ax.scatter(bag2d[:, col1], bag2d[:, col2]); # Plot a dot for each word

# Add the word label over each dot in the scatter plot
for i in range(0, len(words)):
    ax.annotate(words[i], (bag2d[i, col1], bag2d[i, col2]))


plt.show()

Note that similar words like village, and town or petroleum, oil and gas, tends to point in the same direction. Also note that even if sad and happy looks to be close to each other the vectors points in opposite directions. 

In this chart you can figure it out the angles and distances between the words. Some words are close in both kinds of distance metrics. 

## Word distance

Now let's plot the words 'sad', 'happy', 'town' and 'village'. Let's display the vector from village to town and the vector from sad to happy. Let's see that this can be done using linear algebra operations using numpy 

In [None]:
words = ['sad', 'happy', 'town', 'village']

bag2d = np.array([vec(word) for word in words]) # Convert each word to its vector representation

fig, ax = plt.subplots(figsize = (10, 10)) # Create custom size image

col1 = 3 # Select the column for the x axe
col2 = 2 # Select the column for the y axe

# Print an arrow for each word
for word in bag2d:
    ax.arrow(0, 0, word[col1], word[col2], head_width=0.0005, head_length=0.0005, fc='r', ec='r', width = 1e-5)
    
# print the vector difference between village and town
village = vec('village')
town = vec('town')
diff = town - village
ax.arrow(village[col1], village[col2], diff[col1], diff[col2], fc='b', ec='b', width = 1e-5)

# print the vector difference between village and town
sad = vec('sad')
happy = vec('happy')
diff = happy - sad
ax.arrow(sad[col1], sad[col2], diff[col1], diff[col2], fc='b', ec='b', width = 1e-5)


ax.scatter(bag2d[:, col1], bag2d[:, col2]); # Plot a dot for each word

# Add the word label over each dot in the scatter plot
for i in range(0, len(words)):
    ax.annotate(words[i], (bag2d[i, col1], bag2d[i, col2]))


plt.show()


## Linear algebra on word embeddings

As you observed during this week videos, you can perform some algebra on word embedding to find word analogies. Let's see how to do it in Python with Numpy.

For example, you can get the "norm" of a given word in the embedding 

In [None]:
print(np.linalg.norm(vec('town')))
print(np.linalg.norm(vec('sad')))

## Predicting capitals

Now, applying vector difference and addition, you can create a the vector representation for a new word. For example, we can say that the vector difference between France and Paris represents the concept of Capital.

So, you can move translate from the city Madrid in the direction of the concept of Capital, and obtain something you would expect to be close to the corresponding country to which Madrid is the Capital

In [None]:
capital = vec('France') - vec('Paris')
country = vec('Madrid') + capital

print(country[0:5]) # Print the first 5 values of the vector

We can observe that the vector 'country' that we expect to be the same as the vector for Spain is not exactly it.

In [None]:
diff = country - vec('Spain')
print(diff[0:10])

So, you have to look for the closest words in your embedding that matches the candidate country. If your word embedding work as expected, the most similar word must be Spain. Let's define a function that help us to do it. We will create represent our word embedding as a DataFrame, which facilitate the lookup operations

In [None]:
# Create a dataframe out of the dictionary embedding. This facilitate the algebraic operations
keys = word_embeddings.keys()
data = []
for key in keys:
    data.append(word_embeddings[key])

embedding = pd.DataFrame(data=data, index=keys)
# Define a function to find the closest word to a vector:
def find_closest_word(v, k = 1):
    # Calculate the vector difference from each to word to the input vector
    diff = embedding.values - v 
    # Get the norm of each difference vector. 
    # It means the squared euclidean distance from each word to the input vector
    delta = np.sum(diff * diff, axis=1)
    # Find the index of the minimun distance in the array
    i = np.argmin(delta)
    # Return the row name for this item
    return embedding.iloc[i].name


In [None]:
# Print some rows of the embedding as a Dataframe
embedding.head(10)

Now let's find the name that correspond to our numerical country:

In [None]:
find_closest_word(country)

## Predicting other Countries

In [None]:
find_closest_word(vec('Italy') - vec('Rome') + vec('Madrid'))

In [None]:
print(find_closest_word(vec('Berlin') + capital))
print(find_closest_word(vec('Beijing') + capital))

However, it does not always work. :(

In [None]:
print(find_closest_word(vec('Lisbon') + capital))

## Represent a sentence as a vector

You can represent a whole sentence as a vector by summing all the word vectors that conform the sentence. Let's see. 

In [None]:
doc = "Spain petroleum city king"
vdoc = [vec(x) for x in doc.split(" ")]
doc2vec = np.sum(vdoc, axis = 0)
doc2vec

In [None]:
find_closest_word(doc2vec)