# Understanding Word Embeddings for Neural Machine Translation – Exploring the Google News Model

In this notebook, you can further explore word embeddings for NMT based on a much larger model than in the previous notebook concerned with the fundamentals of word embeddings (which we'll call the *base notebook* from now on). Here, we'll explore Google's gigantic News model, which contains a lot more words, a much larger vocabulary and a higher vector dimensionality than the glove-wiki-gigaword-100 model explored in the base notebook. Accordingly, the Google News model captures more and more fine-grained semantic relations between the words in the model. Also, compared to the gigaword model in the base notebook, you can also query the Google News model for multi-word phrases by using an underscore, e.g., *New_York*.  

The code cells for exploring the model will remain unchanged, but the documentation will be much less explicit, since you are already familiar with the basic idea of word embeddings and word embedding models. You can always consult the base notebook for reference purposes.

## 0 Housekeeping
First, we run the same housekeeping steps as we did in the base notebook.

In [None]:
# Upgrade to the newest version of pip and install or upgrade the gensim library (if necessary)
!pip install --upgrade pip
!pip install --upgrade gensim

In this notebook, we'll use the pretrained word2vec-google-news-300 model, which is also provided in the [gensim-data](https://github.com/RaRe-Technologies/gensim-data) repository. Since this model was originally trained using word2vec, no prior GloVe → word2vec conversion was necessary in this case. The word2vec-google-news-300 model contains about 100 billion(!) words from the Google News dataset and a vocabulary of about 3 million unique words and phrases. The corresponding vectors have a dimensionality of 300. The model size is about 1.66 GB, so loading it will take quite some time. With an internet connection speed of 500 mbps, loading the model takes about 10 minutes. So, run the code below and grab a coffee or two.

In [None]:
# Import the pretrained word2vec-google-news-300 model from gensim-data and store it in 'word_embeddings'
import gensim.downloader as api

word_embeddings = api.load('word2vec-google-news-300')

It took a while, but now you're ready to go!

## 1 Exploring the Google News model
Recall that the embedding matrix of a word embedding model is the product of the vocabulary size and the dimensionality of the embedding vectors. So, the size of our embedding matrix is $3.000.000\;x\;300 = 900.000.000$. In other words, the Google News model has an embedding matrix size of 900 million!

In [None]:
# Generate a list of the 150 most frequent words in the model
word_embeddings.index_to_key[:150]

## 2 Exploring individual word vectors
When we query our  model/embedding matrix as a look-up table, it will return an array of 300 floating point numbers, corresponding to a 300-dimensional vector. In the code cell below, we ask the model to give us the word vector for the multi-word unit *New York*. Remember that the underscore _ is used when you want to query the model for such multi-word units. Again, feel free to modify the code in order to explore other word vectors in the model.

In [None]:
# Display individual word vectors
word_embeddings['New_York']

As mentioned in the base notebook, the standard transformer architecture for NMT works with a vector dimensionality of 512, so the vector representations in these systems will still be bigger than the vector representations in our big Google News model.

## 3 Exploring the most similar words
Exploring the most similar words require some heavy calculation on the part of our model, so don't be surprised if the notebook will be busy for a minute or two (it will be much quicker for subsequent processes). If you query the model for the words most similar to *father*, you'll see that the list now also contains a multi-word unit (*eldest_son*).

In [None]:
# Query the model for the words most similar to the word 'father'
word_embeddings.most_similar('father')

## 4 Exploring the least similar words
Again, you can also query the model for words most dissimilar to another word, but the results may not make that much sense to us.

In [None]:
# Using the parameter 'negative=' reverses the most.similar() function
word_embeddings.most_similar(negative=['father'])

## 5 Calculating the similarity between two specific words

In [None]:
# Calculate semantic similarity between words using the similarity() function
word_embeddings.similarity('father', 'nephew')

In [None]:
# Import the libraries required to visualise the word embedding vectors
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Create the visual representations
plt.figure(figsize=(30,3))
sns.heatmap([word_embeddings["father"], word_embeddings["son"], word_embeddings["government"]], xticklabels=True, yticklabels=True, cbar=True,
            vmin=-1, vmax=1, linewidths=0.7)
plt.show()

## 6 Identifying semantic outliers

In [None]:
# Identify semantic outliers using the doesnt_match() function
word_embeddings.doesnt_match(['father', 'mother', 'uncle', 'car'])

## 7 Identifying analogies
As in the model used in the base notebook, analogy identification does not work that well even in our bigger Google News model. For example, if we use 'Germany' is to 'Berlin' as 'France' is to '?', the model will not output the correct city. You can have a go and try to come up with examples where the model will identify the correct analogy.

In [None]:
# Identify top 3 analogy candidates
word_embeddings.most_similar(['Germany', 'Berlin'], ['France'], topn=3)

## 8 Visualizing the spatial arrangement of word embeddings using t-SNE
Run the code below to import the libraries required for visualising the word embeddings in the model.

In [None]:
# Import the libraries required to perform t-SNE dimensionality reductions and visualizations
import numpy as np
import matplotlib.pyplot as plt
 
from sklearn.manifold import TSNE

In the code cell below, you can manipulate the argument *topn=30* in the fourth line of code from the top in order  to determine the number of words to be plotted. If the function is called without the *topn* argument, the standard value will be 10 (see the base notebook). 

In [None]:
# Define a function which displays our word embeddings in a two-dimensional scatter plot
def display_tsne_reduction(model, word):
    
    arr = np.empty((0,300), dtype='float')
    word_labels = [word]

    # Get the words most similar to our input word
    similar_words = model.similar_by_word(word, topn=30)
    
    # Add the vector for each of these words to an array
    arr = np.append(arr, np.array([model[word]]), axis=0)
    for wrd_score in similar_words:
        wrd_vector = model[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)
        
    # Calculate the t-SNE coordinates for 2 dimensions
    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)

    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    
    # Define the visiual representation of our scatter plot
    plt.scatter(x_coords, y_coords)

    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
    plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
    plt.show()

Run the following code to visualize the top *n* words which exhibited the closest proximity to the word passed as second argument to the *display_tsne_reduction()* function.

In [None]:
# Display word embeddings in two-dimensional space
display_tsne_reduction(word_embeddings, 'father')

## 9 Tips for further reading
The reading tips are the same as in the base notebook. If you read through the tips listed here, you'll acquire a profound understanding of the concepts covered in the two notebooks.  
- [Alammar, Jay (2019): The Illustrated Word2vec](http://jalammar.github.io/illustrated-word2vec/)
- [Collis, Jaron (2017): Glossary of Deep Learning: Word Embeddings](https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca)  
- [Jedamski, Derek (2020): Advanced NLP with Python for Machine Learning](https://www.linkedin.com/learning/advanced-nlp-with-python-for-machine-learning), LinkedIn Learning course (free for students of TH Köln)
- [McCormick, Chris (2016): Word2Vec Tutorial - The Skip-Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)  
- [McCormick, Chris (2017): Word2Vec Tutorial Part 2 - Negative Sampling](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)
-[official gensim documentation](https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html)