# Word Embeddings

This notebook will introduce you to word-embeddings. An embedding is essentially a vector-representation for some object or concept - in this case words. Word embeddings can be trained to create these vector for a given vocabulary, and the vectors can then be used by other systems for performing AI-tasks. Word embeddings is an ongoing field of research and many new ideas appear every year.  

The specific type of word embeddings which we will use here is called fastText and has been developed by Facebook. They 
have pretrained embeddings freely available, so we do not need to train them ourselves, but we can simply download theirs and work from there.

#### Global Setup

In [2]:
try:
    with open("../global_setup.py") as setupfile:
        exec(setupfile.read())
except FileNotFoundError: 
    pass

#### Local Setup

In [3]:
from src.text.word_embedding.fast_text_usage import get_fasttext_model
from notebooks.exercises.src.text import word_embedding_viz
from notebooks.exercises.src.text import fasttext_document_visualisation

## FastText model

Here we load the fastText model. You will have to download it if you haven't already, and the cell below will instruct you on where to find it. When the data is downloaded, fastText will be loaded into memory, which will also take a couple of seconds.

In [4]:
fasttext_model = get_fasttext_model(lang="en")

Getting FastText data.
Data for FastText not found. 
It can be downloaded from https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip. 
Do you want to download this file now? - Note that this is a LARGE file.
[y/N]y
Downloading:


HBox(children=(IntProgress(value=0, max=10356881291), HTML(value='')))


Saving to file
File downloaded.
Getting FastText data.
	Using zip file.
	Unzipping into temporary file: C:\Users\anfro\AppData\Local\Temp\tmpgvbx1uz9
	Loading model from temporary file.
	Removing temporary file.


Exception: fastText: Cannot load C:\Users\anfro\AppData\Local\Temp\tmpgvbx1uz9 due to C++ extension failed to allocate the memory

## Word Vectors
We will now look at a ton of word-vectors. The vectors are in 300 dimensions! This is way too many dimensions for a human to visualize geometrically. We can though compute something called a Principle Component Analysis (PCA). You will later learn a ton about this method, but in short terms it allows us to find the few dimensions with the most variance in (the most movement of the vectors and the most "action" to see). If we take the 3 dimensions with the most variance we can plot them in a 3D plot!  

Let's try that!  
Below you can randomly sample some words and plot then in 3D PCA space.  
Watch out and don't pick too many samples! - your computer probably wont be able to handle it ;)

In [None]:
%matplotlib notebook
visualizer  = word_embedding_viz.CompleteWordEmbeddingVisualizer(fasttext_model=fasttext_model)

Okay so this is definitely way too many words and dimensions for us to understand!  
Let's therefore look into some specific words in the next section. 

### Looking into specific words
Below to take out 2 dimensions based on the vectors between points.  

There are a couple of categories below which you can in investigate and you can include/exclude rows and column the table for the plot. You can also select two different kinds of vector-planes (the view you are looking at). We can use PCA like we did in the last section, but we can also use a different method whichi is specialized for the differences of the vectors below (here called something with SVD difference).

In [None]:
%matplotlib notebook
visualizer = word_embedding_viz.WordEmbeddingVisualizer(fasttext_model=fasttext_model)

**Exercise**  
- *What method is best for plotting the differences of vectors?*  
    Answer
- *What method is best for plotting points alone?*  
    Answer

## Document embeddings
We now have an idea that word embeddings have some important information about words.  
We will try to use the embeddings of the words for analysing documents.  

Below are two tabs. The first tab allows you to search for words on Wikipedia and fetch the word-embeddings of the text. The second tab lets you write text-documents yourself.  

The texts are used to compute vectors representing the documents, which can then be plotted in 3D.  
Press "Do Document Embeddings" for showing the plot and use the dropdown menu for selecting what method used to create the document vectors. 

In [None]:
%matplotlib notebook
doc_view = fasttext_document_visualisation.DocumentEmbeddingVisualiser(fasttext_model=fasttext_model)