> **"deep thought, do you have the answer to everything, the secrete of the universe, the great question of life and everything?" <br></br>
>"Yes, I don't think you are going to like it, alright, ready? the answer to the great question is <span style="color:red">42</span>.**"

Word embedding is interesting as 1) it reduce complexity of NLP into operation of vectors, 2) it works surprisingly well with basic neural network instead of fancy deep learning, 3) the intuitive behind is simple, useful in many other areas.

In this post, we will -

1. go over concept of embedding
2. intuitives behind
3. how it being used in NLP related machine learning algorithms.



## 1) Embedding - Equation Solver for King - Man + Woman = Queen

Put simply, word embedding is one way to encode words into a vector of distributed or continuous vectors of numbers. The use of embedding is important for performance of many real-world applications: search engine, ads, ranking, spam detection, automatic text tagging, or next-word prediction on your mobile ... Without realizing it, you are using word embedding in most applications daily already. 

Transforming text to vectors is similar to what we transform colors into RGB color model, in which red, green, and blue light are added together to reproduce a broad array of colors, e.g. RED + GREEN = YELLOW. Can we do the same intuitive vector operation for wording?

![Image for post](https://i.loli.net/2021/01/05/hLBPjQ3ZYXgAHUS.png)

It turns out not only we can do above, we can extract much more insights, using shallow neural network (1 hidden layer), while most people would have thought it requires much complicated deep learning. It seems magical you can do arithmatic operation on words. 

| Expression      | Nearest Token     | 
| :------------- | :----------: |
|  Paris - France + Italy | `Rome`   | 
|  bigger - big + cold | `colder`   | 
|  sushi - Japan + Germany | `bratwurst`   | 
|  Cu - Copper + gold | `Au`   | 
|  Windows - Mircrosoft + Google | `Android`   | 
|  Montreal Canadiens - Montreal + Toronto | `Toronto Maple Leafs`   | 


## 2) Resources

Additonal resources for word embedding, word2vec.

  * GLoVe pre-trained embedding on top of Wikipedia 2014 [6B tokens, 400K vocab, 50d, 100d, 200d, & 300d vectors, 822 MB download)](http://nlp.stanford.edu/data/glove.6B.zip)
  * Chris McCormick [word2vec Tutorial - The skip Gram Model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
  * Mikolov, first paper on word2vec [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf)
  * Mikolov, second paper on word2vec, performance improvements [Distributed Representations of Words and Phrases and their Compositionality](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
  * TensorFlow [word2vec Tutorial](https://www.tensorflow.org/tutorials/word2vec)
  * Jay Alammaar [The illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/)
  * GloVe [GloVe, Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)




## 3) Word Modeling

### 3.1) One-Hot Encoding

This is the traditional way of encoding language numerically. Essentially you build a vocabulary of words, e.g. "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog". You can encode this vocabulary into a vector of length 9 -> [1, 1, 1, 1, 1, 1, 1, 1, 1].

* "dog" is represented as [0, 0, 0, 0, 0, 0, 0, 0, 1]
* "fox" is represented as [0, 0, 0, 1, 0, 0, 0, 0, 0]
* "fox jumps over" is represented as a matrix, each row of the word is encoded in vector 
                      "fox":   [0, 0, 0, 1, 0, 0, 0, 0, 0], 
                      "jumps": [0, 0, 0, 0, 1, 0, 0, 0, 0], 
                      "over":  [0, 0, 0, 0, 0, 1, 0, 0, 0]]

This representation is fairly straightforward, however, simplicity comes with sparsity, for both space, and information density. Imagine there are 170K words in English, it will be quite inefficient if we are going to have a matrix of 170K columns for every setence encoded, most of them will be zero out.


### 3.2) Word Vector 

Stickly speaking, word vector is similar to one-hot encoding, both of them represent the word into a vector of digits. You can think of vector as a position in multidimensional space, called vector space.

This chart illustrate a 3 dimension vector space. In practice, usualy the vector space are 50, 150, 500 dimensions. 3-dimension is more of illustrate purpose. Each location in the space in coordinate (x, y, z) represent a word. 

![](https://miro.medium.com/max/2276/1*5uxHqA0wO0u9d3aanPY7Ww.png)



Let's take a look at concrete example for real. Below is a pre-trained word vectors. This data is made available under the [GloVe](https://nlp.stanford.edu/projects/glove/), training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

The file glove.6B.50d.txt was pre-trained with Wikipedia 2014, including 6B tokens, 400K vocab, uncased. It is a 50 dimension vector. You can also find other dimensions in 100d, 200d, & 300d vectors.

In [None]:
import numpy as np
from scipy import spatial
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

embeddings_dict = {}

with open("/kaggle/input/glove6b50dtxt/glove.6B.50d.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector


Let's print out "King". Keep in mind this is a vector of 50 dimensions. You can try other words by replacing embeddings_dict.

In [None]:
print(embeddings_dict["king"])

The numbers are boring. Let's encode numbers into color bar, so we can have visual comparison. This time, let's print "king", "man", "woman", and "queen".

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize = (20,2))
plt.text(0, 0, "king")
plt.text(0, 1, "man")
plt.text(0, 2, "woman")
plt.text(0, 3, "king-man+woman")
plt.text(0, 4, "queen")
img = plt.imshow([embeddings_dict["king"], 
            embeddings_dict["man"], 
            embeddings_dict["woman"],
            embeddings_dict["king"] - embeddings_dict["man"] + embeddings_dict["woman"],
            embeddings_dict["queen"]], 
            interpolation='nearest',
            cmap='Set3')

## 4) Word2Vec

Word2Vec is most popular examples of word embedding in practice. The core idea behind Word2Vec is Distributional Hypothesis - a model that is able to predict a given word with its neighbouring word, or vice versa, is likely to capture contexture meaning of the word.

### 4.1) Why it works so well?

Consider example of filling in the blank, "would you like a cup of \___ ?", or "i like my \___ black", intuitively, when you see things like "cupf of" or "black" as <span style="color:red">**context**</span>, the words appear near them tend to have similar meaning. As a result, when you train it on large number of documents, the words with similar <span style="color:red">**context**</span> will be pulled closely together.

However, tea and coffee are different drinks, they mean different things in other contexts, e.g. you usually buy good tea as gift in Asia, but prob not with coffee. Those variabilities are captured in other dimensions. The more dimensions you have in the word vector model, the more expressive your model is.



### 4.2) Skip-gram Model


There are two architectures for implementing Word2Vec, CBOW (continuous Bag-of-Words) Model, and Skip-gram Model.

![](https://github.com/udacity/embeddings-cn/raw/694a69c4a8b0711e48d2253e9880d94c72ff60cb/assets/word2vec_architectures.png)


**The Skip-model** essentially is to pass in a word as input, and try to predict the words surrouding it in the neighbouring context. In this way, we can train the network to learn representations for words that show up in similar context.

**CBOW** works the other way around, pass in the context as input, and try to predict the word within it.

Let's look at one example - "the quick brown fox jumps over the lazy dog". skip-model use "jumps" as input, predict "brown", "fox","over", "the" as context, while CBOW works by providing "brown" "fox" as input, predict "fox" as output. 

if we pretend there is a magic model in neural network, that takes "jumps" as input in the format of one-hot encoding of "0100", and this model will learn how to predict "brown", "fox", "over", "the". As a result, the weigting of below "hidden layer" becomes the **embedding vector** of word "jumps".

![](https://i.loli.net/2021/01/07/xbINVzceMqtpJfR.png)







In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### 4.3) Word2Vec - Implementation of Skip-gram Model


There is a good tutorial from [TensorFlow](https://www.tensorflow.org/tutorials/text/word2vec). 



## 5) Tools


if you are interested in other visaulization of word embeddings, [**Embedding Projector**](http://projector.tensorflow.org/) is a web application tool that interactively visualizes embeddings by reading them from our model and rendering them in two or three dimensions. Here is a visualisation of the ten thousand MNIST images which have been coloured by their label.

![Image for post](https://miro.medium.com/freeze/max/60/1*cemzBbyaQndIDYiCAbjdiw.gif?q=20)

![Image for post](https://miro.medium.com/max/1280/1*cemzBbyaQndIDYiCAbjdiw.gif)




## 6) Lesson Learned

Wrote this on jupyter notebook, and some of charts via [excalidra](https://excalidraw.com).

1. jupyter doesn't work well with state management. Save often, set up checkpoints.

2. there are plenty resources on the internet, this is well studied area with a lot of valuable insights from others

3. didn't find much practical resources on best practice in how to test model training... need to dig deeper

