# 3. Word Embeddings
This section is intended to serve as an introduction to many of the things that we will touch on in my NLP section pertaining to Deep learning applied to natural language processing.

You will notice that a lot of the RNN examples that we go over, as well as elsewhere on the web will use word sequences as examples. Why is that?

1. Language is an easy topic to comprehend! We speak, read, and write every single day, which makes it rather intuitive to deal with. If you are reading this post, then you are unavoidably using those abilities at this very moment.
2. RNN's allow for us to no longer treat sentences as **bag-of-words**.

Let's focus on number 2 above for a moment. As an example, consider the sentence:

> "Dogs love cats and I"

It _almost_ has the correct gramatical structure, but its meaning is most certainly different from the original sentence:

> "I love dogs and cats"

So, there is a lot of information (in the quantitative sense) that is thrown away when you use bag-of-words. At this point I am assuming that you have gone through my posts concerning Logistic Regression and intro to NLP, which both go over sentiment analysis and utilize bag-of-words. But in case you have not, let me define bag of words quickly.

## 1. Bag-of-Words
Consider a the task of sentiment analysis, where we are trying to determine whether a sentence is positive or negative. A positive sentence may be:

> "Wow, today is a great day!"

While a negative sentence may be:

> "Ugh, this movie is absolutely terrible."

In order to turn each sentence into an input for the classifier, we first start with a vector of 0's of size $V$ (our vocabularly size), so there is an entry for every individual word:

```
X=[0,0,0,...,0]
len(X) = V
```

We keep track of which word goes with which index using a dictionary, `word2idx`. Now, for every word in the sentence, we will set the corresponding index in the vector to `1`, or perhaps some other frequency measure:

```
X[idx_of_word] = 1
```

So, there is a nonzero value for every word that appears in the sentence, and everywhere else zero:

```
X = [0,1,0,0,...,1]
```

You can see how given this vector, it wouldn't be easy to determine the correct order of words in the sentence. It isn't completely impossible, if for instance the words are such that their is only one possible ordering, but generally some information is lost. 

Now, what happens when you have the two similar sentences:

> "Today is a good day."

And:

> "Today is _not_ a good day."

Well, these lead to nearly the exact same input vector, except `X[not] = 1`. This is actually a known drawback of bag-of-words; they are notoriously bad at being able to handle negation. Now, given what we know about RNN's, you can imagine that they may be good at this because they keep state! For instance, if the RNN saw the word _not_, it may negate everything that comes after it. 

## 2. Word Embeddings
This brings us to a paramount question: How _do_ we treat words in deep learning? The popular method at the moment, which has been able to produce very impressive results, is the use of word embeddings or word vectors. That means that given a vocabulary size $V$, we choose a dimensionality that is much smaller than that, $D$, where $D << V$, and then map each word vector to somewhere in the $D$ dimensional space. By training a model to do certain things like trying to predict the next word, or try to predict surrounding words, we get vectors (word embeddings) that can be manipulated via arithmetic to produce analogies such as:


> king - man $\approx$ queen - woman

The question now is how do we use word embeddings with Recurrent Neural Networks? To accomplish this, we simply create an embedding layer in the RNN. So, the input simply arrives as a one hot encoded word, and in the next layer it becomes a $D$ dimensional vector.

<img src="https://drive.google.com/uc?id=1Q2eh1IRL0qxB05p-xb4TJvYSwEHZP7HO" width="500">

This requires the word embedding transformation matrix to be a $VxD$ matrix, where the $i$th row is the word vector for the $i$th word. For reference, all of the matrix dimensions are below:

$$W_e = VxD$$

$$W_x = DxM$$

$$W_h = MxM$$

$$W_o = MxK$$

Two questions will naturally arise at this point. The first being:

1. How do we traing this model?

The answer to this is our old friend, gradient descent. We will also see later that when we do Word2Vec that there are some variations on the cross entropy error function that will help us speed up training. The second question is:

2. What are the targets? 

This is a good question because language models don't necessarily have targets. You can attempt to learn word embeddings on a sentiment analysis task, so your targets could be movie ratings or some kind of movie score. Your targets could also be next word prediction as we discussed before. Again, if we use Word2Vec, the targets will also change based on the particular Word2Vec method we use. 

## 3. Word Analogies with Word Embeddings
We are now going to go over how you actually can perform calculations that show that:

> king - man $\approx$ queen - woman

It is quite simple, but worth going through so that intuitions can start forming about this entire process.

We can start be rewriting the above as:

> king - man + woman = ?

Then there are two main steps:

1. Convert 3 words on the left to their word embeddings. For example: 

```
vec(king) = Word_embedding[word2idx["king"]]
v0 = vec(king) - vec(man) + vec(woman)
```

And `v0` is just a vector in space with an infinte number of values!

2. We want to then find the "closest" actual word in our vocabulary to the `v0`, and return that word.

Why do we need to do that? Well, the result of `vec(king) - vec(man) + vec(woman)` just gives us a vector. There is no way to map from vectors to words, since a vector space is continuous, and that would require and infinite number of words. So, the idea is that we just find closest word. 

### 3.1 Distance
There are various ways of defining distance in the context above. Sometimes, we will simply use _Euclidean Distance_:

$$\text{Euclidean Distance: } ||a - b||^2$$

It is also common to use the _cosine distance_:

$$\text{Cosine Distance: } cosine\_distance(a, b) = \frac{1 - a^Tb}{||a|| \; ||b||}$$

In this later form, since only the angle matters, because:

$$a^Tb = ||a|| \; ||b|| cos(a,b)$$ 

During training we normalize all of the word vectors so that their length is 1:

$$cos(0) = 1, \; cos(90) = 0, \; cos(180) = -1$$

When two vectors are closer, $cos(\theta)$ is bigger. So, we want our distance to be:

$$\text{Distance} = 1 - cos(\theta)$$

At this point we can say that all of the word embeddings lie on the unit sphere. 

### 3.2 Find the best word
Once we have our distance function, how do we actually find the closest word? The simplest word is to just look at every word in the vocabulary, and get the distance between each vector and your expression vector. Keep track of the smallest distance and then return that word. 

```
min_dist = Infinity; best_word = ''
for word, idx in word2idx.items():
    v1 = Word_embedding[idx]
    if dist(v0, v1) < min_dist:
        min_dist = dist(v0, v1)
        best_word = word

print("The best word is: ", best_word)
```

We may want to leave out the words from the left side of the equation, in this case _king, man_, and _woman_. Note that we will not be using this on our upcoming poetry data, since it doesn't have the kind of vocabulary we are looking for. We are more interested in things like nouns when we do word analogies. We want to compare kings and queens, men and women, occupations, etc. We will look more at word analogies later on. 

## 4. Representing a Sequence of Words as a Sequence of Word Embeddings
Let's quickly go over one small detail from the upcoming code, that may be slightly confusing. We have a word embedding matrix, $W_e$, which is of size $V x D$ (V = vocabulary size, D = word vector dimensionality), and we have an input sequence of word indexes of length $T$. We would like to get a sequence of word vectors that represent a sentence, which is a $TxD$ vector. However, we will need to update the word embeddings via backpropagation, so the $TxD$ matrix we get after grabbing the word vectors cannot be the input into the neural network. 