## Embeddings

Similar words occur in similar contexts. This idea is used to map words to small vectors called *embeddings* which are going to be close to each other when words have similar meanings and far apart when they don't.

## Word2Vec
### Skip-Gram model

1. For each word, map it to an embedding, initially a random one.
2. Use the embedding to predict the context of a word. The context is simply the words that are nearby (in a window around the original word).
3. The model used to predict a nearby word is a logistic regression.
4. Use tSNE to visually examine the resulting word distance.

<img src="https://raw.githubusercontent.com/Runze/ud730-deep-learning-class-notes/master/screenshots/lesson-5-word2vec.png" alt="alt text" width="700">

To compare embeddings, use cosine distance

$$\frac{\textbf {A} \cdot \textbf {B}}{\Vert \textbf {A} \Vert_2 \Vert \textbf {B} \Vert_2} = \frac {\sum_{i=1}^{n}{A_{i}B_{i}}}{{\sqrt {\sum_{i=1}^{n}{A_{i}^{2}}}}{\sqrt {\sum_{i=1}^{n}{B_{i}^{2}}}}}$$

To speed up training, use *sampled softmax* to randomly sample the negative that are not the target (i.e., the 0s in the one-hot-encoded labels):

<img src="https://raw.githubusercontent.com/Runze/ud730-deep-learning-class-notes/master/screenshots/lesson-5-word2vec_sampled-softmax.png" alt="alt text" width="700">

### Continuous Bag-of-Words model (CBOW)

CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'), while the skip-gram does the inverse and predicts source context-words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets. ([Source](https://www.tensorflow.org/tutorials/word2vec))

## Recurrent neural networks

Like CovNets, recurrent neural networks share parameters too; instead of sharing them across space, it does so over time.

<img src="https://raw.githubusercontent.com/Runze/ud730-deep-learning-class-notes/master/screenshots/lesson-5-rnn.png" alt="alt text" width="700">

### Exploding and vanishing gradients

As we backprogate through time, we are applying all these derivatives to the same parameters $w$ (because they are shared), which creates a lot of correlated updates and makes stochasitc gradient descent unstable. Either the gradients go exponentially and we end up with *exploding gradients*, or they go down to 0 and we end up with *vanishing gradients* and makes the model only remember the recent past.

- To prevent exploding gradients, we can cap the norm of the gradient.
- To prevent vanishing gradients, we can use LSTM.

### LSTM (long short-term memory)