# WORD REPRESENTATION

* NLP:
* word embedding (geometry+arithmetic for words)
* debias word embeddings

## Word Representation:

One-hot notation: Having a vocabulary $V$, and "Man" $ = O_{5391}$, disadvantage: treats each word as a thing onto itself, and doesnt help the model to generalize across words: distances in one-hot aren't related to semantic (inner prod. between any two one-hot is zero, so distance is always same).

Instead, *featurized representation*: word embedding.


| FEAT | man | woman | king | queen | apple | orange |
| --- | --- | --- | --- | --- | --- | --- |
| Gender | -1 | 1 | -0.95 | 0.97 | 0.01 | 0 |
| Royal | 0.01 | 0.02 | 0.93 | 0.95 | 0 | -0.01 |
| Age | 0.03 | 0.02 | 0.7 | 0.69 | 0.03 | -0.01 |
| Food | 0.05 | 0 | 0.04 | 0.02 | 0.95 | 0.97 |

So now the vector for "Man", $e_{5391}$, will have as many dimensions as features has our embedding (usually 50 to 1000), and the model will have a better time finding semantic similarities.

* We will see how to learn embeddings: as in NNs, embedding features won't be that explicit.

* Visualizing embeddings: t-SNE

# USING WORD EMBEDDINGS

## NLP applications:

* "named entity recognition example": instead of one-hot, we pipe now the input words through an embedding before passing it to the BRNN (or model). Advantages:
  * NN will have an easier time learning generalizations, since changing "orange farmer" to "apple farmer" won't be a big gap
  * NN will react better to seldom words: even if the training set didn't contain them, the embedding may put them close to words that the RNN was trained on, so it knows what to do (transfer learning).
  * Embeddings are unsupervised and fast, can be trained on 1-100 billion words (quite reasonable).
  * The input vector becomes more dense (instead of 10k one-hot dimensions maybe becomes 300 dense features). Not always an advantage
    
Specific steps:

1. Learn word embedding from large text corpus, or use pre-trained one
2. Transfer embedding to new task with smaller training set
3. Optionally: continue finetuning the embeddings with new data (only if our dataset is big enough)

And 
* Embeddings have proven useful for: named entity recognition, text summarization, coreference resolution, parsing...
* They weren't as useful for: language modelling, machine translation, (especially if you already have lots of data for that task).

**Transfer learning**: by replacing the one-hots with the embeddings, algos generalize better and/or learn from less data. The key is the relation between the embedding's and the problem's data size

## Finally

Relation to face encoding: for face recognition a CNN learns an "embedding" for a face and decides on the top of that if it is the same person. This is similar. **One difference: for face recognition you want the NN to be able to generalize to new, unseen pictures. Embeddings have a fixed vocabulary, and we learn one embedding per vocabulary element and that's it**.




# PROPERTIES OF WORD EMBEDDINGS

* Analogy reasoning: maybe not the most important but representative of embeddings: man $\rightarrow$ woman, king $\rightarrow$ ??

* Arithmetic: $e_{man}-e_{woman} \approx e_{king}-e_{queen}$. The difference in both is basically gender. So $e_{king}-e_{man}+e_{woman} \approx e_{queen}$. *Mikolov et a. 2013*. In general, find the word that maximizes similarity to (king-man+woman).
* Note that this arithmetic doesn't usually work on t-SNE because of its non-linear mapping.

## Similarity functions:

### Cosine similarity
$sim(u,v) = \frac{u^Tv}{\lVert u \rVert_2 \lVert v \rVert_2}$

will be 1 if both point to the same direction, drop to 0 if perpendicular and to -1 if opposed.

### Euclidean distance

$\lVert u-v \rVert^2$ a measurement of dissimilarity (has to be minimized). Works as well, less often used.


# EMBEDDING MATRIX

When learning a "word embedding", an "embedding matrix" is created.

The matrix $E$ has shape `(num_features, len(vocabulary))`. Note that $E \cdot o_{idx} = E_{:, idx} =: e_{idx} := e_w$

Initialize it randomly and optimize with gradient descent.

# LEARNING WORD EMBEDDINGS

Started with relatively complex, and got simpler with time realizing that still works. Follow this path to get a better intuition


## Embedding throug NN:

| I | want | a | glass | of | orange |
| --- | --- | --- | --- | --- | --- |
| 4343 | 9665 | 1 | 3852 | 6163 | 6257 |

Building a language NN model is a reasonable way to train an embedding. We have $E$ that reduces the one-hot input dimensionality. Feeding the output of E (for a given "history of words" concatenated) into a Softmax and training the NN to predict the next word with gradient descent will properly train E, since it is pushed to find an "optimal" fit that allows the model to generalize well and soon, and this usually means that the embedded geometry of the words tends to reflect the semantic in a simple, linear way.

## Towards simpler models:

It turns out that the "history of words" can be changed by other contexts: the N *surrounding words* also works, but **even just the previous word** works well (for embeddings, not for language models). This is the basis for the *skip gram* algorithm.

# WORD2VEC

Much simpler than learning a language model on the top of the embedding. Mikolov et al. 2013

## Skip-grams:

* context-to-target mapping:

Instead of picking a fixed one, pick a random origin and a random target, which can be before or after but not the same. The supervised learning problem on the top will be very unlikely solved, but it will learn a good word embedding.

## Skip-gram Model:

* Vocab size = 10k words (some have 1mi+)
* context "orange" $o_{orange} \rightarrow o_{juice}$ target mapping.
* $o_{orange} \rightarrow E \rightarrow e_{orange} \rightarrow Softmax \rightarrow o_{juice}$. Note that the softmax layer has its own weights and biases to be trained.
* Loss function is the BCE: $\mathcal{L(\hat y, y)} = - \sum y_i log(\hat y_i)$

* **problem**: computational speed at the softmax: to normalize, it has to  sum one exponent for every output unit (as many as vocab size). **solution**: hierarchical softmax. Instead of taking a whole softmax of size, say $2^a$, we can have a (kind of Huffman) tree of binary classificators with depth $a$.

Further details in the paper. No more time spent because there are simpler models (like *negative sampling*).

## How to sample the context?

There are words in the text like "the, of, a..." that happen very frequently, so there will be many mappings repeated which won't be very informative for the embedding. Ignoring them will make the training efficient and effective. So in practice, more common words have less probability to be sampled.


## CBOW (continuous bag of words):

In the paper referred before (Mikolov et al. 2013) this model is also discussed. It takes the surrounding words, has advantages and disadvantages. The main problem with skip-gram is the softmax bottleneck, which can be fixed.

# NEGATIVE SAMPLING

Modify skip-gram to overcome the time complexity problem and make it more efficient.

## Defining a new learning problem:

