# Week 2

## Word representation

* Problems with the standard 1-hot presentation of words: model finds it hard to learn relationships between words (king is to queen as man is to woman):
  * Inner product between any 2 one-hot vectors is 0.
  * Euclidean distance between any pair of vectors is the same.
* Would be better to learn a set of features for each words that help to categorise it (eg gender, royal-ness, age, fruit-ness etc).
* Embedding basically learns relevant features (usually not easy to interpret), generating a 300 (or whatever) dimensions vector for each word.

## Using word embeddings

* Named entity recogition example: what happens if you come across a set of words that your model hasn't learning in the standard RNN approach? Enter embedding.
* Embedding can be trained on much larger corpuses of text (1B - 100B words is not uncommon).
* Embedding allows for transfer learning: models using pretrained embeddings can be trained on smaller corpuses (à la ImageNet models in the image classificatino word).
  * Embeddings can be finetuned with new data (if dataset if big enough).
* Embeddings have been useful for: entity recognition, text summarisation, co-reference resolution, parsing.
* Less useful for: language modeling, machine translation.

### Properties of word embeddings

* Word embeddings can help with "analogy reasoning".
  * Answering the question: "Man is to woman as king is to ___ ?"
* With a 4 dimension embedding for man and woman, with a feature for gender, royalness, age and foodness, you could subtract them from each other:

    ```
    e_man - e_woman = [-2, 0, 0, 0]
    ```
    
  If you did the same from king and queen, you may end up with a similar result:
   
    ```
    e_king - e_queen = [-2, 0, 0, 0]
    ```
   
  * What it captures is that the main different between the two word sets is the gender.
  * Ideas first published in paper [Linguistic Regularities in Continuous Space Word Representations](https://www.aclweb.org/anthology/N13-1090).
* Formally, you'd aim to find the $e_{w} that satisfies that expression:
   
   $e_{man} - e_{woman} \approx e_{king} - e_{w}$
   
   * Find word $e_{w}$ that maximises this the similarity expression (ie has the highest degree of similiarity): $\text{sim}(e_{w}, e_{king} - e_{man} + e_{woman})$
   
* Similarity function: cosine similarity:
  * $sim(u, v) = \frac{u^{t}v}{||u||_2 ||v||_1}$
    * Without the denominator, it's basically the inner product: if they are similar, product will be large.
    * With the denominator, normalises result between -1 and 1. Very distant = -1, very similar = 1.
  * Could also use squared distance: $||u-v||^2$ (since it's a measure of dissimilarity, you should take the negative to maximise.

### Embedding matrix



* Embedding matrix will be a `(300, 10000)` ($E$) dimension matrix (if you have a 10k vocab and 300 latent factors).
* You could look up a word value in the matrix using a `(10000, 1)` ($o_j$) one-hot encoded matrix and doing a dot product against the embedding matrix to get the 300 latent factors for the word:
  $E * o_j = e_j$
* In practise you'd generally just lookup the embedding value of a word using a dictionary lookup, but it's easier to represent mathematically as a dot product.

### Learning word embeddings

* In the early days of learning word embeddings, the algorithms were quite complicated but over time, researchers discovered simpler ways to do it.
* If you started with a sentence and took away the last word:
  
    "I     want   a    glass   of     orange   ___."
     4343  9665   1    3852    6161   6257    
     
    * You would then lookup the embedding for each word:
     * "I"       $o_{4343} \longrightarrow E \longrightarrow e_{4343}$
     * "want"    $o_{9665} \longrightarrow E \longrightarrow e_{9665}$
     * "a"       $o_{1}    \longrightarrow E \longrightarrow e_{1}$
     * "glass"   $o_{3852} \longrightarrow E \longrightarrow e_{3852}$
     * "of"      $o_{3852} \longrightarrow E \longrightarrow e_{3852}$
     * "orange"  $o_{6257} \longrightarrow E \longrightarrow e_{6257}$
    * Then pass all the values into a neural network, which feeds into a softmax layer. Softmax classifies against the 10k values and outputs a single word.
    * The number of weights in the final layer is dependant on the size of the "fixed historic window" hyperparam. In other words, you could choose to take the last n words to predict the output.
    * The number of weights would be 4 times the number of latent factors, in the above example: `4 * 300`.
    * Use backprop to find the ideal word embeddings for the task.
    
* More complex algorithms:
  * Given a sentence: "I want a glass of orange juice to go along with my cereal." try to pick one of the middle words.
    * You might feed into the neural net 4 words on the left & right.
    * You could also just pass in a single last word.
    * You could also just take a nearby 1 word (glass, cereal etc) - this is called a skip gram model.

### Word2Vec

* "A simpler and computational more effective way to learn embeddings".
* Given sentence: "I want a glass of orange juice to go along with my cereal", you find a context -> target pair randomly (called a 'skip gram'), ie `orange` and `juice`, or `orange` and `glass` or `orange` and `my`.
* Probably won't do well on this problem, but you aren't looking for success on this problem, you want to learn a good embedding.