# Introduction to Word Embeddings
## Word Representation
- 1-hot representation

V = [a, aaron, ...., zulu, <UNK>]

- featurized representation
weighted with categorical values

- Visualizaing word embeddings with t-SNE or ([umap](https://github.com/lmcinnes/umap/issues))
 characteristics of a word is emobbeded into lower dimensional space.

## Using word Embeddings
- Named entity recognition example

## Transfer learning and word embeddings
- 1 Learn word embeddings from large text corpus (1- 100B words)
    - or download pre-trained datasets
- 2 Transfer embedding to new task with smaller training set (say, 100k words)
- 3 Optional: Continue to finetune the word embeddings with new data.

Face encoding is a kind of embedding, embedding characteristics of face into vector.

## Properties of  word embedding
- Analogies using word vectors
\begin{eqnarray}
e_{man} - e_{woman} \approx e_{king} - e_{woman} 
\end{eqnarray}
[Mikolov et al., 2013, Linguistic regularities in continuous space word representations](https://www.aclweb.org/anthology/N13-1090)

- Cosine similarity
\begin{eqnarray}
    sim(e_w, e_{king}-e_{man}+e_{woman})
\end{eqnarray}

## Embedding matrix
matrix E ($ word \times category $)
- In practice, use specialized function to look up an embedding.

# Learning Word Embeddings: word2vec & GloVe
## Learning word embeddings

Neural language model
[Bengio et al., 2003, A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)

Other context/target pairs, use last 4 words to be 'context' and the word would be 'target'

## Word2vec
- Skipgrams [Mikolov et al., 2013, Efficient estimation of word representations in vector space](https://arxiv.org/abs/1301.3781)
- Model
Vocab size = 10,000k

- Problems with softmax classification
\begin{eqnarray}
    p(t|c) = \frac{e^{\theta_t^T e_c}}{\sum_{j=1}^{10,000}e^{\theta_j^T e_c}}
\end{eqnarray}

## Negative Sampling
Defining a new learning problem [Mikolov et al., 2013. Disstributed  representation of words and phrases and compositionality](https://arxiv.org/abs/1310.4546)

## GloVe word vectors
- Glove: global vectors wor word representation
[Pennington et al., 2014. GloVE: Global vectors for word representation](https://nlp.stanford.edu/pubs/glove.pdf)

minimize $\sum_{i=1}^{10,000}\sum_{j=1}^{10,000} f(X_{ij}) (\theta_i^T e_j + b_i - b_{j'} - \log X_{ij})^2$

# Applications using Word Embeddings

## Sentiment Classification
Many-to-one problem of RNN sentiment classification.
'completely lacking in good services, good foods.'
This sentence has a lot of words 'good', but the sentence meant to say 'pretty bad'.

## Debiasing word embeddings
### The problem of bias in word embeddings
Man:Woman as King:Queen
Man:Computer_programmer as Woman:Homemaker -> horrable mistakes Mother
Father:Doctor as Mother:Nurse mistakes (x)
Word embeddings can reflect gender, ethnicisty, age, sextual, orientation, and other biases of the text used to train the model.

[Bolukbasi et al., 2016. Man is to computer programmer as woman is to homemaker?](https://arxiv.org/abs/1607.06520)

## Addressing bias in word embeddings
- 1 Identify bias direction
- 2 Neutralize: For every word that is not definitional, project to get rid of bias
- 3 Equalize pairs
How can we define what is the bias??

# Practice questions
## Q1 Suppose you learn a word embedding for a vocabulary of 10000 words. Then the embedding vectors should be 10000 dimensional, so as to capture the full range of variation and meaning in those words.
## Answer: False

## Q2 What is t-SNE?
- A linear transformation that allows us to solve analogies on word vectors
- A non-linear dimensionality reduction technique
- A supervised learning algorithm for learning word embeddings
- An open-source sequence modeling library
## Answer: A non-linear dimensionality reduction technique

## Q3 Suppose you download a pre-trained word embedding which has been trained on a huge corpus of text. You then use this word embedding to train an RNN for a language task of recognizing if someone is happy from a short snippet of text, using a small training set.

Then even if the word “ecstatic” does not appear in your small training set, your RNN might reasonably be expected to recognize “I’m ecstatic” as deserving a label $y=1$.

### Answer: True

## Q4 Which of these equations do you think should hold for a good word embedding? (Check all that apply)

## Q5 Let $E$ be an embedding matrix, and let $e_{1234}$ be a one-hot vector corresponding to word 1234. Then to get the embedding of word 1234, why don’t we call $E\times e_{1234}$ in Python?
### Answer: It is computationally wasteful.

## Q6 When learning word embeddings, we create an artificial task of estimating $P(target|context)$. It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.
### Answer: True

## Q7 In the word2vec algorithm, you estimate $P(t|c)$, where $t$ is the target word and $c$ is a context word. How are $t$ and $c$ chosen from the training set? Pick the best answer.
- $c$ is a sequence of several words immediately before $t$.
- $c$ is the sequence of all the words in the sentence before $t$.
- $c$ and $t$ are chosen to be nearby words.
- $c$ is the one word that comes immediately before $t$.
### Answer: $c$ and $t$ are chosen to be nearby words.

## Q8 Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The word2vec model uses the following softmax function:
\begin{eqnarray}
    p(t|c) = \frac{e^{\theta_t^T e_c}}{\sum_{j=1}^{1,000}e^{\theta_j^T e_c}}
\end{eqnarray}
Which of these statements are correct? Check all that apply.

- θt and ec are both 500 dimensional vectors.
- θt and ec are both 10000 dimensional vectors.
- θt and ec are both trained with an optimization algorithm such as Adam or gradient descent.
- After training, we should expect θt to be very close to ec when t and c are the same word.
### Answer:
- θt and ec are both 500 dimensional vectors.
- θt and ec are both trained with an optimization algorithm such as Adam or gradient descent.

## Q9 Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings.The GloVe model minimizes this objective:
Which of these statements are correct? Check all that apply.

minimize $\sum_{i=1}^{10,000}\sum_{j=1}^{10,000} f(X_{ij}) (\theta_i^T e_j + b_i - b_{j'} - \log X_{ij})^2$

- θi and ej should be initialized to 0 at the beginning of training.
- θi and ej should be initialized randomly at the beginning of training.
- Xij is the number of times word i appears in the context of word j.
- The weighting function f(.) must satisfy f(0)=0

### Answer: 
- θi and ej should be initialized randomly at the beginning of training.
- Xij is the number of times word i appears in the context of word j.
- The weighting function f(.) must satisfy f(0)=0

## Q10  Question 10 You have trained word embeddings using a text dataset of m1 words. You are considering using these word embeddings for a language task, for which you have a separate labeled dataset of m2 words. Keeping in mind that using word embeddings is a form of transfer learning, under which of these circumstance would you expect the word embeddings to be helpful?
### Answer: $m_1 >> m_2$