# CS224n: NLP with Deep Learning

# Lecture 2: Word Vectors and Word Senses

## Finishing Word vectors

* King - Man ~= Idea of Kingship without the Man Part
* +Woman = add Woman Idea to it
* => Get Queen

Plot words on a scatter plot:

### Word2vec : Vectorization

We have 2 big matrices:
* 1 that represents every outside word's vector $U$
* 1 which represents every center word's vector $V$

$U = \begin{bmatrix}[outside\,word\,vector\,1]
\\ \vdots
\\ [outside\,word\,vector\,n] \end{bmatrix}$

$V = \begin{bmatrix}[center\,word\,vector\,1]
\\ \vdots
\\ [center\,word\,vector\,n] \end{bmatrix}$

We multiply U by a center word $v_{4}$

$\begin{align*}
U \cdot v_{4} &= \begin{bmatrix}[outside\,word\,vector\,1]
\\ \vdots
\\ [outside\,word\,vector\,n] \end{bmatrix}
\cdot v_{4}
\\ &=\begin{bmatrix}[similarity\,with\,u_{1}]
\\ \vdots
\\ [similarity\,with\,u_{n}] \end{bmatrix}
\end{align*}$

## Optimization

## Count-based methods

## GloVe

## Evaluating word vectors

## Word Senses

## How to get the meaning of a word?

### WordNet: a dictionary of synonyms

* Using dictionary such as WordNet, which will store the synonyms of words

Problems:
* Missing nuances
* Missing new words / new meanings of words:
ex: ninja
* Can't compute accurate word similarity if they aren't in the same synonym sets

Traditional NLP: 
* Everything until 2012
* Words were regarded as discrete symbols -> Usage of 1-hot vectors

### One-hot encoding

Problems:
* No notion of similarity: word vectors for 'hotel' and 'motel' are orthogonal: similarity = 0
* Big dimensions (200k-1m words)

### Distributional semantics

**A word's meaning is given by the words that frequently appear close-by**

* A word vector = a smaller, but dense vector
* Typical dimensions:
    * Min: 50
    * Average: 300
    * Max: 2000

## Word2Vec

* Word2vec = a framework for learning word vectors

* Closeness in the vector space ~ Word similarity

### General Idea
* Initialize each word vector randomly
* For each position in the text:
    * a center word *c*
    * context ('outside') words *o*
    * Calculate $P(o|c)$ or $P(c|o)$
    * Adjust the word vectors to maximize this probability

Let's maximize the Likelihood, with respect to all the $\theta$ variables

**Likelihood**

$$ L(\theta) = \prod_{t=1}^{T} \prod_{\substack -m\leq j \leq m \\ j \neq 0} P(w_{t+j} | w_{t}; \theta) $$

Let's maximize $L(\theta)$, ie minimize $-L(\theta)$

### Objective function $J(\theta)$

$$ J(\theta) = - \frac{1} {T} log L(\theta) $$

**How do we calculate the probability $P(w_{i} | w{j})$ ?**

$$ P(o | c) = \frac {exp(u_{0}^{T} \cdot v_{c}) } {\sum\limits_{w \in V} exp(u_{w}^{T} \cdot v_{c}) } $$

Le dénominateur somme sur $w \in V$ pour que $\sum\limits_{o \in V}^{} P(o|c) = 1$

P(o|c) = softmax ( dot_product(o,c) for o in outsides)

ie it is the normalized similarity of our center word, compared to all context words

**SoftMax**

Thus, by using softmax, we get a probability distribution 
* Max: because it amplifies probability of the largest elements
* Soft: because it still assigns some probability to the smaller elemnts

### Our parameters $\theta$

$\theta$ is the concatenation of all the word vectors of our vocabulary

* Every word has 2 word vectors
* With d-dimensional vectors, and V words, 

$$ \theta \in \mathbb{R}^{2dV} $$

### Calculating the gradient

$$ \frac{\partial }{\partial v_{c}} log P(o|c) = u_{o} - \sum_{x=1}^{V} p(x|c)u_{x} $$

Slope = observed representation of our context word - what our model thinks the context should look like

Where what our model thinks the context should look like = Expectation

Slope with respect to context word = Actual context word - Expected context word

$ Slope\, with\, respect\, to\, context\, word = Actual\, context\, word - Expected\, context\, word $