## Introduction to Word2Vec
#### The model Word2Vec is a simple word embedding neural network with a single hidden layer, based on the study of Le & Mikolov (2014).

Word embeddings from plain text.

The model assumes the *Distributional Hypothesis* that words are characterized by words they hang out with. this idea is used to estimate the probability of two words occurring near each other.

## Introduction

In NLP popular fixed-length features are **bag-of-words**.

However bag-of-words features have two major weaknesses: 
- they lose the ordering of the words
- they also ignore semantics of the words. 

In the paper by **Le & Mikolov (2014)**, they propose a distributed representations of sentences and documents, which they call *Paragraph Vector*. 

It's an unsupervised algorithm that learns fixed-length feature representations from **variable-length pieces of texts**, such as sentences, paragraphs, and documents. The algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives the algorithm the potential to overcome the weaknesses of bag-of- words models. 

Empirical results show that **Paragraph Vectors outperform bag-of-words models** as well as other techniques for text representations. The paper then shows that they were abke to achieve new **state-of-the-art** results on several **text classification** and **sentiment analysis** tasks.


## Motivation

after training words with similar meaning are mapped to a similar position in the vector space. The difference between word vectors also carry meaning. For ex-ample, the word vectors can be used to answer analogy questions using simple vector algebra: “King” - “man” + “woman” = “Queen”.
Theses properties of word vectors are useful in many NLP task, such as language modelling and understanding and machine translation.

## Language Model

we concatenate the paragraph vector with several word vec- tors from a paragraph and predict the following word in the given context. Both word vectors and paragraph vectors are trained by the stochastic gradient descent and backpropaga- tion

neural network language model proposed:

 - each word is represented by a one -hot  vector
 - if we use a multi-word context, then the vectors are averaged 
 - this context vector is the input of a neural network, and tries to predict the next word.

### Results

After training, the word vectors are mapped into a vector space such that semantically similar words have similar vector representations.

In the study they extend the model to go beyond word level to achieve phrase-level or sentence-level representations. 

Paragraph Vectors is less complex and outperforms other methods that have tried to achieve similar representations, such as **word weighting functions** (which requires task-specific tuning) and **parse trees**. 

Paragraph Vector takes a general approach. It is capable of constructing representations of input sequences of any length.

## One-Word Context

As mentioned before every word is mapped to a unique vector, represented by a column in a matrix W. It is then used as features for prediction of the next word in a sentence.

maximize the average log probability

[EQ]

In the paper hierarchical softmax is used for faster training, but for simpliticy I will just use a regular softmax regression as the multiclass classifier

[EQ]


#### Softmax Regression

Word2Vec is a very simple neural network with a single hidden layer.

In [1]:
sentences = ['<s> the prince loves skateboarding in the park </s>', 
             '<s> the princess loves the prince but the princess hates skateboarding </s>',
             '<s> skateboarding in the park is popular </s>',
             '<s> the prince is popular but the prince hates attention </s>',
             '<s> the princess loves attention but the princess hates the park </s>']

In [None]:
model = Word2Vec_1WordContext(sentences)
W, vocab = model.train()

Using the dot product $W_I \cdot W'^T_O$ we compute the distance between the input word *dwarf* and the output word *hates*:

Now using softmax regression, we can compute the posterior probability $P(w_O|w_I)$:

$$ P(w_O|w_I) = y_i = \frac{exp(W_I \cdot W'^T_O)}{\sum^V_{j=1} exp(W_I \cdot W'^T_j)} $$

### Updating the hidden-to-output layer weights

loss function to minimize is: $E = -\log P(w_O|w_I)$

The error is computed with $t_j - P(w_O|w_I) = e_j$, where $t_j$ is 1 if $w_j$ is the actual output word, otherwise $t_j$ is 0.

To obtain the gradient on the hidden-to-output weights, we compute $e_j \cdot h_i$, where $h_i$ is a copy of the vector corresponding to the input word (only holds with a context of a single word). Finally, using stochastic gradient descent, with a learning rate $\nu$ we obtain the weight update equation for the hidden to output layer weights:

$$W'^{T (t)}_j = W'^{T (t-1)}_j - \nu \cdot e_j \cdot h_j$$.


### Updating the input-to-hidden layer weights

backpropagate the prediction errors to the input-to-hidden weights

### Multi-word context

In [23]:
model = Word2Vec_nWordContext(sentences, learning_rate = 1.0, context_size = 4)
W, vocab = model.train()

In [24]:
model.graph_vector_space()

### Paragraph Vector

Backpropagation

In [17]:
model = Doc2Vec_nWordContext(sentences, learning_rate = 1.0, context_size = 2)
W, vocab = model.train()