# Word Embedding : A walk through

### <span style = "color:#DF7E22"> Core Concept : </span>

- Embedding is to transformn from the **<span style="color:blue">Sparse</span>** representations to a **<span style="color:blue">Higher density</span>** representations that contain **<span style="color:blue">More solid context-information</span>** in a **<span style="color:blue">Vectors space</span>**. 

- The original ideal is to solve the sparse representation that we used one-hot vector to represent each word in the Natural Language Process (NPL-domain)

- This tech is a kind of alternative with **<span style="color:blue">Autoencoder</span>** 

- Due the Embedding learning problem usually a **<span style="color:blue">Unnormalized problem</span>**, it requires lots of computation. We actually have variable approximation-way to interpret it and implement it. 

- In this post, I am going to take a walk though tensorflow implementation, which is base on this **[Paper](https://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf)**






###  <span style = "color:#DF7E22">Key Words :</span>

- Continuous Bag of Words (CBOW)
- The Skip-Gram 
- Hierachical Softmax : For Fast Computer with O[log(M)] where M is the population/vocabulary 
 - Binary Tree 
- Noise Contrastive Estimation (An alternative to the Hierachical Softmax) 
 - NCE loss ( learn by comparison ) 

[Noise-contrastive estimation](http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf)
We train our models using noise-contrastive estimation, a method for fitting unnormalized models, adapted to neural language modelling in [14]. NCE is based on the reduction of density estimation
to probabilistic binary classification. The basic idea is to train a logistic regression classifier to
discriminate between samples from the data distribution and samples from some “noise” distribution,
based on the ratio of probabilities of the sample under the model and the noise distribution

###  <span style = "color:#DF7E22">Math Interpretation : </span>


- **<span style="color:blue"> Neural Probabilistic Language Models </span>**

 - $ P_{\theta}^{h}(w) = \frac{exp(s_{\theta}(w, h))}{\sum_{\hat{w}} exp(s_{\theta}(w, h))}$


- **<span style="color:blue"> Scalable Log-bilinear Models </span>**

 - $ \hat{q}(h) = \sum_{i=1}^{n} c_{i} \odot r_{w_{i}} $ 
 - $c_{i} $ : weight vector for the context word in position i 
 - $ \odot $ denotes element-wise multiplication. 
 - The context can consist of words preceding, following, or surrounding the word being predicted. 
 
 
- **<span style="color:blue"> The Scoring Function </span>**
 - $ s_{\theta} (w,h) = \hat{q}(h)^\intercal q_{w} + b_{w}$
 - which is to compute the similarity between the predicted feature vector and one for word w:


###  <span style = "color:#DF7E22">Code Implementation : </span>


###  <span style = "color:#DF7E22">Reference : </span>

- [A nice blog : Word-Embbeding and Autoencoder](https://ayearofai.com/lenny-2-autoencoders-and-word-embeddings-oh-my-576403b0113a#.rcrh3ybeb)

- [Tensorflow tutorials](https://www.tensorflow.org/versions/r0.12/tutorials/word2vec/index.html)

- [TF tutirial in chinese](http://www.jeyzhang.com/tensorflow-learning-notes-3.html)

- [A nice script that implement embedding with Skip-gram and COBV with hierachical softmax](https://github.com/deborausujono/word2vecpy/blob/master/word2vec.py)