# Feature Extraction Using Word2Vec
## Prediction-based word embedding

Pre-requisite: how a neural network works and the mechanisms by which weights in an NN are updated. 
https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/

Word2vec provided probabilities to the words and proved to be state of the art for tasks like word analogies and word similarities, able to achieve tasks like King -man +woman = Queen

https://en.wikipedia.org/wiki/Word2vec  

Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words.[1][4] According to the authors' note,[5] CBOW is faster while skip-gram is slower but does a better job for infrequent words.  

Word2vec is not a single algorithm but a combination of two techniques  
1. CBOW(Continuous bag of words)  
2. Skip-gram model  

Both of these are shallow neural networks which map word(s) to the target variable which is also a word(s). Both of these techniques learn weights which act as word vector representations. Let us discuss both these methods separately and gain intuition into their working.  

##### B.2.1. CBOW (Continuous Bag of words)  
The way CBOW work is that it tends to predict the probability of a word given a context. A context may be a single word or a group of words. But for simplicity, I will take a single context word and try to predict a single target word.  

Suppose, we have a corpus C = “Hey, this is sample corpus using only one context word.” and we have defined a context window of 1. This corpus may be converted into a training set for a CBOW model as follow. The input is shown below.  The matrix on the right in the below image contains the one-hot encoded from of the input on the left.  
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04205949/cbow1.png


The target for a single datapoint say Datapoint 4 is shown as below

Hey	this	is	sample	corpus	using	only	one	context	word
0	0	0	1	0	0	0	0	0	0
 
This matrix shown in the above image is sent into a shallow neural network with three layers: an input layer, a hidden layer and an output layer. The output layer is a softmax layer which is used to sum the probabilities obtained in the output layer to 1. Now let us see how the forward propagation will work to calculate the hidden layer activation.

Let us first see a diagrammatic representation of the CBOW model.

https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04224109/Screenshot-from-2017-06-04-22-40-29.png

The matrix representation of the above image for a single data point is below.

https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04222108/Screenshot-from-2017-06-04-22-19-202.png

The flow is as follows:  
1. The input layer and the target, both are one- hot encoded of size [1 X V]. Here V=10 in the above example.
2. There are two sets of weights. one is between the input and the hidden layer and second between hidden and output layer. Input-Hidden layer matrix size =[V X N] , hidden-Output layer matrix  size =[N X V] : Where N is the number of dimensions we choose to represent our word in. It is arbitary and a hyper-parameter for a Neural Network. Also, N is the number of neurons in the hidden layer. Here, N=4.
3. There is a no activation function between any layers.( More specifically, I am referring to linear activation)
4. The input is multiplied by the input-hidden weights and called hidden activation. It is simply the corresponding row in the input-hidden matrix copied.
5. The hidden input gets multiplied by hidden- output weights and output is calculated.
6. Error between output and target is calculated and propagated back to re-adjust the weights.
7. The weight  between the hidden layer and the output layer is taken as the word vector representation of the word.

We saw the above steps for a single context word. Now, what about if we have multiple context words? The image below describes the architecture for multiple context words.   
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04220606/Screenshot-from-2017-06-04-22-05-44.png

Below is a matrix representation of the above architecture for an easy understanding.  
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04221550/Screenshot-from-2017-06-04-22-14-311.png

The image above takes 3 context words and predicts the probability of a target word. The input can be assumed as taking three one-hot encoded vectors in the input layer as shown above in red, blue and green.  

So, the input layer will have 3 [1 X V] Vectors in the input as shown above and 1 [1 X V] in the output layer. Rest of the architecture is same as for a 1-context CBOW.  

The steps remain the same, only the calculation of hidden activation changes. Instead of just copying the corresponding rows of the input-hidden weight matrix to the hidden layer, an average is taken over all the corresponding rows of the matrix. We can understand this with the above figure. The average vector calculated becomes the hidden activation. So, if we have three context words for a single target word, we will have three initial hidden activations which are then averaged element-wise to obtain the final activation.  

In both a single context word and multiple context word, I have shown the images till the calculation of the hidden activations since this is the part where CBOW differs from a simple MLP network. The steps after the calculation of hidden layer are same as that of the MLP as mentioned in this article – Understanding and Coding Neural Networks from scratch.  

**The differences between MLP and CBOW**  

1. The objective function in MLP is a MSE(mean square error) whereas in CBOW it is negative log likelihood of a word given a set of context i.e -log(p(wo/wi)), where p(wo/wi) is given as  

wo : output word  
wi: context words  

2. The gradient of error with respect to hidden-output weights and input-hidden weights are different since MLP has  sigmoid activations(generally) but CBOW has linear activations. The method however to calculate the gradient is same as an MLP.  

**Advantages of CBOW:**  

1. Being probabilistic is nature, it is supposed to perform superior to deterministic methods(generally).  
2. It is low on memory. It does not need to have huge RAM requirements like that of co-occurrence matrix where it needs to store three huge matrices.  
 
**Disadvantages of CBOW:**  

1. CBOW takes the average of the context of a word (as seen above in calculation of hidden activation). For example, Apple can be both a fruit and a company but CBOW takes an average of both the contexts and places it in between a cluster for fruits and companies.  
2. Training a CBOW from scratch can take forever if not properly optimized.  

##### B.2.2 Skip – Gram model  
Skip – gram follows the same topology as of CBOW. It just flips CBOW’s architecture on its head. The aim of skip-gram is to predict the context given a word. Let us take the same corpus that we built our CBOW model on. C=”Hey, this is sample corpus using only one context word.” Let us construct the training data.  

The input vector for skip-gram is going to be similar to a 1-context CBOW model. Also, the calculations up to hidden layer activations are going to be the same. The difference will be in the target variable. Since we have defined a context window of 1 on both the sides, there will be “two” one hot encoded target variables and “two” corresponding outputs as can be seen by the blue section in the image.  
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/04235354/Capture1.png  

Two separate errors are calculated with respect to the two target variables and the two error vectors obtained are added element-wise to obtain a final error vector which is propagated back to update the weights.  

The weights between the input and the hidden layer are taken as the word vector representation after training. The loss function or the objective is of the same type as of the CBOW model.  

The skip-gram architecture is shown below.  
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/05000515/Capture2-276x300.png  

For a better understanding, matrix style structure with calculation has been shown below.  
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/05122225/skip.png  


Let us break down the above image.  

Input layer  size – [1 X V], Input hidden weight matrix size – [V X N], Number of neurons in hidden layer – N, Hidden-Output weight matrix size – [N X V], Output layer size – C [1 X V]  

In the above example, C is the number of context words=2, V= 10, N=4  

The row in red is the hidden activation corresponding to the input one-hot encoded vector. It is basically the corresponding row of input-hidden matrix copied.  
The yellow matrix is the weight between the hidden layer and the output layer.  
The blue matrix is obtained by the matrix multiplication of hidden activation and the hidden output weights. There will be two rows calculated for two target(context) words.  
Each row of the blue matrix is converted into its softmax probabilities individually as shown in the green box.
The grey matrix contains the one hot encoded vectors of the two context words(target).  
Error is calculated by substracting the first row of the grey matrix(target) from the first row of the green matrix(output) element-wise. This is repeated for the next row. Therefore, for n target context words, we will have n error vectors.  
Element-wise sum is taken over all the error vectors to obtain a final error vector.  
This error vector is propagated back to update the weights.  

Advantages of Skip-Gram Model  
Skip-gram model can capture two semantics for a single word. i.e it will have two vector representations of Apple. One for the company and other for the fruit.  
Skip-gram with negative sub-sampling outperforms every other method generally.  
 
This is an excellent interactive tool to visualise CBOW and skip gram in action. I would suggest you to really go through this link for a better understanding.  

### C. Word2Vec use case scenarios  
Since word embeddings or word Vectors are numerical representations of contextual similarities between words, they can be manipulated and made to perform amazing tasks like-  

1. Finding the degree of similarity between two words.  
`model.similarity('woman','man')`  
0.73723527  
2. Finding odd one out.  
`model.doesnt_match('breakfast cereal dinner lunch';.split())`  
'cereal'  
3. Amazing things like woman+king-man =queen  
`model.most_similar(positive=['woman','king'],negative=['man'],topn=1)`  
queen: 0.508  
4. Probability of a text under the model  
`model.score(['The fox jumped over the lazy dog'.split()])`  
0.21  

Below is one interesting visualisation of word2vec.  
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/05003425/graph1.jpg  

The above image is a t-SNE representation of word vectors in 2 dimension and you can see that two contexts of apple have been captured. One is a fruit and the other company.  

5.  It can be used to perform Machine Translation.  
https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/06/05003807/ml.png  

The above graph is a bilingual embedding with chinese in green and english in yellow. If we know the words having similar meanings in chinese and english, the above bilingual embedding can be used to translate one language into the other.  

### D. Using pre-trained word vectors  

We are going to use google’s pre-trained model. It contains word vectors for a vocabulary of 3 million words trained on around 100 billion words from the google news dataset. The downlaod link for the model is this. Beware it is a 1.5 GB download.  

https://github.com/mmihaltz/word2vec-GoogleNews-vectors

In [1]:
from gensim.models import Word2Vec, KeyedVectors

# loading the downloaded model
model = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

# the model is loaded. It can be used to perform all of the tasks mentioned above.

# getting word vectors of a word
dog = model['dog']

# performing king queen magic
print(model.most_similar(positive=['woman', 'king'], negative=['man']))

# picking odd one out
print(model.doesnt_match("breakfast cereal dinner lunch".split()))

# printing similarity index
print(model.similarity('woman', 'man'))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


[('queen', 0.7118192911148071), ('monarch', 0.6189674139022827), ('princess', 0.5902431607246399), ('crown_prince', 0.5499460697174072), ('prince', 0.5377321243286133), ('kings', 0.5236844420433044), ('Queen_Consort', 0.5235945582389832), ('queens', 0.5181134343147278), ('sultan', 0.5098593235015869), ('monarchy', 0.5087411999702454)]
cereal
0.76640123


### E. Training your own word vectors
We will be training our own word2vec on a custom corpus. For training the model we will be using gensim and the steps are illustrated as below.  

word2Vec requires that a format of list of list for training where every document is contained in a list and every list contains list of tokens of that documents. I won’t be covering the pre-preprocessing part here. So let’s take an example list of list to train our word2vec model.  

In [2]:
from gensim.models import Word2Vec, KeyedVectors

sentence=[['Neeraj','Boy'],['Sarwan','is'],['good','boy']]

#training word2vec on 3 sentences
model = Word2Vec(sentence, min_count=1,size=300,workers=4)

Let us try to understand the parameters of this model.

sentence – list of list of our corpus  
min_count=1 -the threshold value for the words. Word with frequency greater than this only are going to be included into the model.  
size=300 – the number of dimensions in which we wish to represent our word. This is the size of the word vector.  
workers=4 – used for parallelization  

In [3]:
#using the model
#The new trained model can be used similar to the pre-trained ones.

#printing similarity index
print(model.similarity('boy', 'Boy'))

0.06083992


  """


https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

**Evaluation:**
Mikolov et al. (2013)[1] develop an approach to assessing the quality of a word2vec model which draws on the semantic and syntactic patterns discussed above. They developed a set of 8,869 semantic relations and 10,675 syntactic relations which they use as a benchmark to test the accuracy of a model. When assessing the quality of a vector model, a user may draw on this accuracy test which is implemented in word2vec,[17] or develop their own test set which is meaningful to the corpora which make up the model. This approach offers a more challenging test than simply arguing that the words most similar to a given test word are intuitively plausible.[1]

**Continuous Bag of Words** vs **Skip-Gram**
Accuracy can be improved in a number of ways, including the choice of model architecture (CBOW or Skip-Gram), increasing the training data set, increasing the number of vector dimensions, and increasing the window size of words considered by the algorithm. Each of these improvements comes with the cost of increased computational complexity and therefore increased model generation time.[1]

In models using large corpora and a high number of dimensions, the skip-gram model yields the highest overall accuracy, and consistently produces the highest accuracy on semantic relationships, as well as yielding the highest syntactic accuracy in most cases. However, the CBOW is less computationally expensive and yields similar accuracy results.[1]

Accuracy increases overall as the number of words used increases, and as the number of dimensions increases. Mikolov et al.[1] report that doubling the amount of training data results in an increase in computational complexity equivalent to doubling the number of vector dimensions.

**Word2Vec** vs **LSA**
Altszyler et al. (2017) [18] studied Word2vec performance in two semantic tests for different corpus size. They found that Word2vec has a steep learning curve, outperforming another word-embedding technique (LSA) when it is trained with medium to large corpus size (more than 10 million words). However, with a small training corpus LSA showed better performance. Additionally they show that the best parameter setting depends on the task and the training corpus. Nevertheless, for skip-gram models trained in medium size corpora, with 50 dimensions, a window size of 15 and 10 negative samples seems to be a good parameter setting.