## Word Embeddings

The Keras definition says that word embeddings turn positive integers (indexes) into dense vectors of fixed size. Generally speaking, word embeddings is a technique in the field of NLP. It describes a technique where words are encoded as dense vectors in a high-dimensional space that carry a meaning. Each word has a specific position within the vector space. This position is learned from the text during the training and is premised on the surrounding words. Training the model will cause that semantically similar words will appear closer together also in the vector space.
______

#### Bag of Words

A simple bag of words has for the same representation different meanings. Additionally, huge text datasets would be represented by a vector comprised by many zeros and only a few one’s which is not only inefficient but leads also to data sparsity.


* “Netflix is better than Maxdome” 
    * Bag of words: [Netflix] [is] [better] [than] [Maxdome]<br><br>
* One Hot Encoding: index of the specific word becomes a one, the rest becomes a zero
    * Maxdome → 00001
    * better → 00100
    * Netflix → 10000
______________

#### One Hot Encoding - Problems:
1. Word ordering information is lost:
    * Netflix is better than Maxdome vs. Maxdome is better than Netflix
    * Different meaning but same representation<br><br>
2. Data sparsity:
    * many zeros and few ones (imagine 20.000 zeros)<br><br>
3. Words as atomic symbols
    * cat and dog would have the same distance as cat and apple
    * but cats and dogs are closer together (both are animals)
    * semantic similarity and relations is all learned from the data<br><br>
    
      <img src="images/WE1.png" alt="WE" style="width: 600px;"/><br><br>
    
4. Very hard to find higher level features when using One Hot Encoding
______

### Example

Below we want to show you an example why using dense vectors has a computational benefit when working with deep learning models such as CNNs. First, imagine you have the sentence "deep learning is very deep". Next, you have to decide on how long the vector should be (usally a lenght of 32 or 50). For this example we assign a lenghth of 6 factors per index in this post to keep it readable.


<img src="images/Emb.png" alt="Emb" style="width: 350px;"/>


Now instead of ending up with huge one-hot encoded vectors, an embedding matrix keeps the size of a vector much smaller. The embedded vectors are learned during the training process. This is computationally efficient when using very big datasets. Below you see an example for the embedding matrix for the word deep:

### deep = [.32, .02, .48, .21, .56, .15]<br><br>


___




<img src="images/vectors.png" alt="vectors" style="width: 600px;"/><br><br>
* Also relationships are learned, for example the information of gender<br><br>
* Subtraction of vector [woman - man] is the same as [queen - king]
______

#### Word Embeddings - Benefits:
1. Trained in a completely unsupervised way, so you don't need labeled data <br><br>

2. Reduce data sparsity because you are not dealing with a huge number of 0/1 you have a lot of float values<br><br>

3. Semantic hashing
    * representing the information of word meaning of the words
    * semantic hashing: semantically similar words are closer <br><br>

4. Freely available for out of the box usage
______

## Train Your Own Embedding Layer

#### In a Nutshell:

In the following you will see an example of how to learn a word embedding which is based on a neural network. This example aims to show how Keras supports word embeddings for deep learning in detail. First, it requires the input data to be integer encoded, so that each word is represented by a unique integer. Then the Embedding layer is initialized with random weights and will learn an embedding for all words in the training dataset. We will define a small problem with ten text documents and classify them as positive "1" or negative "0". We are using a Keras Sequential Model to finish the task.

In [7]:
# import packages
import re
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

In [11]:
# define the documents and classify them as positive “1” or negative “0”

docs = ['good job', 'great work', 'well done', 'great effort', 'amazing',
        'bad', 'poor work', 'very weak', 'could have done better', 'not good at all']
labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

______
#### One Hot Encoding

Next, each document has to be integer encoded. This means that as input the Embedding layer will have sequences of integers. We use the Keras one_hot() function that creates a hash of each word as an efficient integer encoding. Further, the vocabulary size is estimated at 20.

In [12]:
# one-hot encode the documents in docs

embedding_vocab_size = 20
encoded_docs = [one_hot(d, embedding_vocab_size) for d in docs]
print(encoded_docs)

[[13, 16], [19, 5], [16, 17], [19, 9], [9], [17], [12, 5], [13, 9], [17, 8, 17, 5], [7, 13, 18, 11]]


______
#### Padding
As Neural Networks are expecting a fixed size of input vector we have to pad each document to ensure they are of the same length.

In [13]:
# one-hot encode the documents in docs

embedding_vocab_size = 20
encoded_docs = [one_hot(d, embedding_vocab_size) for d in docs]
print(encoded_docs)

[[13 16  0  0]
 [19  5  0  0]
 [16 17  0  0]
 [19  9  0  0]
 [ 9  0  0  0]
 [17  0  0  0]
 [12  5  0  0]
 [13  9  0  0]
 [17  8 17  5]
 [ 7 13 18 11]]


______
#### Keras Embedding Layer
Finally, we will define our Embedding layer as part of a neural network model for binary classification.
The Embedding has a vocabulary of 20 words and an input length of 4 words. We will choose a typical embedding space of 8 dimensions. Importantly, the output from the Embedding layer will be 4 vectors of 32 dimensions each, one for each word. In the end, we flatten this to a one 32-element vector to pass on to the Dense output layer.

In [14]:
# define the model
model = Sequential()
model.add(Embedding(input_dim=embedding_vocab_size, 
                    output_dim=32, 
                    input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
# summarize the model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4, 32)             640       
_________________________________________________________________
flatten_1 (Flatten)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 769
Trainable params: 769
Non-trainable params: 0
_________________________________________________________________
None


In [22]:
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


The trained model shows that it learned the training dataset perfectly (which is not surprising).