## Word Embeddings

#### Keras Definition: “Turns positive integers (indexes) into dense vectors of fixed size”
______

#### Bag of Words
* “Netflix is better than Maxdome” 
    * Bag of words: [Netflix] [is] [better] [than] [Maxdome]<br><br>
* One Hot Encoding: index of the specific word becomes a one, the rest becomes a zero
    * Maxdome → 00001
    * better → 00100
    * Netflix → 10000
______________

#### Problems:
1. Word ordering information is lost:
    * Netflix is better than Maxdome vs. Maxdome is better than Netflix
    * Different meaning but same representation<br><br>
2. Data sparsity:
    * many zeros and few ones (imagine 20.000 zeros)<br><br>
3. Words as atomic symbols
    * cat and dog have the same distance as cat and apple
    * but cats and dogs are closer together (both are animals)<br><br>
4. Very hard to find higher level features
______

#### Example
<img src="images/Emb.png" alt="Emb" style="width: 350px;"/>
* Create the embedding matrix next<br><br>
* Decide on how long the vector should be (usually lenght of 32 or 50)<br><br>
* We assign a lenghth of 6 factors per index in this post to keep it readable<br><br>

___
### The embedding matrix for the word deep:
* ### deep = [.32, .02, .48, .21, .56, .15]<br><br>

* Instead of ending up with huge one-hot encoded vectors, an embedding matrix keeps the size of a vector much smaller<br><br>
* The embedded vectors are learned during the training process<br><br>
* This is computationally efficient when using very big datasets

___



#### Word Embeddings
<img src="images/WE1.png" alt="WE" style="width: 600px;"/>
______

<img src="images/collobert.png" alt="collobert" style="width: 600px;"/>
* Example of closest words found in a word embedding space
* France → european countries, Jesus → close to religious words
* It is all learned from the data, that these words are semantically similar and relations
____________

#### Recap
<img src="images/vectors.png" alt="vectors" style="width: 600px;"/><br><br>
* Also relationships are learned, for example the information of gender<br><br>
* Remember: subtraction of vector [woman - man] is the same as [queen - king]
______

#### Benefits
1 Trained in a completely unsupervised way
* you don't need labeled data<br><br>

2 Reduce data sparsity
* you are not dealing with a huge number of 0/1 you have a lot of float values<br><br>

3 Semantic hashing
* representing the information of word meaning of the words
* semantic hashing: semantically similar words are closer
* Appear to carry semantic information about the words<br><br>

4 Freely available for out of the box usage

______

## Train Your Own Embedding Layer

#### In a Nutshell:
* Word embeddings provide a dense representation of words and their relative meanings<br><br>

* They are an improvement over sparse representations used in simpler bag of word model representations<br><br>

* Word embeddings can be learned from text data and reused among projects<br><br>

* They can also be learned as part of fitting a neural network on text data

#### How to train your own embedding layer in Keras

* It requires the input data to be integer encoded, so that each word is represented by a unique integer<br><br>
* Embedding layer is initialized with random weights and will learn an embedding for all words in the training dataset<br><br>
* We will define a small problem with ten text documents and classify them as positive "1" or negative "0"<br><br>
* We are using a Keras Sequential Model here to finish the task

In [1]:
# import packages
import re
import numpy as np
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import one_hot
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

Using TensorFlow backend.
  return f(*args, **kwds)


In [23]:
# define the documents and classify them as positive “1” or negative “0”

docs = ['well done', 'good work', 'great effort', 'nice work', 'excellent',
        'weak', 'poor effort', 'not good', 'poor work', 'could have done better']
labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

______
#### One Hot Encoding
* Next, we can integer encode each document. This means that as input the Embedding layer will have sequences of integers<br><br>

* We use the Keras one_hot() function that creates a hash of each word as an efficient integer encoding<br><br>

* We will estimate the vocabulary size of 30, to reduce the probability of collisions from the hash function

In [35]:
# one-hot encode the documents in docs

embedding_vocab_size = 20
encoded_docs = [one_hot(d, embedding_vocab_size) for d in docs]
print(encoded_docs)

[[17, 13], [5, 8], [10, 7], [10, 8], [6], [5], [4, 7], [15, 5], [4, 8], [2, 13, 13, 7]]


______
#### Padding
* Pad each document to ensure they are of the same length

In [36]:
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[17 13  0  0]
 [ 5  8  0  0]
 [10  7  0  0]
 [10  8  0  0]
 [ 6  0  0  0]
 [ 5  0  0  0]
 [ 4  7  0  0]
 [15  5  0  0]
 [ 4  8  0  0]
 [ 2 13 13  7]]


______
#### Keras Embedding Layer
* We will define our Embedding layer as part of a neural network model for binary classification

* The Embedding has a vocabulary of 30 words and an input length of 4 words

* We will choose a typical embedding space of 8 dimensions

* Importantly, the output from the Embedding layer will be 4 vectors of 32 dimensions each, one for each word

* We flatten this to a one 32-element vector to pass on to the Dense output layer

In [37]:
# define the model
model = Sequential()
model.add(Embedding(input_dim=embedding_vocab_size, 
                    output_dim=32, 
                    input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
# summarize the model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 4, 32)             640       
_________________________________________________________________
flatten_3 (Flatten)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 129       
Total params: 769
Trainable params: 769
Non-trainable params: 0
_________________________________________________________________
None


In [41]:
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


#### Result: the trained model shows that it learned the training dataset perfectly (which is not surprising).