## Word Embeddings in Keras

A word embedding is a class of approaches for representing words and documents using a
dense vector representation. It is an improvement over more the traditional bag-of-word model
encoding schemes where large sparse vectors were used to represent each word or to score each
word within a vector to represent an entire vocabulary. These representations were sparse
because the vocabularies were vast and a given word or document would be represented by a
large vector comprised mostly of zero values.

Instead, in an embedding, words are represented by dense vectors where a vector represents
the projection of the word into a continuous vector space. The position of a word within the
vector space is learned from text and is based on the words that surround the word when it is
used. The position of a word in the learned vector space is referred to as its embedding. Two
popular examples of methods of learning word embeddings from text include:
- Word2Vec.
- GloVe.

### Keras Embedding Layer
Keras offers an Embedding layer that can be used for neural networks on text data. It requires
that the input data be integer encoded, so that each word is represented by a unique integer.
This data preparation step can be performed using the Tokenizer API also provided with
Keras.
The Embedding layer is initialized with random weights and will learn an embedding for all
of the words in the training dataset.

The Embedding layer is defined as the first hidden layer of a network. It must specify 3
arguments:
- input dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words
- output dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
- input length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

### Learning an Embedding
Learn a word embedding while fitting a neural
network on a text classification problem. Define a small problem where we have 10
text documents, each with a comment about a piece of work a student submitted. Each text
document is classified as positive 1 or negative 0. This is a simple sentiment analysis problem.
First, we will define the documents and their class labels

In [2]:
import numpy as np
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

# define documents
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
# define class labels
labels = np.array([1,1,1,1,1,0,0,0,0,0])

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
# integer encode the documents, could use more complex like TF-IDF, needs to be integer for embedding layer
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

[[18, 39], [18, 36], [27, 12], [24, 36], [32], [17], [10, 12], [38, 18], [10, 36], [31, 17, 39, 4]]


In [13]:
#Keras prefers same lengths so pad 
max_length = 4
padded_docs = pad_sequences(encoded_docs,maxlen=max_length,padding='post')
print(padded_docs)

[[18 39  0  0]
 [18 36  0  0]
 [27 12  0  0]
 [24 36  0  0]
 [32  0  0  0]
 [17  0  0  0]
 [10 12  0  0]
 [38 18  0  0]
 [10 36  0  0]
 [31 17 39  4]]


Define the Embedding layer.
The Embedding layer has a vocabulary of 50 and an input length of 4. Choose a
small embedding space of 8 dimensions. The model is a simple binary classification model.
Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one
for each word. Flatten this to a one 32-element vector to pass on to the Dense output layer

In [6]:
#define model
model = Sequential()
model.add(Embedding(vocab_size,8,input_length = max_length))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))
#compile
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
#summarize model
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten_1 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________


The output of the Embedding layer is a 4 x 8 matrix and this is squashed to a 32-element vector
by the Flatten layer

In [15]:
#fit the model
model.fit(padded_docs,labels,epochs=50,verbose=0)
#evaluate
loss,accuracy = model.evaluate(padded_docs,labels,verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


Very simple model not surprising accuracy