# Embedding Layer

Keras offers an Embedding layer that can be used for neural networks on text data.

It requires that the input data be integer encoded so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset. You must specify the input_dim which is the size of the vocabulary, the output_dim which is the size of the vector space of the embedding, and optionally the input_length which is the number of words in input sequences.
layer = Embedding(input_dim, output_dim, input_length=??)
Or, more concretely, a vocabulary of 200 words, a distributed representation of 32 dimensions and an input length of 50 words.
layer = Embedding(200, 32, input_length=50)

# Embedding with Model

The Embedding layer can be used as the front-end of a deep learning model to provide a rich distributed representation of words, and importantly this representation can be learned as part of training the deep learning model.

For example, the snippet below will define and compile and neural network with an embedding input layer and a dense output layer for a document classification problem.

When the model is trained on examples of padded documents and their associated output label both the network weights and the distributed representation will be tuned to the specific data.

In [1]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define problem
vocab_size = 100
max_length = 32
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())

Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 32, 8)             800       
_________________________________________________________________
flatten_1 (Flatten)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257       
Total params: 1,057
Trainable params: 1,057
Non-trainable params: 0
_________________________________________________________________
None


In [2]:
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = [1,1,1,1,1,0,0,0,0,0]
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

[[44, 36], [18, 41], [41, 7], [17, 41], [3], [5], [21, 7], [9, 18], [21, 41], [39, 7, 36, 42]]
[[44 36  0  0]
 [18 41  0  0]
 [41  7  0  0]
 [17 41  0  0]
 [ 3  0  0  0]
 [ 5  0  0  0]
 [21  7  0  0]
 [ 9 18  0  0]
 [21 41  0  0]
 [39  7 36 42]]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten_2 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None
Accuracy: 80.000001


In [36]:
# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = [1,1,1,1,1,0,0,0,0,0]

In [25]:
from nltk import word_tokenize
doclist=[]
for doc in docs:
    doclist.append(word_tokenize(doc))
print(doclist)        

[['Well', 'done', '!'], ['Good', 'work'], ['Great', 'effort'], ['nice', 'work'], ['Excellent', '!'], ['Weak'], ['Poor', 'effort', '!'], ['not', 'good'], ['poor', 'work'], ['Could', 'have', 'done', 'better', '.']]


In [28]:
from gensim.models import Word2Vec
model=Word2Vec(doclist,min_count=1)

for word in model.wv.vocab:
    print(word , end=' ')

Well done ! Good work Great effort nice Excellent Weak Poor not good poor Could have better . 

In [39]:
from keras.preprocessing.text import Tokenizer
tokens=Tokenizer()
tokens.fit_on_texts(docs)
print(docs)
print(tokens.word_docs)

['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', 'Could have done better.']
{'done': 2, 'well': 1, 'good': 2, 'work': 3, 'great': 1, 'effort': 2, 'nice': 1, 'excellent': 1, 'weak': 1, 'poor': 2, 'not': 1, 'could': 1, 'better': 1, 'have': 1}
