### Learning An Embedding

In this section, we will look at how we can learn a word embedding while fitting a neural network on a text classification problem.

We will define a small problem where we have 10 text documents, each with a comment about a piece of work a student submitted. Each text document is classified as positive “1” or negative “0”. This is a simple sentiment analysis problem.

In [1]:
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

In [2]:
# define documents
docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])

### One Hot Representation


In [3]:
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

[[47, 10], [11, 47], [6, 45], [8, 47], [6], [15], [4, 45], [39, 11], [4, 47], [43, 42, 10, 41]]


### Using Pad Sequences 


In [4]:
### Find Max no of words in Whole lists of sentence 
# sent_length=8
sent_length = max([len(sen.split(' ')) for sen in docs ])
embedded_docs=pad_sequences(encoded_docs,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[ 0  0 47 10]
 [ 0  0 11 47]
 [ 0  0  6 45]
 [ 0  0  8 47]
 [ 0  0  0  6]
 [ 0  0  0 15]
 [ 0  0  4 45]
 [ 0  0 39 11]
 [ 0  0  4 47]
 [43 42 10 41]]


### Embedding layer along with Output and No Dense Layer included

In [5]:
### Feature Representation ---- 10 Features (Vector Length)
model=Sequential()
model.add(Embedding(vocab_size, 8, input_length=sent_length))
model.compile('adam','mse')

model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 4, 8)              400       
Total params: 400
Trainable params: 400
Non-trainable params: 0
_________________________________________________________________


In [6]:
model.predict(embedded_docs)

array([[[ 0.03621672,  0.01339883,  0.04481817, -0.01903031,
         -0.00732581,  0.02661328, -0.02585479,  0.02550535],
        [ 0.03621672,  0.01339883,  0.04481817, -0.01903031,
         -0.00732581,  0.02661328, -0.02585479,  0.02550535],
        [-0.01808726, -0.03990116,  0.03457019, -0.0126509 ,
          0.04557369, -0.03543545, -0.00449588,  0.04883726],
        [-0.03252991,  0.0085272 , -0.04137795, -0.01106777,
         -0.00585307, -0.00744376, -0.00409549, -0.02560116]],

       [[ 0.03621672,  0.01339883,  0.04481817, -0.01903031,
         -0.00732581,  0.02661328, -0.02585479,  0.02550535],
        [ 0.03621672,  0.01339883,  0.04481817, -0.01903031,
         -0.00732581,  0.02661328, -0.02585479,  0.02550535],
        [-0.02627642,  0.02711337, -0.02349795,  0.03750106,
          0.04942321,  0.04389988, -0.02694153, -0.03764254],
        [-0.01808726, -0.03990116,  0.03457019, -0.0126509 ,
          0.04557369, -0.03543545, -0.00449588,  0.04883726]],

       [[ 0.

In [7]:
print(model.predict(embedded_docs)[0])

[[ 0.03621672  0.01339883  0.04481817 -0.01903031 -0.00732581  0.02661328
  -0.02585479  0.02550535]
 [ 0.03621672  0.01339883  0.04481817 -0.01903031 -0.00732581  0.02661328
  -0.02585479  0.02550535]
 [-0.01808726 -0.03990116  0.03457019 -0.0126509   0.04557369 -0.03543545
  -0.00449588  0.04883726]
 [-0.03252991  0.0085272  -0.04137795 -0.01106777 -0.00585307 -0.00744376
  -0.00409549 -0.02560116]]


### Embedding Layer with Dense Layer 

In [8]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=sent_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten (Flatten)            (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


In [9]:
# fit the model
model.fit(embedded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(embedded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 89.999998
