

* The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).
* If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.

* Embedding(VOCAB_SIZE, SIZE_OF_EMBEDDING, INP_SEQ_LENGTH)

# Keras 
# vectorising
* Keras prefers inputs to be vectorized
* https://keras.io/preprocessing/text
* Keras provides the one_hot() function that creates a hash of each word as an efficient integer encoding
* or BoW / tf-IDF

# Padding
*  To feed it to N.N, inputs to have the same length
* https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/
*  pad_sequences() function.


In [1]:
import numpy as np
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

Using TensorFlow backend.


In [2]:
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = np.array([1,1,1,1,1,0,0,0,0,0])


# one_hot() function that creates a hash of each word as an efficient integer encoding
* I don’t think that one-hot encoding the string vectors is ideal. Even with the recommended vocab size (50), I still got collisions which defeats the purpose.
* Even the documentation states that uniqueness is not guaranteed.
* Keras’ Tokenizer() is reliable

In [5]:
# integer encode the documents 
vocab_size = 50 # 50 is larger than needed to reduce the probability of collisions from the hash function
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

[[21, 31], [15, 46], [49, 16], [6, 46], [32], [45], [27, 16], [49, 15], [27, 46], [41, 16, 31, 43]]


# The words are 1-offset, leaving room for 0 for “unknown” word.

# preparing i/p for N.N (padding)
* Q) What impact does the type of padding have on the model performance for any task, example sentence classification?
* A)  use a Masking input layer which will ignore padded values. This means that padded inputs have no impact on learning.
# CNN dont support it ; Emb. layer and LSTM support masking
* if CNNs try learning the padded data directly
* Masking layer ignore padding data
* https://stackoverflow.com/questions/49961683/how-to-use-the-result-of-embedding-with-mask-zero-true-in-keras

In [6]:
max_length = 4 # you can either mention in function pad_sequences 
padded_docs = pad_sequences(encoded_docs, padding='post') #by default max len of seq is considered larget sent in corpous i.e 4
print(padded_docs)

[[21 31  0  0]
 [15 46  0  0]
 [49 16  0  0]
 [ 6 46  0  0]
 [32  0  0  0]
 [45  0  0  0]
 [27 16  0  0]
 [49 15  0  0]
 [27 46  0  0]
 [41 16 31 43]]


# Padding and Truncating
* To feed it to N.N, inputs to have the same length
  - Either we ensure that all sequences in the entire data-set have the same length
  - Or Entier batch should be of same length
* Going about choosing ampunt to pad
   - going with longest seq, would be just waste of memory for texr whose length is small
   - going with smalles seq , would be just ignoring other imp values
   - so we go optimal

# Embedding layer is a 2D vector so connect to dense I Flatten it i.e 4 x 8 = 32

In [7]:
model = Sequential()
model.add(Embedding(input_dim=4,output_dim=8,input_length=max_length)) #4 x 8
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))

# compile model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['acc'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4, 8)              32        
_________________________________________________________________
flatten_1 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 65
Trainable params: 65
Non-trainable params: 0
_________________________________________________________________
None


In [8]:
%%time
# fit the model

model.fit(padded_docs, labels, epochs=5, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=1)
print('Accuracy: %f' % (accuracy*100))


IndexError: index 49 is out of bounds for size 4
Apply node that caused the error: AdvancedSubtensor1(embedding_1/embeddings, Elemwise{Cast{int32}}.0)
Toposort index: 35
Inputs types: [TensorType(float32, matrix), TensorType(int32, vector)]
Inputs shapes: [(4, 8), (40,)]
Inputs strides: [(32, 4), (4,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Reshape{2}(AdvancedSubtensor1.0, MakeVector{dtype='int64'}.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "c:\users\me\appdata\local\programs\python\python37\lib\site-packages\IPython\core\interactiveshell.py", line 3018, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "c:\users\me\appdata\local\programs\python\python37\lib\site-packages\IPython\core\interactiveshell.py", line 3183, in run_ast_nodes
    if (yield from self.run_code(code, result)):
  File "c:\users\me\appdata\local\programs\python\python37\lib\site-packages\IPython\core\interactiveshell.py", line 3265, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-150e6839aec9>", line 2, in <module>
    model.add(Embedding(input_dim=4,output_dim=8,input_length=max_length)) #4 x 8
  File "c:\users\me\appdata\local\programs\python\python37\lib\site-packages\keras\engine\sequential.py", line 165, in add
    layer(x)
  File "c:\users\me\appdata\local\programs\python\python37\lib\site-packages\keras\engine\base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "c:\users\me\appdata\local\programs\python\python37\lib\site-packages\keras\layers\embeddings.py", line 141, in call
    out = K.gather(self.embeddings, inputs)
  File "c:\users\me\appdata\local\programs\python\python37\lib\site-packages\keras\backend\theano_backend.py", line 515, in gather
    y = reference[indices]

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.