<a href="https://colab.research.google.com/github/PearlSikka/language-ninja/blob/master/Understanding_word_embeddings_in_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd

In [None]:
import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding,Dense,Flatten
from tensorflow.keras.models import Sequential

In [None]:
sentences=['I love dogs','Lets go outside and play with dog','Beautiful sunrise today, wake up with positivity and shine on','Lets learn Keras and embeddings using embedding layer']

In [None]:
tokenizer=Tokenizer(num_words=30,lower=True,split=' ')

In [None]:
sequences=tokenizer.fit_on_texts(sentences)

In [None]:
tokenizer.word_index

{'and': 1,
 'beautiful': 11,
 'dog': 10,
 'dogs': 6,
 'embedding': 23,
 'embeddings': 21,
 'go': 7,
 'i': 4,
 'keras': 20,
 'layer': 24,
 'learn': 19,
 'lets': 2,
 'love': 5,
 'on': 18,
 'outside': 8,
 'play': 9,
 'positivity': 16,
 'shine': 17,
 'sunrise': 12,
 'today': 13,
 'up': 15,
 'using': 22,
 'wake': 14,
 'with': 3}

word_index will return the vocabulary with unqiue words and their indices

In [None]:
seq=tokenizer.texts_to_sequences(sentences)

In [None]:
padded=pad_sequences(seq)

Padded will return the padded sequences with default padding as pre.

In [None]:
for i,doc in enumerate(padded):
  print("Document",i+1,"has padded encoding as :",doc)

Document 1 has padded encoding as : [0 0 0 0 0 0 0 4 5 6]
Document 2 has padded encoding as : [ 0  0  0  2  7  8  1  9  3 10]
Document 3 has padded encoding as : [11 12 13 14 15  3 16  1 17 18]
Document 4 has padded encoding as : [ 0  0  2 19 20  1 21 22 23 24]


We'll build sequential model and and use Embedding layer

In [None]:
model=Sequential()

In [None]:
vocab_size=30
output_len=5
embedding_layer=model.add(Embedding(input_dim=vocab_size,output_dim=output_len,input_length=padded.shape[1]))

Embedding layer converts Sparse word vectors to Dense vectors.

Embedding layer takes several paramters:
input_dim: size of the vocabulary (word indices should not exceed the input_dim)
output_dim: size of the output dimensionsionality for each of the word indices. output_dim is basically tagging each word into the output_dim size.
input_length: input_length is the size of the features or max_length of the padded sequences.


In [None]:
model.output_shape

(None, 10, 5)

output_shape will be of dimension (none, input_length, output_dim). Each word index will be embedded into output_dim.

In [None]:
word_vec=model.add(Flatten(embedding_layer))

In [None]:
model.output_shape

(None, 50)

In [None]:
model.compile(optimizer=keras.optimizers.Adam(lr=1e-3),loss='binary_crossentropy',metrics=['acc'])

In [None]:
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 10, 5)             150       
_________________________________________________________________
flatten_4 (Flatten)          (None, 50)                0         
Total params: 150
Trainable params: 150
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
embeddings=model.predict(padded)



In [None]:
embeddings.shape

(4, 50)

In [None]:
print(embeddings)

[[-3.64207402e-02  4.60023619e-02  1.82082392e-02  4.49717753e-02
  -1.08600035e-02 -3.64207402e-02  4.60023619e-02  1.82082392e-02
   4.49717753e-02 -1.08600035e-02 -3.64207402e-02  4.60023619e-02
   1.82082392e-02  4.49717753e-02 -1.08600035e-02 -3.64207402e-02
   4.60023619e-02  1.82082392e-02  4.49717753e-02 -1.08600035e-02
  -3.64207402e-02  4.60023619e-02  1.82082392e-02  4.49717753e-02
  -1.08600035e-02 -3.64207402e-02  4.60023619e-02  1.82082392e-02
   4.49717753e-02 -1.08600035e-02 -3.64207402e-02  4.60023619e-02
   1.82082392e-02  4.49717753e-02 -1.08600035e-02  4.60126512e-02
   2.81035192e-02 -4.02567759e-02  2.40106322e-02 -1.39630325e-02
   2.82880776e-02 -2.39621289e-02  4.07682396e-02  2.92933919e-02
   1.71529166e-02 -2.97401678e-02 -3.55838165e-02 -1.22873411e-02
   1.67148449e-02  3.50773074e-02]
 [-3.64207402e-02  4.60023619e-02  1.82082392e-02  4.49717753e-02
  -1.08600035e-02 -3.64207402e-02  4.60023619e-02  1.82082392e-02
   4.49717753e-02 -1.08600035e-02 -3.6420

In [None]:
embed_re=embeddings.reshape(-1,padded.shape[1],output_len)

In [None]:
embed_re.shape

(4, 10, 5)

In [None]:
embed_re

array([[[-3.64207402e-02,  4.60023619e-02,  1.82082392e-02,
          4.49717753e-02, -1.08600035e-02],
        [-3.64207402e-02,  4.60023619e-02,  1.82082392e-02,
          4.49717753e-02, -1.08600035e-02],
        [-3.64207402e-02,  4.60023619e-02,  1.82082392e-02,
          4.49717753e-02, -1.08600035e-02],
        [-3.64207402e-02,  4.60023619e-02,  1.82082392e-02,
          4.49717753e-02, -1.08600035e-02],
        [-3.64207402e-02,  4.60023619e-02,  1.82082392e-02,
          4.49717753e-02, -1.08600035e-02],
        [-3.64207402e-02,  4.60023619e-02,  1.82082392e-02,
          4.49717753e-02, -1.08600035e-02],
        [-3.64207402e-02,  4.60023619e-02,  1.82082392e-02,
          4.49717753e-02, -1.08600035e-02],
        [ 4.60126512e-02,  2.81035192e-02, -4.02567759e-02,
          2.40106322e-02, -1.39630325e-02],
        [ 2.82880776e-02, -2.39621289e-02,  4.07682396e-02,
          2.92933919e-02,  1.71529166e-02],
        [-2.97401678e-02, -3.55838165e-02, -1.22873411e-02,
    

Now this makes it easier to visualize that we have 4(size of corp) documents with each consisting of 10(maxlen) words and each word mapped to a
5-dimensional vector.