### Word Embedding - Practical Implementation using Keras

https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

See Word Embedding Intuition part to Understand How Word Embedding works

In [2]:
from tensorflow.keras.preprocessing.text import one_hot

### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [3]:
### Vocabulary size
voc_size=10000

Keras offers an Embedding layer that can be used for neural networks on text data.

It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API or One Hot Representation which is provided with Keras

#### One Hot Representation

In [4]:
### Each word will be provided with index from Dictionary
onehot_repr=[one_hot(words,voc_size)for words in sent] 
print(onehot_repr)

# the - 8143 , glass - 9713 , of - 9755 , milk - 9432

[[8143, 9713, 9755, 9432], [8143, 9713, 9755, 244], [8143, 2678, 9755, 7746], [3660, 246, 6445, 8285, 4368], [3660, 246, 6445, 8285, 8173], [1470, 8143, 737, 9755, 5438], [3546, 8271, 8839, 8285]]


### Using Pad Sequences

Before passing One Hot representation to Embedding Layer , we have to see the length and width of mtrix should be same.
Some sentences are small, Some are big . We need to pad 0 . If sentence is small in order to match the length of big Sentence


In [9]:
max([len(sen.split(' ')) for sen in sent ])

5

In [10]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

### Find Max no of words in Whole lists of sentence 
# sent_length=8
sent_length = max([len(sen.split(' ')) for sen in sent ])
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0 8143 9713 9755 9432]
 [   0 8143 9713 9755  244]
 [   0 8143 2678 9755 7746]
 [3660  246 6445 8285 4368]
 [3660  246 6445 8285 8173]
 [1470 8143  737 9755 5438]
 [   0 3546 8271 8839 8285]]


### Embedding Layer 

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used in a variety of ways, such as:

1. It can be used alone to learn a word embedding that can be saved and used in another model later.
2. It can be used as part of a deep learning model where the embedding is learned along with the model itself.
3. It can be used to load a pre-trained word embedding model, a type of transfer learning.

In [11]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential
import numpy as np

### Feature Representation ---- 10 Features (Vector Length)
model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
model.compile('adam','mse')

model.summary()

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 5, 10)             100000    
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


### Output of Embedding Layer -- Embedded Matrix

In [12]:
print(model.predict(embedded_docs))  # Matrix for all sentences 

[[[ 0.01849422  0.02723768  0.04391457  0.03099874 -0.04433843
   -0.01770258 -0.04808167 -0.00996188  0.02483911 -0.02960135]
  [-0.01706839  0.04825843  0.015813    0.04609114 -0.01626893
    0.03503921 -0.00550652 -0.00065093  0.02116298  0.01123443]
  [ 0.01693672  0.01889813  0.04780388  0.0183457  -0.03132993
    0.01287115  0.04079077 -0.02996605  0.01374892 -0.03480309]
  [-0.02500593 -0.01240395  0.04258927  0.02297049 -0.03576022
    0.01126082  0.03173603 -0.01215279  0.01983956 -0.01372912]
  [-0.03959063  0.01205043 -0.04634528  0.00621315  0.01087252
    0.03528966  0.00103041 -0.04187227  0.01751666  0.01359773]]

 [[ 0.01849422  0.02723768  0.04391457  0.03099874 -0.04433843
   -0.01770258 -0.04808167 -0.00996188  0.02483911 -0.02960135]
  [-0.01706839  0.04825843  0.015813    0.04609114 -0.01626893
    0.03503921 -0.00550652 -0.00065093  0.02116298  0.01123443]
  [ 0.01693672  0.01889813  0.04780388  0.0183457  -0.03132993
    0.01287115  0.04079077 -0.02996605  0.0137

In [13]:
embedded_docs[0]

array([   0, 8143, 9713, 9755, 9432])

#### Now every word in Sentence 0 is converted into Vector of Dimension 10

0 the glass of milk

In [14]:
print(model.predict(embedded_docs)[0])

[[ 0.01849422  0.02723768  0.04391457  0.03099874 -0.04433843 -0.01770258
  -0.04808167 -0.00996188  0.02483911 -0.02960135]
 [-0.01706839  0.04825843  0.015813    0.04609114 -0.01626893  0.03503921
  -0.00550652 -0.00065093  0.02116298  0.01123443]
 [ 0.01693672  0.01889813  0.04780388  0.0183457  -0.03132993  0.01287115
   0.04079077 -0.02996605  0.01374892 -0.03480309]
 [-0.02500593 -0.01240395  0.04258927  0.02297049 -0.03576022  0.01126082
   0.03173603 -0.01215279  0.01983956 -0.01372912]
 [-0.03959063  0.01205043 -0.04634528  0.00621315  0.01087252  0.03528966
   0.00103041 -0.04187227  0.01751666  0.01359773]]
