## **Word Embedding technique using Embeddings in Keras**

In [12]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot

In [4]:
print(tf.__version__)

2.10.0


In [6]:
sentences = ['This program iterates through the list fishes and compares each element with its adjacent element.' ,
             'If the left-side element is greater than the right-side element, it removes the right-side element from the list.', 
             'It continues this process until it reaches the end of the list. Finally, it prints the modified list.']
print(sentences)

['This program iterates through the list fishes and compares each element with its adjacent element.', 'If the left-side element is greater than the right-side element, it removes the right-side element from the list.', 'It continues this process until it reaches the end of the list. Finally, it prints the modified list.']


In [7]:
vocabulary_size = 10000

## **One Hot Representation**

In [10]:
one_hot_repr = [one_hot(word, vocabulary_size) for word in sentences]
print(one_hot_repr)

[[2152, 43, 7595, 4083, 529, 6450, 5929, 6456, 4518, 8567, 5455, 4313, 6607, 5718, 5455], [3064, 529, 2839, 6112, 5455, 3852, 5949, 8432, 529, 8929, 6112, 5455, 8502, 5442, 529, 8929, 6112, 5455, 5053, 529, 6450], [8502, 454, 2152, 8513, 2816, 8502, 3041, 529, 2154, 766, 529, 6450, 2216, 8502, 7631, 529, 8629, 6450]]


- Here we are getting the index from the vocabulary, where the word is present on which index in vocabulry.
- If we see the index of **the** is 529 in vocabulary, so wherever there is **the** in any sentence the same index will be returned.

- Now in our sentences, each and every sentence length is different.
- If I need to train this in neural network, my sentence length should be same for all sentences.
- That's why we use post and pre padding.

## **Word Embedding Representation**

In [15]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
import numpy as np

In [28]:
[len(sentence.split()) for sentence in sentences]

[15, 18, 18]

- This pad_sequences will add the padding wherver there are less number of words in any sentence that the maximum sentence length.
- Either it will add 0 at first or at last.

In [26]:
max_sentence_len = max([len(sentence.split()) for sentence in sentences])
embedded_sentences = pad_sequences(one_hot_repr, padding='pre', maxlen=max_sentence_len)
print(embedded_sentences)

[[   0    0    0 2152   43 7595 4083  529 6450 5929 6456 4518 8567 5455
  4313 6607 5718 5455]
 [6112 5455 3852 5949 8432  529 8929 6112 5455 8502 5442  529 8929 6112
  5455 5053  529 6450]
 [8502  454 2152 8513 2816 8502 3041  529 2154  766  529 6450 2216 8502
  7631  529 8629 6450]]


- Now as we know, the first sentence has 15 words, and max words length is 18, so it will add 3 0s at the start of the sentence.

In [29]:
embedded_sentences.shape

(3, 18)

- Now the input size is fixed for all the different length sentences.
- Now we have to convert this one hot representation into vectors using Word2Vec.
- In Word2Vec we have to define the feature representation size to create a vector of that size.

In [30]:
feature_dim = 10

In [32]:
model = Sequential()
model.add(Embedding(vocabulary_size, feature_dim, input_length = max_sentence_len))
model.compile('adam', 'mse')

In [33]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 18, 10)            100000    
                                                                 
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


In [35]:
sent_1 = model.predict(embedded_sentences[0])



In [38]:
sent_1

array([[-0.04958859, -0.00029375,  0.0021411 , -0.03382953,  0.00518345,
        -0.02229652, -0.04717842, -0.015091  , -0.00550201, -0.04358589],
       [-0.04958859, -0.00029375,  0.0021411 , -0.03382953,  0.00518345,
        -0.02229652, -0.04717842, -0.015091  , -0.00550201, -0.04358589],
       [-0.04958859, -0.00029375,  0.0021411 , -0.03382953,  0.00518345,
        -0.02229652, -0.04717842, -0.015091  , -0.00550201, -0.04358589],
       [-0.0321614 , -0.04003553,  0.01037462, -0.04467133, -0.01149772,
         0.02548233,  0.01720269,  0.02731569, -0.0404562 ,  0.02892831],
       [ 0.04773477,  0.03213054, -0.00138054,  0.04010015,  0.00552629,
         0.00278773, -0.01416502,  0.03941151, -0.02107297,  0.04168974],
       [-0.03581516, -0.03247398, -0.00897322, -0.03929315,  0.01203708,
        -0.01428032, -0.04475442, -0.0170147 ,  0.02185266,  0.02517873],
       [-0.04949846,  0.03666887, -0.00311844, -0.01789719,  0.02648691,
        -0.01334971,  0.02618753, -0.04231563

In [39]:
embedded_sentences[0]

array([   0,    0,    0, 2152,   43, 7595, 4083,  529, 6450, 5929, 6456,
       4518, 8567, 5455, 4313, 6607, 5718, 5455])

- Now as in embedded_sentences[0], we have first three words as 0.
- Hence the vectors are also same in embedding layer output for first three words.