# Word Embedding using Embedding Layer in Keras

## Steps :-
1. Sentences
2. One hot representation -- index from the dict
3. OneHot Representation -> Embedding layer keras to form a embedding matrix
4. Embedding matrix </br>
vocb_size = 10000, dimension = 10 </br>

In [1]:
from tensorflow.keras.preprocessing.text import one_hot

In [2]:
#  Sentences
sent = ['the glass of mikl',
'the glass of juice',
'the cup of tea',
'I am a good boy',
'I am a good developer',
'understand the meaning of words',
'your videos are good']

In [3]:
sent

['the glass of mikl',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [4]:
# Vocabulary Size
voc_size = 10000

In [5]:
onehot_repr = [one_hot(words,voc_size)for words in sent]
print(onehot_repr)

[[1143, 3621, 1585, 7849], [1143, 3621, 1585, 2573], [1143, 3113, 1585, 4858], [2856, 8249, 4003, 5090, 5480], [2856, 8249, 4003, 5090, 6130], [4587, 1143, 276, 1585, 7207], [9327, 8688, 3123, 5090]]


## Word Embedding Representation

Whenever we want to pass anything to the embedding layer, all the sentences should have same number of words and the size of words should be same.</br>
So for that we use <b>pad_sequence</b>

In [7]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
import numpy as np

In [8]:
sent_length = 8
embedded_docs = pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0    0 1143 3621 1585 7849]
 [   0    0    0    0 1143 3621 1585 2573]
 [   0    0    0    0 1143 3113 1585 4858]
 [   0    0    0 2856 8249 4003 5090 5480]
 [   0    0    0 2856 8249 4003 5090 6130]
 [   0    0    0 4587 1143  276 1585 7207]
 [   0    0    0    0 9327 8688 3123 5090]]


In <b>pad_sequence</b> :</br>
1. maxlen :- It represent the max length we want to give to the sentences.
2. padding :- represent padding technique. As we have in our sentences there are some sentences in which we have 4 words and some have 5 words. So in order to make all sentences of same size or of maxlen size we adding padding to the sentences.

In [9]:
dim = 10 # or we can say features

In [14]:
model = Sequential()
model.add(Embedding(voc_size,10,input_length = sent_length))    #  this embedding layer helps us to convert based on the number of dimensions into the featurised representation.
model.compile('adam','mse')

In [15]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 8, 10)             100000    
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


In [16]:
print(model.predict(embedded_docs))

[[[ 0.03080073  0.00323174 -0.02918334  0.00377128 -0.04070873
    0.00661297 -0.01305697  0.03724326  0.02602698 -0.03102778]
  [ 0.03080073  0.00323174 -0.02918334  0.00377128 -0.04070873
    0.00661297 -0.01305697  0.03724326  0.02602698 -0.03102778]
  [ 0.03080073  0.00323174 -0.02918334  0.00377128 -0.04070873
    0.00661297 -0.01305697  0.03724326  0.02602698 -0.03102778]
  [ 0.03080073  0.00323174 -0.02918334  0.00377128 -0.04070873
    0.00661297 -0.01305697  0.03724326  0.02602698 -0.03102778]
  [ 0.01040239 -0.02902021 -0.04402343  0.02820868 -0.01183882
    0.0257639  -0.0473035   0.03282339  0.03609082 -0.02451554]
  [ 0.01546143  0.04945992  0.04310026  0.01056584 -0.04892234
   -0.01590864 -0.028433   -0.02458343 -0.00854567  0.00221869]
  [-0.00821316 -0.04623196  0.01155024  0.01278696  0.00680412
   -0.04244255 -0.01131557  0.03568219  0.04489153  0.04329354]
  [ 0.03797194 -0.03704823  0.00010245  0.00759044 -0.00157211
   -0.03532404 -0.03511066  0.03577269  0.014524

In [17]:
embedded_docs[0]

array([   0,    0,    0,    0, 1143, 3621, 1585, 7849])

In [19]:
print(model.predict(embedded_docs)[0]) 

[[ 0.03080073  0.00323174 -0.02918334  0.00377128 -0.04070873  0.00661297
  -0.01305697  0.03724326  0.02602698 -0.03102778]
 [ 0.03080073  0.00323174 -0.02918334  0.00377128 -0.04070873  0.00661297
  -0.01305697  0.03724326  0.02602698 -0.03102778]
 [ 0.03080073  0.00323174 -0.02918334  0.00377128 -0.04070873  0.00661297
  -0.01305697  0.03724326  0.02602698 -0.03102778]
 [ 0.03080073  0.00323174 -0.02918334  0.00377128 -0.04070873  0.00661297
  -0.01305697  0.03724326  0.02602698 -0.03102778]
 [ 0.01040239 -0.02902021 -0.04402343  0.02820868 -0.01183882  0.0257639
  -0.0473035   0.03282339  0.03609082 -0.02451554]
 [ 0.01546143  0.04945992  0.04310026  0.01056584 -0.04892234 -0.01590864
  -0.028433   -0.02458343 -0.00854567  0.00221869]
 [-0.00821316 -0.04623196  0.01155024  0.01278696  0.00680412 -0.04244255
  -0.01131557  0.03568219  0.04489153  0.04329354]
 [ 0.03797194 -0.03704823  0.00010245  0.00759044 -0.00157211 -0.03532404
  -0.03511066  0.03577269  0.01452455 -0.00815407]]
