## Word Embedding

A word embedding is a technique to convert words into numbers (vectors) so that a machine learning model (like an RNN or neural network) can understand them.

But not just any numbers — these vectors capture the meaning of the words!



Unlike "one-hot encoding" (where every word is just a 1 or 0), word embeddings:

✅ Capture semantic meaning

✅ Understand similarity (e.g., "king" and "queen" are close)

✅ Can be pretrained (like Word2Vec, GloVe, or FastText)

✅ Learn during model training (in deep learning)

In [2]:
from tensorflow.keras.preprocessing.text import one_hot

In [3]:
###Sentence
sent=["The glass of milk",
      "the glass of juice",
      "the cup of tea",
      "I am a good boy",
      "I am a good developer",
      "Understand the meaning of words",
      "Your videos are good",]

In [38]:
## Define the vocabulary size 
voc_size=10000

In [16]:
one_hot_repr=[]

In [17]:
for word in sent:
    embed=one_hot(word,voc_size)
    one_hot_repr.append(embed)
    

In [18]:
one_hot_repr

[[9749, 2017, 6965, 1199],
 [9749, 2017, 6965, 3223],
 [9749, 5679, 6965, 4849],
 [4520, 7722, 8584, 9280, 8589],
 [4520, 7722, 8584, 9280, 9530],
 [6986, 9749, 2873, 6965, 4317],
 [7108, 9468, 3896, 9280]]

In [21]:
## Word embedding Representation
## We have to make sentences same number of size

from tensorflow.keras.layers import Embedding
from tensorflow.keras.utils import pad_sequences
from tensorflow.keras.models import Sequential
import numpy as np 

In [40]:
sent_length=8
embedded_docs=pad_sequences(one_hot_repr,padding="pre",maxlen=sent_length)
## padding is used to make number of words equal in any sentence.
## by adding number of zeroes before(pre) and after(post).

In [41]:
embedded_docs

array([[   0,    0,    0,    0, 9749, 2017, 6965, 1199],
       [   0,    0,    0,    0, 9749, 2017, 6965, 3223],
       [   0,    0,    0,    0, 9749, 5679, 6965, 4849],
       [   0,    0,    0, 4520, 7722, 8584, 9280, 8589],
       [   0,    0,    0, 4520, 7722, 8584, 9280, 9530],
       [   0,    0,    0, 6986, 9749, 2873, 6965, 4317],
       [   0,    0,    0,    0, 7108, 9468, 3896, 9280]], dtype=int32)

In [39]:
## Feature representation
dim=10

In [44]:
model=Sequential()
model.add(Embedding(voc_size,dim,input_length=sent_length))
model.compile("adam","mse")
model.build(input_shape=(None,sent_length))



In [45]:
model.summary()

In [46]:
model.predict(embedded_docs)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 164ms/step


array([[[ 0.0460086 , -0.02964122, -0.0456552 ,  0.04406799,
         -0.03412241, -0.00128101,  0.0064195 ,  0.03992717,
         -0.04468503, -0.03317754],
        [ 0.0460086 , -0.02964122, -0.0456552 ,  0.04406799,
         -0.03412241, -0.00128101,  0.0064195 ,  0.03992717,
         -0.04468503, -0.03317754],
        [ 0.0460086 , -0.02964122, -0.0456552 ,  0.04406799,
         -0.03412241, -0.00128101,  0.0064195 ,  0.03992717,
         -0.04468503, -0.03317754],
        [ 0.0460086 , -0.02964122, -0.0456552 ,  0.04406799,
         -0.03412241, -0.00128101,  0.0064195 ,  0.03992717,
         -0.04468503, -0.03317754],
        [-0.01562619,  0.02111569, -0.02959358, -0.02308006,
         -0.04576334, -0.00557416,  0.01598623, -0.04797106,
         -0.01160733, -0.02685769],
        [ 0.02978203,  0.02898136,  0.02160995,  0.04875701,
          0.03592726,  0.01998505,  0.02680739,  0.02279563,
         -0.03511987, -0.01837898],
        [ 0.04226874,  0.04118694,  0.01170339, -0.0

1. voc_size means total number of unique words and tokens.
2. Each word will be represented as dimension vector 
3. sent_length is the sentences length which can be use dby using pad_sequences
4. define the model using sequential()
5. Add embedding layer which map each word index from(0 to 9999(voc_size)) to dimensional dense vector.