<a href="https://colab.research.google.com/github/Rishu5kumar/Word_Embeddings_using_Keras_Embedding_Layer/blob/main/Training_Word_Embeddings_Using_Embedding_Layer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embedding Techniques using Embedding Layer in Keras

Embedding layer creates a feature representation of any specific word.

In [None]:
import tensorflow as tf

In [None]:
### sentences
sent=['the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

In [None]:
### Vocabulary size, making vector of 10000 vocab for each word
voc_size=10000

onehot_repr=[tf.keras.preprocessing.text.one_hot(words,voc_size)for words in sent]
print(onehot_repr)


'''
One-Hot Encoding Process
Vocabulary Size: You’ve defined a variable voc_size which indicates the total number of unique words you want to encode. For example, let’s assume voc_size is set to 10,000.

One-Hot Encoding:
The one_hot function from tf.keras.preprocessing.text takes each sentence from the sent list and converts it into a list of integers where each integer corresponds to a unique word in the vocabulary.
Each word in a sentence is assigned a unique integer based on its index in the vocabulary.

List Comprehension:
The line of code is a list comprehension that applies the one_hot function to each sentence in the sent list.
It generates a list of one-hot encoded vectors for each sentence.

Here’s how the processing works step-by-step:

Step 1: Tokenization
Each sentence is split into words.
Let's assume the following unique words are identified from the sentences:
"the", "glass", "of", "milk", "juice", "cup", "tea", "I", "am", "a", "good", "boy", "developer", "understand", "meaning", "words", "your", "videos", "are"

Step 2: Unique Word Mapping
A unique integer is assigned to each word.

Step 3: One-Hot Encoding
Each sentence is converted to a one-hot encoded representation based on the mapping.
For example:
"the glass of milk" might be converted to [1, 2, 3, 4]
"the glass of juice" might be converted to [1, 2, 3, 5]
'''

[[8341, 9030, 6657, 7169], [8341, 9030, 6657, 4069], [8341, 7816, 6657, 2073], [6209, 5317, 6458, 4081, 6168], [6209, 5317, 6458, 4081, 6556], [6742, 8341, 4803, 6657, 8730], [2236, 9678, 8151, 4081]]


The output [[6654, 998, 8966, 1609]] represents the one-hot encoded integer representation of the words in the sentence sent. Each number is an index (within a vocabulary of size 10,000) corresponding to a word in the sentence. This encoding is based on a hashing mechanism that converts words to unique integers but does not represent actual one-hot vectors.

Here we are getting indexes from the dictionary.

In [None]:
import numpy as np

In [None]:
sent_length=8
embedded_docs=tf.keras.preprocessing.sequence.pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

'''
This line pads the sequences (onehot_repr) to ensure they all have the same length (sent_length):

onehot_repr is the list of one-hot encoded sequences.
padding='pre' means the padding will be added to the beginning of sequences if they're shorter than sent_length.
maxlen=sent_length ensures all sequences are exactly sent_length long.
It standardizes the input length for the model.

The sequences are padded with zeros at the beginning (because padding='pre') to ensure each one is exactly 5 elements long.

This prepares the data for input into a neural network, which typically requires fixed-length input.
EX:- onehot_repr = [
    [3, 1, 6, 2],
    [6, 5, 7],
    [8, 4, 5, 7, 2]
]
then, embedded_docs = [
    [0, 0, 3, 1, 6],
    [0, 0, 0, 6, 5],
    [8, 4, 5, 7, 2]
], if sent_length = 5
'''

[[   0    0    0    0 8341 9030 6657 7169]
 [   0    0    0    0 8341 9030 6657 4069]
 [   0    0    0    0 8341 7816 6657 2073]
 [   0    0    0 6209 5317 6458 4081 6168]
 [   0    0    0 6209 5317 6458 4081 6556]
 [   0    0    0 6742 8341 4803 6657 8730]
 [   0    0    0    0 2236 9678 8151 4081]]


"\nThis line pads the sequences (onehot_repr) to ensure they all have the same length (sent_length):\n\nonehot_repr is the list of one-hot encoded sequences.\npadding='pre' means the padding will be added to the beginning of sequences if they're shorter than sent_length.\nmaxlen=sent_length ensures all sequences are exactly sent_length long.\nIt standardizes the input length for the model.\n\nThe sequences are padded with zeros at the beginning (because padding='pre') to ensure each one is exactly 5 elements long.\n\nThis prepares the data for input into a neural network, which typically requires fixed-length input.\nEX:- onehot_repr = [\n    [3, 1, 6, 2],\n    [6, 5, 7],\n    [8, 4, 5, 7, 2]\n]\nthen, embedded_docs = [\n    [0, 0, 3, 1, 6],\n    [0, 0, 0, 6, 5],\n    [8, 4, 5, 7, 2]\n], if sent_length = 5\n"

Whenever we pass anything to the embedding layer, all the sentences should have same number of words as it helps us to create a very good embedding matrix, hence we are using pad_sequences.

In [None]:
model=tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(voc_size,10,input_length=sent_length))

'''
This line adds an embedding layer to the model, where:

voc_size is the size of the vocabulary (e.g., 10,000).
10 is the dimension of the dense embedding vectors (each word is represented as a 10-dimensional vector).
input_length=sent_length specifies the length of the input sequences (i.e., how many words per input sentence).

This layer converts word indices (from one-hot encoding) into dense vectors of fixed size (10 in this case).
'''

model.compile('adam','mse')

model.summary()



In [None]:
print(model.predict(embedded_docs))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 215ms/step
[[[-2.2126103e-02  4.5900952e-02  1.1782099e-02  1.2308203e-02
   -6.6900142e-03  3.2391716e-02 -3.2273054e-02 -1.6047001e-02
   -5.6654923e-03  2.1574821e-02]
  [-2.2126103e-02  4.5900952e-02  1.1782099e-02  1.2308203e-02
   -6.6900142e-03  3.2391716e-02 -3.2273054e-02 -1.6047001e-02
   -5.6654923e-03  2.1574821e-02]
  [-2.2126103e-02  4.5900952e-02  1.1782099e-02  1.2308203e-02
   -6.6900142e-03  3.2391716e-02 -3.2273054e-02 -1.6047001e-02
   -5.6654923e-03  2.1574821e-02]
  [-2.2126103e-02  4.5900952e-02  1.1782099e-02  1.2308203e-02
   -6.6900142e-03  3.2391716e-02 -3.2273054e-02 -1.6047001e-02
   -5.6654923e-03  2.1574821e-02]
  [ 1.9436333e-02  4.3055002e-02  3.7182570e-03 -1.4660060e-02
    4.1585598e-02  2.2574726e-02  4.4214595e-02  3.7256096e-02
   -4.1778814e-02  3.2539282e-02]
  [ 4.7121000e-02 -3.6291622e-02  6.0387477e-03  3.3755932e-02
   -3.1792477e-02  4.0713739e-02  4.4197328e-03 -1.8629491e-02
 

In [None]:
embedded_docs[0]

array([   0,    0,    0,    0, 8341, 9030, 6657, 7169], dtype=int32)

In [None]:
print(model.predict(embedded_docs)[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
[[-0.0221261   0.04590095  0.0117821   0.0123082  -0.00669001  0.03239172
  -0.03227305 -0.016047   -0.00566549  0.02157482]
 [-0.0221261   0.04590095  0.0117821   0.0123082  -0.00669001  0.03239172
  -0.03227305 -0.016047   -0.00566549  0.02157482]
 [-0.0221261   0.04590095  0.0117821   0.0123082  -0.00669001  0.03239172
  -0.03227305 -0.016047   -0.00566549  0.02157482]
 [-0.0221261   0.04590095  0.0117821   0.0123082  -0.00669001  0.03239172
  -0.03227305 -0.016047   -0.00566549  0.02157482]
 [ 0.01943633  0.043055    0.00371826 -0.01466006  0.0415856   0.02257473
   0.0442146   0.0372561  -0.04177881  0.03253928]
 [ 0.047121   -0.03629162  0.00603875  0.03375593 -0.03179248  0.04071374
   0.00441973 -0.01862949  0.03250908  0.00180762]
 [ 0.01062462 -0.04493208 -0.02328709 -0.03729993  0.02037457  0.02284164
   0.04222066  0.03588844  0.04060311 -0.02541548]
 [ 0.04528215 -0.01976923 -0.01649962  0.03962335 -0.