<a href="https://colab.research.google.com/github/Sagu12/all-projects/blob/master/WORD_EMBEDDING_USING_KERAS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#what is word embedding?
the creation of words into a vector format and finding the cosine similarity of the words by looking at their vector values is called word embedding. it basically helps in preserving the semantic similarity of the words and also vectorizes the word in a proper way where we can actually prioritize the important words in a sentence.....

#parameters to be used during coding of a word embedding model
1. creation of word dictionary or find the vocabulary size containing sequence of words in proper order.
2. Creating the word into one hot encoded vectors.
3. Getting the index values of the words in the corpus of sentences.
4. Passing the generated vectors through the embedding layer.
5. This embedding layer will convert these vectors into some other vector representation based on feature selection criteria.
6. Giving the desired dimensions in which we want to represent the particular word in vector representation.
7. So embedding helps in creating a feature representation of particular words in the corpus.
8. here the dimensions are the no. of features we are giving to keras vector form.
9. vocab size is actually the size of our vector initially created so as to designate the words under specific numerical range.

In [0]:
from tensorflow.keras.preprocessing.text import one_hot

In [0]:
#SENTENCES

In [0]:
sent= ["the glass of milk", 
"the glass of juice",
"the cup of tea",
"I am a good boy",
"I am a good developer",
"understand the meaning of words",
"your videos are good"]

In [3]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [0]:
#vocabulary size or the size of the dictionary
voc_size= 10000

In [0]:
#one hot representation

In [6]:
#getting the indexes from the dictionary
onehot_repr= [one_hot(words,voc_size) for words in sent]
print(onehot_repr)

[[7371, 713, 3825, 87], [7371, 713, 3825, 3482], [7371, 1056, 3825, 3630], [7440, 3361, 5867, 5639, 1672], [7440, 3361, 5867, 5639, 619], [4604, 7371, 3181, 3825, 4163], [6462, 1416, 699, 5639]]


In [0]:
#word embedding representation and forming the embedding matrix with the desired dimensions/features we want to give 

In [0]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential 

In [0]:
import numpy as np

In [0]:
#padding basically adds extra words in the form of 0 in pre or post position to make all the sentences of equal length

In [12]:
sent_length= 8
embedded_docs= pad_sequences(onehot_repr, padding="pre", maxlen=sent_length)
print(embedded_docs)

[[   0    0    0    0 7371  713 3825   87]
 [   0    0    0    0 7371  713 3825 3482]
 [   0    0    0    0 7371 1056 3825 3630]
 [   0    0    0 7440 3361 5867 5639 1672]
 [   0    0    0 7440 3361 5867 5639  619]
 [   0    0    0 4604 7371 3181 3825 4163]
 [   0    0    0    0 6462 1416  699 5639]]


In [0]:
#creating the embedding layer by first specifying the dimensions

In [0]:
dim=10


In [0]:
model= Sequential()
model.add(Embedding(voc_size, 10, input_length=sent_length))
model.compile("adam", "mse")

In [17]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8, 10)             100000    
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


In [0]:
#to visualise the embedded docs use the following

In [19]:
print(model.predict(embedded_docs))

[[[-0.0033476  -0.0451025   0.00142052  0.04966414 -0.03840326
   -0.00164687 -0.00499056 -0.04337609  0.02201051  0.01279452]
  [-0.0033476  -0.0451025   0.00142052  0.04966414 -0.03840326
   -0.00164687 -0.00499056 -0.04337609  0.02201051  0.01279452]
  [-0.0033476  -0.0451025   0.00142052  0.04966414 -0.03840326
   -0.00164687 -0.00499056 -0.04337609  0.02201051  0.01279452]
  [-0.0033476  -0.0451025   0.00142052  0.04966414 -0.03840326
   -0.00164687 -0.00499056 -0.04337609  0.02201051  0.01279452]
  [ 0.00010144 -0.02305658 -0.00533779  0.02455579  0.01535118
    0.03841901 -0.02659302  0.01966382 -0.02032864 -0.03127053]
  [-0.02546822 -0.00756079 -0.01709424  0.02064774 -0.02138015
   -0.01184344  0.04587782  0.03234509  0.02728191  0.04127885]
  [ 0.01703197 -0.00676344  0.01575842  0.02993752  0.01173741
    0.02881073 -0.00572356 -0.03846942  0.01307753 -0.0162832 ]
  [-0.04157982  0.02239454  0.02086986 -0.01517365  0.00215524
   -0.02604993  0.00931741 -0.01557238 -0.049827

In [22]:
embedded_docs[0]

array([   0,    0,    0,    0, 7371,  713, 3825,   87], dtype=int32)

In [0]:
#so here it basically means that the word with the resepective indexes will gt converted into a 10 dimensional vector

In [23]:
print(model.predict(embedded_docs)[0])

[[-0.0033476  -0.0451025   0.00142052  0.04966414 -0.03840326 -0.00164687
  -0.00499056 -0.04337609  0.02201051  0.01279452]
 [-0.0033476  -0.0451025   0.00142052  0.04966414 -0.03840326 -0.00164687
  -0.00499056 -0.04337609  0.02201051  0.01279452]
 [-0.0033476  -0.0451025   0.00142052  0.04966414 -0.03840326 -0.00164687
  -0.00499056 -0.04337609  0.02201051  0.01279452]
 [-0.0033476  -0.0451025   0.00142052  0.04966414 -0.03840326 -0.00164687
  -0.00499056 -0.04337609  0.02201051  0.01279452]
 [ 0.00010144 -0.02305658 -0.00533779  0.02455579  0.01535118  0.03841901
  -0.02659302  0.01966382 -0.02032864 -0.03127053]
 [-0.02546822 -0.00756079 -0.01709424  0.02064774 -0.02138015 -0.01184344
   0.04587782  0.03234509  0.02728191  0.04127885]
 [ 0.01703197 -0.00676344  0.01575842  0.02993752  0.01173741  0.02881073
  -0.00572356 -0.03846942  0.01307753 -0.0162832 ]
 [-0.04157982  0.02239454  0.02086986 -0.01517365  0.00215524 -0.02604993
   0.00931741 -0.01557238 -0.0498278  -0.0324008 ]]

In [0]:
#so comparing the embedded docs at 0 has got converted into 10 dimension vector for each word