# Word Embeddings



*   There are already a number of methods to convert word to numerical values or vectors (ie) BOW (Bag of words), TF-IDF
*   But, these methods have many disadvantages such as lack of semantic information
*   A method called **One hot encoding was introduced**

**One hot encoding**

*   consider a sentence s1 - "I like eating apples". 
*   sentence s2 - "I like eating mangoes"
*   The corresponding one-hot vector can be represented as [1, 1, 1, 1] for both the sentences
*   Assuming to determine the simiarity or closeness for the sentences, the result would be [1, 1, 1, 0] as only the last word varies(based on index).
*   In this case, for many sentences or huge corpus, such similar vectors might be obtained so the semantic is lost (ie difference between the word apples and mangoes is not justified).
*  Thus, the word embeddings come into picture

**word embeddings**


*   It can also be described as feature based representation
*   In a huge corpus or dataset, say 10,000 sentences , a sample of 300 features can be considered and vectors of dimension 300 can be created.
*   Else, a vector will have dimension of 10,000 in case of one-hot encoding
*   For the previous example, under the feature or category of fruit, the word apples and mangoes can be categorized so the individual vector value may vary , thus preserving the semantic


Word embeddings provide a dense representation of words and their relative meanings.They are an improvement over sparse representations used in simpler bag of word model representations.Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.










**Implementation of word embeddings**

Keras , by default has embeddings layer . But initially the words are tokenized, and one-hot encoding is applied, and the dimensions or number of features to be considered is defined to obtain the word embeddings

**Implementation**

In [1]:
import tensorflow as tf

In [2]:
tf.__version__

'2.7.0'

In [3]:
from tensorflow.keras.preprocessing.text import one_hot

In [4]:
sentences = [
             "I am a good girl",
             "I am an engineer",
             "The sum rises in the east",
             "Live life to the fullest",
             "I am a developer",
             "I need a cup of tea",
             "I can understand",
             "my work is good",
             "I like apples",
             "I don't like mangoes"
]

In [5]:
#define the vocabulary size
vocab_size = 1000


In [7]:
onehot_repr=[one_hot(words,vocab_size)for words in sentences] 
print(onehot_repr)

#index based on the created vocabulary/dictionary will be obtained
#length of list and length of every sentence is same
# 159 => I 39 => am

[[159, 39, 847, 860, 438], [159, 39, 937, 721], [641, 240, 207, 585, 641, 126], [324, 323, 441, 641, 42], [159, 39, 847, 264], [159, 996, 847, 840, 923, 122], [159, 228, 660], [494, 192, 634, 860], [159, 95, 358], [159, 722, 95, 372]]


**Embedding representation**

In [8]:


from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences #to make length of sentence equal
from tensorflow.keras.models import Sequential
import numpy as np

In [9]:
#Embedding representation

sent_length=8
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[  0   0   0 159  39 847 860 438]
 [  0   0   0   0 159  39 937 721]
 [  0   0 641 240 207 585 641 126]
 [  0   0   0 324 323 441 641  42]
 [  0   0   0   0 159  39 847 264]
 [  0   0 159 996 847 840 923 122]
 [  0   0   0   0   0 159 228 660]
 [  0   0   0   0 494 192 634 860]
 [  0   0   0   0   0 159  95 358]
 [  0   0   0   0 159 722  95 372]]


In [11]:
#define the number of features

dim=10

#define the model 
model=Sequential()
model.add(Embedding(vocab_size,10,input_length=sent_length))
model.compile('adam','mse')


In [12]:
print(model.predict(embedded_docs))

[[[-4.35585752e-02  2.48516239e-02 -2.92364508e-03 -3.21096778e-02
   -2.64827739e-02 -1.70924067e-02 -2.56507993e-02  3.78690101e-02
    1.09915622e-02  3.79628055e-02]
  [-4.35585752e-02  2.48516239e-02 -2.92364508e-03 -3.21096778e-02
   -2.64827739e-02 -1.70924067e-02 -2.56507993e-02  3.78690101e-02
    1.09915622e-02  3.79628055e-02]
  [-4.35585752e-02  2.48516239e-02 -2.92364508e-03 -3.21096778e-02
   -2.64827739e-02 -1.70924067e-02 -2.56507993e-02  3.78690101e-02
    1.09915622e-02  3.79628055e-02]
  [-3.02689206e-02  3.42058204e-02  2.78411545e-02  4.27877903e-03
   -1.29576772e-03 -1.48093104e-02  2.78568603e-02  4.89644073e-02
   -2.68476848e-02  3.92807610e-02]
  [-2.10841894e-02 -4.46150303e-02  2.94293091e-03  4.07879986e-02
   -4.13882621e-02 -3.67621779e-02  3.36346142e-02 -2.34319698e-02
    4.72513922e-02 -2.14796141e-03]
  [-3.15110236e-02 -4.76594940e-02  1.51516683e-02 -2.73773074e-02
    1.14548206e-03  4.41222824e-02 -3.46930996e-02 -1.93387400e-02
   -3.92709859e-

In [13]:
print(model.predict(embedded_docs)[0])

[[-4.3558575e-02  2.4851624e-02 -2.9236451e-03 -3.2109678e-02
  -2.6482774e-02 -1.7092407e-02 -2.5650799e-02  3.7869010e-02
   1.0991562e-02  3.7962805e-02]
 [-4.3558575e-02  2.4851624e-02 -2.9236451e-03 -3.2109678e-02
  -2.6482774e-02 -1.7092407e-02 -2.5650799e-02  3.7869010e-02
   1.0991562e-02  3.7962805e-02]
 [-4.3558575e-02  2.4851624e-02 -2.9236451e-03 -3.2109678e-02
  -2.6482774e-02 -1.7092407e-02 -2.5650799e-02  3.7869010e-02
   1.0991562e-02  3.7962805e-02]
 [-3.0268921e-02  3.4205820e-02  2.7841154e-02  4.2787790e-03
  -1.2957677e-03 -1.4809310e-02  2.7856860e-02  4.8964407e-02
  -2.6847685e-02  3.9280761e-02]
 [-2.1084189e-02 -4.4615030e-02  2.9429309e-03  4.0787999e-02
  -4.1388262e-02 -3.6762178e-02  3.3634614e-02 -2.3431970e-02
   4.7251392e-02 -2.1479614e-03]
 [-3.1511024e-02 -4.7659494e-02  1.5151668e-02 -2.7377307e-02
   1.1454821e-03  4.4122282e-02 -3.4693100e-02 -1.9338740e-02
  -3.9270986e-02  4.0844705e-02]
 [ 1.2421895e-02 -2.9974306e-02  8.6203218e-06 -1.6243923e