## Word Embeddings - Using Embedding Layer

<p style='text-align: justify;'>
Text is one of the most widespread forms of sequence data. It can be understood as either a sequence of characters or a sequence of words. Like all other neural networks, deep-learning models don’t take as input raw text:</p>

<p style='text-align: justify;'>
they only work with numeric tensors. The process of transforming text into numeric vectors is called 'Vectorizing'.</p>

<p style='text-align: justify;'>
the different units into which you can break down text (words, characters,or n-grams) are called tokens, and breaking text into such tokens is called tokenization. All text-vectorization processes consist of applying some tokenization scheme and then associating numeric vectors with the generated tokens.</p>

<p style='text-align: justify;'>
There are two ways to associate a vectpr to tokens - One-Hot Encoding of Words and Word Embeddings.

We will explore 'Word-Embeddings' in this article.</p>

<p style='text-align: justify;'>
Word embeddings associates a dense vector with each token/word in a text corpus.These words vectors are low dimensional floating point vectors.These word vectors are learned from the data. There are two ways to obtain word embeddings.  
    
- Learn word embeddings with 'Embedding Layer'.

- Load a pretrained model which has been trained on different text corpus. and get word vectors from the model.
</p>

### Embedding Layer  

<p style='text-align: justify;'>
Learn word vectors or word embeddings by feding words into a neural network model and learn optimized weights by backpropagation through the model. These weights are then the word vectors for each tokens ina whole text corpus.</p>

<p style='text-align: justify;'>
Keras has 'Embedding' layer which helps us to get the word vectors. It is a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors.</p>

<p style='text-align: justify;'>
Okay before we jump into embedding layer, lets first undtsrand what inputs this layer takes. We can not fed raw words directly into this layer. Tokens from text corpus needs to be preprocessed before we fed those into the layer. How do we prepare the text data for this embedding layer?</p>
    
<p style='text-align: justify;'>
Well, The steps are first tokenize text data , clean data (cleaning special characters, numbers etc) , then convert tokens in each document in whole text corpus into sequence of integers and these sequences of integers correponds to token in each documents in a text corpus are then fed into emnbedding layer. Keras has powerful modules to achieve these steps.</p>

<p style='text-align: justify;'>
Suppose we have below text corpus containing 3 text documents. and our task is to prepare this text corpus for our embedding layer which will return us the word vectors.</p>
    
    corpus = [
              'I live in a country name India.',     --- Document 1
              'India is a great country.',           --- Document 2
              'I love my country very much.'         --- Document 3
             ]


In [1]:
corpus = [
          'I live in a country name India.',     
          'India is a great country.',           
          'I love my country very much.'         
         ]

<p style='text-align: justify;'>
As we know vectorization of words involves tokenization scheme to tokenize text data into list of tokens or words. Also we need to clean the text data meaning it must not contain any special characters, digits etc. And we may also wants to use certain number of words from our whole text . Keras has a module 'preprocessing.text' which has a class 'Tokenizer; which helps us to do these steps.
</p>

In [4]:
import tensorflow as tf
vocab_size = 14 #(13 Unique Words plus 1)

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words = vocab_size,
                                                  lower = True,
                                                  filters = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
                                                  split = ' ',
                                                  char_level = False
                                                 )


tokenizer.fit_on_texts(corpus)
sequences = tokenizer.texts_to_sequences(corpus)

print('Text Corous -> Sequence of Word Indexes.... \n', sequences)

Text Corous -> Sequence of Word Indexes.... 
 [[2, 5, 6, 3, 1, 7, 4], [4, 8, 3, 9, 1], [2, 10, 11, 1, 12, 13]]


<p style='text-align: justify;'>
The arguments are as follows,

    num_words : (N-1) number of unique words in whole text corpus to be considered for word embeddings. Our corpus has 13 unique words and since we want word vectors for all unique words hence we have selected num_words= 14. which means (14-1)=13 most occuring unique words to be considered from our whole corpus.

    lower : True , convert tokens into lower case.
    
    filters : Which characters to be cleaned from tokens. 
    
    split : tokenization scheme. ' ' corresponds to tokenizing a document based on single space.
    
    char_level : True corresponds to tokenize into characters . False corresponds to tokenize text data into words.
</p>

<p style='text-align: justify;'>
    
fit_on_texts() method of tokenizer takes a text corpus and applies above tokenization rules.
</p>

<p style='text-align: justify;'>
finally texts_to_sequences() method convert tokens of each documents in a corpus into sequence of integers. Let see how it finds the integers.
   
Step 1: a OrderedDictinary is created where key is unique word and value is frequency of that word in whole corpus.

Step 2: Sort the dictionary based on value.
    
Step 3: Assign an integer index sequentially starting from 1 to keys of the dictinary.
    
So in our example lets first create OrderedDictionary in the form (word, frequency)  
> ('i', 2), ('live', 1), ('in', 1), ('a', 2), ('country', 3),  
  ('name', 1), ('india', 2), ('is', 1), ('great', 1),  
  ('love', 1), ('my', 1), ('very', 1), ('much', 1)  
    
Next sort the dictionary based on value and here it is,  
> ('country', 3), ('i', 2), ('a', 2), ('india', 2),  
  ('live', 1), ('in', 1), ('name', 1), ('is', 1),   
  ('great', 1), ('love', 1), ('my', 1), ('very', 1), ('much', 1)
    
Next assign index to each words in the sorted dictionary ,  
> {  
       'country' : 1, 'i': 2 , 'a': 3, 'india' : 4,  
       'live': 5, 'in' : 6, 'name' : 7, 'is' : 8,  
       'great' : 9, 'love' : 10, 'my' : 11, 'very' : 12, 'much' : 13  
  }
    
Now since num_words= 14, hence top (14-1)=13 words to be considered and those are nothing but all words across whole corpus.</p>

<p style='text-align: justify;'>
First document in our corpus is ''I live in a country name India.' . I is there in our cosidered word list and it's index is 2. index of 'live' is 5 , index 6 for 'in' and son. finally the sequence for document 1 is [2, 5, 6, 3, 1, 7, 4].  
    
Similarwise sequence for tokens in document 2 'India is a great country.' is [4, 8, 3, 9, 1]  
    
and sequence for tokens in document 3 'I love my country very much.' is [2, 10, 11, 1, 12, 13]
</p>

<p style='text-align: justify;'>
    So in above code we have prepocessed our text corpus into sequence of integers where integers corresponds to indices of words in documnets. Remember still we did not get word vectors. We are just prparing our text data to fed into neural network.
</p>

<p style='text-align: justify;'>
    Next step in our data preparation, is to prepare a 2D array of shape (batch_size, sequence_length) beacuse this 2D array then to be passed as an input to Embedding layer.
     So we have got list of sequences : [[2, 5, 6, 3, 1, 7, 4], [4, 8, 3, 9, 1], [2, 10, 11, 1, 12, 13]]. We will need a 2D array where batch_size=3 (because we have 3 documents in our corpus) and sequence_length=4 (because max length of sequenc is 7). the required 2D array should be as below,       
    $$\begin{bmatrix} 2 & 5 & 6 & 3 & 1 & 7 & 4 \\ 4 & 8 & 3 & 9 & 1 & 0 & 0 \\ 2 & 10 & 11 & 1 & 12 & 13 & 0 \end{bmatrix}$$
    
Keras has beautiful module 'preprocessing.sequence' which helps to prepare the 2D array.
</p>

In [5]:
max_len = 7 # Max length in any of the documents
padded_seq = tf.keras.preprocessing.sequence.pad_sequences(maxlen= 7,
                                                           sequences=sequences,
                                                           truncating='pre',
                                                           padding='post')
print('2D Array: \n\n', padded_seq)
print('\n 2D Array Shape ',padded_seq.shape)

2D Array: 

 [[ 2  5  6  3  1  7  4]
 [ 4  8  3  9  1  0  0]
 [ 2 10 11  1 12 13  0]]

 2D Array Shape  (3, 7)


<p style='text-align: justify;'>
Arguments : 
    
    maxlen = max length of sequence. 7 in this example.  
    
    sequences = list of sequences of integers.  
    
    truncating = if a sequence length is greater than maxlen then how the sequence has to be truncated? 'pre' meaning truncates from begining, 'post' meaning truncates from ending.
    
    padding = if a sequence length is shorter than maxlen then how the sequence has to be padded? 'pre' meaning add 0s from begining, 'post' meaning add 0s towards ending.  
</p>

<p style='text-align: justify;'>
We have maxlen=7. 2nd sequence is of length 5 which is shorter than maxlen, hence a 0 needs to be padded and since padding='post' , 0 is padded at the end. 3rd sequence is of length 6 which is shorter than maxlen by 1, hence a 0 is padded at the end.
    
This is how we converted list of sequences of word indices into a 2D array of shape (3,7) where 3 corresponds to sample size (number of sequences) and 7 corresponds to max length of sequence.
</p>

<p style='text-align: justify;'>
Finally with the help of keras preprocessing module our text corpus is ready to be feded into keras Embedding layer to generate word embeddings.
</p>

<p style='text-align: justify;'>
We will build a Sequential Model with a Embedding layer.

In [7]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=8, input_length=max_len))

<p style='text-align: justify;'>
Embedding layer takes below inputs,  

    input_length : max length of sequence in input 2D array. In this case it is 7.
        
    input_dim : vocabulary size of our corpus which is 13. ( we have set 14 , remember always 1 extra). We wants word embeddings to be generated for all these 13 words  
        
    output_dim : what should be the dimension of generated word vector. It can be any dimansional depends on data corpus size. We have set 8.
</p>

Lets compile the model and then get the word embeddings using predict method of model passing padded sequences as input.

In [9]:
model.compile(optimizer='adam', loss='categorical_crossentropy')
word_embeddings = model.predict(padded_seq)

In [10]:
print('Shape of word_embeddings ', word_embeddings.shape)
print(' \n ', word_embeddings)

Shape of word_embeddings  (3, 7, 8)
 
  [[[-0.03448143  0.02527387  0.04875959  0.03743121  0.00962205
    0.0462733  -0.00197356 -0.03287669]
  [-0.04859023  0.00589529  0.04115    -0.04330999  0.04140783
   -0.02147198  0.02231913  0.04064867]
  [-0.00984266  0.0279462  -0.02953575 -0.00323793  0.00286321
    0.01004422 -0.03504807 -0.01536544]
  [-0.01424579  0.02937368  0.0331848   0.00166159  0.03057465
   -0.00994849  0.03966172  0.04327631]
  [ 0.03490904 -0.045705    0.02629853 -0.0056543  -0.03333612
   -0.01934866 -0.01188357 -0.00877807]
  [-0.02354031 -0.01780294 -0.03162633 -0.00448536 -0.04305381
   -0.01177998  0.04270966  0.03985044]
  [ 0.02116651 -0.03881351  0.01568277  0.02281796  0.04359093
    0.01584264 -0.03330912 -0.0208941 ]]

 [[ 0.02116651 -0.03881351  0.01568277  0.02281796  0.04359093
    0.01584264 -0.03330912 -0.0208941 ]
  [-0.04705135 -0.02723608  0.02537857  0.02739319 -0.00563302
   -0.00999812 -0.00948316 -0.0281214 ]
  [-0.01424579  0.02937368  0.0

<p style='text-align: justify;'>
Look at the shape of word embeddings is (3, 7, 8). It is a 3D tensor of shape (batch_size, input_length, output_dim) . Embedding layer accepts 2D array of sequences and returns a 3D array which can be passed as an input to RNN or ConvNet.
</p>

<p style='text-align: justify;'>
lets interpret the output of 3D array returned from embedding layer. The embedding layer has generated word vectors of 8 dimensions. From this 3D array we will get word vectors of each words in our corpus.
</p>

In [25]:
doc_indx = 0

for documents in corpus:
    print('\n Document: ', documents)
    word_indx = 0
    for word in documents.split(' '):
        print('\n ----- Word ---- ', word)
        
        print(' ----- Word Vector ----- \n ',word_embeddings[doc_indx, word_indx , :])
        
        word_indx += 1
    doc_indx += 1


 Document:  I live in a country name India.

 ----- Word ----  I
 ----- Word Vector ----- 
  [-0.03448143  0.02527387  0.04875959  0.03743121  0.00962205  0.0462733
 -0.00197356 -0.03287669]

 ----- Word ----  live
 ----- Word Vector ----- 
  [-0.04859023  0.00589529  0.04115    -0.04330999  0.04140783 -0.02147198
  0.02231913  0.04064867]

 ----- Word ----  in
 ----- Word Vector ----- 
  [-0.00984266  0.0279462  -0.02953575 -0.00323793  0.00286321  0.01004422
 -0.03504807 -0.01536544]

 ----- Word ----  a
 ----- Word Vector ----- 
  [-0.01424579  0.02937368  0.0331848   0.00166159  0.03057465 -0.00994849
  0.03966172  0.04327631]

 ----- Word ----  country
 ----- Word Vector ----- 
  [ 0.03490904 -0.045705    0.02629853 -0.0056543  -0.03333612 -0.01934866
 -0.01188357 -0.00877807]

 ----- Word ----  name
 ----- Word Vector ----- 
  [-0.02354031 -0.01780294 -0.03162633 -0.00448536 -0.04305381 -0.01177998
  0.04270966  0.03985044]

 ----- Word ----  India.
 ----- Word Vector ----- 
  [

You must be thinking what are these numbers in word vectors. These are nothing but the weights associated to each item in each sequence and learned using backpropagation.

Hope this article helps you to understand Keras Embedding Layer.