# Word Embeddings in Keras

- Embaddings are not hand crafted. Instead, they are learnt during Neural Network training.

- Word embeddings are a way of representing words, to be given as input to a Deep learning model. It is considered the best available representation of words in NLP. In this method, each word is represented as a word vector in a predefined dimension. Higher the dimension richer its ability to incorporate the syntactic and semantic meaning of the word.

- A word in any language will have a meaning. However, its meaning depends on the context in which it is located. Each word can have multiple contexts, this number could be in double digits sometimes. This needs a complex representation.



## Techniques to compute word embeddings
### 1. Using supervised learning

- Take an NLP problem and try to solve it. In that pursuit as a side effect, you get word embeddings. For instance, in case of positive, Nice food. The sandwich was too delicious. In case of Negative, Poor quality food. I will never that food again.

### 2. Using self-supervised learning
- Word2vec
- Glove


### Traditionally integer values and one-hot vectors are used to represent words. Not just words in NLP but any categorical variable in structured data can be seen in this light. One-hot encoder representation has its own drawbacks:

    1. Vector length of each word representation is equal to a total number of unique words in the dictionary. In NLP application, this will make the vector length too big.
    
    2. Different values of the variables can be represented with any relationship using one-hot vectors. Variables features cannot be represented and so the relationship between them. Ex: Day of the week, weekdays will have some sort of relationship and similarly weekends. In one-hot representation, it will not be able to make a distinction.



## Pre-processing with Keras tokenizer:

We will use Keras tokenizer to do pre-processing needed to clean up the data.

First, create a Keras tokenizer object. Using the tokenizer object call “fit_on_texts” function by passing the dataset as a list of data samples. This fits the Keras tokenizer to the dataset. Now other methods inside the tokenizer class/object can be used to apply meaningful operations on the data set.

### tokenizer.text_to_sequence():

This line of code tokenizes the input text by splitting the corpus into tokens of words an makes a list of them. Each unique word token is given corresponding dedicated integer value. For example a sentence: “I don’t like movies because movies are not real” becomes “5, 6,20,9,12,9,22,3,23” here I have taken random dedicated integers for a corresponding word. Word “movies” gets an integer “9”. The text is converted into a stream of integer strings replacing word tokens.

### pad_sequence(list_tokenized_train, maxlen=maxlen):

The pad_sequence takes two arguments, one tokenized text in the form of integers. Which is converted from the dataset using “text_to_sequence()” method. The second argument takes the maximum possible length of a sentence in the text corpus. We can set the “maxlen” by doing some analysis on the length of sentences in the dataset. Ideally, take the length of the longest sentence by removing outliers which are extremely long.

In [17]:
# Libraries

import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding



In [18]:
# First, I will define the documents and their class labels.
reviews = ['nice food',
        'amazing restaurant',
        'too good',
        'just loved it!',
        'will go again',
        'horrible food',
        'never go there',
        'poor service',
        'poor quality',
        'needs improvement']

sentiment = np.array([1,1,1,1,1,0,0,0,0,0])

In [19]:
one_hot("amazing restaurant",30)


[22, 6]

This takes review and returns unique number between 1 and 30. Where amazing is 22 and restaurant is 6.

In [20]:
vocab_size = 30
encoded_reviews = [one_hot(d, vocab_size) for d in reviews]
print(encoded_reviews)

[[26, 18], [22, 6], [8, 20], [18, 18, 16], [15, 4, 12], [28, 18], [22, 4, 26], [17, 14], [17, 11], [21, 14]]


The shape is unbalanced in our reviews, we need to add padding to make all review same shape

In [21]:
# Here we have max_length of 4 which means even if we have 2 words the shape will be of (1, 4) 

max_length = 4
padded_reviews = pad_sequences(encoded_reviews, maxlen=max_length, padding='post')
print(padded_reviews)

[[26 18  0  0]
 [22  6  0  0]
 [ 8 20  0  0]
 [18 18 16  0]
 [15  4 12  0]
 [28 18  0  0]
 [22  4 26  0]
 [17 14  0  0]
 [17 11  0  0]
 [21 14  0  0]]


In [22]:
embeded_vector_size = 5

model = Sequential()
model.add(Embedding(vocab_size, embeded_vector_size, input_length=max_length,name="embedding"))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [23]:
# Splitting Training and testing
X = padded_reviews
y = sentiment

In [24]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
print(model.summary())

None


In [25]:
model.fit(X, y, epochs=50, verbose=0)


<keras.src.callbacks.history.History at 0x7af09d5184a0>

In [26]:
# evaluate the model
loss, accuracy = model.evaluate(X, y)
accuracy

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 256ms/step - accuracy: 0.9000 - loss: 0.6367


0.8999999761581421

The accuracy of our model is around 90%.

In [27]:
weights = model.get_layer('embedding').get_weights()[0]
len(weights)

30

In [28]:
weights[13]


array([-0.00845361, -0.03643029,  0.01936955,  0.04812371, -0.01868276],
      dtype=float32)

In [29]:
weights[4]


array([-0.04690833, -0.02410579,  0.04898885,  0.02167631,  0.0403937 ],
      dtype=float32)

In [30]:
weights[16]


array([-0.0568514 ,  0.00153835, -0.08294711,  0.07071232,  0.08833166],
      dtype=float32)