### Word Embeddings -Keras

Problems with One-Hot Encoded Feature Vector Approaches
A potential drawback with one-hot encoded feature vector approaches such as N-Grams, bag of words and TF-IDF approach is that the feature vector for each document can be huge. For instance, if you have a half million unique words in your corpus and you want to represent a sentence that contains 10 words, your feature vector will be a half million dimensional one-hot encoded vector where only 10 indexes will have 1.

### Word Embeddings
In word embeddings, every word is represented as an n-dimensional dense vector. The words that are similar will have similar vector. Word embeddings techniques such as GloVe and Word2Vec have proven to be extremely efficient for converting words into corresponding dense vectors. The vector size is small and none of the indexes in the vector is actually empty.

### Implementation of Word Embedding with Keras
> To implement word embeddings, the Keras library contains a layer called ``Embedding()``. The embedding layer is implemented in the form of a class in Keras and is normally used as a first layer in the sequential model for NLP tasks.

[Read More](https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work)

> Embedding(200, 32, input_length=50)

* The first parameter in the embeddig layer is the size of the vocabulary or the **total number of unique words in a corpus**.
* The second parameter is the number of the **dimensions for each word vector**. For instance, if you want each word vector to have 32 dimensions, you will specify 32 as the second parameter. 
* And finally, the third parameter is the **length of the input sentence**.

### Custom Word Embeddings
> We are going to create our custom word embedding.

In [2]:
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tensorflow import keras

### Data

In [4]:
corpus = [
    'This is an excellent movie',
    'The move was fantastic I like it',
    'You should watch it is brilliant',
    'Exceptionally good',
    'Wonderfully directed and executed I like it',
    'Its a fantastic series',
    'Never watched such a brillent movie',
    'It is a Wonderful movie',
    "horrible acting",
    'waste of money',
    'pathetic picture',
    'It was very boring',
    'I did not like the movie',
    'The movie was horrible',
    'I will not recommend',
    'The acting is pathetic'
]
sentiments = np.array([1 if i< 8 else 0 for i in range(16)])
sentiments

array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0])

> The first `8` are positive `reviews` about the move and the last 8 are negative reviews.

> `0` sentiment means a negative review about the movie and `1` is a positive review about the move. as we know the `Embedding()` layer takes `vocabulary` or number of `unique` words. We want to find the total number of `unique` words in the copus

In [5]:
from nltk.tokenize import word_tokenize

In [8]:
all_words = []
for sent in corpus:
    words = word_tokenize(sent)
    for word in words:
        all_words.append(word)
len(all_words)

71

In [9]:
unique_words = list(set(all_words))
len(unique_words)

45

> The Embedding layer expects the words to be in **numeric form**. Therefore, we need to convert the sentences in our corpus to numbers. One way to convert text to numbers is by using the ``one_hot`` function from the ``keras.preprocessing.text`` library. The function takes ``sentence`` and the ``total length of the vocabulary and returns the sentence in numeric form``.

In [10]:
voc_len = len(unique_words) + 5 ## we are just adding 5 to unique words

In [15]:
embedded_sentences = [one_hot(sent, voc_len) for sent in corpus]
embedded_sentences

[[11, 29, 1, 13, 23],
 [40, 1, 46, 8, 26, 41, 36],
 [21, 12, 38, 36, 29, 41],
 [19, 27],
 [28, 42, 12, 21, 26, 41, 36],
 [28, 16, 8, 18],
 [45, 2, 4, 16, 47, 23],
 [36, 29, 16, 36, 23],
 [16, 8],
 [25, 47, 24],
 [17, 43],
 [36, 46, 47, 19],
 [26, 49, 15, 41, 40, 23],
 [40, 23, 46, 16],
 [26, 38, 15, 33],
 [40, 8, 29, 17]]

> The embedding layer expects sentences to be of equal size. However, our encoded sentences are of different sizes. One way to make all the sentences of uniform size is to increase the lenght of all the sentences and make it equal to the length of the largest sentence. Let's first find the largest sentence in our corpus and then we will increase the length of all the sentences to the length of the largest sentence. 

In [19]:
word_count = lambda sentence: len(word_tokenize(sentence))
longest_sentence = max(corpus, key=word_count)
len_longest_sentence = len(word_tokenize(longest_sentence))
len_longest_sentence

7

> We want to make all sentences have equal size, so the sentences that has length less than 7 we will fill the gaps of marking them `7` by 0 using `pad_sequences`. The first parameter is the list of **encoded sentences of unequal sizes**, the second parameter is the **size of the longest sentence** or the padding index, while the last parameter is **padding** where you specify post to add padding at the end of sentences.

In [20]:
padded_sents = pad_sequences(embedded_sentences, len_longest_sentence, padding="post")

In [21]:
padded_sents

array([[11, 29,  1, 13, 23,  0,  0],
       [40,  1, 46,  8, 26, 41, 36],
       [21, 12, 38, 36, 29, 41,  0],
       [19, 27,  0,  0,  0,  0,  0],
       [28, 42, 12, 21, 26, 41, 36],
       [28, 16,  8, 18,  0,  0,  0],
       [45,  2,  4, 16, 47, 23,  0],
       [36, 29, 16, 36, 23,  0,  0],
       [16,  8,  0,  0,  0,  0,  0],
       [25, 47, 24,  0,  0,  0,  0],
       [17, 43,  0,  0,  0,  0,  0],
       [36, 46, 47, 19,  0,  0,  0],
       [26, 49, 15, 41, 40, 23,  0],
       [40, 23, 46, 16,  0,  0,  0],
       [26, 38, 15, 33,  0,  0,  0],
       [40,  8, 29, 17,  0,  0,  0]])

### Creating a Simple Model

In [30]:
model = keras.Sequential([
    tf.keras.layers.Embedding(voc_len, 20, input_length=len_longest_sentence),
    keras.layers.Flatten(),
    keras.layers.Dense(16, activation="relu"),
    keras.layers.Dense(1, activation='sigmoid')
])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 7, 20)             1000      
_________________________________________________________________
flatten_2 (Flatten)          (None, 140)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 16)                2256      
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 17        
Total params: 3,273
Trainable params: 3,273
Non-trainable params: 0
_________________________________________________________________


> A Sequential model and add the ``Embedding`` layer as the first layer to the model. The length of the vocabulary is specified by the ``voc_len`` parameter. The dimension of each word vector will be ``20`` and the ``input_length`` will be the length of the longest sentence, which is ``7``. Next, the ``Embedding`` layer is flattened so that it can be directly used with the densely connected layer. Since it is a ``binary classification`` problem, we use the ``sigmoid`` function as the loss function at the dense layer.

### Compiling the Model

In [49]:
model.compile(optimizer=keras.optimizers.Adam(lr=1e-3), 
              loss=keras.losses.BinaryCrossentropy(),
              metrics=['accuracy']
             )

### Trainning the model

> First we want to shuffle our datasets and then split them into train and test as usual

In [50]:
data = np.column_stack([padded_sents, sentiments])
data

array([[11, 29,  1, 13, 23,  0,  0,  1],
       [40,  1, 46,  8, 26, 41, 36,  1],
       [21, 12, 38, 36, 29, 41,  0,  1],
       [19, 27,  0,  0,  0,  0,  0,  1],
       [28, 42, 12, 21, 26, 41, 36,  1],
       [28, 16,  8, 18,  0,  0,  0,  1],
       [45,  2,  4, 16, 47, 23,  0,  1],
       [36, 29, 16, 36, 23,  0,  0,  1],
       [16,  8,  0,  0,  0,  0,  0,  0],
       [25, 47, 24,  0,  0,  0,  0,  0],
       [17, 43,  0,  0,  0,  0,  0,  0],
       [36, 46, 47, 19,  0,  0,  0,  0],
       [26, 49, 15, 41, 40, 23,  0,  0],
       [40, 23, 46, 16,  0,  0,  0,  0],
       [26, 38, 15, 33,  0,  0,  0,  0],
       [40,  8, 29, 17,  0,  0,  0,  0]])

In [51]:
np.random.shuffle(data)
data

array([[40, 23, 46, 16,  0,  0,  0,  0],
       [16,  8,  0,  0,  0,  0,  0,  0],
       [28, 16,  8, 18,  0,  0,  0,  1],
       [36, 46, 47, 19,  0,  0,  0,  0],
       [45,  2,  4, 16, 47, 23,  0,  1],
       [40,  8, 29, 17,  0,  0,  0,  0],
       [26, 38, 15, 33,  0,  0,  0,  0],
       [19, 27,  0,  0,  0,  0,  0,  1],
       [21, 12, 38, 36, 29, 41,  0,  1],
       [17, 43,  0,  0,  0,  0,  0,  0],
       [11, 29,  1, 13, 23,  0,  0,  1],
       [26, 49, 15, 41, 40, 23,  0,  0],
       [36, 29, 16, 36, 23,  0,  0,  1],
       [28, 42, 12, 21, 26, 41, 36,  1],
       [40,  1, 46,  8, 26, 41, 36,  1],
       [25, 47, 24,  0,  0,  0,  0,  0]])

In [52]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, random_state=33, test_size= .2)

In [53]:
X_train = train[:][:, :7]
X_test = test[:][:, :7]

y_test = test[:][:,-1]
y_train = train[:][:,-1]

In [64]:
EPOCHS = 3
BATCH_SIZE = 10
VALIDATION_SET = (X_test, y_test)
model.fit(X_train, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, verbose=2, validation_data=VALIDATION_SET)

Epoch 1/3
2/2 - 0s - loss: 0.0722 - accuracy: 1.0000 - val_loss: 0.5488 - val_accuracy: 0.7500
Epoch 2/3
2/2 - 0s - loss: 0.0693 - accuracy: 1.0000 - val_loss: 0.5456 - val_accuracy: 0.7500
Epoch 3/3
2/2 - 0s - loss: 0.0669 - accuracy: 1.0000 - val_loss: 0.5420 - val_accuracy: 0.7500


<tensorflow.python.keras.callbacks.History at 0x1cb4c5836a0>

### Making Predictions

In [61]:
np.round(model.predict(X_test[:])), y_test

(array([[0.],
        [1.],
        [0.],
        [1.]], dtype=float32),
 array([0, 1, 0, 0]))

> Those are the basics of `custom word embeddings`.

### Loading Pretrained Word Embeddings

[Doccs](https://stackabuse.com/python-for-nlp-word-embeddings-for-deep-learning-in-keras/)