### Word Embeddings

In this notebook we are going to learn how to create and work with word embedding vectors. We  are going to create a simple classification model based on the IMDB dataset and have a quick view on the word embeddings.



> Since this notebook is more focused on trainning our own word embeddings instead of the model achitecture and other stuff. I will explain some of the code cell, some of them will be skipped


### Imports


In [1]:
import os, re, shutil, string
from datetime import datetime

from tensorflow import keras
import tensorflow as tf

tf.__version__

'2.6.0'

### Dataset 

We are going to use the `get_file` method from keras utilsto download the [IMDB movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) and we will load it using the `text_data_from_directory` based on the [this tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)

In [2]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


['README', 'train', 'imdb.vocab', 'test', 'imdbEr.txt']

After theb above cell executed we are going to have the following folder structure:

```
acllmdb
  test
    pos
    neg
  train
    pos
    neg
    unsup
```

We wantto remove the `unsup` directory in the train folder.

In [3]:
train_dir = os.path.join(dataset_dir, "train")
test_dir = os.path.join(dataset_dir, "test")

unsup_dir =os.path.join(train_dir, "unsup")
if os.path.exists(unsup_dir):
  shutil.rmtree(unsup_dir)


In [4]:
BATCH_SIZE = 1024
SEED = 42


In [5]:
tf.random.set_seed(SEED)

In [6]:
train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=BATCH_SIZE, validation_split=0.2, 
    subset='training', seed=SEED)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=BATCH_SIZE, validation_split=0.2, 
    subset='validation', seed=SEED)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


### Checking some examples.

In [7]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    print(label_batch[i].numpy(), text_batch.numpy()[i])

1 b"I first saw Rob Roy twelve years ago. With little money for entertainment, I rented it for my fianc\xc3\xa9 and I to watch on a bone chilling winter's night. The movie I had wanted was gone, so I rented this instead, not expecting much, and was very much surprised with how good it was. I just recently watched it again, and loved it every bit as much as the first time. <br /><br />For those unfamiliar with the story, it's about Scottish outlaw Robert Roy MacGregor, a cattleman and folk hero. From the little I know about the man and his story, liberties have been taken with the facts, but it's a movie, not a textbook, and so the filmmakers can be excused. Basically, the plot of the movie is that Rob Roy borrows money from the Marquis of Montrose to buy cattle which he then intends to sell and reap a large profit from. But, his plan is foiled when the friend entrusted with the money is robbed of the cash and murdered in the forest. Our hero finds himself on the run after failing to se

### Configuring the dataset for performance


In [8]:
AUTOTUNE = tf.data.experimental.AUTOTUNE
train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

### Text preprocessing

Next, define the dataset preprocessing steps required for our sentiment classification model. Initialize a TextVectorization layer with the desired parameters to vectorize movie reviews. 


* [Text Classification](https://www.tensorflow.org/tutorials/keras/text_classification).

In [9]:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')


In [10]:
# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

In [12]:
# Use the text vectorization layer to normalize, split, and map strings to 
# integers. Note that the layer uses the custom standardization defined above. 
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = keras.layers.experimental.preprocessing.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)


In [15]:
vectorize_layer.vocabulary_size()

10000

### Creating a model.


* The [`TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) layer transforms strings into vocabulary indices. We have already initialized `vectorize_layer` as a TextVectorization layer and built it's vocabulary by calling `adapt` on `text_ds`. Now ``vectorize_layer`` can be used as the first layer of our end-to-end classification model, feeding tranformed strings into the Embedding layer.

* The [`Embedding`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, sequence, embedding)`.

* The [`GlobalAveragePooling1D`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GlobalAveragePooling1D) layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.

* The fixed-length output vector is piped through a fully-connected ([`Dense`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)) layer with 16 hidden units.

* The last layer is densely connected with a single output node. 



In [17]:
embedding_dim=16

model = keras.Sequential([
  vectorize_layer,
  keras.layers.Embedding(vocab_size, embedding_dim, name="embedding"),
  keras.layers.GlobalAveragePooling1D(),
  keras.layers.Dense(16, activation='relu'),
  keras.layers.Dense(1)
], name="my_model")

# Compiling the model

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])


In [20]:
model.summary()

Model: "my_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


### Training the model.

In [19]:
model.fit(
    train_ds,
    validation_data=val_ds, 
    epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7fa0ed15a490>

### Retrieving the trained word embeddings and save them to disk.

Next we will retrieve the word embeddings lerand during model training and save them to disk. The weights matrix is of shape `(vocab_size, embedding_dimension)`.

In [21]:
vocab = vectorize_layer.get_vocabulary()
print(vocab[:10])

['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it']


In [24]:
assert len(vocab) == vectorize_layer.vocabulary_size() == 10_000, "failed equality" 

### Getting the embedding weigths

In [25]:
weights = model.get_layer("embedding").get_weights()[0]
weights.shape

(10000, 16)

In [26]:
weights[:2]

array([[ 0.02068263,  0.01661886,  0.02658953, -0.03932418,  0.01821075,
         0.1190527 ,  0.0630055 ,  0.09768607,  0.04270636, -0.01584527,
        -0.01829978, -0.06380946, -0.01145105,  0.12682624, -0.01018622,
         0.00402084],
       [ 0.00280237,  0.06672003,  0.04118211, -0.00034019,  0.04540986,
         0.13215236,  0.00456196,  0.03950495,  0.12480214, -0.02775864,
         0.050193  , -0.19038193, -0.04517735,  0.07841194,  0.08379428,
        -0.02369561]], dtype=float32)

### Using the embedding projector.
We will write the weighs to the disk inorder for us to ues the [Embedding Projector](http://projector.tensorflow.org/). The filed will be in a tab separated formart `tsv`. These two files will:

1.  a file of vectors (containing the embedding), 
2. file of meta data (containing the words).

In [28]:
out_v = open('vecs.tsv', 'w', encoding='utf-8')
out_m = open('meta.tsv', 'w', encoding='utf-8')

for num, word in enumerate(vocab):
  if num == 0: continue # skipping the padding token from the vocab
  vec = weights[num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_v.close()
out_m.close()
print("Done...")


Done...


In [29]:
from google.colab import files

In [30]:
files.download('vecs.tsv')
files.download('meta.tsv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Visualize the embeddings

To visualize the embeddings, upload them to the embedding projector.

Open the [Embedding Projector](http://projector.tensorflow.org/) (this can also run in a local TensorBoard instance).

* Click on "Load data".

* Upload the two files you created above: `vecs.tsv` and `meta.tsv`.

The embeddings you have trained will now be displayed. You can search for words to find their closest neighbors. For example, try searching for "beautiful". You may see neighbors like "wonderful". 

Note: Experimentally, you may be able to produce more interpretable embeddings by using a simpler model. Try deleting the `Dense(16)` layer, retraining the model, and visualizing the embeddings again.

Note: Typically, a much larger dataset is needed to train more interpretable word embeddings. This tutorial uses a small IMDb dataset for the purpose of demonstration.


### Saving word embeddings in the way `glove.6B` does. 

We are going to save these word embeddings as we did in [at the end of this notebook](https://github.com/CrispenGari/keras-api/blob/main/14_NLP/00_Sentiment_Analyisis/00_Sentiment_Analysis_With_A_Closer_Look_Plus_Embeddings.ipynb) where the ``txt`` file of word vectors will be looking as follows:

```txt
the  0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658
to 0.68047 -0.039263 0.30186 -0.17792 0.42962 0.032246 -0.41376 0.13228 -0.29847 -0.085253 0.17118 0.22419 -0.10046 -0.4365
```

In [31]:
out_v = open('word-embeddings.16d.txt', 'w', encoding='utf-8')

for num, word in enumerate(vocab):
  if num == 0: continue # skipping the padding token from the vocab
  vec = weights[num]
  out_v.write(f"{word} ")
  out_v.write(' '.join([str(x) for x in vec]) + "\n")
out_v.close()
print("Done...")

Done...


In [32]:
files.download('word-embeddings.16d.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>