<a href="https://colab.research.google.com/github/JpChii/ML-Projects/blob/main/NLP_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embeddings

Machine Learning models like numbers and they hate anythinh other than numbers they really do. To get useful information via machine learning, first step is to convert text data into numbers. This post discusses about few methods,

## One-Hot Encoding

Encoding text to numbers. Create a zero vector to the length of the vocabulary(number of unique words on the data) and assign `1` at the index o the word. By doing this what we achieve is called a sparse vector, meaning most indices of the vector are zeros. To form a sentence, concatenate one-hot encoding of the words.

Let's consider a vocabulary with 15,000 words and we encode all of them, what we get is 99.99% of zeros in our data which is really inefficient for training.

## Integer Encoding with unique numbers

Let's switch to use an unique number for each words in the vocabulary. This is efficient thean the above because we'll get a dense vector instead of sparse vector.

But in this method, we lose valuable information to amke something out of the data. Realationship between words is lost. Integer encoding can be challenging for models to interpret, because there is no relationship between similar words and the encodings are alos differnet. This leads to feature-weight combination which is not meaningful.

This where **embedding** comes in

## Word Embeddings

Word embedding is an efficient dense vector way where similar words have similar encodings. Embeddings are floating point vectors, whose values doesn't need to be setup manually. The advantage is the embedding values are learned during training similar to weights of a dense layer. The length of the vector is a parameter to be specified.

The embedding length ranges from 8-dimensional for small datasets to 1024-dimensions for large datasets but requires more data to learn.

In this notebook, we'll use [IMDB review dataset](http://ai.stanford.edu/~amaas/data/sentiment/) to practice word embedding and then perform sentiment analysis on the data.

In [None]:
# Setting up imports
import io
import os
import re
import shutil
import string
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [None]:
# Downloading the dataset
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                  untar=True,
                                  cache_dir='.',
                                  cache_subdir='')

Downloading data from http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [None]:
dataset_dir = os.path.join(os.path.dirname(dataset), "aclImdb")
dataset_dir

'./aclImdb'

In [None]:
os.listdir(dataset_dir)

['train', 'imdbEr.txt', 'README', 'imdb.vocab', 'test']

In [None]:
!wget https://raw.githubusercontent.com/JpChii/ML-Tools/main/dl_helper.py

--2021-06-01 14:47:42--  https://raw.githubusercontent.com/JpChii/ML-Tools/main/dl_helper.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17979 (18K) [text/plain]
Saving to: ‘dl_helper.py’


2021-06-01 14:47:42 (53.9 MB/s) - ‘dl_helper.py’ saved [17979/17979]



In [None]:
from dl_helper import walk_through_dir

In [None]:
walk_through_dir(dataset_dir)

There are 2 directories and 3 images in './aclImdb'.
There are 3 directories and 5 images in './aclImdb/train'.
There are 0 directories and 50000 images in './aclImdb/train/unsup'.
There are 0 directories and 12500 images in './aclImdb/train/pos'.
There are 0 directories and 12500 images in './aclImdb/train/neg'.
There are 2 directories and 3 images in './aclImdb/test'.
There are 0 directories and 12500 images in './aclImdb/test/pos'.
There are 0 directories and 12500 images in './aclImdb/test/neg'.


We've ositive and negative reviews in seperate folder in train diretory. In addition there is `/aclImdb/train/unsup` which is not needed for training, so this directory will be deleted next.

In [None]:
# Setting up train directory
train_dir = os.path.join(dataset_dir, "train")
walk_through_dir(train_dir)

There are 3 directories and 5 images in './aclImdb/train'.
There are 0 directories and 50000 images in './aclImdb/train/unsup'.
There are 0 directories and 12500 images in './aclImdb/train/pos'.
There are 0 directories and 12500 images in './aclImdb/train/neg'.


In [None]:
# Removing unsup directory
remove_dir = os.path.join(train_dir, "unsup")
shutil.rmtree(remove_dir)

## Creating dataset

Creating a [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) using [tf.keras.preprocessing.text_dataset_from_directory](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text_dataset_from_directory).

In [None]:
batch_size = 32
seed = 42
train_ds = tf.keras.preprocessing.text_dataset_from_directory(directory=train_dir,
                                                              batch_size=batch_size,
                                                              label_mode="binary",
                                                              validation_split=0.2,
                                                              subset='training',
                                                              seed=seed)

val_ds = tf.keras.preprocessing.text_dataset_from_directory(directory=train_dir,
                                                            batch_size=batch_size,
                                                            validation_split=0.2,
                                                            subset='validation',
                                                            seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


Great now the data is loaded,let's checka few of themfor better understanding.

In [None]:
for text_batch, label_batch in train_ds.take(1):
  for i in range(5):
    if label_batch[i].numpy() == [0.]:
      print(f"Label: {label_batch[i].numpy()}")
      print(f"Text: {text_batch[i].numpy()}\n")

Label: [0.]
Text: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'

Label: [0.]
Text: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as

From going through random texts `0` is negative and `1` is positive

## Configure dataset for performance

cache() and prefetch() are going to be used for more checkout [data performance guide](https://www.tensorflow.org/guide/data_performance)

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

### Using the embedding layer

Keras makes it easy with it's [Embedding Layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding).

Embedding layer can be understood as a lookup table that maps integers of indices(which stand for specific words) to their embeddings(dense vectors). The dimensionality of embeddings(or width) is a parameter to be experimented to see what works well for the problem, similar to numbr of neurons in a dense layer.

In [None]:
# Embed a 1,000 word vocabulary into 5 dimensions
embedding_layer = tf.keras.layers.Embedding(input_dim=1000, # length of the vocabulary
                                            output_dim=5) # Output shape of the embeddings

The weights of the embedding layer is initialized randomly like any other layer. The weight are learned through backpropogation during training.

Once training the is completed the similar word enmbedding will also be similar.

If a integer is passed to an embedding layers, the result replaces each integer with indices form embedding table.

In [None]:
result = embedding_layer(tf.constant([1,2,3]))
result.numpy()

array([[-0.04985893, -0.02151388, -0.01218064, -0.01874789,  0.03003483],
       [ 0.03881696, -0.03618266,  0.04983819, -0.04446222,  0.03230344],
       [ 0.04521865, -0.0280888 , -0.02505544,  0.04856518,  0.02527351]],
      dtype=float32)

For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape `(samples, sequence length)` where each entry is a sequence of integers. It can embed sequences of variable lengths.

We can feed the embedding layer above with shapes `(32, 19)`(batch of 32 sequences of length 10.

The returned tensor has one more ais than the input, the embedding vectors are aligned along the new axis. Pass it a `(2,3)` input batch output is `(2,3,N)`.

In [None]:
input = tf.constant([[0,1,2], [3,4,5]])
print(f"Input shape: {input.shape}")
embedding_result = embedding_layer(input)
print(f"Embedded input shape: {embedding_result.shape}")

Input shape: (2, 3)
Embedded input shape: (2, 3, 5)


When passing a batch of sequences as input, an embedding layer returns a 3D floating point tensor of shape `(samples, sequence_length, embedding_dimensionality)`.

## Text Processing

So the embedding layer only has the ability to return embedding vectors from integers of indices(for words) right now what we have is raw text data. To convert it into numbers also called tokenization we'll setup a `TextVectorizer`

In [None]:
# Create a custom standardization function to strip HTML break tags
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  sripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(sripped_html,
                                  '[%s]' % re.escape(string.punctuation), '')

In [None]:
# Vocabulary size and number of words in a sequence
vocab_size = 10000
sequence_length = 100

# Use the text vectotrization layer to map, split and map strings to integers
# This is gonna use the custom_standardization function defined above
# Setting maximum sequence length, since all the sequences are not of the same length
vectorize_layer = TextVectorization(standardize=custom_standardization,
                                    max_tokens=vocab_size,
                                    output_mode='int',
                                    output_sequence_length=sequence_length)

In [None]:
# Make a text-only dataset (no labels) and call adapt to build the vocabulary
text_ds = train_ds.map(lambda x, y:x)
vectorize_layer.adapt(text_ds)

Now let's model.

In [None]:
embedding_dim = 16

# Build model
model = Sequential([
                    vectorize_layer,
                    Embedding(vocab_size, embedding_dim, name="embedding"),
                    GlobalAveragePooling1D(),
                    Dense(16, activation="relu"),
                    Dense(1)
])

# Compile the mode
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(),
              metrics=["accuracy"])

In [None]:
# Fit the mode
model.fit(train_ds,
          validation_data=val_ds,
          epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7f46d51446d0>

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 100)               0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


## Retrive the trained word embeddings and save them to disk

In [None]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

Let's save them to disk and visuzlize in [Embedding Projector](http://projector.tensorflow.org/)

In [None]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

In [None]:
try:
  from google.colab import files
  files.download('vectors.tsv')
  files.download('metadata.tsv')
except Exception:
  pass

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>