# Design Pattern 2 - Embeddings (Chapter 2)

## Introduction to Design Pattern

In the previous pattern (Hashed Feature) we considered the case where one of our categorical input features has too many categories to sensibly handle with a one-hot encoding. In that case we used hashing which converts the categorical values to integers and groups them somewhat arbitrarily into fewer categories. It works reasonably well (especially for ordinal data), but there are better alternatives. 

Embeddings are a more sophisticated technique that also maps a set of inputs to fewer categories, but preserves the information relationship between them using a set of trainable weights. In this example we are going to use the built-in functionality of Tensorflow to show how to set up Embeddings with categorical data using a simple example from the original repo. Then follow up with a real-world example that handles text data.

Note - this notebook is an introduction to Embeddings only and does not explain how to train them within a deep neural network -please look at the original example for more on this:

* https://github.com/GoogleCloudPlatform/ml-design-patterns/blob/master/02_data_representation/embeddings.ipynb

### Imports

In [1]:
import io
import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow.keras.layers.experimental import preprocessing

## Simple example using sample data

Let's look at the baby weight example data from the original repo notebook.... 

In [2]:
baby_data = pd.read_csv("./data/babyweight_sample.csv") 
print(baby_data.head(5))
print(baby_data.shape)

   weight_pounds  is_male  mother_age  plurality  gestation_weeks
0       5.269048    false          15  Single(1)               28
1       6.375769  Unknown          15  Single(1)               30
2       7.749249     true          42  Single(1)               31
3       1.250021     true          14   Twins(2)               25
4       8.688418     true          15  Single(1)               31
(999, 5)


In [3]:
print(baby_data.plurality.unique())

['Single(1)' 'Twins(2)' 'Triplets(3)' 'Multiple(2+)' 'Quadruplets(4)']


You can see that the 'plurality' column is a categorical text variable, and we can assign numbers to the categories as it is ordinal data (there is a natural ordering from high to low) as shown in the following cell:

In [4]:
CLASSES = {
    'Single(1)': 0,
    'Multiple(2+)': 1,
    'Twins(2)': 2,
    'Triplets(3)': 3,
    'Quadruplets(4)': 4,
    'Quintuplets(5)': 5
}

N_CLASSES = len(CLASSES)

plurality_class = [CLASSES[plurality] for plurality in baby_data.plurality]


Let's print the first 5 examples....

In [5]:
print(baby_data.plurality[:5].values)
print(plurality_class[:5])

['Single(1)' 'Single(1)' 'Single(1)' 'Twins(2)' 'Single(1)']
[0, 0, 0, 2, 0]


Now we set up an embedding layer using Tensorflow!

We supply arguments 'input_dim' and 'output_dim'. 

*  input_dim indicates the size of the vocabulary. For plurality this is 6.
*  output_dim indicates the dimension of the embedding we want to create

In [6]:
EMBED_DIM = 3

embedding_layer = tf.keras.layers.Embedding(input_dim=N_CLASSES,output_dim=EMBED_DIM, name='plurality_embedding')
embeds = embedding_layer(tf.constant(plurality_class))

The variable 'embeds' is a two-dimensional tensor containing the embedding values for plurality for each row of data. Let's inspect it...

In [7]:
print(embeds.shape)
print(embeds[:5])

(999, 3)
tf.Tensor(
[[-0.03500198 -0.00201459 -0.03848321]
 [-0.03500198 -0.00201459 -0.03848321]
 [-0.03500198 -0.00201459 -0.03848321]
 [ 0.03934843  0.04179189 -0.04929756]
 [-0.03500198 -0.00201459 -0.03848321]], shape=(5, 3), dtype=float32)


#### We can now use the embedding to learn the relationship between plurality and birth weight using a simple model

In [8]:
baby_model = tf.keras.models.Sequential([
        embedding_layer,
        tf.keras.layers.Dense(1)
])

baby_model.compile(
   optimizer='adam',
   loss='mse',
   metrics=[tf.keras.metrics.MeanAbsoluteError()]
)

In [9]:
baby_model.fit(tf.constant(plurality_class), baby_data.weight_pounds, batch_size=1, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fae100da0d0>

## Real world example

### Setting up an embedding for categorical land-use data

Let's load the data and look at it...

In [10]:
land_use_cats = pd.read_csv('./data/land_use_categories.csv')

print(land_use_cats)

print(land_use_cats.shape)

    land_cat_id                               land_cat_description
0             1             discontinuous low density urban fabric
1             2          discontinuous medium density urban fabric
2             3           discontinuous dense density urban fabric
3             4                            continuous urban fabric
4             5  industrial commericial public military private...
5             6        discontinuous very low density urban fabric
6             7                                  green urban areas
7             8                      sports and leisure facilities
8             9                                           pastures
9            10                           arable land annual crops
10           11                                         port areas
11           12             fast transit roads and associated land
12           13                                isolated structures
13           14                    other roads and associated 

This is not ordinal data, so although we have a unique id ('land_cat_id') it is meaningless as an indicator of the relationship between categories.

However, the text in 'land_cat_description' does contain information which we can use.

We are going to use the text processing capabilities of Tensorflow to create an embedding 

#### Convert the input text data to TF format

Normally you would just use a representative sample from  a large dataset, but since our data is small we use all of it

In [11]:
data = tf.constant(list(land_use_cats['land_cat_description']))
labels = tf.constant(list(land_use_cats['land_cat_id']))

#### Instantiate a TextVectorization object and create the 'vocabulary'

In [12]:
text_vectorizer = preprocessing.TextVectorization(output_mode="int")
# Index the vocabulary via `adapt()`
text_vectorizer.adapt(data)
# You can retrieve the vocabulary we indexed via get_vocabulary()
vocab = text_vectorizer.get_vocabulary()
print("Vocabulary:", vocab, len(vocab))

Vocabulary: ['', '[UNK]', 'urban', 'land', 'fabric', 'and', 'discontinuous', 'density', 'associated', 'sites', 'roads', 'low', 'areas', 'without', 'wetlands', 'water', 'very', 'vegetation', 'use', 'units', 'transit', 'structures', 'sports', 'rock', 'railways', 'public', 'private', 'port', 'pastures', 'other', 'moor', 'mineral', 'military', 'medium', 'leisure', 'isolated', 'industrial', 'herbaceous', 'green', 'grass', 'glacier', 'forests', 'fast', 'facilities', 'extraction', 'dunes', 'dump', 'dense', 'current', 'crops', 'continuous', 'construction', 'commericial', 'beaches', 'bare', 'arable', 'annual', 'airports'] 58


#### Create an Embedding model

We can make the output size of the embedding anything we want

In [13]:
EMBED_DIM = 3

inputs = tf.keras.layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = tf.keras.layers.Embedding(input_dim=len(vocab), output_dim=EMBED_DIM, name='embedding_layer')(x)
outputs = tf.keras.layers.GlobalAveragePooling1D()(x)

land_use_model = tf.keras.Model(inputs, outputs)

print(land_use_model.summary())

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
embedding_layer (Embedding)  (None, None, 3)           174       
_________________________________________________________________
global_average_pooling1d (Gl (None, 3)                 0         
Total params: 174
Trainable params: 174
Non-trainable params: 0
_________________________________________________________________
None


Now we can use the embedding to encode our data and examine the results

In [14]:
encoded_data = land_use_model(data)
print(encoded_data.shape)
print(encoded_data[:5])

(24, 3)
tf.Tensor(
[[ 8.84700101e-03  9.66095924e-03  1.03270253e-02]
 [ 1.04699237e-02 -3.12716402e-05  1.48955649e-02]
 [ 1.11859245e-02  8.99216812e-03  2.35498250e-02]
 [-3.26201995e-03 -5.70814125e-03  1.04185445e-02]
 [-4.25992021e-03 -7.26009393e-03  3.13125644e-03]], shape=(5, 3), dtype=float32)


In [15]:
EMBED_DIM = 3

inputs = tf.keras.layers.Input(shape=(1,), dtype="string")
x = text_vectorizer(inputs)
x = tf.keras.layers.Embedding(input_dim=len(vocab), output_dim=EMBED_DIM, name='embedding_layer')(x)
x = tf.keras.layers.GlobalAveragePooling1D()(x)
outputs = tf.keras.layers.Dense(1)(x)

land_use_model = tf.keras.Model(inputs, outputs)

print(land_use_model.summary())

land_use_model.compile(optimizer='adam',
              loss='mse',
              metrics=['mae'])


Model: "functional_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization (TextVect (None, None)              0         
_________________________________________________________________
embedding_layer (Embedding)  (None, None, 3)           174       
_________________________________________________________________
global_average_pooling1d_1 ( (None, 3)                 0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 4         
Total params: 178
Trainable params: 178
Non-trainable params: 0
_________________________________________________________________
None


In [16]:
land_use_model.fit(data, labels, batch_size=1, epochs=10, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fadc01df6d0>

### So how do we visualise a trained embedding?

We can do that qualitatively using the  [tensorflow embedding projector](http://projector.tensorflow.org/)

Firstly we need to extract out the trained embedding layer into a new model

#### Baby data example

In [17]:
plurality_embedding = tf.keras.Model(inputs=baby_model.input,
                outputs=baby_model.get_layer("plurality_embedding").output)

print(plurality_embedding.summary())

Model: "functional_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
plurality_embedding_input (I [(None, None)]            0         
_________________________________________________________________
plurality_embedding (Embeddi (None, None, 3)           18        
Total params: 18
Trainable params: 18
Non-trainable params: 0
_________________________________________________________________
None


In [18]:
preds = tf.squeeze(plurality_embedding.predict(tf.constant(plurality_class)))

In [19]:
out_v = io.open('./babydata_vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('./babydata_meta.tsv', 'w', encoding='utf-8')

for i in range(0,preds.shape[0]):
    vec = preds[i].numpy()
    out_m.write(str(baby_data.plurality[i]) + '\n')
    out_v.write('\t'.join([str(x) for x in vec]) + '\n')
    
out_v.close()
out_m.close()