# [Keras + Universal Sentence Encoder = Deep Meter](https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/) 

This notebook creates an autoencoder using the Universal Sentence Encoder. The autoencoder output is CMUdict syllables. The dataset is that subset of Allison Parrish's Project Gutenberg poetry archive which happens to scan in iambic pentameter.

The notebook is based on Chengwei Zhang's example of wrapping the USE inside a larger tensorflow model saves to a Keras model (without save the USE itself in the TF model).

The Universal Sentence Encoder makes getting sentence level embeddings as easy as it has historically been to lookup the embeddings for individual words. The sentence embeddings can then be trivially used to compute sentence level meaning similarity as well as to enable better performance on downstream classification tasks using less supervised training data.

Since there are 10 one-hot values for 10 sets of 6k syllables, this is "multi-label classification"
Changes for multi-label classification:
sigmoid activation instead of softmax
binary_crossentropy

Text format is tab-separated, 2 columns: first text, second multi-level
array of syllables:


# Getting Started

This section sets up the environment for access to the Universal Sentence Encoder on TF Hub and provides examples of applying the encoder to words, sentences, and paragraphs.

In [4]:
# Install the latest Tensorflow version.
!pip3 install --quiet "tensorflow>=1.7"
# Install TF-Hub.
!pip3 install --quiet tensorflow-hub
%cd /content
!git clone https://github.com/LanceNorskog/deep_meter || true
%cd deep_meter
!git pull
# could not figure out how to read gzipped files as text!
!gunzip -qf blobs/*.gz || true
!gunzip -qf prepped_data/*.gz || true

/content
Cloning into 'deep_meter'...
remote: Enumerating objects: 175, done.[K
remote: Counting objects: 100% (175/175), done.[K
remote: Compressing objects: 100% (123/123), done.[K
remote: Total 175 (delta 90), reused 127 (delta 45), pack-reused 0[K
Receiving objects: 100% (175/175), 20.67 MiB | 31.79 MiB/s, done.
Resolving deltas: 100% (90/90), done.
/content/deep_meter
Already up to date.
gzip: blobs/*.gz: No such file or directory


In [0]:
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import keras.layers as layers
from keras.models import Model
from keras import backend as K
np.random.seed(10)

In [0]:
import syllables
from itertools import chain
from ast import literal_eval
from sklearn.preprocessing import MultiLabelBinarizer

In [0]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"]

In [8]:
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)
embed_size = embed.get_output_info_dict()['default'].get_shape()[1].value

INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.
INFO:tensorflow:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-large/3'.
INFO:tensorflow:Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-large/3'.


In [9]:
# Compute a representation for each message, showing various lengths supported.
f = open("prepped_data/gutenberg.iambic_pentameter")
line = f.readline().split("\t")[0]
print(line)
messages = [line]

with tf.Session() as session:
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  message_embeddings = session.run(embed(messages))
message_embeddings[0][:3]

From their Creator, and transgress his Will
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


array([0.09510787, 0.01377319, 0.0609108 ], dtype=float32)

In [0]:
def flatten(data):
  return list(chain.from_iterable(data))
train_text_raw = []
train_labels_raw = []
with open("prepped_data/gutenberg.iambic_pentameter.train") as file:
  for line in file:
    parts = line.split("\t")
    train_text_raw += [parts[0]]
    train_labels_raw += [flatten(literal_eval(parts[1]))]
    
num_training = len(train_text_raw)
    
sentence_labels = flatten([['AE N D'], ['S T AH', 'B ER N'], ['B R AE S'], ['AE N D'], ['T IH N'], ['AE N D'], ['S AA', 'L AH D'], ['G OW L D']])
other_sentence = flatten([['T UW'], ['M IY'], ['IH T'], ['F L OW Z'], ['AH'], ['S AH', 'L AH N'], ['S T R IY M'], ['AH V'], ['T EH R Z']])
third_sentence = flatten([['DH AH'], ['N OW', 'B AH L'], ['S IY', 'M AH N'], ['HH UW'], ['W IH TH', 'HH EH L D'], ['DH AH'], ['HH AE N D']])
    


In [27]:
# syllables in descending order of occurrence - 6k in gutenberg.iambic_pentameter, 15k total
# clamp to most common 100 syllables while debugging- use NCE to get all syllables or interesting number
# 98 + pause + wildcard
num_syllables = 100 
# iambic pentameter
num_symbols = 10
syll_mgr = syllables.syllables(num_syllables)
train_labels_encoded = np.zeros((num_training, num_symbols * num_syllables))
fail_i = 0
fail_j = 0
fail_enc = 0
for i in range(num_training):
  fail_i = i
  if len(train_labels_raw[i]) != 10:
    continue
  for j in range(num_symbols):
    fail_j = j
    fail_enc = -1
    fail_enc = syll_mgr.get_encoding(train_labels_raw[i][j])
    train_labels_encoded[i][j * num_syllables + syll_mgr.get_encoding(train_labels_raw[i][j])] = 1
  
print(train_labels_encoded.shape)

(62320, 1000)


## Wrap embed module in a Lambda layer
Explicitly cast the input as a string

In [0]:
def UniversalEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]

## Assemble model

In [30]:
input_text = layers.Input(shape=(1,), dtype=tf.string)
embedding = layers.Lambda(UniversalEmbedding, output_shape=(embed_size,))(input_text)
dense = layers.Dense(512, activation='relu')(embedding)
pred = layers.Dense(num_syllables * num_symbols, activation='sigmoid')(dense)
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

INFO:tensorflow:Saver not created because there are no variables in the graph to restore
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 1)                 0         
_________________________________________________________________
lambda_1 (Lambda)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 512)               262656    
_________________________________________________________________
dense_2 (Dense)              (None, 1000)              513000    
Total params: 775,656
Trainable params: 775,656
Non-trainable params: 0
_________________________________________________________________


In [50]:
train_text = df_train['text'].tolist()
train_text = np.array(train_text, dtype=object)[:, np.newaxis]

train_label = np.asarray(pd.get_dummies(df_train.label), dtype = np.int8)

NameError: ignored

In [0]:
train_text.shape

(500, 1)

In [0]:
train_label.shape

(500, 6)

In [0]:
train_label[:3]

array([[0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 0]], dtype=int8)

In [0]:
df_test = get_dataframe('test_data.txt')

In [0]:
test_text = df_test['text'].tolist()
test_text = np.array(test_text, dtype=object)[:, np.newaxis]
test_label = np.asarray(pd.get_dummies(df_test.label), dtype = np.int8)

## Train Keras model and save weights
This only train and save our Keras layers not the embed module' weights.

In [0]:
with tf.Session() as session:
  K.set_session(session)
  session.run(tf.global_variables_initializer())
  session.run(tf.tables_initializer())
  history = model.fit(train_text, 
            train_label,
            validation_data=(test_text, test_label),
            epochs=10,
            batch_size=32)
  model.save_weights('./model.h5')

Train on 500 samples, validate on 500 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [0]:
!ls -alh | grep model.h5

## Make predictions

In [0]:
new_text = ["In what year did the titanic sink ?", "What is the highest peak in California ?", "Who invented the light bulb ?"]
new_text = np.array(new_text, dtype=object)[:, np.newaxis]
with tf.Session() as session:
  K.set_session(session)
  session.run(tf.global_variables_initializer())
  session.run(tf.tables_initializer())
  model.load_weights('./model.h5')  
  predicts = model.predict(new_text, batch_size=32)

In [0]:
predicts

array([[8.4159707e-05, 6.0455163e-04, 8.8598131e-04, 7.6786731e-05,
        4.0342723e-04, 9.9794513e-01],
       [1.6934730e-03, 2.2273099e-03, 2.4022337e-02, 2.4095874e-03,
        6.6579401e-01, 3.0385330e-01],
       [5.7516660e-04, 9.6968655e-04, 4.5365933e-02, 9.5041847e-01,
        2.1153719e-03, 5.5547140e-04]], dtype=float32)

In [0]:
categories = df_train.label.cat.categories.tolist()
predict_logits = predicts.argmax(axis=1)
print("Categorie: {0}".format(categories))
predict_labels = [categories[logit] for logit in predict_logits]
predict_labels

Categories: ['ABBR', 'DESC', 'ENTY', 'HUM', 'LOC', 'NUM']


['NUM', 'LOC', 'HUM']