# [Keras + Universal Sentence Encoder = Deep Meter] (https://www.dlology.com/blog/keras-meets-universal-sentence-encoder-transfer-learning-for-text-data/) 

This notebook creates an autoencoder using the Universal Sentence Encoder. The autoencoder output is CMUdict syllables. The dataset is that subset of Allison Parrish's Project Gutenberg poetry archive which happens to scan in iambic pentameter.

The notebook is based on Chengwei Zhang's example of wrapping the USE inside a larger tensorflow model saves to a Keras model (without save the USE itself in the TF model).

The Universal Sentence Encoder makes getting sentence level embeddings as easy as it has historically been to lookup the embeddings for individual words. The sentence embeddings can then be trivially used to compute sentence level meaning similarity as well as to enable better performance on downstream classification tasks using less supervised training data.

Since there are 10 one-hot values for 10 sets of 6k syllables, this is "multi-label classification"
Changes for multi-label classification:
sigmoid activation instead of softmax
binary_crossentropy

Text format is tab-separated, 2 columns: first text, second multi-level
array of syllables:

Multi-output version

Use ARPAbet directly instead of syllables

# Getting Started

This section sets up the environment for access to the Universal Sentence Encoder on TF Hub and provides examples of applying the encoder to words, sentences, and paragraphs.

In [1]:
# Install the latest Tensorflow version.
#!pip3 install --quiet "tensorflow>=1.7"
# Install TF-Hub.
#!pip3 install --quiet tensorflow-hub
#%cd /content
!git clone https://github.com/LanceNorskog/deep_meter || true
%cd /content/deep_meter
!git pull
# could not figure out how to read gzipped files as text!
!gunzip -qf blobs/*.gz || true
!gunzip -qf prepped_data/*.gz || true

fatal: destination path 'deep_meter' already exists and is not an empty directory.
/content/deep_meter
Already up to date.
gzip: blobs/*.gz: No such file or directory
gzip: prepped_data/*.gz: No such file or directory


In [2]:
# boilerplate from base notebook
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
import keras.layers as layers
from keras.models import Model
from keras import backend as K
from keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.optimizers import Nadam, Adam
import gc
from google.colab import files
from google.colab import drive

import pickle
np.random.seed(10)

Using TensorFlow backend.


In [0]:
# github deep_meter code
import utils
# should not need this to use utils.flatten but is true anyway?
from itertools import chain, product
import subprocess
import arpabets
import decodewords
import cmudict
import readprepped
# misc for this notebook
from ast import literal_eval

import scipy


In [0]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3" #@param ["https://tfhub.dev/google/universal-sentence-encoder/2", "https://tfhub.dev/google/universal-sentence-encoder-large/3"]

In [0]:
# read classified poetry lines: text tab [['syll', 'la', 'ble'], ...]
# clip to only most common syllables with syllable manager
# ['words', ...], [[[0,0,1,0], ...]]
def get_data(filename, arpabet_mgr, num_symbols, max_lines=1000000):
    stop_arpabet = 0
    num_arpabets = arpabet_mgr.get_size()      
    lines = open(filename, 'r').read().splitlines()
    num_lines = min(max_lines, len(lines))
    text_lines = []
    text_arpabets = []
    for i in range(0, len(lines)):
      if i == num_lines:
        break
      parts = lines[i].split("\t")
      syllables = literal_eval(parts[1])
      #print(syllables)
      arpas = []
      for s in syllables:
        for p in s:
          for x in p.split(' '):
            arpas.append(x)
      #print(arpas)
      if len(arpas) < num_symbols:
        text_lines.append(str(parts[0]))
        text_arpabets.append(arpas)
    num_lines = len(text_lines)
    label_array = np.zeros((num_symbols, num_lines, num_arpabets), dtype=np.int8)
    for i in range(0, num_lines):
      for j in range(num_symbols):
        label_array[j][i][stop_arpabet] = 1
        # variable-length list of syllables
        if j < len(text_arpabets[i]):
          enc = arpabet_mgr.get_encoding(text_arpabets[i][j])
          if enc >= 0 and enc < num_arpabets:
            label_array[j][i][enc] = 1
            label_array[j][i][stop_arpabet] = 0

    return (text_lines, label_array)


In [6]:
# arpabets in descending order of occurrence - 
# ARPAbet phonemes + stop + pause
# iambic pentameter
meter_syllables = 10
num_symbols = 4 * meter_syllables
arpabets_mgr = arpabets.arpabets()
num_arpabets = arpabets_mgr.get_size() 
arpabets_weights = {}
counts = arpabets_mgr.get_counts()
maxim = np.max(counts)
for i in range(len(counts)):
  if counts[i] > 0:
    arpabets_weights[i] = 1/(counts[i]/maxim)
  else:
    arpabets_weights[i] = 0

print(arpabets_weights)

{0: 0, 1: 0, 2: 1.0, 3: 1.2445861272947125, 4: 1.5197315436241612, 5: 1.5209564750134337, 6: 1.5881610324028617, 7: 1.7561656584457885, 8: 1.8720238095238095, 9: 2.0694571376348017, 10: 2.2044392523364484, 11: 2.43745963401507, 12: 2.8447236180904523, 13: 3.42571860816944, 14: 3.814690026954178, 15: 3.8510204081632655, 16: 3.9810126582278484, 17: 4.132116788321168, 18: 4.1701657458563535, 19: 4.319725295688668, 20: 4.45748031496063, 21: 4.933333333333333, 22: 4.95059029296021, 23: 4.959264126149803, 24: 5.501457725947522, 25: 5.588351431391906, 26: 5.788343558282208, 27: 5.827071538857437, 28: 5.918452692106639, 29: 7.409685863874346, 30: 7.797520661157026, 31: 9.819601040763226, 32: 15.055851063829786, 33: 15.197315436241611, 34: 16.923766816143498, 35: 19.42024013722127, 36: 20.003533568904594, 37: 25.328859060402685, 38: 34.94444444444444, 39: 59.904761904761905, 40: 205.85454545454544}


In [7]:
(train_text, train_label) = get_data('prepped_data/gutenberg.iambic_pentameter.train', arpabets_mgr, num_symbols, max_lines=45000)
print(len(train_text))
print(train_label.shape)

(test_text, test_label) = get_data('prepped_data/gutenberg.iambic_pentameter.test', arpabets_mgr, num_symbols)
print(len(test_text))
print(test_label.shape)

45000
(40, 45000, 41)
4474
(40, 4474, 41)


## Embed training & test text

In [8]:
# Import the Universal Sentence Encoder's TF Hub module
embed = hub.Module(module_url)
# important?
embed_size = embed.get_output_info_dict()['default'].get_shape()[1].value

# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)
print(type(train_text))
#train_text_t = tf.convert_to_tensor(train_text, dtype='string', name='training_text')
with tf.Session() as session:
  session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  train_embeddings = session.run(embed(train_text))
  test_embeddings = session.run(embed(test_text))
train_text_d = np.array(train_embeddings)
test_text_d = np.array(test_embeddings)
print(train_text_d.shape)
print(test_text_d.shape)
# conserve space
embed = None
train_text = None
train_embeddings = None
K.clear_session()
gc.collect()


INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.
<class 'list'>
(45000, 512)
(4474, 512)


0

In [9]:
print(test_text_d.shape)
# slow
num_epochs = 50
adam_0001 = Adam(0.001)
adam_0001 = tf.contrib.opt.NadamOptimizer(0.005)

(4474, 512)


## Assemble model

In [10]:
dropout=0.5
input_embeddings = layers.Input(shape=(512,), dtype=tf.float32, name='Input')
dropout_input = layers.Dropout(dropout)(input_embeddings)
dense = layers.Dense(1024, activation='relu', name='Convoluted')(dropout_input)
dense = layers.Dropout(dropout)(input_embeddings)
dense = layers.Dense(2048, activation='relu', name='Midway')(dense)
dense = layers.Dropout(dropout)(input_embeddings)
dense = layers.Dense(4096, activation='relu', name='Smooth')(dense)
pred_array = []
loss_array = []
names_array = []
for i in range(num_symbols):
  name = 'Flatout'+"{:0>2d}".format(i)
  pred_array.append(layers.Dense(num_arpabets, activation='softmax', name=name)(dense))
  loss_array.append('categorical_crossentropy')
  names_array.append(name)
model = Model(inputs=input_embeddings, outputs=pred_array)
model.compile(loss=loss_array, 
              optimizer=adam_0001, 
              metrics=['categorical_accuracy'])
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input (InputLayer)              (None, 512)          0                                            
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 512)          0           Input[0][0]                      
__________________________________________________________________________________________________
Smooth (Dense)                  (None, 4096)         2101248     dropout_3[0][0]                  
__________________________________________________________________________________________________
Flatout00 (Dense)               (None, 41)           167977      Smooth[0][0]                     
__________________________________________________________________________________________________
Flatout01 

## Train Keras model and save weights
This only trains and save our Keras layers not the embed module' weights.

In [0]:
use_saved_model=True

print(train_label.shape)
if not use_saved_model or not os.path.exists('./model.h5'):
  with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())
    train_labels = []
    test_labels = []
    for i in range(num_symbols):
        train_labels.append(np.array(train_label[i]))
        test_labels.append(np.array(test_label[i]))
    history = model.fit(train_text_d, 
            train_labels,
            validation_data=(test_text_d, test_labels),
            epochs=num_epochs,
            #callbacks = [EarlyStopping(patience=2)],
            batch_size=32,
            class_weight=arpabets_weights,
            verbose=2
            )
    model.save_weights('./model.h5')


(40, 45000, 41)
Train on 45000 samples, validate on 4474 samples
Epoch 1/50
 - 125s - loss: 90.3715 - Flatout00_loss: 1.4319 - Flatout01_loss: 1.6888 - Flatout02_loss: 2.4626 - Flatout03_loss: 3.3143 - Flatout04_loss: 3.2718 - Flatout05_loss: 3.1429 - Flatout06_loss: 3.0981 - Flatout07_loss: 3.2801 - Flatout08_loss: 3.4039 - Flatout09_loss: 3.4119 - Flatout10_loss: 3.3521 - Flatout11_loss: 3.2810 - Flatout12_loss: 3.2900 - Flatout13_loss: 3.3365 - Flatout14_loss: 3.3616 - Flatout15_loss: 3.3431 - Flatout16_loss: 3.3098 - Flatout17_loss: 3.3070 - Flatout18_loss: 3.3319 - Flatout19_loss: 3.3534 - Flatout20_loss: 3.3655 - Flatout21_loss: 3.3664 - Flatout22_loss: 3.3783 - Flatout23_loss: 3.3652 - Flatout24_loss: 3.2918 - Flatout25_loss: 3.0709 - Flatout26_loss: 2.6747 - Flatout27_loss: 2.1436 - Flatout28_loss: 1.5673 - Flatout29_loss: 1.0442 - Flatout30_loss: 0.6287 - Flatout31_loss: 0.3538 - Flatout32_loss: 0.1799 - Flatout33_loss: 0.0847 - Flatout34_loss: 0.0399 - Flatout35_loss: 0.0169 

In [12]:
!ls -l | grep model.h5

drive.mount('/content/gdrive')
!ls /content/gdrive/'My Drive'/'Colab Notebooks'

!cp model.h5 /content/gdrive/My\ Drive/Colab\ Notebooks/model.h5


-rw-r--r-- 1 root root 35407584 Nov  7 06:39 model.h5
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
'Copy of Interactive textgenrnn Demo w  GPU'
'Copy of Semantic Similarity with TF-Hub Universal Encoder'
'Copy of Semantic Similarity with TF-Hub Universal Encoder Lite'
'Copy of The Annotated "Attention is All You Need".ipynb'
'Copy of Transfer Learning - Semantic Similarity with TF-Hub Universal Encoder'
'Copy of Transfer Learning - Semantic Similarity with TF-Hub Universal Encoder (1)'
 data
 Deep_Meter_ARPAbet.ipynb
'Deep Meter: Keras + Universal Sentence Encoder + CMUdict + Project Gutenberg poetry archives'
 Deep_Meter_Multi.ipynb
'Example: Keras with TF-Hub Universal Encoder'
 model.h5
 predictions.pkl
'Sloan 1'
 Untitled0.ipynb


In [0]:
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


In [0]:
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

## Make predictions

In [13]:

#new_text = np.array(new_text, dtype=object)[:, np.newaxis]
with tf.Session() as session:
  K.set_session(session)
  session.run(tf.global_variables_initializer())
  session.run(tf.tables_initializer())
  model.load_weights('./model.h5')  
  predicts = model.predict(test_text_d, batch_size=32)

train_text = None
train_text_d = None
train_label = None
train_labels = None
print(len(predicts))
print(predicts[0].shape)
K.clear_session()
gc.collect()
model = None
with open("./predictions.pkl", "wb") as f:
    pickle.dump(predicts, f, pickle.HIGHEST_PROTOCOL)
!cp ./predictions.pkl /content/gdrive/My\ Drive/Colab\ Notebooks/./predictions.pkl
!ls -l /content/gdrive/My\ Drive/Colab\ Notebooks/*.pkl

40
(4474, 41)
-rw------- 1 root root 29351354 Nov  7 07:00 '/content/gdrive/My Drive/Colab Notebooks/predictions.pkl'


In [14]:
# Collect possible phonemes from each output model
# [num_lines][num_symbols][N > 0.8]
num_tests = 2
arpabet_arrays = [[]] * num_tests
score_arrays = [[]] * num_tests
for j in range(num_tests):
    arpabet_arrays[j] = [[]] * num_symbols
    score_arrays[j] = [[]] * num_symbols
    for i in range(num_symbols):
      arpabet_arrays[j][i] = []
      score_arrays[j][i] = []

sum = 0
count = 0
for i in range(num_symbols):
  for j in range(num_tests):
    for k in range(num_arpabets):
      if predicts[i][j][k] > 0.05:
        #print("i, j {0},{1}".format(i,j))
        arpabet_arrays[j][i].append(arpabets_mgr.get_arpabet(k))
        score_arrays[j][i].append(predicts[i][j][k])
    sum += len(score_arrays[j][i])
    count += 1
print("Mean length = {0}".format(sum/count))

predicts = None
    

Mean length = 4.05


In [15]:

#for i in range(num_symbols):
#  print(len(arpabets[i][0]))
        
  
print("Arpabets[0]: {0}".format(arpabet_arrays[1]))
print("Scores[0]: {0}".format(score_arrays[1]))

#sample =  [['AE'], ['N'], ['D'], ['W'], ['AH', 'EH'], ['N', 'T', 'DH']]
##for x in product(*sample):
#  print(x)

decoder = decodewords.Decoder(cmudict.CMUDict().get_reverse_dict(), arpabets_mgr)
for i in range(0,num_tests):
  alist = []
  slist = []
  print(score_arrays[i])
  for a in product(*arpabet_arrays[i]):
    alist.append(a)
  for s in product(*score_arrays[i]):
    slist.append(s)
  stotals = [1.0] * len(slist)
  for i in range(len(slist)):
    stotals[i] = decodewords.sum_scores(alist[i], slist[i])
  topindex = np.argsort(stotals)[0]
  print("Top score = {0}".format(stotals[topindex]))
  atest = alist[topindex]
  stest = slist[topindex]
  alist = None
  slist = None
  print(arpabet_arrays[i])
  trylist = []
  print(len(slist))
  for s in decoder.decode_sentence(atest, 12):
    print(s)


Arpabets[0]: [['AH', 'T', 'DH', 'AE', 'F'], ['AH', 'N', 'IH', 'AE', 'AO', 'UW'], ['T', 'D', 'R', 'S', 'W', 'HH'], ['IH', 'DH', 'W', 'HH', 'AO', 'AY'], ['AH', 'N', 'R', 'L', 'M', 'EH', 'IY', 'UW'], ['AH', 'N', 'IH', 'T', 'DH'], ['AH', 'N', 'IH', 'T', 'DH', 'Z'], ['AH', 'N', 'IH', 'T', 'S', 'Z'], ['N', 'IH', 'R', 'S', 'M'], ['AH', 'N', 'IH', 'T', 'R', 'AE'], ['AH', 'N', 'IH', 'T', 'D', 'R', 'IY'], ['AH', 'N', 'IH', 'T', 'D', 'DH', 'Z'], ['AH', 'N', 'IH', 'T', 'D', 'Z', 'M'], ['AH', 'N', 'L', 'EH'], ['AH', 'N', 'IH', 'T', 'R'], ['AH', 'N', 'IH', 'T', 'R'], ['AH', 'N', 'IH', 'T', 'DH'], ['AH', 'N', 'IH', 'D', 'Z'], ['AH', 'N', 'IH', 'T', 'D', 'R'], ['AH', 'N', 'IH', 'T', 'D', 'Z'], ['AH', 'N', 'IH', 'T', 'D', 'R', 'M'], ['AH', 'N', 'IH', 'T', 'R'], ['AH', 'N', 'IH', 'T', 'D', 'R', 'L', 'Z'], ['.', 'AH', 'IH', 'R', 'Z'], ['.', 'N', 'IH', 'D', 'S'], ['.', 'D', 'Z'], ['.', 'N'], ['.'], ['.'], ['.'], ['.'], ['.'], ['.'], ['.'], ['.'], ['.'], ['.'], ['.'], ['.'], ['.']]
Scores[0]: [[0.054922525

Exception ignored in: <bound method BaseSession._Callable.__del__ of <tensorflow.python.client.session.BaseSession._Callable object at 0x7fc5e48c34e0>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1455, in __del__
    self._session._session, self._handle, status)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.CancelledError: Session has been closed.


KeyboardInterrupt: ignored

In [18]:
categories = df_train.label.cat.categories.tolist()
predict_logits = predicts.argmax(axis=1)
print("Categorie: {0}".format(categories))
predict_labels = [categories[logit] for logit in predict_logits]
predict_labels

NameError: ignored

In [0]:

os.remove('./model.h5')