<a href="https://colab.research.google.com/github/CMPSC-310-AI-Spring2023/project03_nlp-neurotic-networks/blob/main/ai_project3_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Handle imports

In [1]:
import tensorflow as tf

import numpy as np
import os
import time

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


Load textual script data (one giant .srt with 11 movie scripts included); print number of characters to verify that text has been successfully retrieved

In [3]:
text = open('/content/drive/My Drive/CMPSC310_ColabNotebooks/AI_Project3/11-movie-synop.srt', 'rb').read().decode(encoding='utf-8')
print(f'Length of text: {len(text)} characters')

Length of text: 103083 characters


Sample beginning of subtitle text to ensure correct file has been loaded

In [4]:
print(text[:250])

The story takes place in the future where the greenhouse gases have caused the polar icecaps to melt, flooding coastal cities. To combat over-population, people wishing to have children must apply for a license.

The film starts in the offices of a


Retrieve vocabulary (number of unique characters in data text)

In [5]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

77 unique characters


Demonstrate splitting text into tokens

In [6]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

Create a lookup layer

In [7]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)

Convert tokens into numeric character IDs

In [8]:
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[50, 51, 52, 53, 54, 55, 56], [73, 74, 75]]>

Demonstrate the inverse of converting tokens into numeric IDs--converting numeric IDs into characters (so we can see a human-readable end product)

In [9]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

Rejoin IDs-converted-into-characters to form full text strings

In [10]:
tf.strings.reduce_join(chars, axis=-1).numpy()

array([b'abcdefg', b'xyz'], dtype=object)

Condense above code into callable function for future use

In [11]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

Actually vectorize the subtitle text using above demonstration code

In [12]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(103083,), dtype=int64, numpy=array([44, 57, 54, ..., 68, 69, 11])>

Convert vector into a data stream of character IDs

In [13]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

Retrieve first ten characters from IDs in data stream to demonstrate correctness

In [14]:
for ids in ids_dataset.take(10):
    print(chars_from_ids(ids).numpy().decode('utf-8'))

T
h
e
 
s
t
o
r
y
 


Set `seq_length`, which is the length of a single training sequence (in this case, 100 characters)

In [15]:
seq_length = 100

Demonstrate calling batch to split up subtitles data into training sequences; view the first training sequence to affirm correctness

In [16]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq))

tf.Tensor(
[b'T' b'h' b'e' b' ' b's' b't' b'o' b'r' b'y' b' ' b't' b'a' b'k' b'e'
 b's' b' ' b'p' b'l' b'a' b'c' b'e' b' ' b'i' b'n' b' ' b't' b'h' b'e'
 b' ' b'f' b'u' b't' b'u' b'r' b'e' b' ' b'w' b'h' b'e' b'r' b'e' b' '
 b't' b'h' b'e' b' ' b'g' b'r' b'e' b'e' b'n' b'h' b'o' b'u' b's' b'e'
 b' ' b'g' b'a' b's' b'e' b's' b' ' b'h' b'a' b'v' b'e' b' ' b'c' b'a'
 b'u' b's' b'e' b'd' b' ' b't' b'h' b'e' b' ' b'p' b'o' b'l' b'a' b'r'
 b' ' b'i' b'c' b'e' b'c' b'a' b'p' b's' b' ' b't' b'o' b' ' b'm' b'e'
 b'l' b't' b','], shape=(101,), dtype=string)


Rejoin above output into a human-readable string

In [17]:
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

b'The story takes place in the future where the greenhouse gases have caused the polar icecaps to melt,'
b' flooding coastal cities. To combat over-population, people wishing to have children must apply for a'
b' license.\r\n\r\nThe film starts in the offices of a company called Cybertronics, where its owner, Profes'
b'sor Allen Hobby, wishes to push mecha technology, to make a creation that can love. When his colleagu'
b"es mention their 'love units' Hobby corrects them: he is not talking physical love, but emotional lov"


Define `split_input_target`, a function that can extract both an input text and a target text from a single training sequence

In [18]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

Demonstrate utility of `split_input_target`

In [19]:
split_input_target(list("Tensorflow"))

(['T', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'],
 ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w'])

Split all training sequences in subtitle data into input and target texts

In [20]:
dataset = sequences.map(split_input_target)

Demonstrate `split_input_target` on first training sequence from subtitle data (input text is first 101 characters *except* the final character; target text is first 101 characters *except* the first character)

In [21]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'The story takes place in the future where the greenhouse gases have caused the polar icecaps to melt'
Target: b'he story takes place in the future where the greenhouse gases have caused the polar icecaps to melt,'


Shuffle subtitle training sequences into randomly ordered batches

In [22]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

Define constants required for the training model

In [23]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

Define the `keras` training model

In [24]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

Instantiate an instance of the `keras` model called `model`

In [25]:
model = MyModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

Check parameters of the training model's output

In [26]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 78) # (batch_size, sequence_length, vocab_size)


Produce a summary of the model's characteristics

In [27]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  19968     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  79950     
                                                                 
Total params: 4,038,222
Trainable params: 4,038,222
Non-trainable params: 0
_________________________________________________________________


Begin generating prediction from the first batch as an example

In [28]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

Show prediction product (in character indices)

In [29]:
sampled_indices

array([38, 57, 76, 52, 29, 53, 13, 64,  9, 51,  9, 40,  9, 68, 33, 27, 25,
       28, 39, 66, 47, 51, 54, 60, 62, 24, 49, 20, 26, 34,  0, 52, 75, 20,
       62, 62, 49, 10, 66, 65, 52, 71, 61, 74, 47, 74, 27, 19, 54, 60, 19,
       52, 36, 44, 37,  5, 28, 24, 25, 42, 15, 20, 51, 50, 75, 54,  2, 41,
       17, 66,  0, 14,  9, 48, 33, 34, 12, 32, 19, 16, 29, 25, 58, 37, 32,
       50, 71, 72, 30, 75, 40, 75, 53, 46, 32, 26, 49, 60,  4, 71])

Show prediction output without any prior training

In [30]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b"the Swinton's swimming pool, where it seems that David was attempting to drown Martin.\r\n\r\nHenry deci"

Next Char Predictions:
 b'Mh\xc2\xb0cDd0o,b,O,sHB?CNqWbekm;Z7AI[UNK]cz7mmZ-qpcvlyWyB6ek6cKTL"C;?R27baze\rP4q[UNK]1,YHI/G63D?iLGavwEzOzdVGAZk!v'


Set `from_logits` flag to `True`

In [31]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

Generate mean loss, which will be used to ascertain that untrained model is properly initialized (when compared against vocabulary size)

In [32]:
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

Prediction shape:  (64, 100, 78)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(4.3574033, shape=(), dtype=float32)


Verify that exponential of mean loss is roughly equivalent to vocab size 

In [33]:
tf.exp(example_batch_mean_loss).numpy()

78.054184

Configure model with default arguments

In [34]:
model.compile(optimizer='adam', loss=loss)

Save checkpoints during training

In [35]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

Set number of training epochs

In [36]:
EPOCHS = 150

Execute training across set number of epochs (show training times per epoch to see progress)

In [37]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

Define behavior for making a single step prediction

In [38]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

Instantiate one step prediction

In [39]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

Predict a series of subtitles!

In [80]:
start = time.time()
states = None
next_char = tf.constant(['The story takes place'])
result = [next_char]

for n in range(450):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

The story takes place in the future where the greenhouse gases have caused the polar icecaps to melt, with the rest of the robot childs and gives him the bad news that his estimate of years is now down to days before hereve codes with the apsears in video clips behind the black hole Gquan't stand till be a trap. As David and Joe leave that he should have listened on Eirth tells Eirth leaves Dopli; the Brue Farile sniving in the future where the greenhouse gases have  

________________________________________________________________________________

Run time: 1.1397690773010254
