<a href="https://colab.research.google.com/github/CMPSC-310-AI-Spring2023/project03_nlp-neurotic-networks/blob/main/ai_project3_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Handle imports

In [53]:
import tensorflow as tf

import numpy as np
import os
import time

Load textual script data (one giant .srt with 11 movie scripts included); print number of characters to verify that text has been successfully retrieved

In [54]:
from google.colab import drive

drive.mount('/content/drive')

text = open('/content/drive/My Drive/CMPSC310_ColabNotebooks/AI_Project3/11-movie-mashup.srt', 'rb').read().decode(encoding='utf-8')
print(f'Length of text: {len(text)} characters')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Length of text: 1034770 characters


Sample beginning of subtitle text to ensure correct file has been loaded

In [55]:
print(text[:250])

1
00:01:02,896 --> 00:01:05,898
[WAVES CRASHING]

2
00:01:12,322 --> 00:01:14,698
MAN: <i>Those were the years
after the icecaps had melted...</i>

3
00:01:14,866 --> 00:01:17,826
<i>...because of the greenhouse gases...</i>

4
00:01:17


Retrieve vocabulary (number of unique characters in data text)

In [56]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

90 unique characters


Demonstrate splitting text into tokens

In [57]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

Create a lookup layer

In [58]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)

Convert tokens into numeric character IDs

In [59]:
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[61, 62, 63, 64, 65, 66, 67], [84, 85, 86]]>

Demonstrate the inverse of converting tokens into numeric IDs--converting numeric IDs into characters (so we can see a human-readable end product)

In [60]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

Rejoin IDs-converted-into-characters to form full text strings

In [61]:
tf.strings.reduce_join(chars, axis=-1).numpy()

array([b'abcdefg', b'xyz'], dtype=object)

Condense above code into callable function for future use

In [62]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

Actually vectorize the subtitle text using above demonstration code

In [63]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(1034770,), dtype=int64, numpy=array([19,  2,  1, ...,  1,  2,  1])>

Convert vector into a data stream of character IDs

In [64]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

Retrieve first ten characters from IDs in data stream to demonstrate correctness

In [65]:
for ids in ids_dataset.take(10):
    print(chars_from_ids(ids).numpy().decode('utf-8'))

1



0
0
:
0
1
:
0


Set `seq_length`, which is the length of a single training sequence (in this case, 100 characters)

In [66]:
seq_length = 100

Demonstrate calling batch to split up subtitles data into training sequences; view the first training sequence to affirm correctness

In [67]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq))

tf.Tensor(
[b'1' b'\r' b'\n' b'0' b'0' b':' b'0' b'1' b':' b'0' b'2' b',' b'8' b'9'
 b'6' b' ' b'-' b'-' b'>' b' ' b'0' b'0' b':' b'0' b'1' b':' b'0' b'5'
 b',' b'8' b'9' b'8' b'\r' b'\n' b'[' b'W' b'A' b'V' b'E' b'S' b' ' b'C'
 b'R' b'A' b'S' b'H' b'I' b'N' b'G' b']' b'\r' b'\n' b'\r' b'\n' b'2'
 b'\r' b'\n' b'0' b'0' b':' b'0' b'1' b':' b'1' b'2' b',' b'3' b'2' b'2'
 b' ' b'-' b'-' b'>' b' ' b'0' b'0' b':' b'0' b'1' b':' b'1' b'4' b','
 b'6' b'9' b'8' b'\r' b'\n' b'M' b'A' b'N' b':' b' ' b'<' b'i' b'>' b'T'
 b'h' b'o' b's' b'e'], shape=(101,), dtype=string)


Rejoin above output into a human-readable string

In [68]:
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

b'1\r\n00:01:02,896 --> 00:01:05,898\r\n[WAVES CRASHING]\r\n\r\n2\r\n00:01:12,322 --> 00:01:14,698\r\nMAN: <i>Those'
b' were the years\r\nafter the icecaps had melted...</i>\r\n\r\n3\r\n00:01:14,866 --> 00:01:17,826\r\n<i>...becau'
b'se of the greenhouse gases...</i>\r\n\r\n4\r\n00:01:17,994 --> 00:01:20,662\r\n<i>...and the oceans had risen'
b'\r\nto drown so many cities...</i>\r\n\r\n5\r\n00:01:20,830 --> 00:01:23,332\r\n<i>...along all the shorelines\r'
b'\nof the world.</i>\r\n\r\n6\r\n00:01:23,541 --> 00:01:27,711\r\n<i>Amsterdam, Venice, New York...</i>\r\n\r\n7\r\n0'


Define `split_input_target`, a function that can extract both an input text and a target text from a single training sequence

In [69]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

Demonstrate utility of `split_input_target`

In [70]:
split_input_target(list("Tensorflow"))

(['T', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'],
 ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w'])

Split all training sequences in subtitle data into input and target texts

In [71]:
dataset = sequences.map(split_input_target)

Demonstrate `split_input_target` on first training sequence from subtitle data (input text is first 101 characters *except* the final character; target text is first 101 characters *except* the first character)

In [72]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'1\r\n00:01:02,896 --> 00:01:05,898\r\n[WAVES CRASHING]\r\n\r\n2\r\n00:01:12,322 --> 00:01:14,698\r\nMAN: <i>Thos'
Target: b'\r\n00:01:02,896 --> 00:01:05,898\r\n[WAVES CRASHING]\r\n\r\n2\r\n00:01:12,322 --> 00:01:14,698\r\nMAN: <i>Those'


Shuffle subtitle training sequences into randomly ordered batches

In [73]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 100), dtype=tf.int64, name=None), TensorSpec(shape=(64, 100), dtype=tf.int64, name=None))>

Define constants required for the training model

In [74]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

Define the `keras` training model

In [75]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

Instantiate an instance of the `keras` model called `model`

In [76]:
model = MyModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

Check parameters of the training model's output

In [77]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 91) # (batch_size, sequence_length, vocab_size)


Produce a summary of the model's characteristics

In [78]:
model.summary()

Model: "my_model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     multiple                  23296     
                                                                 
 gru_1 (GRU)                 multiple                  3938304   
                                                                 
 dense_1 (Dense)             multiple                  93275     
                                                                 
Total params: 4,054,875
Trainable params: 4,054,875
Non-trainable params: 0
_________________________________________________________________


Begin generating prediction from the first batch as an example

In [79]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

Show prediction product (in character indices)

In [80]:
sampled_indices

array([48, 14, 60, 90, 14, 25, 24, 43, 66, 77, 20, 64, 16, 43, 55, 63, 90,
       20, 69, 54, 75, 47, 69, 79, 27,  8, 19,  8, 39, 16, 53, 37, 90,  3,
       40, 89,  3, 65, 33, 26, 47, 56, 44, 44, 67, 24, 13, 63, 31, 36,  9,
       75, 44, 21, 71, 80,  5, 77, 20, 42,  1, 81, 41, 27, 80, 38, 72, 78,
       30, 74, 52, 65, 90, 40, 14,  4, 40, 69, 12, 32,  4, 12,  0, 63, 42,
       56, 77, 37, 21, 68,  7, 36, 40, 35, 49,  7, 12, 53, 47, 52])

Show prediction output without any prior training

In [81]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b"817\r\nCome on, baby.\r\n\r\n534\r\n00:47:59,335 --> 00:48:02,254\r\nCharles isn't as bad as some.\r\n\r\n535\r\n00:"

Next Char Predictions:
 b'P,]\xe2\x99\xaa,76Kfq2d.KWc\xe2\x99\xaa2iVoOis9%1%G.UE\xe2\x99\xaa H\xc3\xaa eA8OXLLg6*c>D&oL3kt"q2J\nuI9tFlr=nTe\xe2\x99\xaaH,!Hi)?!)[UNK]cJXqE3h$DHCQ$)UOT'


Set `from_logits` flag to `True`

In [82]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

Generate mean loss, which will be used to ascertain that untrained model is properly initialized (when compared against vocabulary size)

In [83]:
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

Prediction shape:  (64, 100, 91)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(4.5080585, shape=(), dtype=float32)


Verify that exponential of mean loss is roughly equivalent to vocab size 

In [84]:
tf.exp(example_batch_mean_loss).numpy()

90.74547

Configure model with default arguments

In [85]:
model.compile(optimizer='adam', loss=loss)

Save checkpoints during training

In [86]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

Set number of training epochs

In [87]:
EPOCHS = 30

Execute training across set number of epochs (show training times per epoch to see progress)

In [88]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Define behavior for making a single step prediction

In [89]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

Instantiate one step prediction

In [90]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

Predict a series of subtitles!

In [91]:
start = time.time()
states = None
next_char = tf.constant(['1\n'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

1
00:20:45,125 --> 00:21:05,341
Sometimes, we've got Kronole hell
to know that.

254
00:23:28,911 --> 00:23:23,208
They can't rig the man!

433
00:51:05,090 --> 00:51:15,637
<i>If we lose the scientific
for Penpaser, I was an awform.

298
00:29:50,932 --> 00:30:00,567
Hey, boy!

706
01:07:13,820 --> 01:07:05,874
CASE: Go! Go! You can't say him whatever.

470
01:03:30,687 --> 01:03:36,347
There's been and extinction.

1888
01:52:29,000 --> 01:52:29,832
developed ending to the storm,
artise cameras will slow

1075
01:03:13,740 --> 01:03:14,340
Dave any shands and fatherful.

431
00:33:11,730 --> 00:33:12,722
(FOOTSTEPS APPROACHING)

187
00:12:32,899 --> 00:12:23,066
You want a real motival.

655
01:06:42,024 --> 01:06:55,430
He wants to get home...

1748
01:43:40,291 --> 01:43:42,953
Just got to go here they are.

536
00:58:15,934 --> 00:58:28,070
Okay. Looks forgy telling
you to put them onl of us.

71
00:04:25,832 --> 00:04:02,291