<a href="https://colab.research.google.com/github/DanB1421/DATA602/blob/main/Brilliant_Problem_Set_12-qederf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In the Google shared drive (/602/data), the file enron.txt is a subset of the Enron Corpus, a collection of over 500,000 emails from senior management of Enron Corporation leading to its collapse in 2001.  The subset comprises the text of about 15,000 emails available through the TensorFlow Data Set (TFDS) source aeslc (annotated Enron Subject Line Corpus).

Using this dataset, construct a neural net that will generate 50 random characters, beginning with the sequence \verb!The!, that are generated from the distribution of text in the file.

This exercise can be replicated using any of the following sources in the texts and documentation:

* **Raschka** - Character-level language modeling in TensorFlow, pages 600-613
* **Gèron** - Generating Shakespearean Text Using a Character RNN, pages 526-534
* **TensorFlow documentation** [Text Generation with an RNN](https://www.tensorflow.org/text/tutorials/text_generation)


Adjust the temperature ($\alpha$ in Raschka) to avoid repeating text.  Using a GPU runtime to fit the model is advised, which may still require several hours to train.


In [None]:
from google.colab import drive
import numpy as np
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
#read the file content into the variable corpus
with open('/content/drive/Shareddrives/DS602-F22/Data/enron.txt', 'r', encoding='utf8') as f:
  corpus = f.read()

In [None]:
import tensorflow as tf

import os
import time

In [None]:
print(f'Length of text: {len(corpus)} characters')

Length of text: 14307407 characters


In [None]:
print(corpus[:250])

Greg and Mark:  Attached is a draft of the very short story that will accompany your profiles in Enron Business.
(PR management has approved.)
The purpose is simply to introduce you and quickly address the issue that's on everyone's mind, the stock p


In [None]:
vocab = sorted(set(corpus))
print(f'{len(vocab)} unique characters')

96 unique characters


In [None]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

In [None]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)

In [None]:
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[67, 68, 69, 70, 71, 72, 73], [90, 91, 92]]>

In [None]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

In [None]:
chars = chars_from_ids(ids)
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

In [None]:
tf.strings.reduce_join(chars, axis=-1).numpy()

array([b'abcdefg', b'xyz'], dtype=object)

In [None]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

In [None]:
all_ids = ids_from_chars(tf.strings.unicode_split(corpus, 'UTF-8'))
all_ids

<tf.Tensor: shape=(14307407,), dtype=int64, numpy=array([42, 84, 71, ..., 67, 85,  2])>

In [None]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)
for ids in ids_dataset.take(10):
    print(chars_from_ids(ids).numpy().decode('utf-8'))

G
r
e
g
 
a
n
d
 
M


In [None]:
seq_length = 50

In [None]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq))

tf.Tensor(
[b'G' b'r' b'e' b'g' b' ' b'a' b'n' b'd' b' ' b'M' b'a' b'r' b'k' b':'
 b' ' b' ' b'A' b't' b't' b'a' b'c' b'h' b'e' b'd' b' ' b'i' b's' b' '
 b'a' b' ' b'd' b'r' b'a' b'f' b't' b' ' b'o' b'f' b' ' b't' b'h' b'e'
 b' ' b'v' b'e' b'r' b'y' b' ' b's' b'h' b'o'], shape=(51,), dtype=string)


In [None]:
for seq in sequences.take(5):
  print(text_from_ids(seq).numpy())

b'Greg and Mark:  Attached is a draft of the very sho'
b'rt story that will accompany your profiles in Enron'
b' Business.\n(PR management has approved.)\nThe purpos'
b'e is simply to introduce you and quickly address th'
b"e issue that's on everyone's mind, the stock price."


In [None]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

In [None]:
split_input_target(list("Tensorflow"))

(['T', 'e', 'n', 's', 'o', 'r', 'f', 'l', 'o'],
 ['e', 'n', 's', 'o', 'r', 'f', 'l', 'o', 'w'])

In [None]:
dataset = sequences.map(split_input_target)

In [None]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'Greg and Mark:  Attached is a draft of the very sh'
Target: b'reg and Mark:  Attached is a draft of the very sho'


In [None]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(64, 50), dtype=tf.int64, name=None), TensorSpec(shape=(64, 50), dtype=tf.int64, name=None))>

In [None]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [None]:
class MyModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [None]:
model = MyModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 50, 97) # (batch_size, sequence_length, vocab_size)


In [None]:
model.summary()

Model: "my_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  24832     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  99425     
                                                                 
Total params: 4,062,561
Trainable params: 4,062,561
Non-trainable params: 0
_________________________________________________________________


In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

In [None]:
sampled_indices

array([63, 90, 67, 60, 84, 92, 79, 16, 71, 12, 95,  0, 41, 55, 41, 79, 25,
       38, 20, 75,  3, 28, 14, 93, 42, 51, 27, 26,  3, 85, 64, 21,  2, 33,
       41, 91, 93, 22, 68, 49,  2, 78,  6, 57, 76,  4, 68, 62, 22, 51])

In [None]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b' rules are and guess what we can expect.\nInstead h'

Next Char Predictions:
 b'\\xaYrzm-e)}[UNK]FTFm6C1i 9+{GP87 s]2\n>Fy{3bN\nl#Vj!b[3P'


In [None]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

Prediction shape:  (64, 50, 97)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(4.5728498, shape=(), dtype=float32)


In [None]:
tf.exp(example_batch_mean_loss).numpy()

96.819626

In [None]:
model.compile(optimizer='adam', loss=loss)

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [None]:
EPOCHS = 10
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [None]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars, temperature=0.5)

In [None]:
start = time.time()
states = None
next_char = tf.constant(['The'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

The contact will be contacted by the end of the day.
If you have any questions, please contact me at Caroline in your conference call that will be available to arrive at a value of the formal application with the same to subsidize a payment of the most possible to continue to be undertaken in the event of such better start time, and if the first notice of the program is in a process of successful natural gas gas consumers that have been assigned to the gas and out of the  continuing to the contact information in the new building.
The team is to pursue our call yesterday to participate in the Commission and the commercial team to perform the restricted account with the contracts before your best people will be sure that you are not in the present to the draft of the PA and I will be provided to the Enron Corp. San Diego to the following the term sheet to be attending a massive under the Star Conference to  accomplish the company (10 days) and we are currently in  the office of the Calif

In [None]:
start = time.time()
states = None
next_char = tf.constant(['The', 'The', 'The', 'The', 'The'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result, '\n\n' + '_'*80)
print('\nRun time:', end - start)

tf.Tensor(
[b'The outside contest can be determined at the  fact of the entire company.\nThe person will be out of the office and then mentioned to the continued markets.\nI have asked you to pass the Enron short and possible  and speak with your Enron has another trading on the Enron Corp. Savings Plan address for the YPR market.\nAs part of the consumer the comments which we will be offers and the public accounts from the Conference Room.\nWe need to have a contribution to the policy  in the new database.\nThis is the first month to get some time to sign the greatest contact list of the press that I wanted to make sure that the schedule will be done to take a look at the phone given the deadline for a different day.\nThis is the list with the contracts that we can send the confirmation from the company with the security such actions for the extent that we can receive an executed form of agreement with the Master Agreement.\nIn the meantime, the continued car concerns are the counterp