# T-725 Natural Language Processing: Lab 5
In today's lab, we will be working with neural networks, using GRUs and Transformers for text generation.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* **Select `"Runtime" > "Change runtime type"`, and make sure that you have "Hardware accelerator" set to "GPU"**
* Select `"Runtime" > "Run all"` to run the code in this notebook.

In [1]:
import os
import warnings

# Suppress some warnings from TensorFlow about deprecated functions
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

## Generating text with neural networks
Let's create a neural language model and use it to generate some text. This time, we will use character embeddings rather than word embeddings. They are created in exactly the same way, and are often used together in neural network-based models. One benefit of using character embeddings is that we can generate words that our model has never seen before.

The model takes as input a sequence of characters and predicts which character is most likely to follow. We will generate text by repeatedly predicting and appending the next character to a string. First, however, we need some text to train it on.


In [2]:
# Based on the following tutorial:
# https://www.tensorflow.org/tutorials/text/text_generation

import tensorflow as tf
import numpy as np
import os
import time

# Let's download some text by Shakespeare to train our model
url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
path_to_file = tf.keras.utils.get_file('shakespeare.txt', url)

with open(path_to_file, encoding='utf-8') as f:
  shakespeare = f.read()

print("First 250 characters:")
print(shakespeare[:250])

print ("Length of text: {:,} characters".format(len(shakespeare)))

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
First 250 characters:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

Length of text: 1,115,394 characters


Now we can create training examples for our model. Each example will be a pair of strings: one input string containing 100 characters, and a target string that is one character ahead. For example, the first pair we create is:

**Input string**:  `'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'`

**Target string**: `'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '`

However, before we can start training, we need to convert our text into a list of integers, where each integer represents a different character. For example, "First Citizen" becomes:

```
Character:   F   i   r   s   t      C   i   t   i   z   e   n
Integer:   [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52]
```

In [3]:
BATCH_SIZE = 64  # Batch size
BUFFER_SIZE = 10000  # Buffer size to shuffle the dataset

def split_input_target(chunk):
  # Create (input_string, output_string) pairs
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

def prepare_text(text):
  # The unique characters in the file
  vocab = sorted(set(text))
  print ('{} unique characters'.format(len(vocab)))

  # Creating a mapping from unique characters to indices
  char_map = {
      'char_to_index': {char: index for index, char in enumerate(vocab)},
      'index_to_char': np.array(vocab)
  }

  text_as_int = np.array([char_map['char_to_index'][c] for c in text])

  # The maximum length sentence we want for a single input in characters
  seq_length = 100
  examples_per_epoch = len(text) // (seq_length+1)

  # Create training examples / targets
  char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
  sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
  dataset = sequences.map(split_input_target)

  # (TF data is designed to work with possibly infinite sequences,
  # so it doesn't attempt to shuffle the entire sequence in memory. Instead,
  # it maintains a buffer in which it shuffles elements).
  dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

  return dataset, vocab, examples_per_epoch, char_map

Now we can create and train the neural network.

In [34]:
import os

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)


def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(vocab_size,
                                embedding_dim,
                                batch_input_shape=[batch_size, None]),
      tf.keras.layers.GRU(rnn_units,
                          return_sequences=True,
                          recurrent_initializer='glorot_uniform',
                          stateful=True),
      tf.keras.layers.Dense(vocab_size)
  ])

  return model


def create_model(text, epochs=3, embedding_dim = 256, rnn_units = 1024):
  dataset, vocab, examples_per_epoch, char_map = prepare_text(text)

  vocab_size = len(vocab)  # Length of the vocabulary in chars

  model = build_model(vocab_size, embedding_dim, rnn_units, BATCH_SIZE)

  # Compile the model
  model.compile(optimizer='adam', loss=loss)

  # Create checkpoints once the model has been trained
  checkpoint_dir = './training_checkpoints'
  checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
  checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
      filepath=checkpoint_prefix,
      save_weights_only=True)

  # Train the model
  history = model.fit(
      dataset,
      epochs=epochs,
      callbacks=[checkpoint_callback])

  tf.train.latest_checkpoint(checkpoint_dir)
  model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
  model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
  model.build(tf.TensorShape([1, None]))

  return model, char_map

In [5]:
shake_model, shake_chars = create_model(shakespeare)

65 unique characters
Epoch 1/3
Epoch 2/3
Epoch 3/3


Now that we've trained our model, we can finally use it to generate some text. The following function takes a model and a string as input, and continually predicts and appends the next character to the string until it becomes 1,000 characters long.

In [16]:
def generate_text(model, char_map, start_string, temperature=1.0, num_generate = 1000):
    # Evaluation step (generating text using the learned model)
    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    if not start_string:
        print("start_string can't be empty")
        return ""

    # Converting our start string to numbers (vectorizing)
    input_eval = [char_map['char_to_index'][s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = [""] * num_generate

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        predictions = predictions / temperature
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # We pass the predicted word as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated[i] = char_map['index_to_char'][predicted_id]

    return (start_string + ''.join(text_generated))

Let's generate some text!

In [9]:
print(generate_text(shake_model, shake_chars, "ROMEO: ", temperature=1.0))

ROMEO: that not the moly tome:
The proters of piece and teasure farthough, remeant strak.
Till: the medy of for whoshings the Heard I have you; ear hearantis
fothind holes of yours in liage.

CLINFORDEY:
Shile I have deach, you queen smeer.

KINGURIIA:
Ballay, lo, hash voward, xit fornes you what is to musca,
He is the boting do be feathis.

HERMIONE:
O, gentle,--his lime.

GLOUCESTIO:
Now, that you mean a
promence merncio endy?

WISTHES MIIRAND:
Vencless of call, affer prame
yiur entigred and fetty. mark is bug her
We twoo, been, when sheal, when I strak.

NICHAMDINE:
First I and kiss upon tho biving was.

JUCHESS OF YORK:
End with yet.

DUKE OF AUEE:
Madam, the stread wey to cut one; rid that woo, When I deer epert;
Is cried; as him, goin reme but before.

SICINIUS:
Pet my heart thisp:
she stiel not and roth pursapp.

JULIET:
O herly, My repart.
Hip I connent she conterness.
Shippinio's me subrece agbear for your will
Like makeaid I Then,--haves? Whancestixing heard to a inful;
And w

# Assignment
Answer the following questions and hand in your solution in Canvas before 8:30 on Monday morning, October 2nd. Remember to save your file before uploading it.

## Question 1
The `temperature` parameter of `generate_text()`, defined earlier in the notebook, controls how predictable the generated text will be. The lower the temperature, the more the function will tend to append the most likely character (according to the model's prediction). A higher temperature introduces some randomness, leading to more unpredictable text.

The text we generated above used a temperature of 1.0. Try generating more text using the Shakespeare model, once using a temperature of 0.2 and again using a temperature of 0.8.

In [11]:
# Your solution here
text_1 = generate_text(shake_model, shake_chars, "ROMEO: ", temperature=.2)
text_2 = generate_text(shake_model, shake_chars, "ROMEO: ", temperature=.8)

print("################### TEMP: 0.2")
print(text_1)

print("################### TEMP: 0.8")
print(text_2)

################### TEMP: 0.2
ROMEO: the could not the duke of the seath.

LEONTES:
I do not the did the suppection of the seement.

MENENIUS:
What is the world the death of the stard of the death.

COMINIUS:
What is the strange of the world the death.

KING RICHARD III:
Why, the world when the death of the death, the father with the marries
And the death of the comes to the dread with the will in the courter of the world and the courter from the streather with the command.

KING RICHARD III:
What is the stranger the did of the world and death,
And when I shall not the death and leave the trumper of the seath.

KING RICHARD III:
Who, I have stay the world not the dear of the world be not the commander that well the death.

MERCUTIO:
A may a thing in the courter that shall be the world,
And what the strange of the did of the marries and the more that I shall be not the company.

KING RICHARD III:
What is the companter the death of the companter to the streather
As the death of the death

## Question 2
NLTK's `names` corpus contains a list of approximately 8,000 English names. Train a new model on `names_raw` for at least 20 epochs using the `create_model(text, epochs=n)` function defined earlier. Use the trained model to generate a list of names (with the `generate_text` function defined earlier), starting with your own first name. Your name should not contain any non-English characters, and should end with an `\n`.

Print out the names that do not appear in the training data. Do you get any actual names (or at least names that sound plausible)?

In [14]:
# Don't modify this code cell
import nltk
from nltk.corpus import names
nltk.download('names', quiet=True)

# Print out a few examples
names_raw = names.raw()
names_unique = set(names_raw.split())
names_raw = "\n".join(names_unique)
print(names_raw.splitlines()[:5])

['Englebert', 'Ahmad', 'Ibrahim', 'Astrid', 'Aylmer']


In [21]:
# Your solution here
names_model, names_chars = create_model(names_raw, epochs=20)

55 unique characters
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [36]:
names_gen_raw = generate_text(names_model, names_chars, "Andrea\n", temperature=0.2)

In [38]:
def print_names_info(generated, train_set):
    names_gen = generated.split("\n")

    unique_names = list(set(names_gen))

    print(f"Generated {len(names_gen)} names ({len(unique_names)} unique)")

    novel_names = [name for name in unique_names if not (name in train_set)]

    print(f"Generated {len(novel_names)} new names")
    print(novel_names[:20])

print_names_info(names_gen_raw, names_raw)

Generated 154 names (102 unique)
Generated 53 new names
['Saberta', 'Marise', 'Annelle', 'Sabelle', 'Annella', 'Maritta', 'Beris', 'Berrin', 'Gerista', 'Sheris', 'Terie', 'Arissa', 'Sharina', 'Alerine', 'Jonne', 'Anelle', 'Gerina', 'Jorina', 'Sheria', 'Telly']


##Question 3
The size of the model can make a difference when it comes to performance. Create a new model that has twice the number of hidden units as the previous model and double the size of the embeddings. How does the performance change? What happens if you decrease these parameters?

In [35]:
# Your solution here
# 512-dimensional embeddings, 2048 RNN units
names_model2, names_chars2 = create_model(names_raw, epochs=20, embedding_dim=256*2, rnn_units=1024*2)

55 unique characters
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [39]:
names_gen_raw = generate_text(names_model2, names_chars2, "Andrea\n", temperature=0.2)
print_names_info(names_gen_raw, names_raw)

Generated 149 names (80 unique)
Generated 37 new names
['Derina', 'Merisa', 'Annelle', 'Annella', 'Josine', 'Jelina', 'Andelina', 'Charie', 'Andelin', 'Anelle', 'Carise', 'Daris', 'Sheria', 'Cherista', 'Darina', 'Anda', 'Darila', 'Andelle', 'Marella', 'Andeline']


In [51]:
# 128-dimensional embeddings, 512 RNN units
names_model3, names_chars3 = create_model(names_raw, epochs=20, embedding_dim=256//2, rnn_units=1024//2)

names_gen_raw = generate_text(names_model3, names_chars3, "Andrea\n", temperature=0.2)
print_names_info(names_gen_raw, names_raw)

55 unique characters
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Generated 155 names (76 unique)
Generated 36 new names
['Berine', 'Cardie', 'Marrin', 'Derina', 'Lanne', 'Tarie', 'Sharis', 'Sharin', 'Danne', 'Marista', 'Sherin', 'Marrie', 'Carche', 'Brista', 'Derin', 'Lerin', 'Cariste', 'Alben', 'Alenna', 'Darina']


## Question 4
Transformer large language models can also generate text. The following code imports a pretrained GPT-2 model from Huggingface's Transformer library. This model can then be used directly to generate text, given a prompt as context. Alter the prompt to have the transformer model (GPT-2) generate an engaging story beginning using one of the following story starters:


*   It was the day the moon fell.
*   Am I in heaven?  What happened to me?
*   Wandering through the graveyard it felt like something was watching me.
*   Three of us.  We were the only ones left, the only ones to make it to the island.

There are several different methods to choose from to generate the text (as seen in the commented out lines below). Try out the different methods and play with the parameters. This [blogpost](https://huggingface.co/blog/how-to-generate) explains their differences.

Which method has the best performance?

Can GPT-2 generate Shakespere?

In [40]:
# Uncomment if transformers is not installed
!pip install transformers

Collecting transformers
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m95.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m79.2 MB/s[0m eta [36m0:00:0

In [54]:
# Do not modify this code
# https://huggingface.co/docs/transformers/main_classes/text_generation

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")

model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "Today I believe we can finally"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_length=100) # Greedy search
#outputs = model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=3, early_stopping=True) # Beam search
#outputs = model.generate(input_ids, do_sample=True, max_length=100, top_k=0, temperature=0.7) # Sampling
#outputs = model.generate(input_ids, do_sample=True, max_length=100, top_k=50) # Top-k
#outputs = model.generate(input_ids, do_sample=True, max_length=100, top_k=50, top_p=0.92) # Top-p

tokenizer.batch_decode(outputs, skip_special_tokens=True)

['Today I believe we can finally get to the point where we can make a difference in the lives of the people of the United States of America.\n\nI believe that we can make a difference in the lives of the people of the United States of America.\n\nI believe that we can make a difference in the lives of the people of the United States of America.\n\nI believe that we can make a difference in the lives of the people of the United States of America.\n\n']

In [79]:
# Your solution here
def generate_from_prompt(prompt, model, mode="greedy", max_length=100):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    if mode == "greedy":
        args = {}
    elif mode == "beam":
        args = {"num_beams": 5, "no_repeat_ngram_size": 3, "early_stopping": True}
    elif mode == "sampling":
        args = {"do_sample": True, "top_k": 0, "temperature": 0.7}
    elif mode == "topk":
        args = {"do_sample": True, "top_k": 50}
    elif mode == "topp":
        args = {"do_sample": True, "top_k": 50, "top-p": 0.92}
    else:
        return ""

    outputs = model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, max_length=max_length, **args)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [75]:
print(generate_from_prompt("Today I believe we can finally", model)[0]) # Greedy

Today I believe we can finally get to the point where we can make a difference in the lives of the people of the United States of America.

I believe that we can make a difference in the lives of the people of the United States of America.

I believe that we can make a difference in the lives of the people of the United States of America.

I believe that we can make a difference in the lives of the people of the United States of America.




In [76]:
print(generate_from_prompt("It was the day the moon fell.", model, mode="beam")[0]) # Beam search

It was the day the moon fell.

"It was a beautiful day," she said. "It was beautiful. It was beautiful."


In [77]:
print(generate_from_prompt("Wandering through the graveyard it felt like something was watching me.", model, mode="sampling")[0]) # Sampling

Wandering through the graveyard it felt like something was watching me.
I looked around. There were the two guards that I recognized. One was unarmed, with a pistol, the other a single-handed assault rifle.
They both looked at me with an expression of disapproval.
"I don't know what to say," I said.
The guard said nothing. He glanced at me. "You're not doing anything wrong."
I looked back.
I didn't have anything to


In [65]:
print(generate_from_prompt("Am I in heaven? What happened to me?", model, mode="topk")[0]) # Top-k

Am I in heaven? What happened to me? Oh, God, please, if you hear me, be sure to pray." I bowed in prayer. I was about to pray another minute but I was no longer willing to pray because I realized my answer was not going very well and I was now standing at a cross and not feeling well. What seemed to me to be one of those "things", was more than a little disturbing. Now the prayer changed. What I saw was a world wide


In [68]:
print(generate_from_prompt("Wandering through the graveyard it felt like something was watching me.", model, mode="topp")[0]) # Top-p

Wandering through the graveyard it felt like something was watching me. But when I was finished, I turned around and looked for where to find the keys. No one was there.

The man in the black hood pulled a lever and pulled a button. Then, he gave me a hug. I started to run for the door. As I passed by the other guy, I saw the one in the hood that was holding the key. The boy in the black hood started running as fast as


In [73]:
print(generate_from_prompt(shakespeare[:256], model, mode="sampling")[0]) # Sampling

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:

I refuse to give up the fight for the people.

First Citizen:

You must
