# T-725 Natural Language Processing: Lab 5
In today's lab, we will be working with neural networks, using GRUs and Transformers for text generation.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* **Select `"Runtime" > "Change runtime type"`, and make sure that you have "Hardware accelerator" set to "GPU"**
* Select `"Runtime" > "Run all"` to run the code in this notebook.

In [None]:
import os
import warnings

# Suppress some warnings from TensorFlow about deprecated functions
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

## Generating text with neural networks
Let's create a neural language model and use it to generate some text. This time, we will use character embeddings rather than word embeddings. They are created in exactly the same way, and are often used together in neural network-based models. One benefit of using character embeddings is that we can generate words that our model has never seen before.

The model takes as input a sequence of characters and predicts which character is most likely to follow. We will generate text by repeatedly predicting and appending the next character to a string. First, however, we need some text to train it on.


In [None]:
# Based on the following tutorial:
# https://www.tensorflow.org/tutorials/text/text_generation

import tensorflow as tf
import numpy as np
import os
import time

# Let's download some text by Shakespeare to train our model
url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
path_to_file = tf.keras.utils.get_file('shakespeare.txt', url)

with open(path_to_file, encoding='utf-8') as f:
  shakespeare = f.read()

print("First 250 characters:")
print(shakespeare[:500])

print ("Length of text: {:,} characters".format(len(shakespeare)))

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1us/step
First 250 characters:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor
Length of text: 1,115,394 characters


Now we can create training examples for our model. Each example will be a pair of strings: one input string containing 100 characters, and a target string that is one character ahead. For example, the first pair we create is:

**Input string**:  `'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'`

**Target string**: `'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '`

However, before we can start training, we need to convert our text into a list of integers, where each integer represents a different character. For example, "First Citizen" becomes:

```
Character:   F   i   r   s   t      C   i   t   i   z   e   n
Integer:   [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52]
```

In [None]:
# Hyper-parameters:

BATCH_SIZE = 64  # Batch size
BUFFER_SIZE = 10000  # Buffer size to shuffle the dataset
SEQUENCE_LENGTH = 100  # Length of input sequence
EMBEDDING_DIMENSION = 65  # Embedding dimension
RNN_UNITS = 1024  # Number of RNN units

In [None]:
def split_input_target(chunk):
  # Create (input_string, output_string) pairs
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

def prepare_text(text):
  # The unique characters in the file
  vocab = sorted(set(text))
  print ('{} unique characters'.format(len(vocab)))

  # Creating a mapping from unique characters to indices
  char_map = {
      'char_to_index': {char: index for index, char in enumerate(vocab)},
      'index_to_char': np.array(vocab)
  }

  text_as_int = np.array([char_map['char_to_index'][c] for c in text])

  # The maximum length sentence we want for a single input in characters
  seq_length = SEQUENCE_LENGTH
  examples_per_epoch = len(text) // (seq_length+1)

  # Create training examples / targets
  char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
  sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
  dataset = sequences.map(split_input_target)

  # (TF data is designed to work with possibly infinite sequences,
  # so it doesn't attempt to shuffle the entire sequence in memory. Instead,
  # it maintains a buffer in which it shuffles elements).
  dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

  return dataset, vocab, examples_per_epoch, char_map

Now we can create and train the neural network.

In [None]:
import os

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)


def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(vocab_size,
                                embedding_dim),
      tf.keras.layers.GRU(rnn_units,
                          return_sequences=True,
                          recurrent_initializer='glorot_uniform',
                          stateful=True),
      tf.keras.layers.Dense(vocab_size)
  ])

  return model

In [None]:
def create_model(text, epochs=3):
  dataset, vocab, examples_per_epoch, char_map = prepare_text(text)

  train_model = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, BATCH_SIZE)
  train_model.compile(optimizer='adam', loss=loss)

  train_model.fit(dataset, epochs=epochs)

  pred_model = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, batch_size=1)
  pred_model.build(input_shape=(1, 100))
  pred_model.set_weights(train_model.get_weights())

  return pred_model, char_map

In [None]:
shakes_model, shakes_chars = create_model(shakespeare, epochs=3)

65 unique characters
Epoch 1/3
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 53ms/step - loss: 3.2485
Epoch 2/3
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 55ms/step - loss: 2.0859
Epoch 3/3
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 51ms/step - loss: 1.7928


In [None]:
shakes_model.save_weights('shakes_model.weights.h5')

In [None]:
# Ignore. Use only if Colab fails.
#dataset, vocab, examples_per_epoch, char_map = prepare_text(shakespeare)
#mini_data = dataset.take(1)
#newshake = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, batch_size=1)
#newshake.build(input_shape=(1, 100))
#newshake.summary()
#newshake.load_weights('shakes_model.weights.h5')


Now that we've trained our model, we can finally use it to generate some text. The following function takes a model and a string as input, and continually predicts and appends the next character to the string until it becomes 1,000 characters long.

In [None]:
def generate_text(model, char_map, start_string, temperature=1.0):
  # Evaluation step (generating text using the learned model)
  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  if not start_string:
    print("start_string can't be empty")
    return ""

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char_map['char_to_index'][s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Here batch size == 1
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(char_map['index_to_char'][predicted_id])

  return (start_string + ''.join(text_generated))

Let's generate some text!

In [None]:
#### If connected to GPU
print(generate_text(shakes_model, shakes_chars, "ROMEO: ", temperature=1))

#### If not connected to GPU
#print(generate_text(newshake, char_map, "ROMEO: ", temperature=1.0))

ROMEO: it do, you!

BLONDOLLOLA:
If my sholing undoct?

LIRANDUSS:
The veny spares we him this viPlknd,
3rill, his nothe to their surfe
Ester know you grace Preame and my proceinger to the tols'ber and crusemow, efor
Yid should not anw the,
Live mime to sen me, ne't out strens, but what
the offort fif Rome to me she!

GLOUCESTER:
Foil sim, sair more.

Shepesion.


LATWANEE:
Lowe they.

HIRG RICHARD III:
Leady?

CLARENCE:
Tullos as touch tome anown.

MENCUTBY:
Hen curners fot my desw, if the dusher there a divies.

PETISIUS:
You werl, they no morring theirt, I sme' oth deedsgess,
So rable knows thry befope eatle me to d, withs,
Maytersulf aresad to her him.

HENRINGARO:
Ye't fores, Mysex'd, bet the musseds of do bot.

VETBYEX:
Whereagy owf off-You keep me for him!

CRISIO:
May friar all she beft ove quernes sough,
I have prief not, at thou didst for what is have will
as ofs,
Bares I am contrushich! iblris sut you sull, mane kits
ladsell my foomwerd, ang fet me should it,
Come at ele tha

# Assignment
Answer the following questions and hand in your solution in Canvas before 23:59 on Friday, September 27th. Remember to save your file before uploading it.

## Question 1
The `temperature` parameter of `generate_text()`, defined earlier in the notebook, controls how predictable the generated text will be. The lower the temperature, the more the function will tend to append the most likely character (according to the model's prediction). A higher temperature introduces some randomness, leading to more unpredictable text.

The text we generated above used a temperature of 1.0. Try generating more text using the Shakespeare model:

(a) once using a temperature of 0.2 and

(b) again using a temperature of 0.8

and describe the difference.

In [None]:
print(generate_text(shakes_model, shakes_chars, "ROMEO: ", temperature=0.8))

ROMEO: I pays his death!

CLAMELO:
I metel, a king, thou hore thos:
Now the enders of a muster spring!

KATHARINA:
this to word is dustare is my seazer tome:
Year, the tone, of inators, the king!

GLOUCESTER:
He story'd bey, and by thee,
Bear to us all;
The wall the follow stay's not forting compinge of Duckay.

NAUTIRA:
I come the deppecity on the many, but his geetly fir,
And hin their doner she sceet, and not, for shall I thou dost within their resund
Whil me, be hone, if, I will not condeath of miscedtion
The exelt if and the kenders as hands to cindon jeinged;
And I say, who thought thou are misolare:
Which ther, for I will not let the words, and, and his wish as cenalt.

Faret,
You not charge my like a dead, Luck in menelity.

DUKE VINCENTIO:
He the mourestert, like to make.

GLIORCES:
Geath my lord unouthmon! I think, be the bewnem tous one tother, to theer conser.

ISARE:
Are thre my nower lies: they need a ser;
I he thy did my son, stay, mure them no would have they have be my

In [None]:
print(generate_text(shakes_model, shakes_chars, "ROMEO: ", temperature=0.2))

ROMEO: and the death the sentless of him.

ESCALUS:
Now my lord, I will not the master of him.

BENVOLIO:
I may and the marrer is the dead of her and the seem and she stand him and the sees,
And the man and the this death the good for the good marries,
And the prove the best of my sone of her and the were as the dead.

LEONTES:
And the to may and the death of him and the will of his hands,
And the hand of his prose to me here and the death.

SICINIUS:
I may and the better to the son of her and the best of his life
And the the strenger to me and the stand of his propes the will of his parsent of my hands,
I will not she have me and the the stands of him and his for a parron many and the death,
And he have do strenged the father and the best the tongue
That have the more of her the the stands of his father stay
The father of the to the father of him.

LEONTES:
What is the father son of my souls,
And the the stand of her fartent the dead of the will of him and the warms and my souls,
And 

With a temperature of 0.2, the generated text becomes more predictable. Words like "that," "shall", "should" and "the" are repeated frequently. The structure sometimes resembles a verse, but the sentences are meaningless. With a higher temperature, the text feels more random. Most of the generated words aren't even correct, and the output doesn't make sense at all. It doesn't even seem like real verses.

## Question 2
NLTK's `names` corpus contains a list of approximately 8,000 English names. Train a new model on `names_raw` for at least 20 epochs using the `create_model(text, epochs=n)` function defined earlier. Use the trained model to generate a list of names (with the `generate_text` function defined earlier), starting with your own first name. Your name should not contain any non-English characters, and should end with an `\n`.

Print out the names that do not appear in the training data.

(a) Do you get any actual names (or at least names that sound plausible)?

In [None]:
# Don't modify this code cell
import nltk
from nltk.corpus import names
nltk.download('names')

# Print out a few examples
names_raw = names.raw()
names_unique = set(names_raw.split())
names_raw = "\n".join(names_unique)
print(names_raw.splitlines()[:5])

['Shaylah', 'Noelani', 'Josefa', 'Breanne', 'Tildy']


[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.


In [25]:
# Your solution here
names_model, names_chars = create_model(names_raw, epochs=20)

55 unique characters
Epoch 1/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 176ms/step - loss: 5.1690
Epoch 2/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 181ms/step - loss: 3.8288
Epoch 3/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 180ms/step - loss: 3.2416
Epoch 4/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 184ms/step - loss: 2.7925
Epoch 5/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 183ms/step - loss: 2.5377
Epoch 6/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 187ms/step - loss: 2.4289
Epoch 7/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 185ms/step - loss: 2.3787
Epoch 8/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 181ms/step - loss: 2.3391
Epoch 9/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 185ms/step - loss: 2.3030
Epoch 10/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 183ms/step - l

In [27]:
names = generate_text(names_model, names_chars, "Leonardo\n", temperature=0.1)

In [28]:
generated_names = set(names.split("\n"))
starting_names = set(names_raw.split("\n"))
never_seen_names = generated_names.difference(starting_names)
print(never_seen_names)


{'Carille', 'Carista', 'Ronne', 'Tarrie', 'Annele', 'Teris', 'Theri', 'Danne', 'Theria', 'Sherina', 'Therina', 'Tarie', 'Cherista', 'Caris', 'Sheria', 'Tarisa', 'Mariella', 'Carisa', 'Carilie', 'Marile', 'Tarelle', 'Annelle', 'Marille', 'Taria', 'Marista', 'Elis', 'Tarina'}


Yes, some real names are generated like Mariella or Caris

##Question 3
The size of the model can make a difference when it comes to performance. Create a new model that has twice the number of hidden units as the previous model and double the size of the embeddings.

(a) How does the performance change?

(b) What happens if you decrease these parameters?

In [29]:
EMBEDDING_DIMENSION = 120  # Embedding dimension
RNN_UNITS = 2048  # Number of RNN units

def create_my_model(text, epochs=3):
  dataset, vocab, examples_per_epoch, char_map = prepare_text(text)

  train_model = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, BATCH_SIZE)
  train_model.compile(optimizer='adam', loss=loss)

  train_model.fit(dataset, epochs=epochs)

  pred_model = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, batch_size=1)
  pred_model.build(input_shape=(1, 100))
  pred_model.set_weights(train_model.get_weights())

  return pred_model, char_map

names_model, names_chars = create_my_model(names_raw, epochs=20)


55 unique characters
Epoch 1/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 182ms/step - loss: 5.2084
Epoch 2/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 185ms/step - loss: 3.7766
Epoch 3/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 184ms/step - loss: 3.1355
Epoch 4/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 187ms/step - loss: 2.7457
Epoch 5/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 185ms/step - loss: 2.5192
Epoch 6/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 183ms/step - loss: 2.4264
Epoch 7/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 184ms/step - loss: 2.3682
Epoch 8/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 185ms/step - loss: 2.3220
Epoch 9/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 185ms/step - loss: 2.2947
Epoch 10/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 186ms/step - l

In [30]:
names = generate_text(names_model, names_chars, "Leonardo\n", temperature=0.1)

In [31]:
generated_names = set(names.split("\n"))
starting_names = set(names_raw.split("\n"))
never_seen_names = generated_names.difference(starting_names)
print(never_seen_names)


{'Sanne', 'Linne', 'Carista', 'Andella', 'Mar', 'Annelle', 'Lonetta', 'Bertin', 'Mariell', 'Andelle', 'Annella', 'Sheria', 'Lanne', 'Carisa', 'Linna', 'Lanella'}


Increasing the RNN units and embeddings dimension makes the generated names more plausible than those of the previous model. Conversely, if we decrease the values, the names don't sound like real names at all.

## Question 4
Transformer large language models can also generate text. The following code imports a pretrained GPT-2 model from Huggingface's Transformer library. This model can then be used directly to generate text, given a prompt as context. Alter the prompt to have the transformer model (GPT-2) generate an engaging story beginning using one of the following story starters:


*   It was the day the moon fell.
*   Am I in heaven?  What happened to me?
*   Wandering through the graveyard it felt like something was watching me.
*   Three of us.  We were the only ones left, the only ones to make it to the island.

There are several different methods to choose from to generate the text (as seen in the commented out lines below). Try out the different methods and play with the parameters. This [blogpost](https://huggingface.co/blog/how-to-generate) explains their differences.

(a) Which method has the best performance?

(b) Can GPT-2 generate Shakespere?

In [None]:
# Uncomment if transformers is not installed
!pip install transformers



In [None]:
# Do not modify this code
# https://huggingface.co/docs/transformers/main_classes/text_generation

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")

gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
# Only change the prompt and comment or uncomment the different generation lines

prompt = "Am I in heaven? What happened to me?"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

#outputs = gpt2_model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, max_length=100) # Greedy search
#outputs = gpt2_model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=3, early_stopping=True) # Beam search
#outputs = gpt2_model.generate(input_ids, do_sample=True, max_length=100, top_k=0, temperature=0.7) # Sampling
#outputs = gpt2_model.generate(input_ids, do_sample=True, max_length=100, top_k=50) # Top-k
outputs = gpt2_model.generate(input_ids, do_sample=True, max_length=100, top_k=50, top_p=0.92) # Top-p

tokenizer.batch_decode(outputs, skip_special_tokens=True)

### To supress the warning, add:
# pad_token_id=tokenizer.eos_token_id
# for example: outputs = gpt2_model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, max_length=100)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['Am I in heaven? What happened to me? I\'m so sorry to hear your story."\n\n"I\'m not."\n\n"Did I lose anything?"\n\n"If it\'s because of my actions and that they don\'t understand."\n\n"Don\'t blame yourself!"\n\nWhen I came back from the afterlife, I still didn\'t know what had happened.\n\n"I\'m sorry about what happened to me and the world that you have lost."\n\n"']

Just by looking at the phrases generated by each method, I think the Beam Search generated the phrase that makes the most sense.
After various tests, I was unable to make the model generate Shakespeare, so I believe it cannot do that