# T-725 Natural Language Processing: Lab 5
In today's lab, we will be working with neural networks, using GRUs and Transformers for text generation.

To begin with, do the following:
* Select `"File" > "Save a copy in Drive"` to create a local copy of this notebook that you can edit.
* **Select `"Runtime" > "Change runtime type"`, and make sure that you have "Hardware accelerator" set to "GPU"**
* Select `"Runtime" > "Run all"` to run the code in this notebook.

In [1]:
import os
import warnings

# Suppress some warnings from TensorFlow about deprecated functions
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

## Generating text with neural networks
Let's create a neural language model and use it to generate some text. This time, we will use character embeddings rather than word embeddings. They are created in exactly the same way, and are often used together in neural network-based models. One benefit of using character embeddings is that we can generate words that our model has never seen before.

The model takes as input a sequence of characters and predicts which character is most likely to follow. We will generate text by repeatedly predicting and appending the next character to a string. First, however, we need some text to train it on.


In [2]:
# Based on the following tutorial:
# https://www.tensorflow.org/tutorials/text/text_generation

import tensorflow as tf
import numpy as np
import os
import time

# Let's download some text by Shakespeare to train our model
url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
path_to_file = tf.keras.utils.get_file('shakespeare.txt', url)

with open(path_to_file, encoding='utf-8') as f:
  shakespeare = f.read()

print("First 250 characters:")
print(shakespeare[:250])

print ("Length of text: {:,} characters".format(len(shakespeare)))

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt
[1m1115394/1115394[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
First 250 characters:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

Length of text: 1,115,394 characters


Now we can create training examples for our model. Each example will be a pair of strings: one input string containing 100 characters, and a target string that is one character ahead. For example, the first pair we create is:

**Input string**:  `'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'`

**Target string**: `'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '`

However, before we can start training, we need to convert our text into a list of integers, where each integer represents a different character. For example, "First Citizen" becomes:

```
Character:   F   i   r   s   t      C   i   t   i   z   e   n
Integer:   [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52]
```

In [3]:
# Hyper-parameters:

BATCH_SIZE = 64  # Batch size
BUFFER_SIZE = 10000  # Buffer size to shuffle the dataset
SEQUENCE_LENGTH = 100  # Length of input sequence
EMBEDDING_DIMENSION = 65  # Embedding dimension
RNN_UNITS = 1024  # Number of RNN units

In [4]:
def split_input_target(chunk):
  # Create (input_string, output_string) pairs
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

def prepare_text(text):
  # The unique characters in the file
  vocab = sorted(set(text))
  print ('{} unique characters'.format(len(vocab)))

  # Creating a mapping from unique characters to indices
  char_map = {
      'char_to_index': {char: index for index, char in enumerate(vocab)},
      'index_to_char': np.array(vocab)
  }

  text_as_int = np.array([char_map['char_to_index'][c] for c in text])

  # The maximum length sentence we want for a single input in characters
  seq_length = SEQUENCE_LENGTH
  examples_per_epoch = len(text) // (seq_length+1)

  # Create training examples / targets
  char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
  sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
  dataset = sequences.map(split_input_target)

  # (TF data is designed to work with possibly infinite sequences,
  # so it doesn't attempt to shuffle the entire sequence in memory. Instead,
  # it maintains a buffer in which it shuffles elements).
  dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

  return dataset, vocab, examples_per_epoch, char_map

Now we can create and train the neural network.

In [5]:
import os

def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)


def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
      tf.keras.layers.Embedding(vocab_size,
                                embedding_dim),
      tf.keras.layers.GRU(rnn_units,
                          return_sequences=True,
                          recurrent_initializer='glorot_uniform',
                          stateful=True),
      tf.keras.layers.Dense(vocab_size)
  ])

  return model

In [6]:
def create_model(text, epochs=3):
  dataset, vocab, examples_per_epoch, char_map = prepare_text(text)

  train_model = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, BATCH_SIZE)
  train_model.compile(optimizer='adam', loss=loss)

  train_model.fit(dataset, epochs=epochs)

  pred_model = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, batch_size=1)
  pred_model.build(input_shape=(1, 100))
  pred_model.set_weights(train_model.get_weights())

  return pred_model, char_map

In [7]:
shakes_model, shakes_chars = create_model(shakespeare, epochs=3)

65 unique characters
Epoch 1/3
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 52ms/step - loss: 3.2984
Epoch 2/3
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 54ms/step - loss: 2.1107
Epoch 3/3
[1m172/172[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 52ms/step - loss: 1.8166


In [8]:
shakes_model.save_weights('shakes_model.weights.h5')

In [9]:
# Ignore. Use only if Colab fails.
#dataset, vocab, examples_per_epoch, char_map = prepare_text(shakespeare)
#mini_data = dataset.take(1)
#newshake = build_model(len(vocab), EMBEDDING_DIMENSION, RNN_UNITS, batch_size=1)
#newshake.build(input_shape=(1, 100))
#newshake.summary()
#newshake.load_weights('shakes_model.weights.h5')


Now that we've trained our model, we can finally use it to generate some text. The following function takes a model and a string as input, and continually predicts and appends the next character to the string until it becomes 1,000 characters long.

In [10]:
def generate_text(model, char_map, start_string, temperature=1.0):
  # Evaluation step (generating text using the learned model)
  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  if not start_string:
    print("start_string can't be empty")
    return ""

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char_map['char_to_index'][s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Here batch size == 1
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(char_map['index_to_char'][predicted_id])

  return (start_string + ''.join(text_generated))

Let's generate some text!

In [11]:
#### If connected to GPU
print(generate_text(shakes_model, shakes_chars, "ROMEO: ", temperature=1.0))

#### If not connected to GPU
#print(generate_text(newshake, char_map, "ROMEO: ", temperature=1.0))

ROMEO: Connotched pyeeth of Gonio salvio;
And, the LictarjUS:
Norgent of Say, thou they petsiae him mortings princh,
Andiss-pite a pasion bight I ag Henarde
Endert; the was, and my your gost you batke armove
I'll vealing doors of the eardy diss.

MacPestaderoble, 'tis glacent.

RICH:
Fir liftleitnds, stragit

Senatord mine: leath of she man,
Doth prin death se dy lastige
On epeity manner, Thatch swean preisies.

AUWOLYCUS:
Nake an gruater,
Then were you hade,
She it ot name onternown no terk?
The pravose ithel past, that the gods,
A swall'd us the feads, bornobme atreen;
You creed theepf I coundiny ware.
O, fake you willst dirsment, to my excreat;
And withous of the clook's all to wat suchin of you gruck
To that youbjeath by faith's, as whece Handion marres;
Made bear the besticiesbage, well here grees;
And you that she prayel for minesty very steents Odlumabenged,
Farath,
And a Inlot may both haturalle, ane child no
freme pards for uppaitience a fattleme?
USGAinge:
Buke is not or what

# Assignment
Answer the following questions and hand in your solution in Canvas before 23:59 on Friday, September 27th. Remember to save your file before uploading it.

## Question 1
The `temperature` parameter of `generate_text()`, defined earlier in the notebook, controls how predictable the generated text will be. The lower the temperature, the more the function will tend to append the most likely character (according to the model's prediction). A higher temperature introduces some randomness, leading to more unpredictable text.

The text we generated above used a temperature of 1.0. Try generating more text using the Shakespeare model:

(a) once using a temperature of 0.2 and

(b) again using a temperature of 0.8

and describe the difference.

In [22]:
print("[Temperature = 0.2]")
print(generate_text(shakes_model, shakes_chars, "ROMEO: ", temperature=0.2))
print("\n[Temperature = 0.8]")
print(generate_text(shakes_model, shakes_chars, "ROMEO: ", temperature=0.8))


Temperature = 0.2
ROMEO: his fair a warting the death,
When the compent of the warring of the with a stranged him the courter and the dore of the world and the with a strange of the will of the with of the courtes and make
That the hand a strange the courter and the counter
And the with the will of the words of the warts of the warrs and the death,
And the will be the wor a thing in the wartion
That the wast should be the best of the warth,
And the bear the stand of the compented the death,
And the compost the words what the triends and the were of the words
And the death of the wife of the courter of the with him the diest.

SICINIUS:
Nor the compost shall be the will be a mander and string
And the will of the words and the words of the warting of the cause
And the will of the will be a stranger the death,
And the words and be the death and shall be stranges
That the warring of the warting of the reath,
And the warrow in the death of the words of the warse of the wares of the compente

As it's possible to see in the above example, changing the temperature value lead to a new generated text with different characters randomizing factor in which higher value brings more unpredictable words, while the opposite on the contrary. For instance, with a temperature value equal to 0.8, althought some words are correctly in both grammatical and syntactical forms, the most of them present a wrong structure (e.g. 'hur', 'fentlemen', 'Forgats ', etc...), leading to a no sense word. Conversly, a value equal to 0.2 generated more correct words, grouped in proper way in different sentences.

## Question 2
NLTK's `names` corpus contains a list of approximately 8,000 English names. Train a new model on `names_raw` for at least 20 epochs using the `create_model(text, epochs=n)` function defined earlier. Use the trained model to generate a list of names (with the `generate_text` function defined earlier), starting with your own first name. Your name should not contain any non-English characters, and should end with an `\n`.

Print out the names that do not appear in the training data.

(a) Do you get any actual names (or at least names that sound plausible)?

In [29]:
# Don't modify this code cell
import nltk
from nltk.corpus import names
nltk.download('names')

# Print out a few examples
names_raw = names.raw()
names_unique = set(names_raw.split())
names_raw = "\n".join(names_unique)
print(names_raw.splitlines()[:5])

['Ingelbert', 'Calhoun', 'Deloria', 'Luelle', 'Janenna']


[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


In [30]:
names_model, names_char = create_model(names_raw, epochs=20)

55 unique characters
Epoch 1/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 56ms/step - loss: 4.1121
Epoch 2/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step - loss: 3.8060
Epoch 3/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 52ms/step - loss: 3.4931
Epoch 4/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 51ms/step - loss: 3.0316
Epoch 5/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 52ms/step - loss: 2.7211
Epoch 6/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 52ms/step - loss: 2.5297
Epoch 7/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step - loss: 2.4488
Epoch 8/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step - loss: 2.4045
Epoch 9/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 51ms/step - loss: 2.3732
Epoch 10/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step - loss: 2.347

In [32]:
result = generate_text(names_model, names_char, "Samuele\n", temperature=0.2)
list_names_generated = result.split('\n');

{name for name in list_names_generated if name not in names_unique}

{'Alera',
 'Alie',
 'Alona',
 'Andor',
 'Annel',
 'Annisa',
 'Arisa',
 'Beris',
 'Berisa',
 'Carelie',
 'Carich',
 'Carila',
 'Carile',
 'Carille',
 'Caris',
 'Carisa',
 'Charia',
 'Charie',
 'Deris',
 'Ellin',
 'Harila',
 'Jasa',
 'Landy',
 'Lelina',
 'Marila',
 'Marile',
 'Marilia',
 'Robella',
 'Rone',
 'Sana',
 'Sanelle',
 'Sheria',
 'Sherin',
 'Sheris',
 'Tarina',
 'Tharie'}

When training a model with a temperature value of 0.2, the algorithm returned a list of names in which most generated strings were not actual names. However, some cases such as "Ellin" and "Lelina", were instances of real names that the model managed to produce, albeit inconsistently.

##Question 3
The size of the model can make a difference when it comes to performance. Create a new model that has twice the number of hidden units as the previous model and double the size of the embeddings.

(a) How does the performance change?

(b) What happens if you decrease these parameters?

In [37]:
RNN_UNITS = 2048
EMBEDDING_DIMENSION = 130
names_model, names_char = create_model(names_raw, epochs=20)

55 unique characters
Epoch 1/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 173ms/step - loss: 5.5778
Epoch 2/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 175ms/step - loss: 3.8584
Epoch 3/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 175ms/step - loss: 3.3458
Epoch 4/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 175ms/step - loss: 2.8460
Epoch 5/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 176ms/step - loss: 2.5691
Epoch 6/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 178ms/step - loss: 2.4409
Epoch 7/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 177ms/step - loss: 2.3815
Epoch 8/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 179ms/step - loss: 2.3397
Epoch 9/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 180ms/step - loss: 2.3088
Epoch 10/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 180ms/step - l

In [38]:
result = generate_text(names_model, names_char, "Samuele\n", temperature=0.2)
list_names_generated = result.split('\n');

{name for name in list_names_generated if name not in names_unique}

{'Alise',
 'Andel',
 'Annelle',
 'Bertin',
 'Carila',
 'Carille',
 'Caris',
 'Carisa',
 'Carisse',
 'Charie',
 'Coris',
 'Corista',
 'Coriste',
 'Ferina',
 'Lelin',
 'Lenitte',
 'Lonie',
 'Ma',
 'Marile',
 'Marilie',
 'Marille',
 'Mariss',
 'Marista',
 'Marrie',
 'Meria',
 'Merista',
 'Ronie',
 'Rosan'}

In [39]:
RNN_UNITS = 512
EMBEDDING_DIMENSION = 32
names_model, names_char = create_model(names_raw, epochs=20)

55 unique characters
Epoch 1/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step - loss: 3.9508
Epoch 2/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step - loss: 3.4789
Epoch 3/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 3.3960
Epoch 4/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 3.1716
Epoch 5/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 3.0560
Epoch 6/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 2.8795
Epoch 7/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 2.6966
Epoch 8/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 2.5557
Epoch 9/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - loss: 2.4785
Epoch 10/20
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 2.430

In [40]:
result = generate_text(names_model, names_char, "Samuele\n", temperature=0.2)
list_names_generated = result.split('\n');

{name for name in list_names_generated if name not in names_unique}

{'Alele',
 'Alelie',
 'Alelin',
 'Alelle',
 'Alere',
 'Alle',
 'Anarin',
 'Ande',
 'Anele',
 'Arelon',
 'Barie',
 'Carilia',
 'Carise',
 'Charie',
 'Charina',
 'Danele',
 'Danne',
 'Darie',
 'Darina',
 'Darine',
 'Garine',
 'Gerin',
 'Gerina',
 'Jeline',
 'Jona',
 'Jonna',
 'Jonne',
 'Lanan',
 'Larie',
 'Lelina',
 'Leline',
 'Lerina',
 'Leris',
 'Lorina',
 'Marelle',
 'Marile',
 'Marine',
 'Ronne',
 'Sariele',
 'Sarin',
 'Sellie',
 'Share',
 'Sharin',
 'Sharina',
 'Sharine',
 'Sharta',
 'Shene',
 'Silla',
 'Sorelie',
 'Sorin',
 'Tarie'}

**QUESTION (A)**: By incresing the number of hidden units and the size of embeddings the performance of the model increase a bit, recognizing names like 'Ellin' or 'Leline' as actual names instead of "random names"(as reported in the example above)



**QUESTION (B)**: By decreasing the number of hidden units and the size of embeddings, the model’s performance tends to decline. It generates more repetitive and structurally simplistic outputs, with fewer realistic names and a higher occurrence of artificial or nonsensical strings, as seen in the earlier example.




## Question 4
Transformer large language models can also generate text. The following code imports a pretrained GPT-2 model from Huggingface's Transformer library. This model can then be used directly to generate text, given a prompt as context. Alter the prompt to have the transformer model (GPT-2) generate an engaging story beginning using one of the following story starters:


*   It was the day the moon fell.
*   Am I in heaven?  What happened to me?
*   Wandering through the graveyard it felt like something was watching me.
*   Three of us.  We were the only ones left, the only ones to make it to the island.

There are several different methods to choose from to generate the text (as seen in the commented out lines below). Try out the different methods and play with the parameters. This [blogpost](https://huggingface.co/blog/how-to-generate) explains their differences.

(a) Which method has the best performance?

(b) Can GPT-2 generate Shakespere?

In [16]:
# Uncomment if transformers is not installed
!pip install transformers



In [17]:
# Do not modify this code
# https://huggingface.co/docs/transformers/main_classes/text_generation

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")

gpt2_model = AutoModelForCausalLM.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [60]:
# Only change the prompt and comment or uncomment the different generation lines

prompt = "I think it was a good day"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Greedy search
outputs = gpt2_model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, max_length=100)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

### To supress the warning, add:
# pad_token_id=tokenizer.eos_token_id
# for example: outputs = gpt2_model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, max_length=100)

['I think it was a good day for the team," said coach Mike Krzyzewski. "We had a great day. We had a great game. We had a great game. We had a great game. We had a great game. We had a great game. We had a great game. We had a great game. We had a great game. We had a great game. We had a great game. We had a great game. We had a great game. We had']

In [61]:
 # Beam search
outputs = gpt2_model.generate(input_ids, max_length=100, pad_token_id=tokenizer.eos_token_id, num_beams=5, no_repeat_ngram_size=3, early_stopping=True)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

['I think it was a good day," he said. "It was a great day for me. I think it\'s been a great year for me."\n\nHe said he was happy to be back on the field.\n\n"It\'s been great to play in front of a lot of people and to be able to play with the guys that I\'ve been with for the last couple of years. It\'s been good to get back to where I want to be. I feel like I']

In [62]:
# Sampling
outputs = gpt2_model.generate(input_ids, do_sample=True, pad_token_id=tokenizer.eos_token_id, max_length=100, top_k=0, temperature=0.7)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

['I think it was a good day," said Mr. Redford, who had also worked at the hotel. "I think it was a good day for everybody."\n\nThe reopening of the Hotel Cajun — and the reopening of the hotel lobby — was supposed to be a case of neighborhood celebration and celebration. It was not.\n\nBut that, according to the hotel\'s owner, was not the case.\n\n"It was a very productive project that was being done']

In [63]:
# Top-k
outputs = gpt2_model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, do_sample=True, max_length=100, top_k=50)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

['I think it was a good day."\n\nThat\'s right. I\'m sure fans still love his work, but they\'re still willing to pay for his performances. I\'m sure they want to see him get an even more exciting display — but let\'s start with his role-playing on the stage; his performance from my perspective. But I think it\'s safe to say most of the fans will want to see the "Mafia" actor play a different role.\n\nI still think']

In [64]:
# Top-p
outputs = gpt2_model.generate(input_ids, do_sample=True, pad_token_id=tokenizer.eos_token_id, max_length=100, top_k=50, top_p=0.92)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

['I think it was a good day."\n\nIn the afternoon, Mr. Clinton\'s campaign sent out an official statement calling Mr. Comey\'s dismissal an "error" and blaming the "incidents of his tenure."\n\nOn the morning of his resignation announcement, Mr. Comey tweeted: "The Trump campaign has called for an immediate and thorough investigation. Our goal is to get the truth."\n\nThe president has defended his firing, saying there was a lack of good intelligence about Russia\'s']

In [66]:
prompt = "But, soft! what light through yonder window breaks?"

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

models = {
    'Greedy search' : gpt2_model.generate(input_ids, pad_token_id=tokenizer.eos_token_id, max_length=100),
    'Beam search' : gpt2_model.generate(input_ids, max_length=100, num_beams=5, no_repeat_ngram_size=3, early_stopping=True, pad_token_id=tokenizer.eos_token_id),
    'Sampling' : gpt2_model.generate(input_ids, do_sample=True, max_length=100, top_k=0, temperature=0.7, pad_token_id=tokenizer.eos_token_id),
    'Top-k' : gpt2_model.generate(input_ids, do_sample=True, max_length=100, top_k=50, pad_token_id=tokenizer.eos_token_id),
    'Top-p' : gpt2_model.generate(input_ids, do_sample=True, max_length=100, top_k=50, pad_token_id=tokenizer.eos_token_id, top_p=0.92)
}


for n, out in models.items():
  print(n)
  print(
      tokenizer.batch_decode(out, skip_special_tokens=True)
  )
  print()

Greedy search
["But, soft! what light through yonder window breaks?\n\nI'm not sure what to do.\n\nI'm not sure what to do.\n\nI'm not sure what to do.\n\nI'm not sure what to do.\n\nI'm not sure what to do.\n\nI'm not sure what to do.\n\nI'm not sure what to do.\n\nI'm not sure what to do.\n\nI'm not sure what to"]

Beam search
["But, soft! what light through yonder window breaks?\n\nI can't help it.\n\nIt's not like I'm going to be able to do anything about it. It's just that I'm not going to have the time or the energy to do it. I'm just going to go out there and do what I want to do. I don't know if I can do it or not, but I know I can. I know that I can, and I know"]

Sampling
["But, soft! what light through yonder window breaks?\n\nSo much light, so much light…\n\nAnd my favorite spot in the world…\n\n…is the Red Lion!\n\nWe can't find it.\n\nWe're stuck! We're stuck!\n\nThere's a massive pile of blueberries, and we're stuck!\n\nWe can't find our way out.\n\nNobody is there!\n\nI

**QUESTION (A)**:Beam search produced the most coherent and contextually relevant output. Unlike the other methods, it avoided topic drift and repetition, maintaining a realistic tone aligned with the original sports-related input. Its structured narrative and clarity make it the best choice in this case.

**QUESTION (B)**: None of the generation methods produced authentic Shakespearean text beyond the initial quoted line. While they all began with a genuine verse from Romeo and Juliet, the continuations were modern, incoherent, or stylistically inconsistent with Shakespeare’s language. Therefore, none of the outputs can be considered true Shakespeare; they merely echoed his words without capturing his poetic form or Elizabethan tone.
