# **Third Attempt at generating text using RNN**

So far i've tried to generate shakespeare like text using Tensorflow. ith my first few attempts i went of the Udacity course and tried to train my own model to perfom text generation. I didn't get very far as the models i tried to train were far to complex and i saw very little results. Following the guide on tensorflow, the trained model was far simplier and was successfully trained.

I thought i'd give it another go at training my own text generation model/Something a bit intresting...

Text Generation Model trained on [Anime Quotes](https://www.kaggle.com/datasets/tarundalal/anime-quotes)


P.s
I'm most likely going to steal some stuff from both the Udacity and Tensorflow guide

# **Import Dependencies**

In [1]:
import tensorflow as tf
import numpy as np
import urllib.request
import csv

print(tf.__version__)


2.8.2


# **Download the dataset**

In [2]:
# i downloaded the dataset from this link
url = "https://www.kaggle.com/datasets/tarundalal/anime-quotes/download?datasetVersionNumber=1"


In [3]:
!pwd

/content


Extracted the csv file and loaded it into the contents folder

In [4]:
# read the csvfile
anime_quotes = []

# the csv file contains Quote, character, Anime. For this task we are only
# interested in the quote so we would only get the first column from each row.
with open('AnimeQuotes.csv') as csv_file:
  csv_reader = csv.reader(csv_file, delimiter=',')
  for row in csv_reader:
    anime_quotes.append(row[0])

print(anime_quotes[:10])


['Quote', 'People’s lives don’t end when they die, it ends when they lose faith.', 'If you don’t take risks, you can’t create a future!', 'If you don’t like your destiny, don’t accept it.', 'When you give up, that’s when the game ends.', 'All we can do is live until the day we die. Control what we can…and fly free.', 'Forgetting is like a wound. The wound may heal, but it has already left a scar.', 'It’s just pathetic to give up on something before you even give it a shot.”', 'If you don’t share someone’s pain, you can never understand them.', 'Whatever you lose, you’ll find it again. But what you throw away you’ll never get back.']


In [5]:
# remove the header
anime_quotes = anime_quotes[1:]

print(anime_quotes[:10])
print(len(anime_quotes))


['People’s lives don’t end when they die, it ends when they lose faith.', 'If you don’t take risks, you can’t create a future!', 'If you don’t like your destiny, don’t accept it.', 'When you give up, that’s when the game ends.', 'All we can do is live until the day we die. Control what we can…and fly free.', 'Forgetting is like a wound. The wound may heal, but it has already left a scar.', 'It’s just pathetic to give up on something before you even give it a shot.”', 'If you don’t share someone’s pain, you can never understand them.', 'Whatever you lose, you’ll find it again. But what you throw away you’ll never get back.', 'We don’t have to know what tomorrow holds! That’s why we can live for everything we’re worth today!”']
121


# **Prepare the text**

The main task is here is to be able to generate anime quotes from our own seed text. Towards this we need, a set of feature and labels to train the model on.

<br>

**Set features and labels**   
The feature and labels need to reflect the task, so the feature should be a set of initial text and the label should be the next set of text.

From what i've seen there are 2 ways we can approach this, we can create a model which Predicts the next char or predicts the next word. I'll try out the different methods to prepare the text
- Predicting next char 
- Predicting the next probable word


In this collab i'll generate a model to predict the next probable char.



In [6]:
# combine the contents of the list into a single string
all_anime_quotes = " ".join(anime_quotes)


num_char = len(all_anime_quotes)
unique_chars = set(all_anime_quotes)
vocab_size = len(unique_chars)

print(all_anime_quotes)
print(f"Unique_chars: {unique_chars}")
print(f"Total number of charachthers in all_anime_quotes: {num_char}")
print(f"Vocabulary size: {vocab_size}")


People’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you can’t create a future! If you don’t like your destiny, don’t accept it. When you give up, that’s when the game ends. All we can do is live until the day we die. Control what we can…and fly free. Forgetting is like a wound. The wound may heal, but it has already left a scar. It’s just pathetic to give up on something before you even give it a shot.” If you don’t share someone’s pain, you can never understand them. Whatever you lose, you’ll find it again. But what you throw away you’ll never get back. We don’t have to know what tomorrow holds! That’s why we can live for everything we’re worth today!” Why should I apologize for being a monster? Has anyone ever apologized for turning me into one? People become stronger because they have memories they can’t forget. I’ll leave tomorrow’s problems to tomorrow’s me. If you wanna make people dream, you’ve gotta start by believing in that dream you

In [7]:
print(sorted(unique_chars))


[' ', '!', ',', '-', '.', ':', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xa0', '’', '“', '”', '…']


To recap the steps we are going to take for the text generation model.

**Preparing the text**
- We are going to perform tokenization on each individual chars to convert them into tokens
- From the tokens we would then create sequences. We would create sequences of length 100 which would be our feature. Our label would be our sequence shifted one way to the right.

**Model training**
- We would then train an RNN model on the features and labels.

**Text generation**
- We would then generate a text from a seed word using the trained model

define a function to map the char into tokens.

In [8]:
# define a dictionary to map the char into token
char_to_token = dict([(char, token) for token, char in enumerate(unique_chars)])
print(char_to_token)


{'u': 0, '.': 1, 'T': 2, 'm': 3, 'o': 4, 'Y': 5, 'j': 6, 'e': 7, 'F': 8, 'n': 9, '’': 10, ':': 11, 'k': 12, 'J': 13, 'P': 14, '…': 15, 'z': 16, '“': 17, '”': 18, 'h': 19, 'l': 20, 'a': 21, 'S': 22, 't': 23, 'A': 24, 'b': 25, 'y': 26, 'f': 27, 'M': 28, '-': 29, 'v': 30, 'E': 31, 'H': 32, 'V': 33, 'G': 34, '?': 35, 'W': 36, 'K': 37, 'O': 38, 'p': 39, '!': 40, 's': 41, 'w': 42, ' ': 43, 'd': 44, ',': 45, 'D': 46, 'x': 47, 'c': 48, 'R': 49, 'C': 50, 'B': 51, 'N': 52, 'L': 53, 'r': 54, 'g': 55, '\xa0': 56, 'U': 57, 'q': 58, 'i': 59, 'I': 60}


In [9]:
# create a dictionary with inverted mapping
token_to_char = dict([(token, char) for char, token in char_to_token.items()])
print(token_to_char)


{0: 'u', 1: '.', 2: 'T', 3: 'm', 4: 'o', 5: 'Y', 6: 'j', 7: 'e', 8: 'F', 9: 'n', 10: '’', 11: ':', 12: 'k', 13: 'J', 14: 'P', 15: '…', 16: 'z', 17: '“', 18: '”', 19: 'h', 20: 'l', 21: 'a', 22: 'S', 23: 't', 24: 'A', 25: 'b', 26: 'y', 27: 'f', 28: 'M', 29: '-', 30: 'v', 31: 'E', 32: 'H', 33: 'V', 34: 'G', 35: '?', 36: 'W', 37: 'K', 38: 'O', 39: 'p', 40: '!', 41: 's', 42: 'w', 43: ' ', 44: 'd', 45: ',', 46: 'D', 47: 'x', 48: 'c', 49: 'R', 50: 'C', 51: 'B', 52: 'N', 53: 'L', 54: 'r', 55: 'g', 56: '\xa0', 57: 'U', 58: 'q', 59: 'i', 60: 'I'}


In [10]:
# Sanity check
print(f"A has token {char_to_token['A']}")
print(f"{char_to_token['A']} represents {token_to_char[char_to_token['A']]}")


A has token 24
24 represents A


In [11]:
# Convert the text data into sequences
sequences = []
for char in all_anime_quotes:
  token = char_to_token[char]
  sequences.append(token)

print(sequences)
print(f"Length of sequence: {len(sequences)}")


[14, 7, 4, 39, 20, 7, 10, 41, 43, 20, 59, 30, 7, 41, 43, 44, 4, 9, 10, 23, 43, 7, 9, 44, 43, 42, 19, 7, 9, 43, 23, 19, 7, 26, 43, 44, 59, 7, 45, 43, 59, 23, 43, 7, 9, 44, 41, 43, 42, 19, 7, 9, 43, 23, 19, 7, 26, 43, 20, 4, 41, 7, 43, 27, 21, 59, 23, 19, 1, 43, 60, 27, 43, 26, 4, 0, 43, 44, 4, 9, 10, 23, 43, 23, 21, 12, 7, 43, 54, 59, 41, 12, 41, 45, 43, 26, 4, 0, 43, 48, 21, 9, 10, 23, 43, 48, 54, 7, 21, 23, 7, 43, 21, 43, 27, 0, 23, 0, 54, 7, 40, 43, 60, 27, 43, 26, 4, 0, 43, 44, 4, 9, 10, 23, 43, 20, 59, 12, 7, 43, 26, 4, 0, 54, 43, 44, 7, 41, 23, 59, 9, 26, 45, 43, 44, 4, 9, 10, 23, 43, 21, 48, 48, 7, 39, 23, 43, 59, 23, 1, 43, 36, 19, 7, 9, 43, 26, 4, 0, 43, 55, 59, 30, 7, 43, 0, 39, 45, 43, 23, 19, 21, 23, 10, 41, 43, 42, 19, 7, 9, 43, 23, 19, 7, 43, 55, 21, 3, 7, 43, 7, 9, 44, 41, 1, 43, 24, 20, 20, 43, 42, 7, 43, 48, 21, 9, 43, 44, 4, 43, 59, 41, 43, 20, 59, 30, 7, 43, 0, 9, 23, 59, 20, 43, 23, 19, 7, 43, 44, 21, 26, 43, 42, 7, 43, 44, 59, 7, 1, 43, 50, 4, 9, 23, 54, 4, 20, 43, 

In [12]:
# From the sequence_data create a list containing
sequence_length = 100
sequences_as_tf_data = tf.data.Dataset.from_tensor_slices(sequences).batch(sequence_length+1, drop_remainder=True)


In [13]:
# display the 2 samples from the dataset
for sample in sequences_as_tf_data.take(2):
  print("".join([token_to_char[token] for token in sample.numpy()]))

People’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you ca
n’t create a future! If you don’t like your destiny, don’t accept it. When you give up, that’s when t


In [14]:
# Split sequences into features and labels
def split_sequence(sequence):
  feature = sequence[:-1]
  label = sequence[1:]
  return feature, label


In [15]:
# try it out on the 2 samples
for sample in sequences_as_tf_data.take(2):
  (feature, label) = split_sequence(sample.numpy())
  print("\nFeature, Label pair")
  print("".join([token_to_char[token] for token in feature]))
  print("".join([token_to_char[token] for token in label]))



Feature, Label pair
People’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you c
eople’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you ca

Feature, Label pair
n’t create a future! If you don’t like your destiny, don’t accept it. When you give up, that’s when 
’t create a future! If you don’t like your destiny, don’t accept it. When you give up, that’s when t


In [16]:
# Apply the split_sequence function to the dataset
feature_label_data = sequences_as_tf_data.map(split_sequence)

for feature, label in feature_label_data.take(1):
  print(feature)
  print(label)


tf.Tensor(
[14  7  4 39 20  7 10 41 43 20 59 30  7 41 43 44  4  9 10 23 43  7  9 44
 43 42 19  7  9 43 23 19  7 26 43 44 59  7 45 43 59 23 43  7  9 44 41 43
 42 19  7  9 43 23 19  7 26 43 20  4 41  7 43 27 21 59 23 19  1 43 60 27
 43 26  4  0 43 44  4  9 10 23 43 23 21 12  7 43 54 59 41 12 41 45 43 26
  4  0 43 48], shape=(100,), dtype=int32)
tf.Tensor(
[ 7  4 39 20  7 10 41 43 20 59 30  7 41 43 44  4  9 10 23 43  7  9 44 43
 42 19  7  9 43 23 19  7 26 43 44 59  7 45 43 59 23 43  7  9 44 41 43 42
 19  7  9 43 23 19  7 26 43 20  4 41  7 43 27 21 59 23 19  1 43 60 27 43
 26  4  0 43 44  4  9 10 23 43 23 21 12  7 43 54 59 41 12 41 45 43 26  4
  0 43 48 21], shape=(100,), dtype=int32)


We have our text data prepared. Inputs to the model is a sequence of tokens and the label is also the same sequence shifted by 1 to the right.

In [17]:
# might be easier to convert sequences to strings if we define a function to convert it

def convert_sequence_to_string(sequence):
  string = "".join([token_to_char[token] for token in sequence])
  return string


In [18]:
for feature, label in feature_label_data.take(1):
  print(convert_sequence_to_string(feature.numpy()))
  print(convert_sequence_to_string(label.numpy()))

People’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you c
eople’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you ca


In [19]:
# create a batched dataset
batched_dataset = (feature_label_data.batch(1))

Althought this seems un-necessary considering that, there is really only 124 samples, RNN models expect a batch dimension. 

# **Define the RNN model**

In [20]:
vocab_szie = 61
Embedding_dim = 128
GRU_units = 256


In [21]:
# define a text generation model
Anime_qoutes_model = tf.keras.Sequential([tf.keras.layers.Embedding(vocab_size, Embedding_dim),
                                         tf.keras.layers.GRU(units=GRU_units, dropout=0.5, 
                                                             recurrent_dropout=0.25,
                                                             return_sequences=True),
                                         tf.keras.layers.Dense(units=vocab_size, activation="softmax")])




In [22]:
# Compile the model
Anime_qoutes_model.compile(optimizer='adam',
                           loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                           )

Anime_qoutes_model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 128)         7808      
                                                                 
 gru (GRU)                   (None, None, 256)         296448    
                                                                 
 dense (Dense)               (None, None, 61)          15677     
                                                                 
Total params: 319,933
Trainable params: 319,933
Non-trainable params: 0
_________________________________________________________________


Some notes missed out from an unsaved version
- Batch dimension in the data is needed for RNN 
- SparseCategoricalCrossentropy used for multiclass classification when expected labels are integers and not one-hot encoded labels.
- Use categoricalCrossentropy loss for multiclass classifications with one-hot encoded labels

# **Train the model**

**Run forward pass and get the loss**

In [23]:
# get a single feature and label from the batch dataset
for batch_feature, batch_label in batched_dataset.take(1):
  print(batch_feature[0])
  print(batch_label[0])
  

tf.Tensor(
[14  7  4 39 20  7 10 41 43 20 59 30  7 41 43 44  4  9 10 23 43  7  9 44
 43 42 19  7  9 43 23 19  7 26 43 44 59  7 45 43 59 23 43  7  9 44 41 43
 42 19  7  9 43 23 19  7 26 43 20  4 41  7 43 27 21 59 23 19  1 43 60 27
 43 26  4  0 43 44  4  9 10 23 43 23 21 12  7 43 54 59 41 12 41 45 43 26
  4  0 43 48], shape=(100,), dtype=int32)
tf.Tensor(
[ 7  4 39 20  7 10 41 43 20 59 30  7 41 43 44  4  9 10 23 43  7  9 44 43
 42 19  7  9 43 23 19  7 26 43 44 59  7 45 43 59 23 43  7  9 44 41 43 42
 19  7  9 43 23 19  7 26 43 20  4 41  7 43 27 21 59 23 19  1 43 60 27 43
 26  4  0 43 44  4  9 10 23 43 23 21 12  7 43 54 59 41 12 41 45 43 26  4
  0 43 48 21], shape=(100,), dtype=int32)


In [24]:
# display the string
print(convert_sequence_to_string(batch_feature[0].numpy()))
print(convert_sequence_to_string(batch_label[0].numpy()))


People’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you c
eople’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you ca


In [25]:
# pass the feature to the untrained model and get the prediction
batch_prediction = Anime_qoutes_model(batch_feature)
print(batch_prediction)


tf.Tensor(
[[[0.01619977 0.01635637 0.01622239 ... 0.01632152 0.01634176 0.01633183]
  [0.01631493 0.01619919 0.01645799 ... 0.01638399 0.01644187 0.01625898]
  [0.01631034 0.01635299 0.01657584 ... 0.0165051  0.01636226 0.01628372]
  ...
  [0.01632326 0.01626946 0.0162732  ... 0.01599888 0.01616622 0.01669499]
  [0.01605609 0.01623164 0.01612335 ... 0.01595752 0.01617189 0.01665997]
  [0.01644229 0.01666695 0.01607653 ... 0.01617    0.01625697 0.01625678]]], shape=(1, 100, 61), dtype=float32)


In [26]:
# something intresting to show
# try passing just the feature and not the batch feature.
try:
  batch_prediction = Anime_qoutes_model(batch_feature[0])
  print(batch_prediction)
except Exception as e:
  print("Sorry that's a no no")
  print(f"{e}")

Sorry that's a no no
Exception encountered when calling layer "sequential" (type Sequential).

Input 0 of layer "gru" is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (100, 128)

Call arguments received:
  • inputs=tf.Tensor(shape=(100,), dtype=int32)
  • training=False
  • mask=None


The GRU layer expects a feature with 3 dimensions: `[batch dimension, time steps, time step dimension]`

In [27]:
# convert the prediction into a readable text
predicted_char_list = []
for prob_distribution in batch_prediction[0].numpy():
  char_id = np.argmax(prob_distribution)
  predicted_char_list.append(token_to_char[char_id])

print(predicted_char_list)


['?', 'B', 'K', '?', 'F', 'B', 'K', 'x', 'N', 'N', 'F', '!', 'K', '!', 'N', 'N', 'K', 'K', 'K', 'R', 'U', 'K', 'K', 'g', 'H', 'V', 'V', 'K', 'K', 'K', 'R', 'R', 'K', '?', 'f', 'O', 'V', 'K', '?', 'f', 'V', 'R', 'f', 'K', 'K', 'g', '-', 'H', 'V', 'N', 'K', 'K', 'K', 'R', 'R', 'K', '?', 'f', 'F', 'V', 'N', 'K', 'K', 'w', 'f', 'V', 'R', 'R', 'J', 'f', 'V', 'p', 'f', '?', 'f', 'f', 'f', 'V', 'K', 'K', 'K', 'R', 'U', 'R', 'R', 'R', 'g', '!', 'K', 'V', '.', '’', '-', '?', 'f', '?', 'f', 'f', 'f', 'R']


In [28]:
predicted_text = "".join(predicted_char_list)
print(predicted_text)
print(len(predicted_text))


?BK?FBKxNNF!K!NNKKKRUKKgHVVKKKRRK?fOVK?fVRfKKg-HVNKKKRRK?fFVNKKwfVRRJfVpf?fffVKKKRURRRg!KV.’-?f?fffR
100


In [29]:
# get the loss of the model
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
example_loss = loss(batch_label, batch_prediction).numpy()
print(example_loss)


4.112594


  return dispatch_target(*args, **kwargs)


Sort of need to make sense of the warning. I've found this [thread](https://stackoverflow.com/questions/67848962/selecting-loss-and-metrics-for-tensorflow-model) on stackoverflow that provides some better guidance on selecting loss, metrics and difference between softmax and sigmoid activation.

**TLDR Summary**
- Can sparse_categorical_accuracy as a metric for classification task, where the label is an integer and not a one-hot encoded label.
- Similarly, we can use spare_categorical_crossentropy as a loss function for classification task in the same scenario as above.

<br>

- In cases, where we have our labels represented as one-hot encoded vectors we could use categorical_accuracy as a metric and categorical_crossentropy as a loss function.

<br>

- softmax activation functions are commonly used as the activation function in the output layer for the classification task. These functions produce a probabilitiy distribution, so the sum of the output from the layer = 1.
Generally if the model outputs a probability distribution, you'll need to set the from_logits = False.

<br>

So what are Logits???   
[Another stack overflow thread](https://stackoverflow.com/questions/34240703/what-are-logits-what-is-the-difference-between-softmax-and-softmax-cross-entrop)

**Summary**



In [30]:
# define the loss without setting the from_logits argument to be true
loss=tf.keras.losses.SparseCategoricalCrossentropy()
example_loss = loss(batch_label, batch_prediction).numpy()
print(example_loss)


4.112594


Happy days.

As annoying as it is i think it is best i keep the errors in the notebook so that i and anyone who might be reading this can easily see the issues, lessons and solutions i came across.

--
**redefine the model**

In [31]:
Anime_qoutes_model = tf.keras.Sequential([tf.keras.layers.Embedding(vocab_size,
                                                                    Embedding_dim),
                                          tf.keras.layers.GRU(units=GRU_units,
                                                              dropout=0.5,
                                                              recurrent_dropout=0.25,
                                                              return_sequences=True),
                                          tf.keras.layers.Dense(units=vocab_size, activation="softmax")])

Anime_qoutes_model.compile(optimizer='adam',
                           loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                           metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])



In [32]:
# define the model call backs
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath="./model_checkpoints/model_epoch_{epoch}_loss_{loss}",
                                                               monitor='loss',
                                                               save_best_only=True)
early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor='loss', min_delta=0.1, patience=4)

In [None]:
history = Anime_qoutes_model.fit(batched_dataset, epochs=22,
                                 callbacks=[model_checkpoint_callback, early_stopping_callback])

Epoch 1/22



Epoch 2/22



Epoch 3/22





i have a feeling the final model is going to overfit onto the dataset, might be best to stop at the 10th epoch

# **Generate text using the trained model**

In [None]:
# if you know, you know
Seed_word = "Never going to take me down"
Seed_word_id = []
for char in Seed_word:
  id = char_to_token[char]
  Seed_word_id

print(Seed_word)
print(Seed_word_id)


In [None]:
for i in range(1000):
  prediction = Anime_qoutes_model.predict(Seed_word_id)
  predicted_char = np.argmax(prediction)
  Seed_word_id.append(predicted_char)


In [None]:
print("Initial seed text: ", Seed_word)
print(convert_sequence_to_string(Seed_word_id))
