# **Third Attempt at generating text using RNN**

So far i've tried to generate shakespeare like text using Tensorflow. ith my first few attempts i went of the Udacity course and tried to train my own model to perfom text generation. I didn't get very far as the models i tried to train were far to complex and i saw very little results. Following the guide on tensorflow, the trained model was far simplier and was successfully trained.

I thought i'd give it another go at training my own text generation model/Something a bit intresting...

Text Generation Model trained on [Anime Quotes](https://www.kaggle.com/datasets/tarundalal/anime-quotes)


P.s
I'm most likely going to steal some stuff from both the Udacity and Tensorflow guide

# **Import Dependencies**

In [1]:
import tensorflow as tf
import numpy as np
import urllib.request
import csv

print(tf.__version__)


2.8.2


# **Download the dataset**

In [None]:
# i downloaded the dataset from this link
url = "https://www.kaggle.com/datasets/tarundalal/anime-quotes/download?datasetVersionNumber=1"


In [None]:
!pwd

Extracted the csv file and loaded it into the contents folder

In [2]:
# read the csvfile
anime_quotes = []

# the csv file contains Quote, character, Anime. For this task we are only
# interested in the quote so we would only get the first column from each row.
with open('AnimeQuotes.csv') as csv_file:
  csv_reader = csv.reader(csv_file, delimiter=',')
  for row in csv_reader:
    anime_quotes.append(row[0])

print(anime_quotes[:10])


['Quote', 'People’s lives don’t end when they die, it ends when they lose faith.', 'If you don’t take risks, you can’t create a future!', 'If you don’t like your destiny, don’t accept it.', 'When you give up, that’s when the game ends.', 'All we can do is live until the day we die. Control what we can…and fly free.', 'Forgetting is like a wound. The wound may heal, but it has already left a scar.', 'It’s just pathetic to give up on something before you even give it a shot.”', 'If you don’t share someone’s pain, you can never understand them.', 'Whatever you lose, you’ll find it again. But what you throw away you’ll never get back.']


In [3]:
# remove the header
anime_quotes = anime_quotes[1:]

print(anime_quotes[:10])
print(len(anime_quotes))


['People’s lives don’t end when they die, it ends when they lose faith.', 'If you don’t take risks, you can’t create a future!', 'If you don’t like your destiny, don’t accept it.', 'When you give up, that’s when the game ends.', 'All we can do is live until the day we die. Control what we can…and fly free.', 'Forgetting is like a wound. The wound may heal, but it has already left a scar.', 'It’s just pathetic to give up on something before you even give it a shot.”', 'If you don’t share someone’s pain, you can never understand them.', 'Whatever you lose, you’ll find it again. But what you throw away you’ll never get back.', 'We don’t have to know what tomorrow holds! That’s why we can live for everything we’re worth today!”']
121


# **Prepare the text**

The main task is here is to be able to generate anime quotes from our own seed text. Towards this we need, a set of feature and labels to train the model on.

<br>

**Set features and labels**   
The feature and labels need to reflect the task, so the feature should be a set of initial text and the label should be the next set of text.

From what i've seen there are 2 ways we can approach this, we can create a model which Predicts the next char or predicts the next word. I'll try out the differmt methods to prepare the text
- Predicting next char 
- Predicting the next probable word


In this collab i'll generate a model to predict the next probable char.



In [4]:
# combine the contents of the list into a single string
all_anime_quotes = " ".join(anime_quotes)


num_char = len(all_anime_quotes)
unique_chars = set(all_anime_quotes)
vocab_size = len(unique_chars)

print(all_anime_quotes)
print(f"Unique_chars: {unique_chars}")
print(f"Total number of charachthers in all_anime_quotes: {num_char}")
print(f"Vocabulary size: {vocab_size}")


People’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you can’t create a future! If you don’t like your destiny, don’t accept it. When you give up, that’s when the game ends. All we can do is live until the day we die. Control what we can…and fly free. Forgetting is like a wound. The wound may heal, but it has already left a scar. It’s just pathetic to give up on something before you even give it a shot.” If you don’t share someone’s pain, you can never understand them. Whatever you lose, you’ll find it again. But what you throw away you’ll never get back. We don’t have to know what tomorrow holds! That’s why we can live for everything we’re worth today!” Why should I apologize for being a monster? Has anyone ever apologized for turning me into one? People become stronger because they have memories they can’t forget. I’ll leave tomorrow’s problems to tomorrow’s me. If you wanna make people dream, you’ve gotta start by believing in that dream you

In [5]:
print(sorted(unique_chars))


[' ', '!', ',', '-', '.', ':', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xa0', '’', '“', '”', '…']


To recap the steps we are going to take for the text generation model.

**Preparing the text**
- We are going to perform tokenization on each individual chars to convert them into tokens
- From the tokens we would then create sequences. We would create sequences of length 100 which would be our feature. Our label would be our sequence shifted one way to the right.

**Model training**
- We would then train an RNN model on the features and labels.

**Text generation**
- We would then generate a text from a seed word using the trained model

define a function to map the char into tokens.

In [6]:
# define a dictionary to map the char into token
char_to_token = dict([(char, token) for token, char in enumerate(unique_chars)])
print(char_to_token)


{'P': 0, '“': 1, 'I': 2, 'W': 3, 'U': 4, 'F': 5, 'M': 6, 'Y': 7, 'e': 8, 'b': 9, '’': 10, 's': 11, 'B': 12, '\xa0': 13, 'R': 14, '”': 15, 'N': 16, 'i': 17, 'y': 18, 'm': 19, '?': 20, '.': 21, 'z': 22, 'E': 23, 'V': 24, 'D': 25, 'u': 26, 'f': 27, '…': 28, '-': 29, 'n': 30, 't': 31, 'T': 32, 'o': 33, 'k': 34, 'r': 35, 'A': 36, 'l': 37, '!': 38, 'j': 39, 'H': 40, 'g': 41, 'G': 42, 'L': 43, 'p': 44, 'c': 45, 'x': 46, 'q': 47, ' ': 48, 'O': 49, 'v': 50, ':': 51, 'J': 52, 'h': 53, ',': 54, 'K': 55, 'a': 56, 'S': 57, 'd': 58, 'C': 59, 'w': 60}


In [7]:
# create a dictionary with inverted mapping
token_to_char = dict([(token, char) for char, token in char_to_token.items()])
print(token_to_char)


{0: 'P', 1: '“', 2: 'I', 3: 'W', 4: 'U', 5: 'F', 6: 'M', 7: 'Y', 8: 'e', 9: 'b', 10: '’', 11: 's', 12: 'B', 13: '\xa0', 14: 'R', 15: '”', 16: 'N', 17: 'i', 18: 'y', 19: 'm', 20: '?', 21: '.', 22: 'z', 23: 'E', 24: 'V', 25: 'D', 26: 'u', 27: 'f', 28: '…', 29: '-', 30: 'n', 31: 't', 32: 'T', 33: 'o', 34: 'k', 35: 'r', 36: 'A', 37: 'l', 38: '!', 39: 'j', 40: 'H', 41: 'g', 42: 'G', 43: 'L', 44: 'p', 45: 'c', 46: 'x', 47: 'q', 48: ' ', 49: 'O', 50: 'v', 51: ':', 52: 'J', 53: 'h', 54: ',', 55: 'K', 56: 'a', 57: 'S', 58: 'd', 59: 'C', 60: 'w'}


In [8]:
# Sanity check
print(f"A has token {char_to_token['A']}")
print(f"{char_to_token['A']} represents {token_to_char[char_to_token['A']]}")


A has token 36
36 represents A


In [9]:
# Convert the text data into sequences
sequences = []
for char in all_anime_quotes:
  token = char_to_token[char]
  sequences.append(token)

print(sequences)
print(f"Length of sequence: {len(sequences)}")


[0, 8, 33, 44, 37, 8, 10, 11, 48, 37, 17, 50, 8, 11, 48, 58, 33, 30, 10, 31, 48, 8, 30, 58, 48, 60, 53, 8, 30, 48, 31, 53, 8, 18, 48, 58, 17, 8, 54, 48, 17, 31, 48, 8, 30, 58, 11, 48, 60, 53, 8, 30, 48, 31, 53, 8, 18, 48, 37, 33, 11, 8, 48, 27, 56, 17, 31, 53, 21, 48, 2, 27, 48, 18, 33, 26, 48, 58, 33, 30, 10, 31, 48, 31, 56, 34, 8, 48, 35, 17, 11, 34, 11, 54, 48, 18, 33, 26, 48, 45, 56, 30, 10, 31, 48, 45, 35, 8, 56, 31, 8, 48, 56, 48, 27, 26, 31, 26, 35, 8, 38, 48, 2, 27, 48, 18, 33, 26, 48, 58, 33, 30, 10, 31, 48, 37, 17, 34, 8, 48, 18, 33, 26, 35, 48, 58, 8, 11, 31, 17, 30, 18, 54, 48, 58, 33, 30, 10, 31, 48, 56, 45, 45, 8, 44, 31, 48, 17, 31, 21, 48, 3, 53, 8, 30, 48, 18, 33, 26, 48, 41, 17, 50, 8, 48, 26, 44, 54, 48, 31, 53, 56, 31, 10, 11, 48, 60, 53, 8, 30, 48, 31, 53, 8, 48, 41, 56, 19, 8, 48, 8, 30, 58, 11, 21, 48, 36, 37, 37, 48, 60, 8, 48, 45, 56, 30, 48, 58, 33, 48, 17, 11, 48, 37, 17, 50, 8, 48, 26, 30, 31, 17, 37, 48, 31, 53, 8, 48, 58, 56, 18, 48, 60, 8, 48, 58, 17, 8, 

In [10]:
# From the sequence_data create a list containing
sequence_length = 100
sequences_as_tf_data = tf.data.Dataset.from_tensor_slices(sequences).batch(sequence_length+1, drop_remainder=True)


In [11]:
# display the 2 samples from the dataset
for sample in sequences_as_tf_data.take(2):
  print("".join([token_to_char[token] for token in sample.numpy()]))

People’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you ca
n’t create a future! If you don’t like your destiny, don’t accept it. When you give up, that’s when t


In [12]:
# Split sequences into features and labels
def split_sequence(sequence):
  feature = sequence[:-1]
  label = sequence[1:]
  return feature, label


In [13]:
# try it out on the 2 samples
for sample in sequences_as_tf_data.take(2):
  (feature, label) = split_sequence(sample.numpy())
  print("\nFeature, Label pair")
  print("".join([token_to_char[token] for token in feature]))
  print("".join([token_to_char[token] for token in label]))



Feature, Label pair
People’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you c
eople’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you ca

Feature, Label pair
n’t create a future! If you don’t like your destiny, don’t accept it. When you give up, that’s when 
’t create a future! If you don’t like your destiny, don’t accept it. When you give up, that’s when t


In [14]:
# Apply the split_sequence function to the dataset
feature_label_data = sequences_as_tf_data.map(split_sequence)

for feature, label in feature_label_data.take(1):
  print(feature)
  print(label)


tf.Tensor(
[ 0  8 33 44 37  8 10 11 48 37 17 50  8 11 48 58 33 30 10 31 48  8 30 58
 48 60 53  8 30 48 31 53  8 18 48 58 17  8 54 48 17 31 48  8 30 58 11 48
 60 53  8 30 48 31 53  8 18 48 37 33 11  8 48 27 56 17 31 53 21 48  2 27
 48 18 33 26 48 58 33 30 10 31 48 31 56 34  8 48 35 17 11 34 11 54 48 18
 33 26 48 45], shape=(100,), dtype=int32)
tf.Tensor(
[ 8 33 44 37  8 10 11 48 37 17 50  8 11 48 58 33 30 10 31 48  8 30 58 48
 60 53  8 30 48 31 53  8 18 48 58 17  8 54 48 17 31 48  8 30 58 11 48 60
 53  8 30 48 31 53  8 18 48 37 33 11  8 48 27 56 17 31 53 21 48  2 27 48
 18 33 26 48 58 33 30 10 31 48 31 56 34  8 48 35 17 11 34 11 54 48 18 33
 26 48 45 56], shape=(100,), dtype=int32)


We have our text data prepared. Inputs to the model is a sequence of tokens and the label is also the same sequence shifted by 1 to the right.

In [15]:
# might be easier to convert sequences to strings if we define a function to convert it

def convert_sequence_to_string(sequence):
  string = "".join([token_to_char[token] for token in sequence])
  return string


In [16]:
for feature, label in feature_label_data.take(1):
  print(convert_sequence_to_string(feature.numpy()))
  print(convert_sequence_to_string(label.numpy()))

People’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you c
eople’s lives don’t end when they die, it ends when they lose faith. If you don’t take risks, you ca


In [17]:
# create a batched dataset
batched_dataset = (feature_label_data.batch(1))

# **Define the RNN model**

In [24]:
vocab_szie = 61
Embedding_dim = 128
GRU_units = 256


In [25]:
# define a text generation model
Anime_qoutes_model = tf.keras.Sequential([tf.keras.layers.Embedding(vocab_size, Embedding_dim),
                                         tf.keras.layers.GRU(units=GRU_units, dropout=0.5, 
                                                             recurrent_dropout=0.25,
                                                             return_sequences=True),
                                         tf.keras.layers.Dense(units=vocab_size, activation="softmax")])


In [26]:
# Compile the model
Anime_qoutes_model.compile(optimizer='adam',
                           loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                           )

Anime_qoutes_model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, None, 128)         7808      
                                                                 
 gru_2 (GRU)                 (None, None, 256)         296448    
                                                                 
 dense_2 (Dense)             (None, None, 61)          15677     
                                                                 
Total params: 319,933
Trainable params: 319,933
Non-trainable params: 0
_________________________________________________________________


Some notes missed out from an unsaved version
- Batch dimension in the data is needed for RNN 
- SparseCategoricalCrossentropy used for multiclass classification when expected labels are integers and not one-hot encoded labels.
- Use categoricalCrossentropy loss for multiclass classifications with one-hot encoded labels

# **Train the model**

In [27]:
# define the model call backs
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath="./model_checkpoints/model_epoch_{epoch}_loss_{loss}",
                                                               monitor='loss',
                                                               save_best_only=True)
early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor='loss', min_delta=0.1, patience=4)

In [28]:
history = Anime_qoutes_model.fit(batched_dataset, epochs=22,
                                 callbacks=[model_checkpoint_callback, early_stopping_callback])

Epoch 1/22


  return dispatch_target(*args, **kwargs)






Epoch 2/22



Epoch 3/22



Epoch 4/22



Epoch 5/22



Epoch 6/22



Epoch 7/22



Epoch 8/22



Epoch 9/22

KeyboardInterrupt: ignored

i have a feeling the final model is going to overfit onto the dataset, might be best to stop at the 10th epoch