<a href="https://colab.research.google.com/github/Harrow-Enigma/ai-lecture-series-summer21/blob/main/GRU_Shakespeare_Generation_(Interactive).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GRU Shakespeare Generation

Train an AI to write like Shakespeare!


Copyright 2021 Team Enigma

In [None]:
# Copyright 2021 Team Enigma

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Creating our AI

__SPOILER ALERT__: Implementing a multi-layer neural network with RNN and then training it from scratch is going to take an absurdly long time. It is well beyond what we can do in a lecture. Therefore, to speed up the process, we will be using _TensorFlow_, a popular machine learning library, and _Keras_, a framework built on top of it.

In [None]:
# Import TensorFlow, Keras, and relevant libraries
import tensorflow as tf
from tensorflow import keras
import os
import numpy as np
import time

### Downloading the data
We are going to teach an AI how to generate realistic-looking text of a certain type based on data in a _.txt_ file. Here are few prepared styles for you to choose from: 


| Text Type  | Description | Adress / URL |
| ------------- | ------------- | ------------- |
| Haikus | A form of 3-line poetry, originating from Japan. | [https://raw.githubusercontent.com/PerceptronV/Apollo-Psi/master/Haikus.txt](https://raw.githubusercontent.com/PerceptronV/Apollo-Psi/master/Haikus.txt) |
| Shakespeare | _Everyone knows him_ | [https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt](https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt) |
| Emily Dickinson | American poet. Her works were unconventional and often centred around themes like death or religion. | [https://raw.githubusercontent.com/PerceptronV/Miscellaneous/master/Emily_Dickinson.txt](https://raw.githubusercontent.com/PerceptronV/Miscellaneous/master/Emily_Dickinson.txt) |
| Virgil | __Latin__ full text from the Roman epic poem _The Aeneid_ by Virgil | [https://raw.githubusercontent.com/PerceptronV/Miscellaneous/master/aeneid-lat.txt](https://raw.githubusercontent.com/PerceptronV/Miscellaneous/master/aeneid-lat.txt) |



Choose the type you'd like, copy its link address, then run the next block and paste in the url.

_(To customise, you can also input an url of any text file you like, as long as it is encoded in utf-8 and is public.)_

In [None]:
URL = input("Enter the URL of the text file you'd like to use:\n")

if os.path.exists(URL):
  fp = URL

else:
  if os.path.exists('src.txt'):
    os.remove('src.txt')
  fp = keras.utils.get_file('src.txt', URL, cache_subdir = '/content')

### Reading and tokenizing the text
Before we do anything, we have to load the text in the file you've downloaded. Then we convert all the characters in that text into integers, called _tokens_. The whole process is called _tokenization_.

In [None]:
text = open(fp,'rb').read().decode(encoding='utf-8')  # Reading the file

vocab = sorted(set(text))  # Creating a list of unique characters
TOKEN_SIZE = len(vocab)

# Create a mapping from character to token
get_tok = {u:i for i, u in enumerate(vocab)}

#Create a mapping from token to character
get_char = np.array(vocab)

In [None]:
# Let's visualize the mapping between the first set of tokens and characters
# \n is the newline character

print('token\t----->\tcorresponding character\n')
for i in range(0,7):
  print('{}\t----->\t{}'.format(i, get_char[i].replace('\n', '\\n')))

In [None]:
# Defining helper functions to make tokenizing and decoding easier

def tokenize(s):
  return [get_tok[i] for i in s]

def decode(l):
  ret=''
  for i in l:
    ret+=get_char[i]
  return ret

### Creating a dataset
The training process is really simple. We separate the text into sequences of tokens, all of the same length. We then give the model an input token, and train it to predict the most probable next token.

This is like the next-word prediction function on your smartphones; our AI just specializes in the text you're giving it.

In [None]:
# Converting the whole text into tokens and turning it into tf.data.Dataset

token_text = tokenize(text)
token_data = tf.data.Dataset.from_tensor_slices(token_text)

In [None]:
# Breaking the text into sequences

SEQ_LEN = 5   # This is the sequence length. You can tweak this value.

seq_data = token_data.batch(SEQ_LEN, drop_remainder = True)

print('Example sequence:\n')
for i in seq_data.take(1):
  print(decode(i).replace('\n', '\\n'))

In [None]:
BATCH_SZ = 64  # The batch size parameter. It's usually between 64-256.

# Creating input-target pairs

def to_inp_targ(seq):
  '''
    Function for turning an individual sequence into input-target pairs
  '''
  return seq[:-1], seq[1:]

# Applying the to_inp_targ function to all sequences, then batching the dataset
dataset = seq_data.map(to_inp_targ)
batch_dataset = dataset.batch(BATCH_SZ, drop_remainder = True)

In [None]:
# Visualizing input-target pairs

for inp_seq, out_seq in dataset.take(1):
  print('Input')
  print(decode(inp_seq).replace('\n', '\\n'))
  print('\nOutput')
  print(decode(out_seq).replace('\n', '\\n'))

###Making the model
We're going to make our own on top of the keras.Model framework.

The model takens in a token, turns it into an embedding vector, then passes it through a Recurrent Neural Network, a GRU in this case. This then goes through a fully connected layer, the no. of neurons of which equals the token size (i.e. the number of different characters we have in our text), because we want the model to output a probability for all the possible next tokens.

Once the model predicts the next token, we will feed in the correct next token from the data, instead of using the token which the model predicts. This is called teacher-forcing, which makes training faster.

In [None]:
class Model(keras.Model):
  def __init__(self, batch_size, embedding_dim, units, token_size, dropout):
    super(Model,self).__init__(name = 'Model')
    
    # We now define the layers we need.
    # Note: The input layer is only used for declaring the input shape;
    # as you can see, it isn't really referred to below in the call function.
    self.input_layer = keras.Input(batch_shape = (batch_size, None))

    self.embedding_layer = keras.layers.Embedding(token_size, embedding_dim)

    self.rnn_layer = keras.layers.GRU(units, return_sequences = True,
                                      recurrent_initializer = 'glorot_uniform',
                                      dropout = dropout, stateful = True)
    
    self.out = keras.layers.Dense(token_size)


  def call(self, inp):
    # Defining how the input propagates through our neural network
    # The following code has been scrambled. Re-arrange the lines so
    # that the input can propagate in the correct order!
    
    x = self.embedding_layer(inp)
    return self.out(x)
    x = self.rnn_layer(x)

In [None]:
# Model parameters

EMB_DIM = 256  # This is the size of the embedding vector
UNITS = 1024  # This is the number of RNN units in the model

'''
0 <= dropout < 1

Dropout controls how much the model forgets certain information 
during training.

I.e.  A dropout of 0 means that the model remembers everything its 
      taught at evert moment in the training.

Dropout is used to combat overfitting, which happens when your model
learns the data so well it begins to memorize it. 

So, if you find that after training below, the model memorises the
dataset its given instead of generating anything new, try increasing
the dropout value.
'''
DROPOUT = 0.0

In [None]:
# Initialize the model
model = Model(BATCH_SZ, EMB_DIM, UNITS, TOKEN_SIZE, DROPOUT)

In [None]:
print('Example output shape:')
for i in batch_dataset.take(1):
  out = model(i[0])
  print(out.shape)

In [None]:
model.summary()

### Defining the loss function and optimizer
Since the task is about classifying the most probable next token, we use the cross-entropy loss.

---

As for the optimizer, we will use the Adam algorithm to help train our model weights. More information of Adam is [available here](https://arxiv.org/pdf/1412.6980.pdf), but it's essentially just a modified version of gradient descent.

In [None]:
'''
Remark: There's nothing magical about the name 
`sparse_categorical_crossentropy`. It's just some technical
details. The loss function is still calculating the 
cross-entropy loss by the formula given above.
'''
def loss_function(true, pred):
  return keras.losses.sparse_categorical_crossentropy(true, pred,
                                                      from_logits = True)

'''
Learning rate is the size of each gradient descent step. You can fine-
tune this parameter to make your training more successful.
If the loss decreases too slowly, increase the learning rate. If the
loss drops quickly but rises rapidly after a while, decrease the learning
rate.
'''
LEARNING_RATE = 1e-3
optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)

### Training our network
We are going to implement the training loop and gradient descent here.

In [None]:
def train_step(input, labels):
  
  with tf.GradientTape() as tape:
    predictions = model(input)
    loss = loss_function(labels, predictions)
    
    sum_loss = tf.reduce_sum(loss, axis = 1)
  
  # Calculate gradients with respect to model variables
  gradients = tape.gradient(sum_loss, model.trainable_variables)

  # Update weights with optimzer, based on the gradients
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

  return tf.reduce_mean(loss)

We're just iterating over the data in the code below. Nothing mysterious.

In [None]:
def train(epochs):

  for i in range(epochs):
    total_loss = []
    start = time.time()

    for e, batch in enumerate(batch_dataset):
      # Due to the way the dataset is structured, each batch includes
      # two components: batch[0] is the inputs, batch[1] is the labels

      step_loss = train_step(batch[0], batch[1])
      total_loss.append(step_loss)

      if (e + 1) % 10 == 0:
        print('Batch: {}\tElapsed time: {}s\tLoss: {}'.format(
            e + 1, time.time()-start, np.mean(np.asarray(total_loss))
        ))
    
    print('> Epoch: {}\tTime: {}s\tLoss: {}\n'.format(
        i+1, time.time()-start, np.mean(np.asarray(total_loss))
    ))

In [None]:
'''
Epochs is the number of rounds we train an AI. Generally, the more you
train, the better. But there is a point at which the model starts to
overfit, or in other words, starts to model the data so well it memorises
it. 

If overfitting happens to you, try reducing the number of epochs or tweak
the dropout parameter above.

If the AI isn't generating meaningful stuff after you've trained it, try
increasing the number of epochs, or change the model structure altogether.
'''

EPOCHS = 32

### _Actually_ running the __`train`__ function
Behold - the FUN!!!

But, training does take a few minutes. Be patient and watch the loss drop.


In [None]:
train(EPOCHS)

In [None]:
# Let's save our model weights to immortalize it's legacy

model.save_weights('weights.h5')

## Text generation
Getting started with our AI!

In [None]:
'''
We have to first instantiate a new model, because we built our
first model based on a fized batch size. However, now we need to
generated text one sample at a time, so we have to build another
model with an input batch size of 1.

It's just a technical detail
'''

gen_model = Model(1, EMB_DIM, UNITS, TOKEN_SIZE, DROPOUT)
gen_model.build((1, None))
gen_model.load_weights('weights.h5')

### Sampling

The following code inputs a prompt to the model, and then samples the predicted next token based on their corresponding probabilities, generated by our model.

Note that before sampling, the probabilities are divided by a temperature. A higher temperature means there is a higher chance that some tokens with a lower assigned popability might be sampled, often making the generated text more surprising. As the temperature approaches 0, it is basically the same as selecting the token with highest probability.

Temperature is always a positive integer. Tweak its value to balance the creativity of the text with its uniformity.

<br/>

### The generaion procedure

Once the next token is sampled, it is added to the original prompt. This continued prompt is fed back into the model, which then generates next-token-probabilities for us to sample from, and so we have another token. We add this to the original prompt, let the model predict the next token based on this new prompt, and so on etc.

This generation technique continues until the model has generated the desired number of outputs, the `generation_length`.

In [None]:
def generate_text(temperature, generation_length, prompt):
    
    # Turning input text into tokens
    input = tf.expand_dims(tokenize(prompt), 0)

    output = []  # Initializing a list to store generated tokens
    gen_model.reset_states()  # Resetting the idden states of our RNN model

    for i in range(generation_length):
      
      preds = gen_model.predict(input)  # Making a next-word prediction with our model
      preds = tf.squeeze(preds, 0)
      preds = preds / temperature  # Dividing distribution by temperature

      # Sampling from the distribution
      id = tf.random.categorical(preds, num_samples=1)[-1,0].numpy()

      input = tf.expand_dims([id],0)
      output.append(id)
      
    return prompt + decode(output)

In [None]:
# Defining generation parameters

TEMPERATURE = 0.2
GENERATION_LENGTH = 300

### Generation!!

In [None]:
prompt = 'There was once a drunk sailor'
result = generate_text(TEMPERATURE, GENERATION_LENGTH, prompt)

print(result)