<a href="https://colab.research.google.com/github/GiuliaLanzillotta/exercises/blob/master/Writing_like_Dante.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text character-level prediction
The inspiration for this notebook is drawn from [this beautiful blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).


> #### But what is our goal today? 
We’ll train **RNN character-level language models**. That is, we’ll give the RNN a huge chunk of text and ask it to model the probability distribution of the next character in the sequence given a sequence of previous characters. This will then allow us to generate new text one character at a time.


Let's have some fun!

 
We'll get the data from the Gutenberg dataset.
> ### About the dataset: 
The **Gutemberg project** offers a large collection of free books that can be retrieved in plain text for a variety of languages.
<br> I have picked the **Divina Commedia - Canto I** by *Dante ALighieri* (Italian version) for this notebook. The full text is available [here](http://www.gutenberg.org/cache/epub/1009/pg1009.txt).








In [0]:
!wget -nv 'http://www.gutenberg.org/cache/epub/1009/pg1009.txt' -O 'divina_commedia.txt'

2020-04-01 09:51:13 URL:http://www.gutenberg.org/cache/epub/1009/pg1009.txt [221280/221280] -> "divina_commedia.txt" [1]


In [0]:
!head -80 'divina_commedia.txt' | tail -20

  Tant è amara che poco è più morte;
  ma per trattar del ben chi vi trovai,
  dirò de laltre cose chi vho scorte.

  Io non so ben ridir com i vintrai,
  tant era pien di sonno a quel punto
  che la verace via abbandonai.

  Ma poi chi fui al piè dun colle giunto,
  là dove terminava quella valle
  che mavea di paura il cor compunto,

  guardai in alto e vidi le sue spalle
  vestite già de raggi del pianeta
  che mena dritto altrui per ogne calle.

  Allor fu la paura un poco queta,
  che nel lago del cor mera durata
  la notte chi passai con tanta pieta.



## Preprocessing 
---
We have now to pre-process this large .txt file to make it ready to be fed as input to the model. <br>

Our steps will be : 

    1. Cut off the sections that do not belong to the original text .
    2. Build a vocabulary for our inputs: <br> 
      Since the input are single characters, the vocabulary should contain the characters from the text language's alphabet.
    3. Create the batches : the sequence of characters, which is the text, should be split into multiple batches.

### 1. Cut off the added parts 

In [0]:
# Let's cut the parts that were not included in the original text 
!wc -l 'divina_commedia.txt'

6949 divina_commedia.txt


In [0]:
!head -30 'divina_commedia.txt'

In [0]:
!tail -410 'divina_commedia.txt'

In [0]:
# So we have to cut the first 30 lines and the last 410 
# With a few calculations I came up with these numbers: 
!tail -6919 'divina_commedia.txt' | head -6505 > 'divina_commedia_cut.txt'

In [0]:
!head -10 'divina_commedia_cut.txt'
!tail -10 'divina_commedia_cut.txt'








  LA DIVINA COMMEDIA
  di Dante Alighieri

  e sanza cura aver dalcun riposo,

  salimmo sù, el primo e io secondo,
  tanto chi vidi de le cose belle
  che porta l ciel, per un pertugio tondo.

  E quindi uscimmo a riveder le stelle.





### 2. Build the vocabulary 

In [0]:
import numpy as np
import pickle
import os
import collections

In [0]:
# Parameters :
seq_length = 50
batch_size = 128 
encoding = 'utf-8'
seed = np.random.RandomState(42)

In [0]:
# Files' locations :
input_file = './divina_commedia_cut.txt'
vocab_file = './vocab.pkl' # where we'll save the vocabulary 
tensor_file = '.data.npy' 

In [0]:
# The vocabulary will contain all the characters that we can find in the text 
import codecs
with codecs.open(input_file, "r", encoding=encoding) as f:
    data = f.read()
counter = collections.Counter(data)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
chars, _ = zip(*count_pairs)
vocab_size = len(chars)
vocab = dict(zip(chars, range(len(chars))))

# save the vocabulary using pickle
with open(vocab_file, 'wb') as f:
    pickle.dump(chars, f)
# save the text -as a sequence of characters, encoded with their 
# vocabulary index- in a numpy vector (tensor) using numpy
tensor = np.array(list(map(vocab.get, data)))
np.save(tensor_file, tensor)

In [0]:
print("loaded vocabulary with {} letters".format(vocab_size))

loaded vocabulary with 73 letters


In [0]:
# Let's save a version of the vocabulary which cointains the characters as keys 
vocab_inv = {v: k for k, v in vocab.items()}

### 3. Create the batches 

In [0]:
num_batches = int(tensor.size/(batch_size*seq_length))
tensor = tensor[:num_batches * batch_size * seq_length] # reshaping the tensor according to the batches defined

In [0]:
# We now build input and output tensors: 
# the output tensor (the sequence of characters that have to be predicted)
# will be equal to the input tensor- with each character shifted by 1 to the left
# Example: 
#   input tensor = "H e l l o"
#   output tensor = " e l l o H"
xdata = tensor # the input tensor 
ydata = np.copy(tensor) # the output tensor 
ydata[:-1] = xdata[1:]
ydata[-1] = xdata[0]

In [0]:
x_batches = np.split(xdata.reshape(batch_size, -1),num_batches, 1)
y_batches = np.split(ydata.reshape(batch_size, -1),num_batches, 1)

In [0]:
# A quick look at the batches, to make sure everything matches our expectations 
b = x_batches[2]
t = y_batches[2]
print('total of {} batches of shape: {}'.format(len(x_batches), b.shape))
print('content of batch 0, entry 0, time steps 0 to 10')
print('input : {}'.format(b[0, :10]))
print('target: {}'.format(t[0, :10]))

total of 30 batches of shape: (128, 50)
content of batch 0, entry 0, time steps 0 to 10
input : [ 2  5  8  4  0 39 11 12 11 12]
target: [ 5  8  4  0 39 11 12 11 12 11]


In [0]:
print('input : {}'.format([vocab_inv[i] for i in b[0, :20]]))
print('target: {}'.format([vocab_inv[i] for i in t[0, :20]]))

input : ['a', 'n', 't', 'o', ' ', 'I', '\r', '\n', '\r', '\n', '\r', '\n', ' ', ' ', 'N', 'e', 'l', ' ', 'm', 'e']
target: ['n', 't', 'o', ' ', 'I', '\r', '\n', '\r', '\n', '\r', '\n', ' ', ' ', 'N', 'e', 'l', ' ', 'm', 'e', 'z']


In [0]:
# Helper function: this will be useful to scan the batches 
def next_batch(x_batches, y_batches, pointer):
  """ 
  Scans the next batch. 
  Parameters: 
    - x_batches: numpy tensor
    - y_batches: numpy tensor
    - pointer: int 
        It represents the current index. 
  Returns: 
    - x: char
    - y: char
    - pointer : int
        New position of the pointer 
  """
  x, y = x_batches[pointer], y_batches[pointer]
  pointer += 1
  return x, y, pointer
pointer = 0

In [0]:
# Helper function to reshuffle the batches. 
# To use when starting a new epoch
def reshuffle(x_batches, y_batches):
  """
  Permutes the order of the input and output batches.
  """
  idx = seed.permutation(len(x_batches))
  x_batches = [x_batches[i] for i in idx]
  y_batches = [y_batches[i] for i in idx]
  return x_batches, y_batches

## Model
---
In this section we will build and train an *LSTM* from scratch using Tensorlfow 1.x APIs. <br>
The steps are the following: 

    1. Build the model. 
    2. Set the loss and the optimizer.
    3. Train the model for a few steps.
    4. Save the model. 

### Build the model

In [0]:
# Hyperparameters
learning_rate = 1e-3
hidden_size = 256 #Size of one LSTM hidden layer
num_layers = 2 #How many LSTM layers to use
print_every_steps = 20 #How often to print progress to the console
log_dir= "/tmp/tensorflow/divina_rnn/logs" #Where to store summaries and checkpoints


In [0]:
%tensorflow_version 1.x
import tensorflow as tf
from tensorflow.python.util import deprecation
deprecation._PRINT_DEPRECATION_WARNINGS = False
def rnn_lstm(inputs, seq_lengths, hidden_size=hidden_size, num_layers=num_layers):
    """
    Builds an RNN with LSTM cells.
    Parameters
    -----------
    - inputs: numpy tensor
        The input tensor to the RNN in shape `[batch_size, seq_length]`.
    - seq_lengths: 
        Tensor of shape `[batch_size]` specifying the total number 
        of time steps per sequence.
    -hidden_size: int
        The number of units for each LSTM cell.
    -num_layers: int
        The number of LSTM cells we want to use.
    Returns 
    ------------
    The initial state, final state, predicted logits and probabilities.
    """
    # one-hot encoding of the inputs
    # the resulting shape is `[batch_size, seq_length, vocab_size]`
    input_one_hot = tf.one_hot(inputs, vocab_size, axis=-1)
    
    # create a list of all LSTM cells we want
    cells = [tf.contrib.rnn.LSTMCell(hidden_size) for _ in range(num_layers)]
    
    # we stack the cells together and create one big RNN cell
    cell = tf.contrib.rnn.MultiRNNCell(cells)
    
    # we need to set an initial state for the cells
    batch_size = tf.shape(inputs)[0]
    initial_state = cell.zero_state(batch_size, dtype=tf.float32)
    
    # now we are ready to unrol the graph
    # outputs has shape [batch_size, seq_length, hidden_size]
    outputs, final_state = tf.nn.dynamic_rnn(cell=cell,
                                             initial_state=initial_state,
                                             inputs=input_one_hot,
                                             sequence_length=seq_lengths)


    # Mapping the output to the vocabulary space with a dense layer 
    # FLATTENING: 
    max_seq_length = tf.shape(inputs)[1] 
    outputs_flat = tf.reshape(outputs, [-1, hidden_size]) # [batch_size*seq_length, hidden_size]
    
    # dense layer: hidden_size -> vocab_size 
    weights = tf.Variable(tf.truncated_normal([hidden_size, vocab_size], stddev=0.1))
    bias = tf.Variable(tf.constant(0.1, shape=[vocab_size]))
    logits_flat = tf.matmul(outputs_flat, weights) + bias
    
    # reshape back
    logits = tf.reshape(logits_flat, [batch_size, max_seq_length, vocab_size])
    
    # activate to turn logits into probabilities
    probs = tf.nn.softmax(logits)
    
    # we return the initial and final states because this will be useful later
    return initial_state, final_state, logits, probs

In [0]:
# create input placeholders
with tf.name_scope("input"):
    # shape is `[batch_size, seq_length]`, both are dynamic
    text_input = tf.placeholder(tf.int32, [None, None], name='x-input')
    # shape of target is same as shape of input
    text_target = tf.placeholder(tf.int32, [None, None], name='y-input')
    # sequence length placeholder
    seq_lengths = tf.placeholder(tf.int32, [None], name='seq-lengths')

In [0]:
# build the model
initial_state, final_state, logits, probs = rnn_lstm(text_input,seq_lengths)

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



### Defining the loss and the optimizer

In [0]:
# define the loss
with tf.name_scope("cross-entropy"):
  # The loss operation : 
    cross_entropy_loss = tf.contrib.seq2seq.sequence_loss( # special loss for sequential output
        logits, text_target,  
        weights=tf.ones_like(text_input, dtype=tf.float32)) # weights is referring to the predictions weighting 
    tf.summary.scalar('cross_entropy_loss', cross_entropy_loss)

In [0]:
# check number of trainable parameters
def count_trainable_parameters():
    """Counts the number of trainable parameters in the current default graph."""
    tot_count = 0
    for v in tf.trainable_variables():
        v_count = 1
        for d in v.get_shape():
            v_count *= d.value
        tot_count += v_count
    return tot_count
print("Number of trainable parameters: {}".format(count_trainable_parameters()))

Number of trainable parameters: 881993


In [0]:
# create the optimizer
global_step = tf.Variable(1, name='global_step', trainable=False)
with tf.name_scope("train"):
    optim = tf.train.AdamOptimizer(learning_rate)
    params = tf.trainable_variables()
    gradients = tf.gradients(cross_entropy_loss, params)
    # We use gradient clipping to address the exploding gradients issue 
    clipped_gradients, _ = tf.clip_by_global_norm(gradients, 5)
    # Finally the train step operation: 
    train_step = optim.apply_gradients(zip(clipped_gradients, params), global_step=global_step)

### Training the model

In [0]:
# Training step helper function 
def do_train_step(num_steps, summary_op, pointer=pointer, x_batches=x_batches, y_batches=y_batches):
    """Perform as many training steps as specified."""
    for i in range(num_steps):
      step = tf.train.global_step(sess, global_step)

      # Get the next batch of data 
      if pointer >= num_batches:
        # Initialise the new epoch 
        pointer = 0
        x_batches, y_batches = reshuffle(x_batches,y_batches)  
      x, y, pointer = next_batch(x_batches,y_batches,pointer)
      feed_dict = {text_input: x, 
                    text_target: y, 
                    seq_lengths: [x.shape[1]]*x.shape[0]}
      
      # Run the optimization over the data and evaluate the loss
      summary, train_loss, _ = sess.run([summary_op, cross_entropy_loss, train_step],
                                        feed_dict=feed_dict)
      
      writer_train.add_summary(summary, step)
      if step % print_every_steps == 0:
          print('[{}] Cross-Entropy Loss Training [{:.3f}]'.format(step, train_loss)) 

In [0]:
# Create the session
sess = tf.InteractiveSession()

# Initialize all variables
sess.run(tf.global_variables_initializer())

summaries_merged = tf.summary.merge_all()
writer_train = tf.summary.FileWriter(log_dir + '/train', sess.graph)

In [0]:
pointer = 0
do_train_step(5000, summaries_merged, pointer)

[1020] Cross-Entropy Loss Training [1.682]
[1040] Cross-Entropy Loss Training [1.640]
[1060] Cross-Entropy Loss Training [1.699]
[1080] Cross-Entropy Loss Training [1.658]
[1100] Cross-Entropy Loss Training [1.652]
[1120] Cross-Entropy Loss Training [1.664]
[1140] Cross-Entropy Loss Training [1.652]
[1160] Cross-Entropy Loss Training [1.644]
[1180] Cross-Entropy Loss Training [1.598]
[1200] Cross-Entropy Loss Training [1.640]
[1220] Cross-Entropy Loss Training [1.626]
[1240] Cross-Entropy Loss Training [1.605]
[1260] Cross-Entropy Loss Training [1.586]
[1280] Cross-Entropy Loss Training [1.615]
[1300] Cross-Entropy Loss Training [1.596]
[1320] Cross-Entropy Loss Training [1.596]
[1340] Cross-Entropy Loss Training [1.563]
[1360] Cross-Entropy Loss Training [1.608]
[1380] Cross-Entropy Loss Training [1.571]
[1400] Cross-Entropy Loss Training [1.581]
[1420] Cross-Entropy Loss Training [1.542]
[1440] Cross-Entropy Loss Training [1.540]
[1460] Cross-Entropy Loss Training [1.552]
[1480] Cros

### Saving (and loading) the model

In [0]:
saver = tf.train.Saver(var_list=tf.trainable_variables(), max_to_keep=2)
saver.save(sess, os.path.join(log_dir, 'checkpoints', 'model_name'), global_step)

'/tmp/tensorflow/divina_rnn/logs/checkpoints/model_name-6002'

In [0]:
# loading the model
ckpt_path = tf.train.latest_checkpoint(os.path.join(log_dir, 'checkpoints'))
saver.restore(sess, ckpt_path)

INFO:tensorflow:Restoring parameters from /tmp/tensorflow/divina_rnn/logs/checkpoints/model_name-6002


## Generating text 
We will use the model we have just trained to generate new text. 
<br> *How to do this?*
<br> We generate text character-by-character and feed the output of each time step back as input to the model. In other words, we get the output character for a given sequence, append that character to the sequence and repeat the whole process

In [0]:
def sample(prime_text, num_steps, vocab = vocab):
    """
    Sample `num_steps` characters from the model and initialize it with `prime_text`.
    Parameters: 
    ---------------
    - prime_text: str.
        A string that we want to initialize the RNN with.
    - num_steps: int.
        Integer specifying how many characters we want to predict after `prime_text`.
    Returns:
    ---------------
        str
        The `prime_text` plus prediction.
    """
    
    input_prime = [vocab[c] for c in prime_text]
    
    # Feed the prime sequence into the model. 
    feed_dict = {text_input: [input_prime],
                 seq_lengths: [len(input_prime)]}
    state, out_probs = sess.run([final_state, probs], feed_dict=feed_dict)
    
    next_char_probs = out_probs[0, -1] # the output of the model is a probability distr.
    # over all the characters in the vocabulary. We sample from this categorical:
    def weighted_pick(p_dist):
        cs = np.cumsum(p_dist)
        idx = int(np.sum(cs < np.random.rand()))
        return idx
    next_char = weighted_pick(next_char_probs)
    predicted_text = vocab_inv[next_char]
    
    # now we can sample for `num_steps`
    for _ in range(num_steps):
        feed_dict = {text_input: [[next_char]],
                     seq_lengths: [1],
                     initial_state: state}
        
        state, out_probs = sess.run([final_state, probs], feed_dict=feed_dict)
        next_char = weighted_pick(out_probs[0, -1])
        predicted_text += vocab_inv[next_char]   
    
    return prime_text + predicted_text

In [0]:
print(sample('Il ', 1000))

Il chi move?».

  E io a lui: «Dor, fosse, tra miggia,
  non li oche giaco, se Cercoresto daicra
  feruto alcundon par de Fuorchi sonno,
  me steso le sue cadëa in avro si tecco.

  Tattir si poccia a corpo par savilsa;
  e chio forse di quanto a terra soscruto,
  mi fu maca scricciatà il focose
  al figuboro e deggea feder puoso.

  Per convien la bercine dol posova
  de lun cheggio, e a le piaggi traddume.

  Per li vede; anoma sonesta,
  chi conaviesti mitri con umanto,
  di cotando vi là dener savermi?

  Quand ionime nuver vosta dieto,
  di tembili a de loro, ov io viva sicisu;
  perch io non li confeani a De chosì forta.

  Onde vi rispavar per laere partia
  con grando amenta, quanda gnata fava:
  volsi per adirgorer, Tu per vinte;
  ma quell altra è malerento mungo,
  piangendo così da ciascuna folse,
  e volse più chïoltro e disse anosa;
  nellon che tra la vertù prea suppe,

  e gridò edi che piangea con cigimutor;
  a quei dise:: 

In [0]:
# cleanup
sess.close()