In [1]:
import numpy as np
import tensorflow as tf

## Purpose
Create text using a character-based recurrent neural network. We will use the novel Great Expectations by Charles Dickens. We will train the network on this text so that, if we give it a character sequence such as thousan, it will produce the next character in the sequence, d. This process can be continued, and longer sequences of text created by calling the model repeatedly on the evolving sequence.

In [2]:
text_url = 'https://www.gutenberg.org/files/1400/1400-0.txt' # Great Expectations by Charles Dickens
file_path = tf.keras.utils.get_file('1400-0.txt', text_url) # Downloads to cache if it isn't already there

In [3]:
with open(file_path) as fp:
    text = fp.read()

print(f'Lenght of text: {len(text)} characters')

Lenght of text: 1013445 characters


The first 824 characters are not part of the book. They are notes and licencing information from Project Gutenberg and shouldn't be part of training so lets remove them

In [4]:
text = text[824:]

In [5]:
print(text[:300])

Chapter I

My father's family name being Pirrip, and my Christian name Philip, my
infant tongue could make of both names nothing longer or more explicit
than Pip. So, I called myself Pip, and came to be called Pip.

I give Pirrip as my father's family name, on the authority of his
tombstone and my s


Next, lets create a mapping from char to int so the characters can represented as integers

In [13]:
unique_chars = sorted(set(text)) # Gets distinct values
char_to_int = {char:i for i, char in enumerate(unique_chars)}
int_to_char = {v:k for k, v in char_to_int.items()}

In [7]:
# Sample output
for (k, v), _ in zip(chars_to_int.items(), range(10)):
    print(f"{repr(k):4s}: {v}")

'\n': 0
' ' : 1
'!' : 2
'$' : 3
'%' : 4
'&' : 5
"'" : 6
'(' : 7
')' : 8
'*' : 9


In [8]:
book_vector = np.array([chars_to_int[char] for char in text])

# Sample mapping
print(f"{text[10:27]} ----> {book_vector[10:27]}")


My father's fami ----> [ 0 40 78  1 59 54 73 61 58 71  6 72  1 59 54 66 62]


In [18]:
# The maximum length sentence we want for a single input in characters
sequence_length = 100
examples_per_epoch = len(text) // sequence_length

In [17]:
char_dataset = tf.data.Dataset.from_tensor_slices(book_vector)

# Sanity check
for char in char_dataset.take(8):
    print(int_to_char[char.numpy()])

C
h
a
p
t
e
r
 


In [19]:
# Because we're adding 1 to the sequence in this function, the batch size is 101
sequences = char_dataset.batch(sequence_length + 1, drop_remainder=True)

In [20]:
def split_input_target(chunk):
    """The function returns the text that we have been working with, 
    together with the same text, but shifted one character along
    
    Example
    -------
    
    """
    input_text = chunk[:-1]
    target_text = chunk[1:]
    
    return input_text, target_text

dataset = sequences.map(split_input_target)