## 3.2. Increasing dataset size

The next thing we're going to try is to increase the size of our dataset. On the previous trainings we used a small subset of the book "Don Quijote de La Mancha" that contained 169KB of text.

The problem is that we have to consider that what we're going to do is to teach Spanish to our RNN. And, let's be honest, it's quite difficult to learn a language from scratch by reading only 169K characters (a few chapters of a book); we'll learn some words and maybe even a few sentences, but it's very difficult to really learn the language. 

Therefore, in order to solve this, we'll greatly increase the size of the dataset. We'll use the entire "Don Quijote de la Mancha" book, and to it we'll append another very famous Spanish book, "La Regenta" by Leopoldo Alas. Combining both, we'll get a dataset of about 4MB (more than 20x the previous one). And, although this will slow down our training a lot, it will be with very high probability a very huge improvement in our code. 

Let's start the code:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using Theano backend.
Using gpu device 0: GeForce GTX 960M (CNMeM is disabled, cuDNN not available)


The next step will be to read both books and to combine them into a single dataset, and then we'll proceed with the usual calculations

In [None]:
# Load the books, merging them and covert the result to lowercase
filename1 = "El ingenioso hidalgo don Quijote de la Mancha.txt"
book1 = open(filename1).read()
filename2 = "La Regenta.txt"
book2 = open(filename2).read()
book = book1 + book2
book = book.lower()

# Create mapping of unique chars to integers, and its reverse
chars = sorted(list(set(book)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

# Summarizing the loaded data
n_chars = len(book)
n_vocab = len(chars)
print "Total Characters: ", n_chars
print "Total Vocab: ", n_vocab

# Prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []

# Iterating over the book
for i in range(0, n_chars - seq_length, 1):
    sequence_in = book[i:i + seq_length]
    sequence_out = book[i + seq_length]
    
    # Converting each char to its corresponding int
    sequence_in_int = [char_to_int[char] for char in sequence_in]
    sequence_out_int = char_to_int[sequence_out]

    # Appending the result to the current data 
    dataX.append(sequence_in_int)
    dataY.append(sequence_out_int)
n_patterns = len(dataX)
print "Total Patterns: ", n_patterns

# Reshaping X to be [samples, time steps, features]
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# Normalizing
X = X / float(n_vocab)
# One hot encode the output variable
y = np_utils.to_categorical(dataY)

# Define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))

# Starting from a checkpoint (if we set one)
checkpoint = ""
if checkpoint:
    model.load_weights(checkpoint)

# Amount of epochs that we still have to run
epochs_run = 0
epochs_left = 50 - epochs_run

# Define the checkpoints structure
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# Compiling the model
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fitting the model
model.fit(X, y, nb_epoch=epochs_left, batch_size=64, callbacks=callbacks_list)

(We won't see the results here because I've actually executed this code in another machine, not directly in the notebook; as you can imagine, this will take a loooooong time).