# Character-by-Character text generation using a language model
![title](pics/text_generation.png)

In [1]:
#import 

#keras imports
import keras
from keras import layers

#general imports
from IPython.display import display, Markdown #just to display markdown
import random
import numpy as np
import sys

Using TensorFlow backend.


## Downloading and parsing our initial text file

In [2]:
path = keras.utils.get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
text = open(path).read().lower()

print ("Corpus length:"+str(len(text)))

display(Markdown("### Initial text:"))
print(text[:500]+"\n")

Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt
Corpus length:600893


### Initial text:

preface


supposing that truth is a woman--what then? is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists, have failed to understand women--that the terrible
seriousness and clumsy importunity with which they have usually paid
their addresses to truth, have been unskilled and unseemly methods for
winning a woman? certainly she has never allowed herself to be won; and
at present every kind of dogma stands with sad and discouraged mien--if,
indeed, it s



## Vectorizing partially-overlapping sequences of characters

In [3]:
# Length of extracted character sequences
maxlen = 60
# We sample a new sequence every `step` characters
step = 3
# This holds our extracted sequences
sentences = []
# This holds the targets (the follow-up characters)
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))
# List of unique characters in the corpus
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
# Dictionary mapping unique characters to their index in `chars`
char_indices = dict((char, chars.index(char)) for char in chars)
# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

    
print ("\nafter data encoding:")
print('encoded text shape:', x.shape)
print('encoded target shape:', y.shape, "\n")
 
print("\nFirst 5 data samples & targets\n")


print ("text sentence")
print (sentences[:5])

print ("\ncorresponding characters to predict")
print(next_chars[:5])

Number of sequences: 200278
Unique characters: 57
Vectorization...

after data encoding:
encoded text shape: (200278, 60, 57)
encoded target shape: (200278, 57) 


First 5 data samples & targets

text sentence
['preface\n\n\nsupposing that truth is a woman--what then? is the', 'face\n\n\nsupposing that truth is a woman--what then? is there ', 'e\n\n\nsupposing that truth is a woman--what then? is there not', '\nsupposing that truth is a woman--what then? is there not gr', 'pposing that truth is a woman--what then? is there not groun']

corresponding characters to predict
['r', 'n', ' ', 'o', 'd']


## A single-layer LSTM model for next-character prediction

In [4]:
print ("Starting model archiitecture development")
model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax')) 
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
model.summary()

Starting model archiitecture development
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 128)               95232     
_________________________________________________________________
dense_1 (Dense)              (None, 57)                7353      
Total params: 102,585
Trainable params: 102,585
Non-trainable params: 0
_________________________________________________________________


### TRAINING THE LANGUAGE MODEL AND SAMPLING FROM IT
Given a trained model and a seed text snippet, we generate new text by repeatedly:
- Drawing from the model a probability distribution over the next character given the text available so far
- Reweighting the distribution to a certain "temperature"
- Sampling the next character at random according to the reweighted distribution 4) Adding the new character at the end of the available text

In [5]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

### The text generation loop

In [12]:
for epoch in range(1, 10):
    print('epoch', epoch)
    # Fit the model for 1 epoch on the available training data
    model.fit(x, y,
              batch_size=2048,
              epochs=1)
    # Select a text seed at random
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated_text = text[start_index: start_index + maxlen]
    print('\n--- Generating with seed: "' + generated_text + '"')
    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('\n------ temperature:', temperature)
        sys.stdout.write(generated_text)
        # We generate 400 characters
        for i in range(400):
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.
            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]
            generated_text += next_char
            generated_text = generated_text[1:]
            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

epoch 1
Epoch 1/1

--- Generating with seed: "l be relegated to the
physiological sciences and to the hist"

------ temperature: 0.2
l be relegated to the
physiological sciences and to the history of the same are something that the superficial the sense and all the contrary and suffering the sense of the sense of the same the spirit of the sense of the same as the sense of the strength of the sense of the superficial the spirit of the same as a soul and the superficial and the superficial the sense of the superior of the end the superficial and as a soul and the strength of the spirit o

------ temperature: 0.5
e superficial and as a soul and the strength of the spirit of the higher in the fact is not one of the extent as a man is the same will be a man is not can no longer the super-al enough to character in the bad are consequently always the beguner man and the devil. for his injustial should have no right more is a will pridest the spirit of the fact, the father of the higher to co

  This is separate from the ipykernel package so we can avoid doing imports until


appearly: formher inclidains time wade; nature as bring, heartakes
enders over expedience of ruling imse.--palger's sporl, how ascerpecteduage
emmanicism, for order
of thus
becanse" of seriousnes un the case," herethis pro
epoch 6
Epoch 1/1

--- Generating with seed: " the principal causes which have kept the type of
"man" upon"

------ temperature: 0.2
 the principal causes which have kept the type of
"man" upon the constraint to the fact that the earth and something which has not the spirit and there is not to be and are there is not to be and desires the sense of the same to the strength and superstition of the strength of the sense of the superiority of the sense of the spirit and still relation of the spirit and there is not to be and something which we conscience of the superiority of the sense of t

------ temperature: 0.5
ing which we conscience of the superiority of the sense of the constant science, as they are world for a thinking is a men of the more strong every word somet