# Text Generation with Neural Networks

Import necessary packages for preprocessing, model building, etc. We follow the steps described in the theoretical part of this summer school as follows:

0. Define Reseach Goal (already done)
2. Retrieve Data
3. Prepare Data
4. Explore Data
5. Model Data
6. Present and automate Model

In [45]:
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
from keras import backend as K
import numpy as np
import random
import sys
import io

# 1. Retrieve Data

Load your data! You can pick up data from everywhere, such as plain text, HTML, source code, etc.
You can either automatically download with Keras get_file function or download it manually and import it in this notebook.

## Example Data Set
[trump.txt](https://raw.githubusercontent.com/harshilkamdar/trump-tweets/master/trump.txt)

In [46]:
path = get_file('trump.txt', origin='https://raw.githubusercontent.com/harshilkamdar/trump-tweets/master/trump.txt')
text = io.open(path, encoding='utf-8').read().lower()

print('corpus length:', len(text))

corpus length: 2264441


# 2. Prepare Data

As described in the theoretical part of this workshop we need to convert our text into a word embedding that can be processed by a (later) defined Neural Network. 


## 2.1. Create Classes 
The goal after this step is to have a variable which contains the distinct characters of the text. Characters can be letters, digits, punctions, new lines, spaces, etc.

### Example:
Let's assume we have the following text as input: "hallo. "

After the following step, we want to have all distinct characters, i.e.:

``[ "h", "a", "l", "o", ".", " " ] ``


In [47]:
chars = sorted(list(set(text)))
print('total chars:', len(chars))

total chars: 176


## 2.2. Create Training Set

In the following section we need to create our test set based on our text. The idea is to map a sequence of characters to a class. In this case, a class is one of our distinct characters defined in the previous task. This means that a sequence of characters predicts the next character. This is important for the later model to know which characters come after specific sequences. The sequence length can be chosen. So try out different squence length.

### Example:
Our text is still: "hallo. "
Sequence length: 2 (i.e. 2 characters predict the next character)

The result (training set) should be defined as follows:

``
Seuences --> Class
 "ha"    --> "l"
 "al"    --> "l"
 "ll"    --> "o"
 "lo"    --> "."
 "o."    --> " "
``

You can read the previous example like this: Squence "ha" predicts the next character " l ", sequence "al" predicts next character " l " and so on.

In [4]:
seqlen = 30 # Sequence length parameter
step = 5   # Determines the how many characters the window should be shifted in the text 
sequences = []  # List of sequences
char_class = [] # Corresponding class of each sequence

for i in range(0, len(text) - seqlen, step):
    sequences.append(text[i: i + seqlen])
    char_class.append(text[i + seqlen])
print('#no sequences:', len(sequences))

#no sequences: 452883


## 2.3. Check your Data

Now that we processed our data, it's time to understand what we have built so far.

In [5]:
for idx in range(len(sequences[:10])):
    print(sequences[idx], ":" , char_class[idx])

.@jebbush was terrible on face :  
bush was terrible on face the  : n
was terrible on face the natio : n
errible on face the nation tod : a
le on face the nation today. b : e
 face the nation today. being  : a
 the nation today. being at 2% :  
nation today. being at 2% and  : f
n today. being at 2% and falli : n
ay. being at 2% and falling se : e


In [6]:
# Print from 1st to 10th character 
chars[:10]

['\n', ' ', '!', '"', '#', '$', '%', '&', "'", '(']

In [7]:
# Print from 150th to 160th character :-)
chars[150:160]

['🔅', '😁', '😂', '😃', '😄', '😆', '😇', '😈', '😉', '😊']

## 2.4. Vectorization of Training Sequences

The following section describes the desired form of our final training set. 

text: "hallo. ".
As defined above we have a couple of sequences mapping to the next appearing character in the text (e.g. "ha" mapping to "l"). But first of all, we transform each sequence to the following one-hot-encoded matrix.

**Example:** 
sequence "ha" maps to the following matrix

|     |  h  |  a  |  l  |  o  |  .  | ' ' |
|-----|-----|-----|-----|-----|-----|-----|
|  h  |  1  |  0  |  0  |  0  |  0  |  0  |
|  a  |  0  |  1  |  0  |  0  |  0  |  0  |

next sequence "al" maps to the following matrix

|     |  h  |  a  |  l  |  o  |  .  | ' ' |
|-----|-----|-----|-----|-----|-----|-----|
|  a  |  0  |  1  |  0  |  0  |  0  |  0  |
|  l  |  0  |  0  |  1  |  0  |  0  |  0  |

... And so on

## 2.5. Vectorization of Target Classes

We build our target classes similar to the training set. We need a one hot-encoded vector for each target (which is a character).

**Example:** for target char "l" the vector looks like this

|     |  h  |  a  |  l  |  o  |  .  | ' ' |
|-----|-----|-----|-----|-----|-----|-----|
|  l  |  0  |  0  |  1  |  0  |  0  |  0  |

In [8]:
# Indexed characters as dictionary
char_indices = dict((c, i) for i, c in enumerate(chars))

# Both matrices will initialized with zeros
training_set = np.zeros((len(sequences), seqlen, len(chars)), dtype=np.bool)
target_char = np.zeros((len(sequences), len(chars)), dtype=np.bool)
for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        training_set[i, t, char_indices[char]] = 1
    target_char[i, char_indices[char_class[i]]] = 1

# 3. Explore Data

In [9]:
# Let's check the shape of the training_set

training_set.shape

(452883, 30, 176)

Output: (x, y, z)

    x = number of all sequences to test
    y = window size to predict the next character
    z = number of all appearing characters in text (for one-hot-enconding) 

In [10]:
# Let's check the shape of the target_char (act as our target classes)

target_char.shape

(452883, 176)

Output: (x, y)

    x = number of all sequences to test
    y = the mapping of each sequence to the next character


# 4. Model data

Let's get down to business! Create your own model!

First of all check how to configure a model in Keras (see [keras doc](https://keras.io/models/about-keras-models/#about-keras-models)) 

In [51]:
# TODO Create your model in a variable called "model"

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 128)               156160    
_________________________________________________________________
dense_2 (Dense)              (None, 176)               22704     
_________________________________________________________________
activation_2 (Activation)    (None, 176)               0         
Total params: 178,864
Trainable params: 178,864
Non-trainable params: 0
_________________________________________________________________


In [37]:
def getNextCharIdx(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [43]:
# Creation of reverse char index, to get the char for the predicted class
indices_char = dict((i, c) for i, c in enumerate(chars))

def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    start_index = random.randint(0, len(text) - seqlen - 1)
    for diversity in [1, 0.1, 0.5]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + seqlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(280):
            x_pred = np.zeros((1, seqlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = getNextCharIdx(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

In [44]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

# 5. Evaluate Model

We are not at the sweet part of the task! Let's fit our model and see what it prints!

In [50]:
model.fit(training_set, target_char,
          batch_size=128,
          epochs=50,
          callbacks=[print_callback])

Epoch 1/50
 16256/452883 [>.............................] - ETA: 5:05 - loss: 3.0575

KeyboardInterrupt: 