![alt text](https://www.msengineering.ch/typo3conf/ext/msengineering/Resources/Public/Images/Logo/mse-full.svg "MSE Logo") 

# AnTeDe Practical Work 8: Name Generation with RNN

by Fabian Märki

## Summary
The aim of this lab is to get an understanding of building a RNN model using Keras. The task is to train a character-level language models that generates new baby names (but feel free to change this to e.g. new start-up names or city names etc.). 

### Source
- https://github.com/JKH4/name-generator/blob/master/dev/2018-05-18_JKH_NameGen-Main.ipynb

This lab contains assigments (although most of the code is given). <font color='red'>Questions are written in red.</font>

In [None]:
from __future__ import print_function
import numpy as np
import pandas as pd
import random
import sys
import io
import os
import tensorflow as tf
import matplotlib.pyplot as plt
import json

from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, GRU, SimpleRNN, Bidirectional, InputLayer
from keras.optimizers import Adam
from keras.utils.data_utils import get_file
from keras.utils.np_utils import to_categorical
from sklearn.metrics import confusion_matrix
from more_itertools import sort_together

In [None]:
physical_devices = tf.config.experimental.list_physical_devices('GPU')
if physical_devices:
    print("Run on GPU")
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
    #config.gpu_options.per_process_gpu_memory_fraction = 0.4

### Your Task

Below you find working code that needs some improvement.

<font color='red'>Your task is to get hands on experience with Keras and RNNs by trying different options on how to build a RNN and on how to tune it. 
<br>After you created a model with good performance, please write a short summary about your experience: what works well, what did not work well, what influenced the performance of the model, did you experience strange behaviors, how did you analyze the data, how do you estimate the performance of your model, what possible further improvements you can consider.</font>

<font color='red'>Modify the function `create_model` according your intuition on how the model could be improved. Options you might want to try (please experiment with at least four of them):</font>

- try different RNN types (SimpleRNN, LSTM, GRU - see [here](https://keras.io/layers/recurrent))
- try different number of RNN units (see `parameters`)
- use regularization techniques (e.g. dropout)
- use options provided by the RNN types (e.g. arguments "dropout" and "recurrent_dropout" - see [here](https://keras.io/layers/recurrent))
- stack several RNN layers (see [here](https://keras.io/getting-started/sequential-model-guide/) and search for "Stacked LSTM for sequence classification")
- try different optimizers (see [here](https://keras.io/optimizers/))
- get more inspiration from [here](https://ruder.io/deep-learning-nlp-best-practices)

You might also want to have a look at the `parameters` variable below and modify it according your needs.</font> 

<font color='green'>Please provide a summary of your experience in this box.</font>

In [None]:
def create_model(parameters):
    length_of_sequence = parameters["trainset_infos"]['length_of_sequence']
    number_of_chars = parameters["trainset_infos"]['number_of_chars']
    
    model = Sequential()
    model.add(InputLayer(input_shape=(length_of_sequence, number_of_chars)))
    model.add(SimpleRNN(parameters["rnn_units"]))
    model.add(Dense(number_of_chars, activation='softmax'))
    
    if parameters.get("verbose"):
        model.summary()
        
    return model

def load_data(parameters):
    length_of_sequence = 5
    padding_start = '#'
    padding_end = '*'
    file_url = parameters["file_url"]

    text = ''
    with io.open(get_file(os.path.basename(file_url), origin=file_url), encoding='utf-8') as f:
        text = f.read().lower()

    names = pd.read_csv(io.StringIO(text), names=['name'], comment='#', header=None)
    names['name'] = names['name'].map(lambda n: n.replace(padding_start, ''))    # replace characters used for training
    names['name'] = names['name'].map(lambda n: n.replace(padding_end, ''))    # replace characters used for training
    names['name'] = names['name'].map(lambda n: padding_start + n + padding_end) 
    
    data_dict = {}
    data_dict['name_list'] = names['name']
    data_dict['char_list'] = sorted(list(set(data_dict['name_list'].str.cat() + '*')))
    data_dict['char_to_ix'] = { ch:i for i,ch in enumerate(data_dict['char_list']) }
    data_dict['ix_to_char'] = { i:ch for i,ch in enumerate(data_dict['char_list']) }
           
    # Extract target names to list (currently '#name*')
    training_names = data_dict['name_list'].tolist()
    
    # Extract padding characters
    padding_start = training_names[0][0]
    padding_end = training_names[0][-1]

    # Extract target character convertors
    # This will be used to convert a character to its "one hot index" and vice versa (cf Keras to_categorical())
    c2i = data_dict['char_to_ix']
    i2c = data_dict['ix_to_char']
    
    # Extract the target number of characters in all target names
    # This will be used to convert character index in its "one hot" representation (cf Keras to_categorical())
    number_of_chars = len(data_dict['char_list'])
    
    # Pad target names with enough (lengh_of_sequence) padding characters (result '##...##name**...**' )
    # The goal is  be sure that, for each name, the first training data is X[0] = '##...##'
    # and Y[0] = First actual character of the name
    training_names = [
        padding_start * (length_of_sequence - 1) + n + padding_end * (length_of_sequence - 1) for n in training_names
    ]

    # Init X and Y as list
    X_list = []
    Y_list = []

    # Init counter for visual feedback
    counter = 0 if parameters["verbose"] else None
    
    for name in training_names:
        # Slide a window on the name, one character at a time
        for i in range(max(1, len(name) - length_of_sequence)):
            # Extract the new sequence and the character following this sequence
            new_sequence = name[i:i + length_of_sequence]
            target_char = name[i + length_of_sequence]
            
            # Add the new sequence to X (input of the model)
            X_list.append([to_categorical(c2i[c], number_of_chars) for c in new_sequence])
            # Add the following character to Y (target to be predicted by the model)
            Y_list.append(to_categorical(c2i[target_char], number_of_chars))

        # visual feedback
        if parameters["verbose"]:
            counter += 1
            print(counter) if counter % 100 == 0 else print('.', end='')
            
    # make sure number of elements allignes with batch size
    offset = len(X_list) % parameters["batch_size"]
    if offset != 0:
        elements_to_copy = parameters["batch_size"] - offset
        X_list.extend(X_list[:elements_to_copy])
        Y_list.extend(Y_list[:elements_to_copy])
        
    # Convert X and Y to numpy array
    x_train = np.array(X_list)
    y_train = np.array(Y_list)
    
    # Extract the number of training samples
    m = len(x_train)
    
    # Create a description of the trainset
    parameters["trainset_infos"] = {
        'length_of_sequence': length_of_sequence,
        'number_of_chars': number_of_chars,
        'm': m,
        'padding_start': padding_start,
        'padding_end': padding_end,
    }

    print(
        '\n{} names split in {} training sequence of {} encoded chars !'.format(counter, m, length_of_sequence)
    ) if parameters["verbose"] else None

    # Visual feedbacks
    if parameters["verbose"]:
        print('X shape: {}'.format(x_train.shape))
        print('Y shape: {}'.format(y_train.shape))

        print('X[0] = {}'.format(x_train[0]))
        print('Y[0] = {}'.format(y_train[0]))

        print('Training set size: {}'.format(m))
        print('length_of_sequence: {}'.format(length_of_sequence))
        print('number_of_chars: {}'.format(number_of_chars))
        print('some names: {}'.format(names['name'][:5]))
    
                 
    parameters["x_train"] = x_train
    parameters["y_train"] = y_train
    parameters["word2index"] = c2i
    parameters["index2word"] = i2c                


def load_embeddings(word2index, word2embedding, embedding_dim, input_length = None, trainable=False):
    #return gensim_embedding_model.get_keras_embedding(train_embeddings=train_embeddings)
    
    embedding_matrix = np.zeros((len(word2index) + 1, embedding_dim))
    
    for word, i in word2index.items():
        embedding_vector = word2embedding.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
            
    if input_length is None:
        return Embedding(len(word2index) + 1, 
                         embedding_dim,
                         weights=[embedding_matrix],
                         trainable=trainable)
    else:
        return Embedding(len(word2index) + 1, 
                         embedding_dim,
                         weights=[embedding_matrix],
                         input_length=input_length,
                         trainable=trainable)

def create_embedding(parameters):
    if parameters["embedding_use_pretrained"]:
        embedding_model = parameters["embedding_model"]
        word2embedding = {word: embedding_model[word] for word, vector in embedding_model.vocab.items()}
        embedding_dim = embedding_model.vector_size
        embedding_layer = load_embeddings(parameters["word2index"], word2embedding, embedding_dim)
        return embedding_layer
    else:
        return Embedding(parameters["max_words"], parameters["embedding_dim"], input_length=parameters["maxlen"])


def compile_model(model, parameters):
    optimizer = Adam(lr = parameters["learning_rates"][parameters["iter"]])

    model.compile(loss=parameters["loss_function"], optimizer=optimizer, metrics = parameters["metrics"])

        
def train_model(model, parameters):
    i = parameters["iter"]
    
    # Train the model
    h = model.fit(
        parameters["x_train"], parameters["y_train"],
        validation_data = parameters.get("validation_data"),
        batch_size = parameters["batch_size"],
        callbacks = parameters.get("callbacks"),
        initial_epoch = parameters["total_epochs"],
        epochs = parameters["total_epochs"] + parameters["epochs_to_run"][i]
    )

    history = parameters["history"]
    # Update history
    for key, val in h.history.items():
        col = history.get(key)
        
        if col is None:
            col = np.array([])
        
        history[key] = np.append(col, val)
        
    
    # Update the training session info
    parameters['total_epochs'] += parameters['epochs_to_run'][i]
    
    
def plot_class_balance(y, title=''):
    (unique, counts) = np.unique(y, return_counts=True)
    (unique, counts) = sort_together([unique, counts])

    plt.bar(unique, counts, align='center')
    plt.xticks(np.arange(len(unique)), unique)
    plt.xlabel('label')
    plt.ylabel('count')
    plt.title(title)

    plt.show()
    
    
def plot_confusion_matrix(y_true, y_pred, title=''):
    classes = list(set(list(y_true) + list(y_pred)))
    classes.sort()

    cmm = confusion_matrix(y_true, y_pred)

    print('Set Population: {}'.format(cmm.sum()))
    print('Accuracy: {:.4f}'.format(float(cmm.trace()) / cmm.sum()))

    plt.figure(figsize=(10, 8))
    plt.imshow(cmm / cmm.sum(), interpolation='nearest', cmap='Blues')
    plt.title(title)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.colorbar()

    plt.ylim(-0.5, len(classes)-0.5)

    if classes is not None:
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, rotation=45, size='x-large')
        plt.yticks(tick_marks, classes, size='x-large')

    for y in range(cmm.shape[0]):
        for x in range(cmm.shape[1]):
            if cmm[y, x] > 0:
                plt.text(x, y, '%.0i' % cmm[y, x],
                         horizontalalignment='center',
                         verticalalignment='center')
    plt.show()
    

def plot_training_session(parameters, plots=["accuracy", "val_accuracy"]):
    history = parameters["history"]
    
    x = range(len(history[plots[0]]))
    
    for label in plots:
        vals = history.get(label)
        if vals is not None:
            plt.plot(x, vals, label = label)
        else:
            print("There is no data for", label)
    
    plt.xlabel('epoch')
    plt.title('Progress')
    plt.legend()
    plt.show()
    
def generate_name(model, parameters, start_char = None, name_max_length = 25):
    '''
    Generate some name with the RNN model
    
    ## Inputs:
    model (Keras model): 
    parameters (dict): the parameters
    name_max_length (integer): max size of the generated name
    verbose (boolean): show some feedbacks
    ## Outputs:
    generated_name (string): name generated by the RNN
    (probability, gap): few numbers about this generated name
        probability: probability to generate this name (cummulative probability to select each character)
        ecart: gap between best name and this name (cummulative sum of gaps between selected character and best character)
        
    '''
    trainset_infos = parameters["trainset_infos"]
    # Extract the number of unique character in trainset
    dict_size = trainset_infos["number_of_chars"]
    
    # Extract the size of an input sequence
    sequence_length = trainset_infos["length_of_sequence"]
    
    # Extract utils dictionnary to convert character to one hot index and vice versa
    # in this context 'word' is meant to be a character
    i2c = parameters["index2word"]
    c2i = parameters["word2index"]
    
    # Extract padding character
    padding_start = trainset_infos["padding_start"]
    
    # Init a name full of padding_start character
    generated_name = padding_start * (sequence_length + name_max_length)

    # Init counters
    probability = 1
    gap = 0

    if start_char is not None:
        generated_name = generated_name[:(sequence_length - 1)] + start_char + generated_name[sequence_length:]
    
    # Generate new character from current sequence
    for i in range(name_max_length):
        # Extract current sequence from generated character
        x_char = generated_name[i:i+sequence_length]
        
        # Convert current sequence to one hot vector
        x_cat = np.array([[to_categorical(c2i[c], dict_size) for c in x_char]])
        
        # Predict new character probabilities
        # Actually this output a list of probabilities for each character
        p = model.predict(x_cat)

        # Extract the best character (and its probability)
        best_char = i2c[np.argmax(p)]
        best_char_prob = np.max(p)

        # Choose a random character index according to their probabilities (and its probability)
        new_char_index = np.random.choice(range(dict_size), p = p.ravel())
        new_char_prob = p[0][new_char_index]
        
        # Convert the index to an actual character
        new_char = i2c[new_char_index]
                
        # Update the generated name with the new character
        generated_name = generated_name[:sequence_length+i] + new_char + generated_name[sequence_length+i+1:]
        
        # Update counters
        probability *= new_char_prob # probabilities are multiplied
        gap += best_char_prob-new_char_prob # gaps are summed

        # Show some feedbacks
        if parameters["verbose"]:
            print(
                'i={} new_char: {} ({:.3f}) [best:  {} ({:.3f}), diff: {:.3f}, prob: {:.3f}, gap: {:.3f}]'.format(
                    i,
                    new_char,
                    new_char_prob,
                    best_char,
                    best_char_prob,
                    best_char_prob-new_char_prob,
                    probability,
                    gap
                )
            )

        # Stop the prediction loop if it reached a 'padding_end' character
        if (new_char == trainset_infos['padding_end']):
            break
    
    # Clean the generated name
    generated_name = generated_name.strip('#*')
    
    # Show some feedbacks
    print('{} (probs: {:.6f}, gap: {:.6f})'.format(generated_name, probability, gap)) if parameters["verbose"] else None

    return generated_name, {'probability': probability, 'gap': gap}

<font color='red'>You will need to tune the parameters (you probably want to have a look at 'rnn_units', and 'epochs_to_run').  Please indicate as comments the values you tried, and the best values you keep.</font>

<font color='red'>The number of elements in the array 'epochs_to_run' and 'learning_rates' defines the number of epochs and learning rate per epoch the model should be trained.  E.g., in the first training round use 3 epochs with a learning rate of 0.03, in a second training round use 5 epochs and decrease the learning rate to 0.001, etc.) </font>

In [None]:
parameters = {
    "verbose": True,
    "max_words": 20000,
    "max_sequence_length": 1000,
    "maxlen": 100,
    "rnn_units": 8,
    "embedding_use_pretrained": False,
    "embedding_fine_tune": False,
    "embedding_dim": 128,
    "iter": 0,
    "epochs_to_run": [3, 5, 7],
    "learning_rates": [0.03, 0.001, 0.0003],
    "total_epochs": 0,
    "loss_function": "categorical_crossentropy", 
    "metrics": ["accuracy"],
    "batch_size": 32,
    "history": {},
    "file_url": "https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/male.txt"
    #"file_url": "https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/female.txt"
}

In [None]:
load_data(parameters)

In [None]:
parameters["iter"] = 0
parameters["total_epochs"] = 0
parameters["history"] = {}

model = create_model(parameters)
compile_model(
    model,
    parameters
)

for i in range(len(parameters["epochs_to_run"])):
    train_model(model, parameters)
    parameters["iter"] = i

In [None]:
plot_training_session(parameters, plots=["accuracy"])

In [None]:
generate_name(model, parameters, start_char = 'j')

In [None]:
generate_name(model, parameters)