# University of Twente - Natural Language Processing - Workshop


## Introduction
In this workshop we are going to generate texts using various Neural Language Models. The goal of the workshop is to give you some experience with these models and allow you to tweak their behavior to see what impact your changes have.

### Notes
- The sessions in Google Colab are not persistent and will only run for up to 12 hours. If you want to keep your progress, please do so by connecting it with Google Drive.
- GPU (and TPU) support is available. You can enable it by going to 'Runtime' -> 'Change Runtime Type' and selecting a GPU. This will reset the runtime, meaning that you may need to re-run the code cells.
- The internet contains many datasets which you can use in this workshop. Examples include the FastAI datasets, Keras, and Kaggle.

# Basic LSTM in Keras

In this section we are going to use LSTMs to generate texts using Nietzsche's writings as input. We follow this [example](https://keras.io/examples/lstm_text_generation/) from the Keras team.

In [0]:
# Imports
import io
import random
import sys

import numpy as np
import pandas as pd

from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file

In [0]:
# Check if we are using the GPU. You should see: '/device:GPU:0'
import tensorflow as tf
tf.test.gpu_device_name()

In [0]:
# Settings
dataset = 'https://s3.amazonaws.com/text-datasets/nietzsche.txt'
keep_fraction = 0.3

maxlen = 40   # maximum sequence size (in characters)
step = 3      # step size to create sequences

lstm_learning_rate = 0.01
lstm_activation = 'softmax'

batch_size = 128
num_epochs = 20

In [0]:
# Load corpus - Nietzsche

# Download the dataset, read the text and lowercase it.
path = get_file(
    'dataset.txt',
    origin=dataset)
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()

# The dataset contains over 90k words. Which is rather small, but will still take too long to train during this workshop.
# Lets reduce the size of the data set.
# Feel free to remove this.
number_of_characters_to_keep = int(len(text) * keep_fraction)
text = text[:number_of_characters_to_keep]


chars = sorted(list(set(text)))
print('Number of unique characters:', len(chars))

# We assign an integer value to each unique character
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# The mapping
print('Character indices:', char_indices)

In [0]:
# Create snippets of the text by sliding over it.
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
    
print('Number of sequences created:', len(sentences))

print('Examples of sequences:', sentences[:5])

In [0]:
# Create vectors of the input data
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
    
print(f'We\'ve created {len(sentences)} sequences, of length {maxlen}, with {len(chars)} unique characters.')
# Now we've created a boolean representation of that.
print('X dimensions:', x.shape)
print('Y dimensions:', y.shape)

In [0]:
# Create a simple model with one LSTM of 128 units
# Feel free to add more or tweak the settings!
lstm_model = Sequential()
lstm_model.add(LSTM(128, input_shape=(maxlen, len(chars))))
lstm_model.add(Dense(len(chars), activation=lstm_activation))

lstm_optimizer = RMSprop(lstm_learning_rate)
lstm_model.compile(loss='categorical_crossentropy', optimizer=lstm_optimizer)

# Note, we are not training the model yet

In [0]:
# Helper functions
def sample(preds, diversity=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / diversity
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = lstm_model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [0]:
# Train the model. This will take some time.
model.fit(x, y,
          batch_size=batch_size,
          epochs=1,
          callbacks=[print_callback])

In [0]:
# Yes! We've created a model and are now able to annoy Reddit, Facebook, and so on, with our fab texts!

def generate_random_text(length=120):
    start_index = random.randint(0, len(text) - maxlen - 1)
    diversity = 1.0
    generated = ''
    sentence = text[start_index: start_index + maxlen]
    for i in range(length):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_indices[char]] = 1.

        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds, diversity)
        next_char = indices_char[next_index]

        sentence = sentence[1:] + next_char

        generated += next_char
    return generated

In [0]:
generate_random_text(100)

## Challenge
Now try changing the model and dataset. Can you improve the results?

How about changing the data from character to word based?

In [0]:
# Insert modified code here


# CNN in Keras

This time we'll create a CNN-based classifier using the pre-trained [GloVe](http://nlp.stanford.edu/projects/glove/) word embeddings. This example is based on [this](https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py) script from the Keras team.

In [0]:
# Imports
import os
import sys
import tarfile
import zipfile
from pathlib import Path

import numpy as np
import pandas as pd
import requests

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model
from keras.initializers import Constant

In [0]:
# Settings
glove_embeddings_url = 'http://nlp.stanford.edu/data/glove.6B.zip'
newsgroup_data_url = 'http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz'

base_dir = Path('.')
glove_dir = base_dir / 'glove.6B'
text_data_dir = base_dir / '20_newsgroup'
max_sequence_length = 1000
max_num_words = 20000
embedding_dim = 100
validation_split = 0.2

In [0]:
# Download and extract data
glove_res = requests.get(glove_embeddings_url)
with zipfile.ZipFile(io.BytesIO(glove_res.content)) as zf:
  zf.extractall(glove_dir)

news_res = requests.get(newsgroup_data_url)
with tarfile.open(io.BytesIO(news_res.content), 'r:gz') as tf:
  tf.extractall(base_dir)

In [0]:
# Index word vectors
embeddings_index = {}
with open(glove_dir / 'glove.6B.100d.txt') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, 'f', sep=' ')
        embeddings_index[word] = coefs
        
print('Found %s word vectors.' % len(embeddings_index))

In [0]:
# Process text data
texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for name in sorted(os.listdir(text_data_dir)):
    path = os.path.join(text_data_dir, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                args = {} if sys.version_info < (3,) else {'encoding': 'latin-1'}
                with open(fpath, **args) as f:
                    t = f.read()
                    i = t.find('\n\n')  # skip header
                    if 0 < i:
                        t = t[i:]
                    texts.append(t)
                labels.append(label_id)

print('Found %s texts.' % len(texts))

In [0]:
# Tokenize
tokenizer = Tokenizer(num_words=max_num_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)


In [0]:
# Find unique words
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

In [0]:
# Create data and label tensors
data = pad_sequences(sequences, maxlen=max_sequence_length)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

In [0]:
# Split in train and test
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(validation_split * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

In [0]:
# Prepare embedding matrix
num_words = min(max_num_words, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    if i >= max_num_words:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [0]:
# Create the embedding layer
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            embeddings_initializer=Constant(embedding_matrix),
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

In [0]:
# Create the model
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

In [0]:
# Train the model
model.fit(x_train, y_train,
          batch_size=128,
          epochs=10,
          validation_data=(x_val, y_val))

In [0]:
# Use the model for classification

In [0]:
# How could you use the embeddings and the CNN to generate texts?

# AWD-LSTM in fastai

This time we use the AWD-LSTM in fastai to generate movie reviews.

The complete example can be found here: './fastai-nlp-course/5-nn-imdb.ipynb'