In [1]:
%cd /content/drive/My Drive/Colab Notebooks/nlp/apps/language_models

/content/drive/My Drive/Colab Notebooks/nlp/apps/language_models


# What we need to do

- You will start by converting a line of text into a tensor
- Then you will create a generator to feed data into the model
- You will train a neural network in order to predict the new set of characters of defined length.
- You will use embeddings for each character and feed them as inputs to your model.
    - Many natural language tasks rely on using embeddings for predictions.

- Your model will convert each character to its embedding, run the embeddings through a Gated Recurrent Unit GRU, and run it through a linear layer to predict the next set of characters.

- You will get the embeddings;
- Stack the embeddings on top of each other;
- Run them through two layers with a relu activation in the middle;
- Finally, you will compute the softmax. 

To predict the next character:
- Use the softmax output and identify the word with the highest probability.
- The word with the highest probability is the prediction for the next word.

In [2]:
!pip install -q -U trax

In [3]:
import trax
import trax.fastmath.numpy as np
import numpy
import random
import itertools
from trax import fastmath



# Get the data

We will treat the sherlock novels are our data. Then, we will treat each line as a sentence, because we are going to predict characters instead of words, we need to convert each sentence into characters. After this, each character line is going to be stored in a list. In other words, we are going to have a list of list of characters.       
Finally, we will create a generator that takes the batch_size and max_length. Where max_length is the sentence with the maximum size.



In [4]:
path = '/content/drive/MyDrive/Colab Notebooks/nlp/apps/data/sherlock_novels.txt'
testing_path = '/content/drive/MyDrive/Colab Notebooks/nlp/apps/data/study in scarlet.txt'
output_dir = '/content/drive/MyDrive/Colab Notebooks/nlp/apps/language_models/models'

In [5]:
def get_sentences(path):
    """
    Reads a txt file and returns each line (sentence)
    in a list
    """
    with open(path) as f:
        sentences = f.readlines()

    return sentences

def get_max_length(sentences):
    """
    Takes a list of sentences and search for the
    longest one.
    """
    sentence = max(sentences, key=len)
    max_length = len(sentence)
    return max_length, sentence


# Preprocess



In [6]:
def preprocess(sentences):
    """
    Takes a list of sentences to clean and lowercase them
    """
    for i, sentence in enumerate(sentences):
        sentences[i] = sentence.strip().lower()

    return sentences


# Create validation and test set

In [7]:
def create_train_val(sentences):
    """
    It takes a list of sentences and divides them into
    90% train and 10% validation
    """
    n = len(sentences)
    pct = int(n * 0.9)
    train = sentences[:pct]
    validation = sentences[pct:]

    return train, validation

# Convert sentences to tensors

Now, we need to convert our sentences into numbers, thus we can feed them into our model.

In [8]:
def sentence2tensor(sentence, end_token=1):
    """
    It takes the sentence and transforms each
    character to a number
    """
    tensor = [ord(char) for char in sentence]
    # append the end token to the sentence
    tensor.append(end_token)

    return tensor

    
    


# Generate batches

We will convert our text sentences into numpy arrays and we will add padding to each sentence. This padding will be determine by the sentence with the max_length in our corpus.

The batch is a tuple with three values: inputs, targets, mask. Mask will be 1 for all non-padding tokens.

In [9]:
def generate_batch(batch_size, max_length, sentences, sentence2tensor=sentence2tensor, shuffle=True):
    """
    It takes a list of sentences, 
    """
    index = 0
    current_batch = []
    num_sentences = len(sentences)

    # create an array with the indexes of sentences that can be shuffled
    sentences_index = [*range(num_sentences)]

    if shuffle:
        random.shuffle(sentences_index)

    while True:
        if index >= num_sentences:
            # reset index if we used all the sentences
            index = 0

            if shuffle:
                random.shuffle(sentences_index)

        # get a sentence
        sentence = sentences[sentences_index[index]]

        if len(sentence) < max_length:
            current_batch.append(sentence)

        index += 1

        # check if we already have our desire batch_size
        if len(current_batch) == batch_size:
            batch = []
            mask = []
            for batch_sentence in current_batch:
                # convert the batch sentence to a tensor
                tensor = sentence2tensor(batch_sentence)

                # add the padding
                pad = [0] * (max_length - len(tensor))
                tensor_padded = tensor + pad

                batch.append(tensor_padded)
                mask_tensor = [0 if i == 0 else 1 for i in tensor_padded]
                mask.append(mask_tensor)

            # convert the padded tensor into a trax tensor
            trax_batch = np.array(batch)
            trax_mask = np.array(mask)

            # yield two copies of the batch and mask
            yield trax_batch, trax_batch, trax_mask

            # reset current_batch to an empty list
            current_batch = []
                


Once we have our function to generate batches, we need to a way to cycle over them and thus we create multiple epochs. One way to do it is with the itertools.cycle() function.

```python
import itertools

infinite_generator = itertools.cycle(generate_batch(batch_size, max_length, sentences))
```

# Create the model

In [10]:
def create_model(vocab_size=256, emb_dim=300, n_layers=2, mode='train'):
    """
    Returns a GRU model.
    Args:
        vocab_size: int. the amount of unique char
        emb_dim: int. embeddings depth
        n_layers: int. number of GRU layers
        mode: str
    returns:
        model: a trax model
    """
    model = trax.layers.Serial(
        trax.layers.ShiftRight(mode=mode),
        trax.layers.Embedding(vocab_size, emb_dim),
        [trax.layers.GRU(emb_dim) for _ in range(n_layers)],
        trax.layers.Dense(vocab_size),
        trax.layers.LogSoftmax()
    )

    return model



# Training

In [11]:
sentences = get_sentences(path)
pre_sentences = preprocess(sentences)
# get_max_length returns the len and the sentence
max_length, _ = get_max_length(pre_sentences)
batch_size = 32
train, val = create_train_val(pre_sentences)

num_sentences = len(train)
n_steps = int(num_sentences / batch_size)
epochs = 100


In [12]:
# num_sentences = len(pre_sentences)
# print(num_sentences)
# print(int(num_sentences / batch_size))

def train_model(model, train_sentences, val_sentences, generate_batch, 
                max_lenght, learning_rate= 0.0001, batch_size=32, n_steps=1, output_dir='models/'):
    """
    It trains our trax model
    Args:
        model: trax model
        train_sentences: list
        val_sentences: list
        generate_batch: func
        max_length: int. it is the max length of the longest sentence
        batch_size: int
        n_steps: int. Number of steps to perform
        output_dir: str
    returns:
        a trax Training loop for the model.
    """
    print(f'This is the amount of steps needed to end traning: {n_steps}')
    train_generator = generate_batch(batch_size, max_length, sentences)
    infinite_train_generator = itertools.cycle(train_generator)

    val_generator = generate_batch(batch_size, max_length, val_sentences)
    infinite_val_generator = itertools.cycle(val_generator)

    train_task = trax.supervised.training.TrainTask(
        labeled_data=infinite_train_generator,
        loss_layer=trax.layers.CrossEntropyLoss(),
        optimizer=trax.optimizers.Adam(learning_rate),
        n_steps_per_checkpoint=500
    )

    val_task = trax.supervised.training.EvalTask(
        labeled_data=infinite_val_generator,
        metrics=[trax.layers.CrossEntropyLoss(), trax.layers.Accuracy()],
        n_eval_batches=10
    )

    training_loop = trax.supervised.training.Loop(model, train_task, 
                                                  eval_tasks=val_task, output_dir=output_dir)
    
    training_loop.run(n_steps=n_steps)

    return training_loop


In [13]:
# training_loop = train_model(create_model(), train, val, generate_batch, max_length, learning_rate=0.001, 
#                             n_steps=n_steps * epochs, output_dir=output_dir)

# Evaluation

To evaluate language models, we usually use perplexity which is a measure of how well a probability model predicts a sample. 

$$P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}$$

As an implementation hack, we would usually take the log of that formula (to enable us to use the log probabilities we get as output of our `RNN`, convert exponents to products, and products into sums which makes computations less complicated and computationally more efficient). We should also take care of the padding, since we do not want to include the padding when calculating the perplexity (because we do not want to have a perplexity measure artificially good).


$$log P(W) = {log\big(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\big)}$$

$$ = {log\big({\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\big)^{\frac{1}{N}}}$$ 

$$ = {log\big({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\big)^{-\frac{1}{N}}} $$
$$ = -\frac{1}{N}{log\big({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\big)} $$
$$ = -\frac{1}{N}{\big({\sum_{i=1}^{N}{logP(w_i| w_1,...,w_{n-1})}}\big)} $$

In [25]:
def calculate_perplexity(predictions, targets):
    """
    It takes the predictions and targets. Where predictions
    is a tensor of log probabilities.
    Args:
        predictions: trax.array. These are the predictions of a list
            of batches of tensors corresponding to the sentences of the tex
        targets: trax.array. These are the actual list of batches corresponding
            to the sentences of the text
    returns:
        log_perplexity: float. This is the log perplexity of our model
    """
    # we use trax.layers.one_hot to transform the target into the same dimension
    total_log_perplexity = np.sum(predictions * trax.layers.one_hot(targets, predictions.shape[-1]), axis=-1)
    non_pad = 1.0 - np.equal(targets, 0)

    # get rid of the padding
    perplexity = total_log_perplexity * non_pad
    log_perplexity = np.sum(perplexity) / np.sum(non_pad)

    return -log_perplexity


In [26]:
# Testing 
model = create_model()
model.init_from_file(output_dir + '/' + 'model.pkl.gz')
batch = next(generate_batch(batch_size, max_length, pre_sentences, shuffle=False))
preds = model(batch[0])
log_ppx = calculate_perplexity(preds, batch[1])
print('The log perplexity and perplexity of your model are respectively', log_ppx, np.exp(log_ppx))

The log perplexity and perplexity of your model are respectively 1.0361327 2.8182967


# Generate text with the model

We will now use our own language model to generate new sentences for that we need to make draws from a Gumble distribution.

The Gumbel Probability Density Function (PDF) is defined as: 

$$ f(z) = {1\over{\beta}}e^{(-z+e^{(-z)})} $$

where: $$ z = {(x - \mu)\over{\beta}}$$

The maximum value, which is what we choose as the prediction in the last step of a Recursive Neural Network `RNN` we are using for text generation, in a sample of a random variable following an exponential distribution approaches the Gumbel distribution when the sample increases asymptotically. For that reason, the Gumbel distribution is used to sample from a categorical distribution.

In [44]:
def gumbel_sample(log_probabilities, temperature=1.0):
    """Gumbel sampling from a categorical distribution."""
    u = numpy.random.uniform(low=1e-6, high=1.0 - 1e-6, size=log_probabilities.shape)
    g = -np.log(-np.log(u))

    return np.argmax(log_probabilities + g * temperature, axis=-1)

def prediction(num_chars, prefix):
    inp = [ord(char) for char in prefix]
    result = [c for c in prefix]
    max_len = len(prefix) + num_chars
    for _ in range(num_chars):
        cur_inp = np.array(inp + [0] * (max_len - len(inp)))
        outp = model(cur_inp[None, :])  # Add batch dim.
        next_char = gumbel_sample(outp[0, len(inp)])
        inp += [int(next_char)]
       
        if inp[-1] == 1:
            break # EOS
        result.append(chr(int(next_char)))
    
    return "".join(result)



In [46]:
# Create 10 sentences using our language model
# the sentences will be max_length long or when the model
# predicts EOS token.
for _ in range(10):
    print(prediction(max_length, ''))

"already?" asked vicarage.
me without your measure."
i had taken abar that the game he need, to some tongania, whatever
raised streets, and i saw a prompt fellow which was broken out,
there any trace of the man."
it outside the room, sooner which had happened.
and round the dooches, something in the state which had been so
"but it was easy. what frequent?"
"have you arranged silverton extended figures at the fall, holmes,
business, and my pipe is so close towards me on one side, as was
