# Preparations


Execute the following code blocks to configure the session and import relevant modules.

In [2]:
%config InlineBackend.figure_format ='retina'
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [3]:
import keras
keras.__version__

2024-05-21 00:17:38.508796: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-05-21 00:17:39.075183: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2024-05-21 00:17:39.075251: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2024-05-21 00:17:39.862196: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-

'2.11.0'

In [159]:
from pathlib import Path

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

#from tensorflow.keras.optimizer_v2.adam import Adam
# Alternatively:
from tensorflow.keras.optimizers import Adam


## Running on colab
You can use this [link](https://colab.research.google.com/github/NBISweden/workshop-neural-nets-and-deep-learning/blob/rnn_labs/session_recurrentNeuralNetworks/lab_language_modelling/language_modelling.ipynb) to run the notebook on Google Colab. If you do so, it's advised that you first make a copy to your own Google Drive before starting you work on the notebbok. Otherwise changes you make to the notebook will not be saved.

## Download data
You can download and extract the data to your local machine by executing the cell below. This is useful if you haven't extracted the archive in the repository, or are running the notebook outside of the repository (e.g. on Colab)

In [158]:
# Run this cell if you don't have the data and you are running the notebook on colab. It'll download it from github and exctract it to the current directory.
data_directory = Path('data')
data_url = "https://github.com/NBISweden/workshop-neural-nets-and-deep-learning/raw/rnn_labs/session_recurrentNeuralNetworks/lab_language_modelling/data/jane_austen.zip"

if not (data_directory / 'jane_austen').exists():
    if not data_directory.exists():
        data_directory.mkdir(parents=True)
    if not (data_directory / 'jane_austen.zip').exists():
        from urllib.request import urlretrieve
        urlretrieve(data_url, 'data/jane_austen.zip')
    if (data_directory / 'jane_austen.zip').exists():
        import zipfile
        with zipfile.ZipFile(data_directory / 'jane_austen.zip') as zf:
            zf.extractall(data_directory)        


# Lab session: Language Modelling

A long standing application of RNNs is to modell natural language. In the field of computational linguistics, a "Language Model" is a model which assign probaility to strings of a language. In this lab we will train an RNN to perform this task.

We desire a model which can give us the probability of observing a string in a language, e.g. $P(\text{the}, \text{cat}, \text{sat}, \text{on}, \text{the}, \text{mat})$. Directly modelling this joint probability distribution is problematic, in particular because we will pretty much only see single examples of most strings in our data. The way we model this instead is by using the "chain rule of probability". A joint probability distribution can be broken down into products of conditional distributions according to:

$$
\begin{align*}
P(x_1, \dots, x_n) = P(x_1) \prod_{i=2}^{n} P(x_i | x_{1}, \dots, x_{i-1})
\end{align*}
$$

So with the example of $P(\text{the}, \text{cat}, \text{sat}, \text{on}, \text{the}, \text{mat})$ we can actually model this as
$$
\begin{align*}
P(\text{the}, \text{cat}, \text{sat}, \text{on}, \text{the}, \text{mat}) = P(\text{the})P(\text{cat} | \text{the})P(\text{sat} | \text{the}, \text{cat})P(\text{on} | \text{the}, \text{cat}, \text{sat})P(\text{the}| \text{the}, \text{cat}, \text{sat}, \text{on})P(\text{mat}| \text{the}, \text{cat}, \text{sat}, \text{on}, \text{the})
\end{align*}
$$

As you can see, this is a kind of auto regressive structure, and we'll model it using a recurrent neural network. Below illustrates how we model this for the fourth word in the example sentence:


$$
\begin{align*}
P(\text{on} | \text{the}, \text{cat}, \text{sat}) = f(\text{the}, \text{cat}, \text{sat}; \theta)
\end{align*}
$$

Here $f(\bullet ; \theta)$ is our learnable model, our RNN. We then train this model as we would train a classifier; for the example input we try to make it assign all probability to the word "$\text{on}$". We can then use this to then the probability of whole sequences by taking the product of the probabilities. In practice, when we learn the maximum log likelihood, the loss decomposes into a sum of the conditional log probabilites.

## Prepare data

In this lab we'll work with discrete input data. We'll start with the most simple way of just _tokenizing_ the string at each byte. Tokenizing refers to the process of taking a raw string of characters and convert that into a sequence of tokens. In language modelling, the tokens are typically larger units, e.g. words or word-pieces but we'll start out simply. As an example we'll use the collected works of Jane Austen (under `data/jane_austen`). The text we're going to use are in markdown (`.md` files) so we search for all such files in the directory.

In [5]:
from pathlib import Path
data_directory = Path('data/jane_austen')
data_files = sorted(data_directory.glob('*.txt'))
data_files

[PosixPath('data/jane_austen/Jane Austen, Emma (1815).txt'),
 PosixPath('data/jane_austen/Jane Austen, Juvenilia (1787–93).txt'),
 PosixPath('data/jane_austen/Jane Austen, Lady Susan (1871).txt'),
 PosixPath('data/jane_austen/Jane Austen, Lesley Castle (1792).txt'),
 PosixPath('data/jane_austen/Jane Austen, Love and Freindship (1790).txt'),
 PosixPath('data/jane_austen/Jane Austen, Mansfield Park (1814).txt'),
 PosixPath('data/jane_austen/Jane Austen, Northanger Abbey (1818).txt'),
 PosixPath('data/jane_austen/Jane Austen, Persuasion (1818).txt'),
 PosixPath('data/jane_austen/Jane Austen, Pride and Prejudice (1813).txt'),
 PosixPath('data/jane_austen/Jane Austen, Sense and Sensibility (1811).txt'),
 PosixPath('data/jane_austen/Jane Austen, The history of England (1791).txt')]

The data is small, so we can actually ingest the whole dataset. We will create a list of strings which can later be used to train our models. 

In [6]:
loaded_data = []
for data_file in data_files:
    with open(data_file) as fp:
        loaded_data.append(fp.read())
loaded_data


['# Emma\n\n✍ Jane Austen\n\n## Volume I\n\n### Chapter I\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.\n\nShe was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister’s marriage, been mistress of his house from a very early period. Her mother had died too long ago for her to have more than an indistinct remembrance of her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.\n\nSixteen years had Miss Taylor been in Mr.\xa0Woodhouse’s family, less as a governess than a friend, very fond of both daughters, but particularly of Emma. Between *them* it was more the intimacy of sisters. Even before Miss Taylor had ceased to hold the nominal office of governess,

### The tokenizer

In this example we'll use a very simple tokenizer, but in general you can think of this as important preprocessing step when creating a language model. 

It's important to realize that the model will see each token as it's own discrete, atomic symbol. It will not be able to _see_ beforehand that that tokens might have similarities. For example, if we tokenize by splitting the text on whitespace and punctuation we might get the tokens "book" and "bookseller". From the models point of view these will be as similar as the words "book" and "profiting". In the field of computational linguistics, tokenization was long an important consideration to make the model learn to exploit similarites of words.

In our example, we tokenize by simply breaking the text down into the most basic unit, the characters of the text. This makes the problem a lot harder for the model, it will need to spend quite a lot of time trying to learn what sequences of characters actually describe words.



In [7]:
def character_tokenizer(text):
    """returns a sequence of tokens based on the given text. Just tokenizes the text based on the character in the text"""
    return list(text)
    

In [8]:
tokenized_data = [list(character_tokenizer(text)) for text in loaded_data]
tokenized_data[0]

['#',
 ' ',
 'E',
 'm',
 'm',
 'a',
 '\n',
 '\n',
 '✍',
 ' ',
 'J',
 'a',
 'n',
 'e',
 ' ',
 'A',
 'u',
 's',
 't',
 'e',
 'n',
 '\n',
 '\n',
 '#',
 '#',
 ' ',
 'V',
 'o',
 'l',
 'u',
 'm',
 'e',
 ' ',
 'I',
 '\n',
 '\n',
 '#',
 '#',
 '#',
 ' ',
 'C',
 'h',
 'a',
 'p',
 't',
 'e',
 'r',
 ' ',
 'I',
 '\n',
 '\n',
 'E',
 'm',
 'm',
 'a',
 ' ',
 'W',
 'o',
 'o',
 'd',
 'h',
 'o',
 'u',
 's',
 'e',
 ',',
 ' ',
 'h',
 'a',
 'n',
 'd',
 's',
 'o',
 'm',
 'e',
 ',',
 ' ',
 'c',
 'l',
 'e',
 'v',
 'e',
 'r',
 ',',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 'r',
 'i',
 'c',
 'h',
 ',',
 ' ',
 'w',
 'i',
 't',
 'h',
 ' ',
 'a',
 ' ',
 'c',
 'o',
 'm',
 'f',
 'o',
 'r',
 't',
 'a',
 'b',
 'l',
 'e',
 ' ',
 'h',
 'o',
 'm',
 'e',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 'h',
 'a',
 'p',
 'p',
 'y',
 ' ',
 'd',
 'i',
 's',
 'p',
 'o',
 's',
 'i',
 't',
 'i',
 'o',
 'n',
 ',',
 ' ',
 's',
 'e',
 'e',
 'm',
 'e',
 'd',
 ' ',
 't',
 'o',
 ' ',
 'u',
 'n',
 'i',
 't',
 'e',
 ' ',
 's',
 'o',
 'm',
 'e',
 ' ',
 'o',
 'f',
 '

### Embeddings

In a neural network, everything needs to be represented as a vector of real numbers. If the data we want to operate on is _categorical_ we need to first change the representation into dense vectors. We will do this by creating a look-up table, where each discrete token in our data (e.g. the characters of the text when using character tokenization) is represented by its _own_ vector of real values. We will create these vectors using random values and the vector associated with a particular discrete token is called its _embedding_.

To efficiently make this lookup, we will create an intermediary representation, where we assign each token an integer value in the range of $[1, n]$, where $n$ is the number of tokens we have. We will reserve the integer code $0$ for invalid tokens, and if it's part of the input it will be replaced by the zero vector.

The embedding vectors will then be collected in a matrix, with one vector per row. The embedding process can then be efficiently implemented by just indexing into this matrix with the token integer code.
To create these encodings, we will create a map from all tokens we expect in our input to a set of integers using a python dictionary. We will also create a special _unknown_ `<UNK>` token, which is used whenever we see an input token we didn't expect. This is often used in practice if we need to limit the number of tokens (for example, we might only  choose to use the most common 50000 words in our data, even though there might be millions. Any word less common than these 50000 will be replaced by the `<UNK>` token).

In [9]:
from collections import Counter # We will use a counter to keep track of which tokens are the most common
vocabulary_counter = Counter()
for tokenized_text in tokenized_data:
    vocabulary_counter.update(tokenized_text)
vocabulary_counter.most_common()

[(' ', 761113),
 ('e', 426703),
 ('t', 289220),
 ('a', 263957),
 ('o', 261579),
 ('n', 241421),
 ('i', 215755),
 ('r', 207083),
 ('s', 206598),
 ('h', 206493),
 ('d', 140730),
 ('l', 131991),
 ('u', 97980),
 ('m', 83808),
 ('c', 77034),
 ('f', 76155),
 ('w', 75743),
 ('y', 75100),
 ('g', 64571),
 (',', 60093),
 ('p', 51561),
 ('b', 49888),
 ('v', 37043),
 ('.', 35377),
 ('\n', 22017),
 ('k', 19677),
 ('I', 17258),
 ('M', 11018),
 (';', 10762),
 ('“', 9300),
 ('”', 9209),
 ('T', 5947),
 ('—', 5646),
 ('x', 5583),
 ('H', 5311),
 ('’', 5267),
 ('S', 5147),
 ('\xa0', 5118),
 ('E', 4780),
 ('C', 4473),
 ('W', 4400),
 ('q', 4082),
 ('*', 3954),
 ('A', 3925),
 ('B', 3894),
 ('!', 3661),
 ('j', 3494),
 ('-', 3060),
 ('?', 2763),
 ('L', 2682),
 ('F', 2435),
 ('Y', 2027),
 ('D', 2010),
 ('N', 1848),
 ('J', 1636),
 ('z', 1576),
 ('P', 1365),
 ('O', 1294),
 ('R', 1138),
 (':', 1032),
 ('G', 858),
 ('#', 815),
 ('K', 643),
 ('(', 509),
 (')', 509),
 ('V', 425),
 ('U', 327),
 ('‘', 281),
 ('X', 247)

In [10]:
# We now take the vocabular_counter and create a dictionary from the the token to an integer code
# We reserve the first two code positions for the empty token (used for padding sequences) and 
# the unknown token
token_encoding_map = dict()
token_encoding_map['<EMPTY>'] = 0
token_encoding_map['<UNK>'] = 1
i = 2
for token, count in vocabulary_counter.most_common():
    token_encoding_map[token] = i
    i += 1
token_encoding_map

{'<EMPTY>': 0,
 '<UNK>': 1,
 ' ': 2,
 'e': 3,
 't': 4,
 'a': 5,
 'o': 6,
 'n': 7,
 'i': 8,
 'r': 9,
 's': 10,
 'h': 11,
 'd': 12,
 'l': 13,
 'u': 14,
 'm': 15,
 'c': 16,
 'f': 17,
 'w': 18,
 'y': 19,
 'g': 20,
 ',': 21,
 'p': 22,
 'b': 23,
 'v': 24,
 '.': 25,
 '\n': 26,
 'k': 27,
 'I': 28,
 'M': 29,
 ';': 30,
 '“': 31,
 '”': 32,
 'T': 33,
 '—': 34,
 'x': 35,
 'H': 36,
 '’': 37,
 'S': 38,
 '\xa0': 39,
 'E': 40,
 'C': 41,
 'W': 42,
 'q': 43,
 '*': 44,
 'A': 45,
 'B': 46,
 '!': 47,
 'j': 48,
 '-': 49,
 '?': 50,
 'L': 51,
 'F': 52,
 'Y': 53,
 'D': 54,
 'N': 55,
 'J': 56,
 'z': 57,
 'P': 58,
 'O': 59,
 'R': 60,
 ':': 61,
 'G': 62,
 '#': 63,
 'K': 64,
 '(': 65,
 ')': 66,
 'V': 67,
 'U': 68,
 '‘': 69,
 'X': 70,
 '1': 71,
 '2': 72,
 '0': 73,
 '3': 74,
 '>': 75,
 '8': 76,
 '5': 77,
 '[': 78,
 ']': 79,
 '7': 80,
 '4': 81,
 'Q': 82,
 '9': 83,
 '6': 84,
 '_': 85,
 '&': 86,
 '…': 87,
 '\\': 88,
 '✍': 89,
 'é': 90,
 'Z': 91,
 '£': 92,
 '"': 93,
 '–': 94,
 'ê': 95,
 '☞': 96,
 '^': 97,
 'à': 98,
 '❧':

In [11]:
# We also create the inverse dictionary so that we can go the other way around
inverse_token_encoding_map = {i: token for token, i in token_encoding_map.items()}

In [12]:
def encode_tokenized_text(tokenized_text):
    unk_code = token_encoding_map['<UNK>']
    # By using .get() on the dictionary instead of subscript (token_encoding_map[c]) 
    # we can supply a default value to use if that token isn't in the encoding map. 
    # This allows us to handle out of vocabulary tokens by simply replacing them with the <UNK> token (its encoding actually)
    encoded_text = [token_encoding_map.get(c, unk_code) for c in tokenized_text]
    return encoded_text

def decode_encoded_text(encoded_text):
    decoded_text = [inverse_token_encoding_map.get(x, '<UNK>') for x in encoded_text]
    return decoded_text

We now have what we need to be able to create or datasets. We will encode them all into integer sequences, and then go the other way around to make sure the process worked.

In [13]:
# We start by printing the first 100 characters from the first text
print("Tokenized data:", tokenized_data[0][:100])

encoded_data = []
for tokenized_text in tokenized_data:
    encoded_text = encode_tokenized_text(tokenized_text)
    encoded_data.append(encoded_text)

print("Encoded data:  ", encoded_data[0][:100])
test_decode = decode_encoded_text(encoded_data[0])
print("Decoded data:  ", test_decode[:100])

Tokenized data: ['#', ' ', 'E', 'm', 'm', 'a', '\n', '\n', '✍', ' ', 'J', 'a', 'n', 'e', ' ', 'A', 'u', 's', 't', 'e', 'n', '\n', '\n', '#', '#', ' ', 'V', 'o', 'l', 'u', 'm', 'e', ' ', 'I', '\n', '\n', '#', '#', '#', ' ', 'C', 'h', 'a', 'p', 't', 'e', 'r', ' ', 'I', '\n', '\n', 'E', 'm', 'm', 'a', ' ', 'W', 'o', 'o', 'd', 'h', 'o', 'u', 's', 'e', ',', ' ', 'h', 'a', 'n', 'd', 's', 'o', 'm', 'e', ',', ' ', 'c', 'l', 'e', 'v', 'e', 'r', ',', ' ', 'a', 'n', 'd', ' ', 'r', 'i', 'c', 'h', ',', ' ', 'w', 'i', 't', 'h', ' ']
Encoded data:   [63, 2, 40, 15, 15, 5, 26, 26, 89, 2, 56, 5, 7, 3, 2, 45, 14, 10, 4, 3, 7, 26, 26, 63, 63, 2, 67, 6, 13, 14, 15, 3, 2, 28, 26, 26, 63, 63, 63, 2, 41, 11, 5, 22, 4, 3, 9, 2, 28, 26, 26, 40, 15, 15, 5, 2, 42, 6, 6, 12, 11, 6, 14, 10, 3, 21, 2, 11, 5, 7, 12, 10, 6, 15, 3, 21, 2, 16, 13, 3, 24, 3, 9, 21, 2, 5, 7, 12, 2, 9, 8, 16, 11, 21, 2, 18, 8, 4, 11, 2]
Decoded data:   ['#', ' ', 'E', 'm', 'm', 'a', '\n', '\n', '✍', ' ', 'J', 'a', 'n', 'e', ' ', 'A', 'u',

### What about one-hot vectors?

As a general rule of thumb, you should never represent categorical variables as one-hot vectors when using them as inputs to neural networks. It's wasteful on computation and makes optimization with momentum optimizers messier. Instead, use an `Embedding` layer and encode the input as integers. If you like, you can think of this as a sparse representation of one-hot vectors, where the integer is essentially the index of the "hot" bit in the vector. One case where you might want to use one-hot vectors is for the targets and the `CategoricalCrossEntropy` loss, but here it's also better to stick with integer encoded categorical variables and use `SparseCategoricalCrossEntropy` unless your targets have their probability mass spread over more than one class value.

## A data sampler - `keras.utils.Sequence`

For convenience we will create a data sampler. Its job will be to supply the training with mini-batches of text sequences from all our Jane Austen books. In practice it's a good idea to create such a wrapper for our dataset since it allows us to control the batches delivered to the training loop while still allowing the data loading to be done in sequence. 

In this case we will take the simplest possible approach to just randomly sample subsequences from all of our texts.

We will incorporate the above preprocessing steps into this class, so that it encapsulate all the necessary book keeping info we need to encode and decode data,.

In [143]:
from keras.utils import Sequence, pad_sequences
from collections import Counter # We will use a counter to keep track of which tokens are the most common
import numpy 

class RandomTextDataset(Sequence):
    def __init__(self, text_files, context_length, batch_size, tokenizer=character_tokenizer, unk_string = '<UNK>', empty_string='<EMPTY>', rng=None, max_vocab=None) -> None:
        super().__init__()
        if rng is None:
            rng = numpy.random.default_rng()
        self.rng = rng
                
        self.context_length = context_length
        self.batch_size = batch_size
        
        self.unk_string = unk_string
        self.empty_string = empty_string
        self.max_vocab = max_vocab
        
        self.tokenzier = tokenizer
        
        loaded_data = []
        for data_file in data_files:
            with open(data_file) as fp:
                loaded_data.append(fp.read())
        
        tokenized_data = [list(tokenizer(text)) for text in loaded_data]
        
        vocabulary_counter = Counter()
        for tokenized_text in tokenized_data:
            vocabulary_counter.update(tokenized_text)
            
        # We now take the vocabular_counter and create a dictionary from the the token to an integer code
        # We reserve the first two code positions for the empty token (used for padding sequences) and 
        # the unknown token
        self.token_encoding_map = dict()
        self.token_encoding_map[self.empty_string] = 0
        self.token_encoding_map[self.unk_string] = 1
        i = 2
        for token, count in vocabulary_counter.most_common(self.max_vocab):
            self.token_encoding_map[token] = i
            i += 1

        # We also create the inverse dictionary so that we can go the other way around
        self.inverse_token_encoding_map = {i: token for token, i in self.token_encoding_map.items()}
        self.encoded_texts = [self.encode_tokenized_text(text) for text in tokenized_data]
        
        # The +1 is because we create a input and target sequence by taking one which is context_length+1, 
        # and then shift the target one step to the left while dropping the last token for the input
        self.n = sum(len(encoded_text)//((self.context_length+1)*self.batch_size) for encoded_text in self.encoded_texts)  
    
    def __len__(self):
        return self.n
    
    def __getitem__(self, item):
        # We don't actually sample the text round-robin, instead we just take any text. 
        input_sequences, target_sequences = zip(*[self.sample_sequence() for i in range(self.batch_size)])
        pad_size_input = max(len(s) for s in input_sequences)
        input_sequences_padded = pad_sequences(input_sequences, pad_size_input, padding="post", value=0)
        target_sequences_padded = pad_sequences(target_sequences, pad_size_input, padding="post", value=0)
        return input_sequences_padded, target_sequences_padded
        
        
    def sample_sequence(self):
        text_to_sample = self.rng.integers(len(self.encoded_texts))
        sampled_text = self.encoded_texts[text_to_sample]
        start_index = self.rng.integers(len(sampled_text)- self.context_length - 1)
        sampled_sequence = sampled_text[start_index: start_index+self.context_length+1]
        input_sequence = sampled_sequence[:-1]
        target_sequence = sampled_sequence[1:]
        return input_sequence, target_sequence
    
    def encode_text(self, text):
        tokenized_text = self.tokenize_text(text)
        encoded_text = self.encode_tokenized_text(tokenized_text)
        return encoded_text
    
    def tokenize_text(self, text):
        return self.tokenzier(text)
    
    def encode_tokenized_text(self, tokenized_text):
        unk_code = self.token_encoding_map[self.unk_string]
        # By using .get() on the dictionary instead of subscript (token_encoding_map[c]) 
        # we can supply a default value to use if that token isn't in the encoding map. 
        # This allows us to handle out of vocabulary tokens by simply replacing them with the <UNK> token (its encoding actually)
        encoded_text = [self.token_encoding_map.get(c, unk_code) for c in tokenized_text]
        return encoded_text

    def decode_encoded_text(self, encoded_text):
        decoded_text = [self.inverse_token_encoding_map.get(x, self.unk_string) for x in encoded_text]
        return decoded_text

    def get_vocab_size(self):
        return len(self.token_encoding_map)
        
        

In [144]:
## Define some data parameters
CONTEXT_LENGTH = 256  # The length of the sequences we will train on
BATCH_SIZE = 64  # How many examples we'll process per batch.

In [147]:
dataset = RandomTextDataset(data_files, CONTEXT_LENGTH, BATCH_SIZE, tokenizer=character_tokenizer, max_vocab=None)  # If you set max_vocab to an integer n, only the most frequent n tokens will be used. The remainder will be replaced by <UNK>

In [146]:
dataset[0]

(array([[17, 17,  2, ...,  2,  8,  7],
        [13,  5,  8, ..., 21,  2, 14],
        [ 4, 11,  8, ...,  8,  7, 20],
        ...,
        [ 6, 14,  2, ...,  2, 15,  5],
        [ 7, 10, 21, ..., 20,  2, 15],
        [11,  5,  4, ...,  9,  2,  1]], dtype=int32),
 array([[17,  2, 17, ...,  8,  7, 16],
        [ 5,  8, 12, ...,  2, 14,  7],
        [11,  8,  7, ...,  7, 20,  2],
        ...,
        [14,  2, 11, ..., 15,  5,  4],
        [10, 21,  2, ...,  2, 15, 19],
        [ 5,  4,  2, ...,  2,  1,  3]], dtype=int32))

## The model

We're now ready to start training a model, but there are some things we need to decide on first. We will train our RNNs (LSTMs actually) using _truncated_ Back-Progation Through Time. The truncation comes from the fact that we can't train the model on whole books; we simply don't have enough memory to do so. Instead we train the model by showing it _truncated_ parts of the whole books. Ideally, we wan't to show it as long sequences as we can fit into the memory of our GPUs, so this variable will depend on that. We also wan't to make use of the parallel nature of the GPU, so we want to train on multiple sequences in parallel in a mini-batch.

In [127]:
import keras.losses


epochsVal = 100
learnRateVal = 0.01
batchSizeVal = 10
opt = Adam(learning_rate=learnRateVal)

embedding_dimension = 32
rnn_dimension = 128
output_projection_dimension = 128

num_embeddings = dataset.get_vocab_size()
model = Sequential()
model.add(Embedding(num_embeddings, embedding_dimension, mask_zero=True))
# Add LSTM layers; X.shape[1] refers to the number of columns in X which is the number of time steps, or window size
model.add(LSTM(units=rnn_dimension, return_sequences=True, activation="tanh", unit_forget_bias=True, recurrent_dropout=0, dropout=0.2, use_bias=True))
# Add dense layer with activation for categorical output
model.add(Dense(output_projection_dimension, activation="relu"))
model.add(Dense(num_embeddings))
# Compile model using loss function for categorical data

loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)  # We use the sparse loss since we've integer encoded the targets. We also set from_logits=True, since we're not applying the softmax explicityly
model.compile(loss=loss_fn, optimizer=opt, metrics=["accuracy"])
model.summary()

Model: "sequential_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_16 (Embedding)    (None, None, 32)          3328      
                                                                 
 lstm_17 (LSTM)              (None, None, 128)         82432     
                                                                 
 dense_18 (Dense)            (None, None, 128)         16512     
                                                                 
 dense_19 (Dense)            (None, None, 104)         13416     
                                                                 
Total params: 115,688
Trainable params: 115,688
Non-trainable params: 0
_________________________________________________________________


## Training the model
We're now ready to train the model. Do this by running the `fit()` mehtod of the model object. You will se a steady drop in loss and accuracy. For this problem we're not looking at the performance of a development (validation set), so it's hard to track if the model overfits. Likely it will not have capacity to do so, and for the purpose of this lab we can allow the model to do so.

In [150]:
model.fit(dataset, epochs=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f29ec446910>

## Generating text - ChatLSTM

Now that we've trained our model we can try using to to generate some text. First, you can choose a text to prompt it with by setting the `prompt` variable below. You can change how long the text you wish to generate should be by setting `generation_length` to some desired value. There's a third parameter `temperature` which allows you to control how random the sampling of the next character should be. It should be a float value strictly larger than $0$. If it's set to $1$, the learnt probabilty distribution of the model will be used to sample the next character, as the temperature gets closer to 0, it gets closer to the most probable prediction, and as the temperature goes above $1$, the distribution approches the uniform distribution over the next character.

In [160]:
prompt = "It is a truth universally acknowledged, that "
generation_length = 100
temperature = 0.3
random_seed = 1729  # Set to None to use a different random seed each time this cell is executed
rng = np.random.default_rng(random_seed)

def softmax(logits, temperature):
    temperature_scaled_logits = logits/temperature
    exponentiated_logits = np.exp(temperature_scaled_logits)
    return exponentiated_logits / np.sum(exponentiated_logits)


encoded_prompt = np.array(dataset.encode_text(prompt), dtype=np.int32)
for i in range(generation_length):
    next_token_logits = model.predict(encoded_prompt[None, ...], verbose=0)[0, -1]
    p =  softmax(next_token_logits, temperature)
    sampled_token = rng.choice(len(p), p=p)
    encoded_prompt = np.concatenate([encoded_prompt, [sampled_token]])
    
generated_text = "".join(dataset.decode_encoded_text(encoded_prompt))
print(generated_text)

It is a truth universally acknowledged, that the first attention of the same as the death of the contrary of his absence of his reproacher of the


## Exercises

Now that we've tested the language model, let's try to modify it and see what effect it might have.

### Exercise 1

We've trained the model by just taking the text in as is. Often it's a good idea to preprocess the text to make the learning easier. 
 
 - Create a new tokenizer that before return the character, converts it to lower case. 
 - Do this by implementing a new function `lower_case_character_tokenizer()`. 
 - Look at the [`str class`](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) in the python documentation for suitable methods to convert  to lower case.
 - When you create the dataset, give this function as the tokenizer input instead. 
 - Now train a model with this new dataset. Do you notice any performance difference after you've trained for the same amount of epochs?


### Exercise 2

Training the neural network to predict the next character requires a lot of capacity. 
- Try adding two more `LSTM` layers to the model and train it for the same number of epochs as the previous model.
- Can you see any difference in performance?
- Try increasing the dimensionality of the `LSTM` layers. What effect does it have on the model?

### Exercise 3

Instead of increasing the model capacity to solve the problem, we can try to change the data.
- Implement a _new tokenizer_ that instead of splitting the text by each character splits it using white space and punctuations.
- You can use a regular expression and the  [`re module`](https://docs.python.org/3/library/re.html) to split the string. Below is a useful snipped:
  ```python
  import re
  tokenized_text = re.split(r"[;.!:, \-_'\n\t#]+", loaded_data[0])
  ```
  The parts inside the brackets of the regular expression lists all the characters we want to remove. The list might not be exhaustive.

- The tokenizer should remove any empty strings from the result, these would behave weirdly in the model. You can use something like:
  ```python
  tokenized_text = [token for token in tokenized_text if token]
  ```

- Use this nerw  white space tokenzier as the tokenizer input to the dataset class. You should probably also set the `max_vocab` parameter to something apropriate like 10000. This limits the number of words the dataset effectively uses to this number.

- Now train the model using this new tokenizer. What can you say about loss? After training it for some epochs and trying it, how does the text generation change?

### Exercise 4

Instead of using the collected works of Jane Austen, try to train a language model on your own dataset or on the Shakespeare dataset also provided in the `data` directory.
