# Preparations


Execute the following code blocks to configure the session and import relevant modules.

In [None]:
%config InlineBackend.figure_format ='retina'
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import keras
keras.__version__

'2.15.0'

In [None]:
from pathlib import Path

import numpy as np
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

#from tensorflow.keras.optimizer_v2.adam import Adam
# Alternatively:
from tensorflow.keras.optimizers import Adam


## Running on colab
You can use this [link](https://colab.research.google.com/github/NBISweden/workshop-neural-nets-and-deep-learning/blob/master/session_recurrentNeuralNetworks/lab_language_modelling/language_modelling.ipynb) to run the notebook on Google Colab. If you do so, it's advised that you first make a copy to your own Google Drive before starting you work on the notebbok. Otherwise changes you make to the notebook will not be saved.

## Download data
You can download and extract the data to your local machine by executing the cell below. This is useful if you haven't extracted the archive in the repository, or are running the notebook outside of the repository (e.g. on Colab)

In [None]:
# Run this cell if you don't have the data and you are running the notebook on colab. It'll download it from github and exctract it to the current directory.
data_directory = Path('data')
data_url = "https://github.com/NBISweden/workshop-neural-nets-and-deep-learning/raw/master/session_recurrentNeuralNetworks/lab_language_modelling/data/jane_austen.zip"

if not (data_directory / 'jane_austen').exists():
    if not data_directory.exists():
        data_directory.mkdir(parents=True)
    if not (data_directory / 'jane_austen.zip').exists():
        from urllib.request import urlretrieve
        urlretrieve(data_url, 'data/jane_austen.zip')
    if (data_directory / 'jane_austen.zip').exists():
        import zipfile
        with zipfile.ZipFile(data_directory / 'jane_austen.zip') as zf:
            zf.extractall(data_directory)


# Lab session: Language Modelling

A long standing application of RNNs is to modell natural language. In the field of computational linguistics, a "Language Model" is a model which assign probaility to strings of a language. In this lab we will train an RNN to perform this task.

We desire a model which can give us the probability of observing a string in a language, e.g. $P(\text{the}, \text{cat}, \text{sat}, \text{on}, \text{the}, \text{mat})$. Directly modelling this joint probability distribution is problematic, in particular because we will pretty much only see single examples of most strings in our data. The way we model this instead is by using the "chain rule of probability". A joint probability distribution can be broken down into products of conditional distributions according to:

$$
\begin{align*}
P(x_1, \dots, x_n) = P(x_1) \prod_{i=2}^{n} P(x_i | x_{1}, \dots, x_{i-1})
\end{align*}
$$

So with the example of $P(\text{the}, \text{cat}, \text{sat}, \text{on}, \text{the}, \text{mat})$ we can actually model this as
$$
\begin{align*}
P(\text{the}, \text{cat}, \text{sat}, \text{on}, \text{the}, \text{mat}) = P(\text{the})P(\text{cat} | \text{the})P(\text{sat} | \text{the}, \text{cat})P(\text{on} | \text{the}, \text{cat}, \text{sat})P(\text{the}| \text{the}, \text{cat}, \text{sat}, \text{on})P(\text{mat}| \text{the}, \text{cat}, \text{sat}, \text{on}, \text{the})
\end{align*}
$$


<div style="display: flex; justify-content: center;">
<img src="https://github.com/NBISweden/workshop-neural-nets-and-deep-learning/blob/master/session_recurrentNeuralNetworks/lab_language_modelling/images/language_modelling.gif?raw=1" width="500"/>
</div>

As you can see, this is a kind of auto regressive structure, and we'll model it using a recurrent neural network. Below illustrates how we model this for the fourth word in the example sentence:


$$
\begin{align*}
P(\text{on} | \text{the}, \text{cat}, \text{sat}) = f(\text{the}, \text{cat}, \text{sat}; \theta)
\end{align*}
$$

Here $f(\bullet ; \theta)$ is our learnable model, our RNN. We then train this model as we would train a classifier; for the example input we try to make it assign all probability to the word "$\text{on}$". We can then use this to then the probability of whole sequences by taking the product of the probabilities. In practice, when we learn the maximum log likelihood, the loss decomposes into a sum of the conditional log probabilites.

## Prepare data

In this lab we'll work with discrete input data. We'll start with the most simple way of just _tokenizing_ the string at each byte. Tokenizing refers to the process of taking a raw string of characters and convert that into a sequence of tokens. In language modelling, the tokens are typically larger units, e.g. words or word-pieces but we'll start out simply. As an example we'll use the collected works of Jane Austen (under `data/jane_austen`). The text we're going to use are in markdown (`.md` files) so we search for all such files in the directory.

In [None]:
from pathlib import Path
data_directory = Path('data/jane_austen')
data_files = sorted(data_directory.glob('*.txt'))
data_files

[PosixPath('data/jane_austen/Jane_Austen_Emma_(1815).txt'),
 PosixPath('data/jane_austen/Jane_Austen_Juvenilia_(1787_93).txt'),
 PosixPath('data/jane_austen/Jane_Austen_Lady_Susan_(1871).txt'),
 PosixPath('data/jane_austen/Jane_Austen_Lesley_Castle_1792).txt'),
 PosixPath('data/jane_austen/Jane_Austen_Love_and_Freindship_(1790).txt'),
 PosixPath('data/jane_austen/Jane_Austen_Mansfield_Park_(1814).txt'),
 PosixPath('data/jane_austen/Jane_Austen_Northanger_Abbey_(1818).txt'),
 PosixPath('data/jane_austen/Jane_Austen_Persuasion_(1818).txt'),
 PosixPath('data/jane_austen/Jane_Austen_Pride_and_Prejudice_(1813).txt'),
 PosixPath('data/jane_austen/Jane_Austen_Sense_and_Sensibility_(1811).txt'),
 PosixPath('data/jane_austen/Jane_Austen_The_history_of_England_(1791).txt')]

The data is small, so we can actually ingest the whole dataset. We will create a list of strings which can later be used to train our models.

In [36]:
loaded_data = []
for data_file in data_files:
    with open(data_file) as fp:
        loaded_data.append(fp.read())

print(loaded_data[0][0:100])
# print(loaded_data[1])

# Emma

✍ Jane Austen

## Volume I

### Chapter I

Emma Woodhouse, handsome, clever, and rich, with 
---------------------------------------


### The tokenizer

In this example we'll use a very simple tokenizer, but in general you can think of this as important preprocessing step when creating a language model.

It's important to realize that the model will see each token as it's own discrete, atomic symbol. It will not be able to _see_ beforehand that that tokens might have similarities. For example, if we tokenize by splitting the text on whitespace and punctuation we might get the tokens "book" and "bookseller". From the models point of view these will be as similar as the words "book" and "profiting". In the field of computational linguistics, tokenization was long an important consideration to make the model learn to exploit similarites of words.

In our example, we tokenize by simply breaking the text down into the most basic unit, the characters of the text. This makes the problem a lot harder for the model, it will need to spend quite a lot of time trying to learn what sequences of characters actually describe words.



In [100]:
def character_tokenizer(list_text):
    """returns a sequence of tokens based on the given text. Tkenizes and lower the text based on the character in the text"""
    text = []
    for text_i in list_text:
      text.append(text_i.lower())
    # lower case
    return text


In [87]:
tokenized_data = [list(character_tokenizer(text)) for text in loaded_data]
tokenized_data[0][0:10]

['#', ' ', 'e', 'm', 'm', 'a', '\n', '\n', '✍', ' ']

### Embeddings

In a neural network, everything needs to be represented as a vector of real numbers. If the data we want to operate on is _categorical_ we need to first change the representation into dense vectors. We will do this by creating a look-up table, where each discrete token in our data (e.g. the characters of the text when using character tokenization) is represented by its _own_ vector of real values. We will create these vectors using random values and the vector associated with a particular discrete token is called its _embedding_.

To efficiently make this lookup, we will create an intermediary representation, where we assign each token an integer value in the range of $[1, n]$, where $n$ is the number of tokens we have. We will reserve the integer code $0$ for invalid tokens, and if it's part of the input it will be replaced by the zero vector.

The embedding vectors will then be collected in a matrix, with one vector per row. The embedding process can then be efficiently implemented by just indexing into this matrix with the token integer code.
To create these encodings, we will create a map from all tokens we expect in our input to a set of integers using a python dictionary. We will also create a special _unknown_ `<UNK>` token, which is used whenever we see an input token we didn't expect. This is often used in practice if we need to limit the number of tokens (for example, we might only  choose to use the most common 50000 words in our data, even though there might be millions. Any word less common than these 50000 will be replaced by the `<UNK>` token).

In [88]:
from collections import Counter # We will use a counter to keep track of which tokens are the most common
vocabulary_counter = Counter()
for tokenized_text in tokenized_data:
    vocabulary_counter.update(character_tokenizer(tokenized_text))
vocabulary_counter.most_common()

[(' ', 761113),
 ('e', 431483),
 ('t', 295167),
 ('a', 267882),
 ('o', 262873),
 ('n', 243269),
 ('i', 233013),
 ('h', 211804),
 ('s', 211745),
 ('r', 208221),
 ('d', 142740),
 ('l', 134673),
 ('u', 98307),
 ('m', 94826),
 ('c', 81507),
 ('w', 80143),
 ('f', 78590),
 ('y', 77127),
 ('g', 65429),
 (',', 60093),
 ('b', 53782),
 ('p', 52926),
 ('v', 37468),
 ('.', 35377),
 ('\n', 22017),
 ('k', 20320),
 (';', 10762),
 ('“', 9300),
 ('”', 9209),
 ('x', 5830),
 ('—', 5646),
 ('’', 5267),
 ('j', 5130),
 ('\xa0', 5118),
 ('q', 4128),
 ('*', 3954),
 ('!', 3661),
 ('-', 3060),
 ('?', 2763),
 ('z', 1583),
 (':', 1032),
 ('#', 815),
 ('(', 509),
 (')', 509),
 ('‘', 281),
 ('1', 153),
 ('2', 97),
 ('0', 69),
 ('3', 61),
 ('>', 61),
 ('8', 56),
 ('5', 55),
 ('[', 54),
 (']', 54),
 ('7', 51),
 ('4', 48),
 ('9', 42),
 ('6', 41),
 ('_', 38),
 ('&', 34),
 ('…', 20),
 ('\\', 16),
 ('✍', 11),
 ('é', 9),
 ('£', 6),
 ('"', 5),
 ('–', 4),
 ('ê', 4),
 ('☞', 3),
 ('^', 3),
 ('à', 3),
 ('❧', 2),
 ('<', 2),
 ('

In [89]:
# We now take the vocabular_counter and create a dictionary from the the token to an integer code
# We reserve the first two code positions for the empty token (used for padding sequences) and
# the unknown token
token_encoding_map = dict()
token_encoding_map['<EMPTY>'] = 0 # for padding
token_encoding_map['<UNK>'] = 1 # for unknown tokens not included in vocabulary matrix
i = 2
for token, count in vocabulary_counter.most_common():
    token_encoding_map[token] = i
    i += 1
token_encoding_map

{'<EMPTY>': 0,
 '<UNK>': 1,
 ' ': 2,
 'e': 3,
 't': 4,
 'a': 5,
 'o': 6,
 'n': 7,
 'i': 8,
 'h': 9,
 's': 10,
 'r': 11,
 'd': 12,
 'l': 13,
 'u': 14,
 'm': 15,
 'c': 16,
 'w': 17,
 'f': 18,
 'y': 19,
 'g': 20,
 ',': 21,
 'b': 22,
 'p': 23,
 'v': 24,
 '.': 25,
 '\n': 26,
 'k': 27,
 ';': 28,
 '“': 29,
 '”': 30,
 'x': 31,
 '—': 32,
 '’': 33,
 'j': 34,
 '\xa0': 35,
 'q': 36,
 '*': 37,
 '!': 38,
 '-': 39,
 '?': 40,
 'z': 41,
 ':': 42,
 '#': 43,
 '(': 44,
 ')': 45,
 '‘': 46,
 '1': 47,
 '2': 48,
 '0': 49,
 '3': 50,
 '>': 51,
 '8': 52,
 '5': 53,
 '[': 54,
 ']': 55,
 '7': 56,
 '4': 57,
 '9': 58,
 '6': 59,
 '_': 60,
 '&': 61,
 '…': 62,
 '\\': 63,
 '✍': 64,
 'é': 65,
 '£': 66,
 '"': 67,
 '–': 68,
 'ê': 69,
 '☞': 70,
 '^': 71,
 'à': 72,
 '❧': 73,
 '<': 74,
 '🎭': 75,
 '👥': 76,
 'æ': 77}

In [90]:
# We also create the inverse dictionary so that we can go the other way around
inverse_token_encoding_map = {i: token for token, i in token_encoding_map.items()}
inverse_token_encoding_map

{0: '<EMPTY>',
 1: '<UNK>',
 2: ' ',
 3: 'e',
 4: 't',
 5: 'a',
 6: 'o',
 7: 'n',
 8: 'i',
 9: 'h',
 10: 's',
 11: 'r',
 12: 'd',
 13: 'l',
 14: 'u',
 15: 'm',
 16: 'c',
 17: 'w',
 18: 'f',
 19: 'y',
 20: 'g',
 21: ',',
 22: 'b',
 23: 'p',
 24: 'v',
 25: '.',
 26: '\n',
 27: 'k',
 28: ';',
 29: '“',
 30: '”',
 31: 'x',
 32: '—',
 33: '’',
 34: 'j',
 35: '\xa0',
 36: 'q',
 37: '*',
 38: '!',
 39: '-',
 40: '?',
 41: 'z',
 42: ':',
 43: '#',
 44: '(',
 45: ')',
 46: '‘',
 47: '1',
 48: '2',
 49: '0',
 50: '3',
 51: '>',
 52: '8',
 53: '5',
 54: '[',
 55: ']',
 56: '7',
 57: '4',
 58: '9',
 59: '6',
 60: '_',
 61: '&',
 62: '…',
 63: '\\',
 64: '✍',
 65: 'é',
 66: '£',
 67: '"',
 68: '–',
 69: 'ê',
 70: '☞',
 71: '^',
 72: 'à',
 73: '❧',
 74: '<',
 75: '🎭',
 76: '👥',
 77: 'æ'}

In [91]:
def encode_tokenized_text(tokenized_text):
    unk_code = token_encoding_map['<UNK>']
    # By using .get() on the dictionary instead of subscript (token_encoding_map[c])
    # we can supply a default value to use if that token isn't in the encoding map.
    # This allows us to handle out of vocabulary tokens by simply replacing them with the <UNK> token (its encoding actually)
    encoded_text = [token_encoding_map.get(c, unk_code) for c in tokenized_text]
    return encoded_text

def decode_encoded_text(encoded_text):
    decoded_text = [inverse_token_encoding_map.get(x, '<UNK>') for x in encoded_text]
    return decoded_text

We now have what we need to be able to create or datasets. We will encode them all into integer sequences, and then go the other way around to make sure the process worked.

In [92]:
# We start by printing the first 100 characters from the first text
print("Tokenized data:", tokenized_data[0][:100])

encoded_data = []
for tokenized_text in tokenized_data:
    encoded_text = encode_tokenized_text(tokenized_text)
    encoded_data.append(encoded_text)

print("Encoded data:  ", encoded_data[0][:100])
test_decode = decode_encoded_text(encoded_data[0])
print("Decoded data:  ", test_decode[:100])

Tokenized data: ['#', ' ', 'e', 'm', 'm', 'a', '\n', '\n', '✍', ' ', 'j', 'a', 'n', 'e', ' ', 'a', 'u', 's', 't', 'e', 'n', '\n', '\n', '#', '#', ' ', 'v', 'o', 'l', 'u', 'm', 'e', ' ', 'i', '\n', '\n', '#', '#', '#', ' ', 'c', 'h', 'a', 'p', 't', 'e', 'r', ' ', 'i', '\n', '\n', 'e', 'm', 'm', 'a', ' ', 'w', 'o', 'o', 'd', 'h', 'o', 'u', 's', 'e', ',', ' ', 'h', 'a', 'n', 'd', 's', 'o', 'm', 'e', ',', ' ', 'c', 'l', 'e', 'v', 'e', 'r', ',', ' ', 'a', 'n', 'd', ' ', 'r', 'i', 'c', 'h', ',', ' ', 'w', 'i', 't', 'h', ' ']
Encoded data:   [43, 2, 3, 15, 15, 5, 26, 26, 64, 2, 34, 5, 7, 3, 2, 5, 14, 10, 4, 3, 7, 26, 26, 43, 43, 2, 24, 6, 13, 14, 15, 3, 2, 8, 26, 26, 43, 43, 43, 2, 16, 9, 5, 23, 4, 3, 11, 2, 8, 26, 26, 3, 15, 15, 5, 2, 17, 6, 6, 12, 9, 6, 14, 10, 3, 21, 2, 9, 5, 7, 12, 10, 6, 15, 3, 21, 2, 16, 13, 3, 24, 3, 11, 21, 2, 5, 7, 12, 2, 11, 8, 16, 9, 21, 2, 17, 8, 4, 9, 2]
Decoded data:   ['#', ' ', 'e', 'm', 'm', 'a', '\n', '\n', '✍', ' ', 'j', 'a', 'n', 'e', ' ', 'a', 'u', 's', '

### What about one-hot vectors?

As a general rule of thumb, you should never represent categorical variables as one-hot vectors when using them as inputs to neural networks. It's wasteful on computation and makes optimization with momentum optimizers messier. Instead, use an `Embedding` layer and encode the input as integers. If you like, you can think of this as a sparse representation of one-hot vectors, where the integer is essentially the index of the "hot" bit in the vector. One case where you might want to use one-hot vectors is for the targets and the `CategoricalCrossEntropy` loss, but here it's also better to stick with integer encoded categorical variables and use `SparseCategoricalCrossEntropy` unless your targets have their probability mass spread over more than one class value.

## A data sampler - `keras.utils.Sequence`

For convenience we will create a data sampler. Its job will be to supply the training with mini-batches of text sequences from all our Jane Austen books. In practice it's a good idea to create such a wrapper for our dataset since it allows us to control the batches delivered to the training loop while still allowing the data loading to be done in sequence.

In this case we will take the simplest possible approach to just randomly sample subsequences from all of our texts.

We will incorporate the above preprocessing steps into this class, so that it encapsulate all the necessary book keeping info we need to encode and decode data,.

In [93]:
from keras.utils import Sequence, pad_sequences
from collections import Counter # We will use a counter to keep track of which tokens are the most common
import numpy

class RandomTextDataset(Sequence):
    def __init__(self, text_files, context_length, batch_size, tokenizer=character_tokenizer, unk_string = '<UNK>', empty_string='<EMPTY>', rng=None, max_vocab=None) -> None:
        super().__init__()
        if rng is None:
            rng = numpy.random.default_rng()
        self.rng = rng

        self.context_length = context_length
        self.batch_size = batch_size

        self.unk_string = unk_string
        self.empty_string = empty_string
        self.max_vocab = max_vocab

        self.tokenzier = tokenizer

        loaded_data = []
        for data_file in data_files:
            with open(data_file) as fp:
                loaded_data.append(fp.read())

        tokenized_data = [list(tokenizer(text)) for text in loaded_data]

        vocabulary_counter = Counter()
        for tokenized_text in tokenized_data:
            vocabulary_counter.update(tokenized_text)

        # We now take the vocabular_counter and create a dictionary from the the token to an integer code
        # We reserve the first two code positions for the empty token (used for padding sequences) and
        # the unknown token
        self.token_encoding_map = dict()
        self.token_encoding_map[self.empty_string] = 0
        self.token_encoding_map[self.unk_string] = 1
        i = 2
        for token, count in vocabulary_counter.most_common(self.max_vocab):
            self.token_encoding_map[token] = i
            i += 1

        # We also create the inverse dictionary so that we can go the other way around
        self.inverse_token_encoding_map = {i: token for token, i in self.token_encoding_map.items()}
        self.encoded_texts = [self.encode_tokenized_text(text) for text in tokenized_data]

        # The +1 is because we create a input and target sequence by taking one which is context_length+1,
        # and then shift the target one step to the left while dropping the last token for the input
        self.n = sum(len(encoded_text)//((self.context_length+1)*self.batch_size) for encoded_text in self.encoded_texts)

    def __len__(self):
        return self.n

    def __getitem__(self, item):
        # We don't actually sample the text round-robin, instead we just take any text.
        input_sequences, target_sequences = zip(*[self.sample_sequence() for i in range(self.batch_size)])
        pad_size_input = max(len(s) for s in input_sequences)
        input_sequences_padded = pad_sequences(input_sequences, pad_size_input, padding="post", value=0)
        target_sequences_padded = pad_sequences(target_sequences, pad_size_input, padding="post", value=0)
        return input_sequences_padded, target_sequences_padded


    def sample_sequence(self):
        text_to_sample = self.rng.integers(len(self.encoded_texts))
        sampled_text = self.encoded_texts[text_to_sample]
        start_index = self.rng.integers(len(sampled_text)- self.context_length - 1)
        sampled_sequence = sampled_text[start_index: start_index+self.context_length+1]
        input_sequence = sampled_sequence[:-1]
        target_sequence = sampled_sequence[1:]
        return input_sequence, target_sequence

    def encode_text(self, text):
        tokenized_text = self.tokenize_text(text)
        encoded_text = self.encode_tokenized_text(tokenized_text)
        return encoded_text

    def tokenize_text(self, text):
        return self.tokenzier(text)

    def encode_tokenized_text(self, tokenized_text):
        unk_code = self.token_encoding_map[self.unk_string]
        # By using .get() on the dictionary instead of subscript (token_encoding_map[c])
        # we can supply a default value to use if that token isn't in the encoding map.
        # This allows us to handle out of vocabulary tokens by simply replacing them with the <UNK> token (its encoding actually)
        encoded_text = [self.token_encoding_map.get(c, unk_code) for c in tokenized_text]
        return encoded_text

    def decode_encoded_text(self, encoded_text):
        decoded_text = [self.inverse_token_encoding_map.get(x, self.unk_string) for x in encoded_text]
        return decoded_text

    def get_vocab_size(self):
        return len(self.token_encoding_map)



In [94]:
## Define some data parameters
CONTEXT_LENGTH = 256  # The length of the sequences we will train on
BATCH_SIZE = 64  # How many examples we'll process per batch.

In [95]:
dataset = RandomTextDataset(data_files, CONTEXT_LENGTH, BATCH_SIZE, tokenizer=character_tokenizer, max_vocab=None)  # If you set max_vocab to an integer n, only the most frequent n tokens will be used. The remainder will be replaced by <UNK>

In [96]:
dataset[0]

(array([[ 8,  7,  2, ..., 20, 11,  5],
        [ 2,  7,  6, ...,  5,  4,  2],
        [ 2,  5, 13, ...,  6, 24,  3],
        ...,
        [23,  6, 10, ...,  5, 19,  2],
        [10,  9,  3, ...,  8, 10,  2],
        [16,  9,  2, ..., 12,  2,  6]], dtype=int32),
 array([[ 7,  2,  5, ..., 11,  5,  4],
        [ 7,  6,  4, ...,  4,  2, 19],
        [ 5, 13,  4, ..., 24,  3, 11],
        ...,
        [ 6, 10,  3, ..., 19,  2,  4],
        [ 9,  3,  2, ..., 10,  2,  5],
        [ 9,  2, 10, ...,  2,  6, 14]], dtype=int32))

In [97]:
dataset[0][0][0].shape

(256,)

## The model

We're now ready to start training a model, but there are some things we need to decide on first. We will train our RNNs (LSTMs actually) using _truncated_ Back-Progation Through Time. The truncation comes from the fact that we can't train the model on whole books; we simply don't have enough memory to do so. Instead we train the model by showing it _truncated_ parts of the whole books. Ideally, we wan't to show it as long sequences as we can fit into the memory of our GPUs, so this variable will depend on that. We also wan't to make use of the parallel nature of the GPU, so we want to train on multiple sequences in parallel in a mini-batch.

In [98]:
import keras.losses


epochsVal = 100
learnRateVal = 0.01
batchSizeVal = 10
opt = Adam(learning_rate=learnRateVal)

embedding_dimension = 32
rnn_dimension = 128
output_projection_dimension = 128

num_embeddings = dataset.get_vocab_size()
model = Sequential()
model.add(Embedding(num_embeddings, embedding_dimension, mask_zero=True))
# Add LSTM layers; X.shape[1] refers to the number of columns in X which is the number of time steps, or window size
model.add(LSTM(units=rnn_dimension, return_sequences=True, activation="tanh", unit_forget_bias=True, recurrent_dropout=0, dropout=0.2, use_bias=True))
# Add dense layer with activation for categorical output
model.add(Dense(output_projection_dimension, activation="relu"))
model.add(Dense(num_embeddings))
# Compile model using loss function for categorical data

# or model.add(Dense(num_embeddings,activation = 'softmax'))   change↓: from_logits=False
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)  # We use the sparse loss since we've integer encoded the targets. We also set from_logits=True, since we're not applying the softmax explicityly
model.compile(loss=loss_fn, optimizer=opt, metrics=["accuracy"])
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, None, 32)          2496      
                                                                 
 lstm_4 (LSTM)               (None, None, 128)         82432     
                                                                 
 dense_8 (Dense)             (None, None, 128)         16512     
                                                                 
 dense_9 (Dense)             (None, None, 78)          10062     
                                                                 
Total params: 111502 (435.55 KB)
Trainable params: 111502 (435.55 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


## Training the model
We're now ready to train the model. Do this by running the `fit()` mehtod of the model object. You will se a steady drop in loss and accuracy. For this problem we're not looking at the performance of a development (validation set), so it's hard to track if the model overfits. Likely it will not have capacity to do so, and for the purpose of this lab we can allow the model to do so.

In [99]:
model.fit(dataset, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7a88cb373490>

## Generating text - ChatLSTM

Now that we've trained our model we can try using to to generate some text. First, you can choose a text to prompt it with by setting the `prompt` variable below. You can change how long the text you wish to generate should be by setting `generation_length` to some desired value. There's a third parameter `temperature` which allows you to control how random the sampling of the next character should be. It should be a float value strictly larger than $0$. If it's set to $1$, the learnt probabilty distribution of the model will be used to sample the next character, as the temperature gets closer to 0, it gets closer to the most probable prediction, and as the temperature goes above $1$, the distribution approches the uniform distribution over the next character.

In [74]:
prompt = "I want to go home"
generation_length = 100
temperature = 0.3
random_seed = 1729  # Set to None to use a different random seed each time this cell is executed
rng = np.random.default_rng(random_seed)

def softmax(logits, temperature):
    temperature_scaled_logits = logits/temperature
    exponentiated_logits = np.exp(temperature_scaled_logits)
    return exponentiated_logits / np.sum(exponentiated_logits)


encoded_prompt = np.array(dataset.encode_text(prompt), dtype=np.int32)
for i in range(generation_length):
    next_token_logits = model.predict(encoded_prompt[None, ...], verbose=0)[0, -1]
    p =  softmax(next_token_logits, temperature)
    sampled_token = rng.choice(len(p), p=p)
    encoded_prompt = np.concatenate([encoded_prompt, [sampled_token]])

generated_text = "".join(dataset.decode_encoded_text(encoded_prompt))
print(generated_text)

I want to go home of his particular and the same of the happiness of his sister was too much and desirable of the sam


In [101]:
prompt = "I want to go home"

encoded_prompt = np.array(dataset.encode_text(prompt), dtype=np.int32)
for i in range(generation_length):
    next_token_logits = model.predict(encoded_prompt[None, ...], verbose=0)[0, -1]
    p =  softmax(next_token_logits, temperature)
    sampled_token = rng.choice(len(p), p=p)
    encoded_prompt = np.concatenate([encoded_prompt, [sampled_token]])

generated_text = "".join(dataset.decode_encoded_text(encoded_prompt))
print(generated_text)

i want to go home she had always so much and her attention to the reader of the moment which i was a pleasure of her 


## Exercises

Now that we've tested the language model, let's try to modify it and see what effect it might have.

### Exercise 1

We've trained the model by just taking the text in as is. Often it's a good idea to preprocess the text to make the learning easier.

 - Create a new tokenizer that before return the character, converts it to lower case.
 - Do this by implementing a new function `lower_case_character_tokenizer()`.
 - Look at the [`str class`](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) in the python documentation for suitable methods to convert  to lower case.
 - When you create the dataset, give this function as the tokenizer input instead.
 - Now train a model with this new dataset. Do you notice any performance difference after you've trained for the same amount of epochs?


### Exercise 2

Training the neural network to predict the next character requires a lot of capacity.
- Try adding two more `LSTM` layers to the model and train it for the same number of epochs as the previous model.
- Can you see any difference in performance?
- Try increasing the dimensionality of the `LSTM` layers. What effect does it have on the model?

### Exercise 3

Instead of increasing the model capacity to solve the problem, we can try to change the data.
- Implement a _new tokenizer_ that instead of splitting the text by each character splits it using white space and punctuations.
- You can use a regular expression and the  [`re module`](https://docs.python.org/3/library/re.html) to split the string. Below is a useful snipped:
  ```python
  import re
  tokenized_text = re.split(r"[;.!:, \-_'\n\t#]+", loaded_data[0])
  ```
  The parts inside the brackets of the regular expression lists all the characters we want to remove. The list might not be exhaustive.

- The tokenizer should remove any empty strings from the result, these would behave weirdly in the model. You can use something like:
  ```python
  tokenized_text = [token for token in tokenized_text if token]
  ```

- Use this nerw  white space tokenzier as the tokenizer input to the dataset class. You should probably also set the `max_vocab` parameter to something apropriate like 10000. This limits the number of words the dataset effectively uses to this number.

- Now train the model using this new tokenizer. What can you say about loss? After training it for some epochs and trying it, how does the text generation change?

### Exercise 4

Instead of using the collected works of Jane Austen, try to train a language model on your own dataset or on the Shakespeare dataset also provided in the `data` directory.


### Exercise 5

Two other variables are important for training RNNs, the context length (the length of sequences we actually train the model on) and batch size.
- Double the batch size (set in the cell where you create the dataset) and train the model for 5 epochs. Does anything happen with training time?
- Reset batch size (halve it again) Double the sequence length and train the model for 5 epochs. Does the change have a similar effect as when you doubled the batch size?

