# Importing the packages 

In [58]:
# Deep learning framework 
import tensorflow as tf 

# Array math 
import numpy as np

# Layers 
from keras import layers

# High level overview 

This project is about how to create an autoregressive model that predicts the next character based on past characters. The text which we will be using is a famous Lithuanian poem called "Eglė žalčiu karaliene". The poem is in Lithuanian language and is written by Salomėja Neris. The text is located in the `input/input.txt` file.

Before diving deep into the modeling with real data, I will cover some modeling concepts. The concepts are:

* Loss for such models 
* Embedding layer 
* Attention layer 
* Decoder layer 

We will use small data examples in order to fit everything into our heads and build a strong intuition about what is happening in the model when using big marices, big number of hidden units and etc. 

# Reading the data 

In [7]:
# Reading the text 
with open('input/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Printing the total length of the text
print('Length of text: {} characters'.format(len(text)))

# Printing out the first 200 characters in text
print(text[:200])

Length of text: 16654 characters
﻿EGLĖ DUODA ŽODĮ...

Lenktyniuoja, raitos
Bangos ties krantu. –
Trys žvejų mergaitės
Sukasi ratu.

Jas rausvai nudažo
Saulė leisdamos. –
Krykščia jos ir grąžos
Ligi sutemos.

– Jau saulelė miršta!
Šok


# Creating the character index 

The model that we will create will use individual characters for prediction. Thus, we need to create the character index. The character index is a dictionary that maps each character to a unique integer.

In [10]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)

print(f"All the unique characters in the text: {''.join(chars)}\nSize of the vocabulary: {vocab_size}")

All the unique characters in the text: 
 !,.:;?ABDEGIJKLMNOPRSTUVYabdegijklmnoprstuvyzĄąČčĖėęĮįŠšŪūųŽž–﻿
Size of the vocabulary: 65


In [14]:
# Creating a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

base_string = 'eglė'

print(f"Base string: {base_string}\nEncoded: {encode(base_string)}\nDecoded: {decode(encode(base_string))}")

Base string: eglė
Encoded: [30, 31, 35, 52]
Decoded: eglė


The word eglė is split into 4 characters: e, g, l, ė. The token index for e is 30, g is 31, l is 35 and ė is 52. 

Thus the full sequence of the word eglė is: `[30, 31, 35, 52]`.

# Creating the train and test sequences 

In [19]:
# Encoding everything in the text
encoded = encode(text)

# Converting the encoded text to a tensor
encoded = tf.cast(encoded, tf.int32)

# Printing the first 200 encoded characters
print(f"The first 200 token index sequence:\n{encoded[:200].numpy()}")

# Spliting the encoded text to train and test 
n = len(encoded)
train_size = int(n*0.9)
train_text = encoded[:train_size]
test_text = encoded[train_size:]
print(f"Length of the train text: {len(train_text)}\nLength of the test text: {len(test_text)}")

The first 200 token index sequence:
[64 11 12 16 51  1 10 24 19 10  8  1 61 19 10 54  4  4  4  0  0 16 30 37
 34 42 45 37 32 43 38 33 27  3  1 40 27 32 42 38 41  0  9 27 37 31 38 41
  1 42 32 30 41  1 34 40 27 37 42 43  4  1 63  0 23 40 45 41  1 62 44 30
 33 60  1 36 30 40 31 27 32 42 52 41  0 22 43 34 27 41 32  1 40 27 42 43
  4  0  0 14 27 41  1 40 27 43 41 44 27 32  1 37 43 29 27 62 38  0 22 27
 43 35 52  1 35 30 32 41 29 27 36 38 41  4  1 63  0 15 40 45 34 57 50 32
 27  1 33 38 41  1 32 40  1 31 40 48 62 38 41  0 16 32 31 32  1 41 43 42
 30 36 38 41  4  0  0 63  1 14 27 43  1 41 27 43 35 30 35 52  1 36 32 40
 57 42 27  2  0 56 38 34]
Length of the train text: 14988
Length of the test text: 1666


# Loss function 

The loss function when trying to predict the next token index is the categorical crossentropy. The categorical crossentropy is the loss function that is used when the target is a one-hot encoded vector.

The formula for the loss is: 

$$
\begin{align}
L = -\sum_{i=1}^{n} y_i \log(\hat{y}_i)
\end{align}
$$

where $y_i$ is the one-hot encoded target, $\hat{y}_i$ is the predicted probability for the target, n is the number of classes. 

In our case we would have 65 classes, since we have 65 unique characters.

The intuition here is that we want to maximize the probability of the correct class because then the log(probability) will be close to 0 and hence the loss will be small. 

For example, 

Let us say that we have 3 classes: a, b, c. 

The target is a. Thus the one hot encoded target is:

$$
\begin{align}
y = [1, 0, 0]
\end{align}
$$

The predicted probability vector is:

$$
\begin{align}
\hat{y} = [0.8, 0.1, 0.1]
\end{align}
$$

The loss is:

$$
L = -\sum_{i=1}^{n} y_i \log(\hat{y}_i) \\
L = - \left(1 * log(0.8) + 0 * log(0.1) + 0 * log(0.1) \right) \\
L = -log(0.8) = 0.09...
$$

# Embedding layer 

The embedding layer is a layer that maps each token index to a vector. The embedding layer is a trainable layer. The embedding layer is initialized with random weights and then the weights are updated during the training process.

Because our vocabulary is of size 65, the embedding layer will have 65 rows. For example purposes, let us say that the embedding dimmension is 10. Thus, the embedding layer will have 65 rows and 10 columns. 

Each number in the matrix will be trained during the training process.

In [47]:
# Importing the embedding layer 
from keras.layers import Embedding

# Initiating the embedding layer 
embedding_dim = 10
embedding = Embedding(vocab_size, embedding_dim)

# Setting the example index
example_index = 30

# Decoding 
example_char = decode([example_index])

# Extracting the first character from the train text
embedding_vector = embedding(example_index)

# Printing out the shape of the embedding layer
print(f"Shape of the embedding layer: {embedding.weights[0].shape} | The number of trainable weights: {embedding.weights[0].shape[0] * embedding.weights[0].shape[1]}")

# Passing the first character through the embedding layer
print(f"The token index: {example_index}\nThe actual letter {example_char}\nThe vector representing that letter:\n{embedding_vector}")

Shape of the embedding layer: (65, 10) | The number of trainable weights: 650
The token index: 30
The actual letter e
The vector representing that letter:
[ 0.02279686 -0.00544538  0.03994528  0.01177512  0.00980796 -0.03409342
  0.03105379  0.03771725  0.03322155  0.02564735]


If the sequence length is 4, then the output dimension of the embedding layer will be: `[4, 10]`. This will represent the embedding vector for each token in the sequence.

For example, lets encode the word "eglė" and get the embedding vector for each token.

In [57]:
word = 'eglė'
word_seq = encode(word)

# Passing the word through the embedding layer
word_embedding = embedding(np.array(word_seq))

print(f"The word: {word}\nThe encoded word: {word_seq}\nThe embedding vector of the word:\n{word_embedding}")

The word: eglė
The encoded word: [30, 31, 35, 52]
The embedding vector of the word:
[[ 0.02279686 -0.00544538  0.03994528  0.01177512  0.00980796 -0.03409342
   0.03105379  0.03771725  0.03322155  0.02564735]
 [-0.01839764 -0.04626665  0.02475411  0.02320589  0.00434644  0.01509551
  -0.03369321  0.00931333  0.03201846  0.02168529]
 [-0.04623183  0.01552046  0.00478268 -0.03365133 -0.04867652 -0.04938136
   0.03261245  0.03320005 -0.02372136 -0.04863739]
 [-0.03661405 -0.04686065  0.03424572 -0.00874369 -0.03142884 -0.03757707
  -0.03044369 -0.04448486  0.00237327  0.00550954]]


If in our batch we have 8 sequences, then the output dimension of the embedding layer will be: `[8, 4, 10]`. This will represent the embedding vector for each token in the sequence.

The output dimensions can be interpreted as follows:

* 8 - batch size
* 4 - sequence length
* 10 - feature dimension 

The above naming convention is more commonly used in the deep learning community.

# Attention layer

The self-attention layer is a layer that calculates the attention weights for each token in the sequence. The attention weights are calculated based on the query, key and value vectors. The query, key and value vectors are the output of the embedding layer. 

In [59]:
class SelfAttentionLayer(layers.Layer):
    def __init__(self, embedding_dim):
        super(SelfAttentionLayer, self).__init__()
        # The number of features in the input 
        self.embedding_dim = embedding_dim

        # Self-attention part; 
        # It is called self-attention because we create the query, key, and value matrices from the same input
        self.query_layer = layers.Dense(embedding_dim)
        self.key_layer = layers.Dense(embedding_dim)
        self.value_layer = layers.Dense(embedding_dim)
        
    def call(self, inputs):
        # Compute query, key, and value matrices
        query = self.query_layer(inputs)
        key = self.key_layer(inputs)
        value = self.value_layer(inputs)
        
        # Compute dot product attention scores
        scores = tf.matmul(query, key, transpose_b=True)
        scores_scaled = tf.divide(scores, tf.math.sqrt(tf.cast(self.embedding_dim, tf.float32)))
        attention_weights = tf.nn.softmax(scores_scaled, axis=-1)
        
        # Apply attention weights to value matrix
        output = tf.matmul(attention_weights, value)
        return output

The attention calculation starts with calculating the query, key and value vectors. 

In [63]:
q = layers.Dense(embedding_dim, use_bias=False)(word_embedding)
k = layers.Dense(embedding_dim, use_bias=False)(word_embedding)
v = layers.Dense(embedding_dim, use_bias=False)(word_embedding)

print(f"Query matrix:\n{q}\nKey matrix:\n{k}\nValue matrix:\n{v}")
print(f"The shapes of the tensors: \nQuery: {q.shape}\nKey: {k.shape}\nValue: {v.shape}")

Query matrix:
[[-0.04093811  0.0112277   0.01796059 -0.03300705  0.02126562  0.02079885
  -0.02377996  0.04060603 -0.02131523 -0.00091095]
 [-0.00381637 -0.0443967   0.01795177  0.03025293  0.00765426 -0.05867576
  -0.0300079  -0.044669   -0.00123588  0.02930452]
 [ 0.0469352   0.07168187 -0.0287734  -0.03954661 -0.01353524  0.02381645
   0.02240428  0.03126669 -0.0058535  -0.05523315]
 [ 0.02974571 -0.02309038 -0.03442734  0.02481633  0.00480846 -0.02746624
  -0.00255225 -0.00447242 -0.03464481  0.01547391]]
Key matrix:
[[ 0.0664301  -0.00572214 -0.03418593 -0.00950872 -0.05302523 -0.02517207
   0.00384649  0.00208169  0.02986644 -0.01498902]
 [ 0.02763128  0.03465866 -0.00711504  0.04442833  0.02406945  0.02182924
  -0.00517627  0.04050875  0.01062051 -0.01249776]
 [ 0.02054819 -0.00717465 -0.01168453 -0.00354873  0.00078363 -0.04589725
  -0.05217977 -0.01644216  0.05031686  0.00067306]
 [ 0.00364839 -0.01843851 -0.02806041  0.0614133   0.01793488  0.02942345
  -0.01362184  0.0100990

As we can see, each feature in the in the embedding vector gets fed forward and the signal gets transformed based on the neuron activation function.

Then, the scores and the final attention weights are calculated.

In [68]:
scores = tf.matmul(q, k, transpose_b=True)
scores_scaled = tf.divide(scores, tf.math.sqrt(tf.cast(embedding_dim, tf.float32)))
attention_weights = tf.nn.softmax(scores_scaled, axis=-1)

print(f"Initial attention weights: \n{attention_weights}\nShape: {attention_weights.shape}")

Initial attention weights: 
[[0.24976097 0.2501995  0.24999113 0.2500484 ]
 [0.24992102 0.2496806  0.2503477  0.2500507 ]
 [0.25032815 0.25026578 0.24976029 0.2496458 ]
 [0.25007668 0.24991512 0.24996834 0.25003985]]
Shape: (4, 4)


The intuition behind the attention weights is that the attention weights are high for the tokens that are important for the prediction of the next token.

The final attention vector is calculated by multiplying the attention weights with the value vectors.

In [70]:
attention_output = tf.matmul(attention_weights, v)

print(f"Attention output: \n{attention_output}\nShape: {attention_output.shape}")

Attention output: 
[[ 0.01298132 -0.00344365 -0.04788522  0.01498111 -0.01377805  0.02582581
  -0.00594918 -0.0157879   0.00324314  0.00091845]
 [ 0.01299661 -0.00343155 -0.04791439  0.01499181 -0.0137539   0.02587295
  -0.00599003 -0.01580316  0.003248    0.00092159]
 [ 0.01295136 -0.0034396  -0.04785607  0.0149993  -0.0138049   0.02582769
  -0.00596882 -0.01576768  0.00322966  0.00089588]
 [ 0.01297594 -0.00343441 -0.04788569  0.01499329 -0.01377541  0.02584481
  -0.00596803 -0.01578397  0.0032364   0.00090656]]
Shape: (4, 10)


The whole above calculation is unique and is attributed to one *head*. The attention layer can have multiple heads. The number of heads is a hyperparameter. 

Eache head would have its own query, key and value vectors.

# Decoder model

The decoder model refers to the stacking of layers in the following order from the paper "Attention is all you need":

* Embedding layer
* Multi-head attention layer
* Residual connection
* Layer normalization
* Feed forward layer
* Residual connection
* Layer normalization

The residual connection is a connection that adds the input to the output of the layer. The residual connection is used to prevent the vanishing gradient problem. 

The layer normalization is a normalization layer that normalizes the output of the previous layer. The layer normalization is used to prevent the exploding gradient problem.

The feed forward layer is a layer that has 2 dense layers with relu activation function in between. The output of the feed forward layer is the input of the next layer.

In [71]:
# Full decoder implementation
class Decoder(layers.Layer):
    def __init__(self, embedding_dim, vocab_size):
        super(Decoder, self).__init__()
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size
        self.embedding = layers.Embedding(vocab_size, embedding_dim)
        self.attention = SelfAttentionLayer(embedding_dim)
        self.dense = layers.Dense(vocab_size, bias=False, activation='relu')
        self.output = layers.Dense(vocab_size, activation='softmax')

    def call(self, inputs):
        # Extract the token index from the input
        token_index = inputs[:, 0]
        
        # Pass the token index through the embedding layer
        token_embedding = self.embedding(token_index)
        
        # Pass the token embedding through the self-attention layer
        attention_output = self.attention(token_embedding)
        
        # Adding the residual connection
        attention_output = attention_output + token_embedding

        # Pass the attention output through the dense layer
        dense_output = self.dense(attention_output)

        # Pass the dense output through the output layer
        output = self.output(dense_output)
        
        return output