### Required Libraries

In [131]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

### Positional Encoding

In [132]:
"""
Because the network eliminates convolutional layers, and recurrent layers it must have a way to understand the position
of words in respect to one another. Saying 'I saw a movie.' is not the same as 'Movie a I saw.' In order to do this
sine and cosine is used to correspond each dimension of the positional encoding to a sinusoid wave. Note dimensions of
encodings share same dimensions as d model so they can be summed together. (i.e. d model = embedding dimensions). Also,
some inputs are dropped to reduce overfitting when passing inputs from the multi-head attention sub-layer.
"""

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, dropout, maxlen):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        
        # Positional Encoding Sinusoid Formula
        pos_encoding = torch.zeros(maxlen, d_model)
        pos = torch.arange(0, maxlen, dtype=torch.float).view(-1, 1) # 0, 1, 2, 3, 4, 5
        denom = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0)) / d_model) # 1000^(2i/d_model)        
        pos_encoding[:, 0::2] = torch.sin(pos * denom)
        pos_encoding[:, 1::2] = torch.cos(pos * denom)
        
        # saving encoingss without gradients
        pos_encoding = pos_encoding.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pos_encoding', pos_encoding)
        
    def forward(self, token_embedding):
        # Residual connection + positional encoding
        return self.dropout(token_embedding + self.pos_encoding[:token_embedding.size(0), :])

### Transformer Network

In [133]:
"""
Transformer network architecture declaration. The basic architecture requires the number of tokens (i.e. total amount of unique words),
d model described as the dimensions the encoder/decoder expects as inputs (embedding projections), number of heads which identifies 
the number of heads used for the multi-head attention model (i.e. number of linear projections for Q, K, & V), number of encoder layers
which decides the amount of multi-head attention (normed) & feed forward (normed) sequences, number of decoder layers which entails the 
total amount of masked multi-head attention (normed), multi-head attention (normed), & feed forward (norm) sequences, feed forward which describes
the amount of hidden dimensions to in the feed forward network, and lastly dropout which indicates the percentage of inputs to be dropped
when passing to another sub-layer (i.e. percent of inputs to be dropped in masked mult-head attention, multi-head attention, & feed forward
netowrk).


Encoder
Inputs are fed into the embedding layer of the encoder where the tokenized inputs are transformed into vectors of values with the dimensions
equivalent to d model. Next the embeddings are positionaly encoded as described in the PositionalEncoding class above. Results from this
layer are dropped (based on dropout percentage) and passed to the 1st & 2nd sub-layer of the encoder through residual connections.

The multi-head attention layer (1st sub-layer) recieves the positional encodings and computes the weighted similarity of values based on
the compatability function between the queries & keys for each value (i.e. weighted a word based on similarities with other words). These
are concatenated into h (h being the number of heads) learnable linear projections which are then normalized.

The feed forward network (2nd sub-layer) recieves these normalized projections and processes it such that it fits better
to the next multi-head attention layer. This result is normalized and passed to the unmasked multi-head attention layer in
the decoder.

Decoder
The decoder takes the output (input from encoder shifted right) embeds it, then positionaly encodes it (described above). 
As mentioned the result of the positional encodings are dropped and passed to the 1st, 2nd, & 3rd sub-layer through
residual connections. 

The masked multi-head attention layer (1st sub-layer) recieves the positional encodings and operates the same as the normal multi-head attention
layer (described above), but it masks positions to prevent positions from attenting to subsequent positions (i.e. it makes sure a position can 
only be predicted from the positions that preceed it). The result is normalized then passed to the unmasked multi-head attention
layer. 

The unmasked mutli-head attention (2nd sub-layer) layer recieves the result from the 1st sub-layer, but also the normalized result from the encoders
feed forward network. It then peforms the same computation as the multi-head attention layer described in the encoder. This result is 
normalized then fed into the feed forward network.

The feed forward network (3rd sub-layer) peforms the same operations as described in the encoder. The result is then normalzed and passed to 
a linear function so that softmax can be applied to compute the probabilities for the next predicted word.

Note: residual connections are used to prevent the vanishing gradient issue as gradients are back-propgated when learning

Defaults
d_model:
model defaults to 512 dimensions (less dimensions means less information passed between sub-layers with a gain of computation, 
while more means more information at a cost of compututation)

n_head:
defaults to 8 (scaled dot-product attention layers is proportional to number of heads. i.e. the amount of 
learnable linear projections is dependent on the number of heads)

n_encoder:
defaults to 6 and defines the amount of encoder stacks to be used in the transformer network

n_decoder:
defaults to 6 and defines the amount of decoder stacks to be used in the transformer network

feed_forward:
defaults to 2048 hidden dimensions (increasing dimensions may help improve fit of input to next multi-head attention layer)

dropout: 
defaults to 0.1 or 10%. Dropout helps prevent overfitting, but dropping too many inputs may prevent the network from learning
anything, while dropping not enough may cause the network to overfit
"""

class Transformer(nn.Module):

    # Constructor
    def __init__(self, n_tokens: int, d_model: int=512, n_head: int=8, n_encoder: int=6, n_decoder: int=6, feed_forward: int=2048, dropout: float=0.1):         
        super().__init__()
        self.model_type = 'Transformer'
        self.d_model = d_model

        # transformer layers
        self.embedding = nn.Embedding(n_tokens, d_model)
        self.pos_encoder = PositionalEncoder(d_model=d_model, dropout=dropout, maxlen=5000)
        self.transformer = nn.Transformer(d_model=d_model, nhead=n_head, num_encoder_layers=n_encoder, 
                                        num_decoder_layers=n_decoder, dim_feedforward=feed_forward, dropout=dropout)
        self.out = nn.Linear(d_model, n_tokens)
        
    def forward(self, src: torch.LongTensor, tgt: torch.LongTensor, tgt_mask=None, src_pad_mask=None, tgt_pad_mask=None) -> torch.LongTensor:
        # src shape: (batch_size, src seq_length), tgt shape: (batch_size, tgt seq_length)

        # embedding + positional encoding shape: (batch_size, sequence length, d_model)
        src = self.embedding(src) * np.sqrt(self.d_model)
        tgt = self.embedding(tgt) * np.sqrt(self.d_model)
        src = self.pos_encoder(src)
        tgt = self.pos_encoder(tgt)
        
        # reshape: (sequence length, batch_size, d_model),
        src = src.permute(1,0,2)
        tgt = tgt.permute(1,0,2)

        # transformer inputs shape: (sequence length, batch_size, n_tokens)
        decoder_out = self.transformer(src, tgt, tgt_mask=tgt_mask, src_key_padding_mask=src_pad_mask, tgt_key_padding_mask=tgt_pad_mask)

        # compute linear function and apply softmax (Cross Entropy Loss already does this just return n_tokens as out)
        out = self.out(decoder_out)
        return out
    
    # create mask for tgt
    def get_tgt_mask(self, seq_length: int) -> torch.LongTensor:
        # keep decoder from peeking ahead (i.e. show one word at a time in the sequence)
        mask = torch.tril(torch.ones(seq_length, seq_length) == 1) # Lower triangular matrix
        mask = mask.float()
        mask = mask.masked_fill(mask == 0, float('-inf')) # Convert zeros to -inf
        mask = mask.masked_fill(mask == 1, float(0.0)) # Convert ones to 0
        return mask
    
    # create binary encoded matrix ignore pad
    def create_pad_mask(self, matrix: torch.LongTensor, pad_val: int) -> torch.BoolTensor:
        # matrix = [3, 2, 1, 8, 0, 0, 0], pad_v = 0 -> [False, False, False, False, True, True, True]
        return (matrix == pad_val)

In [134]:
def generate_random_data(n):
    SOS_token = np.array([2])
    EOS_token = np.array([3])
    length = 8

    data = []

    # 1,1,1,1,1,1 -> 1,1,1,1,1
    for i in range(n // 3):
        X = np.concatenate((SOS_token, np.ones(length), EOS_token))
        y = np.concatenate((SOS_token, np.ones(length), EOS_token))
        data.append([X, y])

    # 0,0,0,0 -> 0,0,0,0
    for i in range(n // 3):
        X = np.concatenate((SOS_token, np.zeros(length), EOS_token))
        y = np.concatenate((SOS_token, np.zeros(length), EOS_token))
        data.append([X, y])

    # 1,0,1,0 -> 1,0,1,0,1
    for i in range(n // 3):
        X = np.zeros(length)
        start = np.random.randint(0, 1)

        X[start::2] = 1

        y = np.zeros(length)
        if X[-1] == 0:
            y[::2] = 1
        else:
            y[1::2] = 1

        X = np.concatenate((SOS_token, X, EOS_token))
        y = np.concatenate((SOS_token, y, EOS_token))

        data.append([X, y])

    np.random.shuffle(data)

    return data


def batchify_data(data, batch_size=16, padding=False, padding_token=-1):
    batches = []
    for idx in range(0, len(data), batch_size):
        # We make sure we dont get the last bit if its not batch_size size
        if idx + batch_size < len(data):
            # Here you would need to get the max length of the batch,
            # and normalize the length with the PAD token.
            if padding:
                max_batch_length = 0

                # Get longest sentence in batch
                for seq in data[idx : idx + batch_size]:
                    if len(seq) > max_batch_length:
                        max_batch_length = len(seq)

                # Append X padding tokens until it reaches the max length
                for seq_idx in range(batch_size):
                    remaining_length = max_batch_length - len(data[idx + seq_idx])
                    data[idx + seq_idx] += [padding_token] * remaining_length

            batches.append(np.array(data[idx : idx + batch_size]).astype(np.int64))

    print(f"{len(batches)} batches of size {batch_size}")

    return batches

In [135]:
train_data = generate_random_data(9000)
val_data = generate_random_data(3000)

train_dataloader = batchify_data(train_data)
val_dataloader = batchify_data(val_data)

print(f'Example input: {train_data[0][0]}\nExample Output: {train_data[0][1]}\nExample Batch Shape (batch size, word/next word, sequence length)\n{train_dataloader[0].shape}')

562 batches of size 16
187 batches of size 16
Example input: [2. 1. 1. 1. 1. 1. 1. 1. 1. 3.]
Example Output: [2. 1. 1. 1. 1. 1. 1. 1. 1. 3.]
Example Batch Shape (batch size, word/next word, sequence length)
(16, 2, 10)


In [136]:
def train_loop(net: Transformer, optimizer: torch.optim, loss_fn: torch.nn, dataloader: np.ndarray, epochs: int=3, device=None):
    
    net.train()
    net_loss = 0
    n = len(dataloader)

    for epoch in range(epochs):
        print(f'{"-"*25}Epoch Started{"-"*25}')
        samples_trained = 0
    
        for i, data in enumerate(dataloader, 0):

            batch_size = len(data)

            # separate source & target then transform to tesnor (move to device)
            x, y = data[:, 0], data[:, 1]
            x, y = torch.LongTensor(x).to(device), torch.LongTensor(y).to(device)

            # shifting one to the right (predict next word)
            input_tgt = y[:,:-1]
            output_tgt = y[:,1:]
            
            # mask out pad for multi-head attention & mask out next positions for masked multi-head attention
            seq_length = input_tgt.size(1)
            tgt_mask = net.get_tgt_mask(seq_length).to(device)
            # src_pad_mask = net.create_pad_mask(input_tgt, pad_val=0)
            # tgt_pad_mask = net.create_pad_mask(output_tgt, pad_val=0)

            # Standard training except we pass in input_tgt and tgt_mask
            pred = net(x, input_tgt, tgt_mask)

            # Permute pred to have batch size first again
            pred = pred.permute(1, 2, 0)      
            loss = loss_fn(pred, output_tgt)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            net_loss += loss.detach().item()

            samples_trained += batch_size
            if (i + 1) % (n // 4) == 0:
                print(f'{samples_trained}/{batch_size * n} Samples Trained Loss: {loss.item()}')
        print(f'{"-"*25}Epoch Complete{"-"*25}')
    print(f'Average Loss: {net_loss / n}')
            
        


In [137]:
net = Transformer(n_tokens=4, d_model=512, n_head=2, n_encoder=3, n_decoder=3, feed_forward=2048, dropout=0.1)
optimizer = optim.SGD(net.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()
device = torch.device('cuda')
net.to(device);

In [138]:
loss = train_loop(net=net, optimizer=optimizer, loss_fn=loss_fn, dataloader=train_dataloader, epochs=1, device=device)

-------------------------Epoch Started-------------------------
2240/8992 Samples Trained Loss: 0.2915380001068115
4480/8992 Samples Trained Loss: 0.1993098258972168
6720/8992 Samples Trained Loss: 0.19259800016880035
8960/8992 Samples Trained Loss: 0.19939610362052917
-------------------------Epoch Complete-------------------------
Average Loss: 0.24705622626263052


In [139]:
loss