### Required Libraries

In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

### Positional Encoding

In [7]:
"""
Because the network eliminates convolutional layers, and recurrent layers it must have a way to understand the position
of words in respect to one another. Saying 'I saw a movie.' is not the same as 'Movie a I saw.' In order to do this
sine and cosine is used to correspond each dimension of the positional encoding to a sinusoid wave. Note dimensions of
encodings share same dimensions as d model so they can be summed together. (i.e. d model = embedding dimensions). Also,
some inputs are dropped to reduce overfitting when passing inputs from the multi-head attention sub-layer.
"""

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, dropout, maxlen):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        
        # Positional Encoding Sinusoid Formula
        pos_encoding = torch.zeros(maxlen, d_model)
        pos = torch.arange(0, maxlen, dtype=torch.float).view(-1, 1) # 0, 1, 2, 3, 4, 5
        denom = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0)) / d_model) # 1000^(2i/d_model)        
        pos_encoding[:, 0::2] = torch.sin(pos * denom)
        pos_encoding[:, 1::2] = torch.cos(pos * denom)
        
        # saving encoingss without gradients
        pos_encoding = pos_encoding.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pos_encoding', pos_encoding)
        
    def forward(self, token_embedding):
        # Residual connection + positional encoding
        return self.dropout(token_embedding + self.pos_encoding[:token_embedding.size(0), :])

### Transformer Network

In [8]:
"""
Transformer network architecture declaration. The basic architecture requires the number of tokens (i.e. total amount of unique words),
d model described as the dimensions the encoder/decoder expects as inputs (embedding projections), number of heads which identifies 
the number of heads used for the multi-head attention model (i.e. number of linear projections for Q, K, & V), number of encoder layers
which decides the amount of multi-head attention (normed) & feed forward (normed) sequences, number of decoder layers which entails the 
total amount of masked multi-head attention (normed), multi-head attention (normed), & feed forward (norm) sequences, feed forward which describes
the amount of hidden dimensions to in the feed forward network, and lastly dropout which indicates the percentage of inputs to be dropped
when passing to another sub-layer (i.e. percent of inputs to be dropped in masked mult-head attention, multi-head attention, & feed forward
netowrk).


Encoder
Inputs are fed into the embedding layer of the encoder where the tokenized inputs are transformed into vectors of values with the dimensions
equivalent to d model. Next the embeddings are positionaly encoded as described in the PositionalEncoding class above. Results from this
layer are dropped (based on dropout percentage) and passed to the 1st & 2nd sub-layer of the encoder through residual connections.

The multi-head attention layer (1st sub-layer) recieves the positional encodings and computes the weighted similarity of values based on
the compatability function between the queries & keys for each value (i.e. weighted a word based on similarities with other words). These
are concatenated into h (h being the number of heads) learnable linear projections which are then normalized.

The feed forward network (2nd sub-layer) recieves these normalized projections and processes it such that it fits better
to the next multi-head attention layer. This result is normalized and passed to the unmasked multi-head attention layer in
the decoder.

Decoder
The decoder takes the output (input from encoder shifted right) embeds it, then positionaly encodes it (described above). 
As mentioned the result of the positional encodings are dropped and passed to the 1st, 2nd, & 3rd sub-layer through
residual connections. 

The masked multi-head attention layer (1st sub-layer) recieves the positional encodings and operates the same as the normal multi-head attention
layer (described above), but it masks positions to prevent positions from attenting to subsequent positions (i.e. it makes sure a position can 
only be predicted from the positions that preceed it). The result is normalized then passed to the unmasked multi-head attention
layer. 

The unmasked mutli-head attention (2nd sub-layer) layer recieves the result from the 1st sub-layer, but also the normalized result from the encoders
feed forward network. It then peforms the same computation as the multi-head attention layer described in the encoder. This result is 
normalized then fed into the feed forward network.

The feed forward network (3rd sub-layer) peforms the same operations as described in the encoder. The result is then normalzed and passed to 
a linear function so that softmax can be applied to compute the probabilities for the next predicted word.

Note: residual connections are used to prevent the vanishing gradient issue as gradients are back-propgated when learning

Defaults
d_model:
model defaults to 512 dimensions (less dimensions means less information passed between sub-layers with a gain of computation, 
while more means more information at a cost of compututation)

n_head:
defaults to 8 (scaled dot-product attention layers is proportional to number of heads. i.e. the amount of 
learnable linear projections is dependent on the number of heads)

n_encoder:
defaults to 6 and defines the amount of encoder stacks to be used in the transformer network

n_decoder:
defaults to 6 and defines the amount of decoder stacks to be used in the transformer network

feed_forward:
defaults to 2048 hidden dimensions (increasing dimensions may help improve fit of input to next multi-head attention layer)

dropout: 
defaults to 0.1 or 10%. Dropout helps prevent overfitting, but dropping too many inputs may prevent the network from learning
anything, while dropping not enough may cause the network to overfit
"""

class Transformer(nn.Module):

    # Constructor
    def __init__(self, n_tokens: int, d_model: int=512, n_head: int=8, n_encoder: int=6, n_decoder: int=6, feed_forward: int=2048, dropout: float=0.1):         
        super().__init__()
        self.model_type = 'Transformer'
        self.d_model = d_model

        # transformer layers
        self.embedding = nn.Embedding(n_tokens, d_model)
        self.pos_encoder = PositionalEncoder(d_model=d_model, dropout=dropout, maxlen=5000)
        self.transformer = nn.Transformer(d_model=d_model, nhead=n_head, num_encoder_layers=n_encoder, 
                                        num_decoder_layers=n_decoder, dim_feedforward=feed_forward, dropout=dropout)
        self.out = nn.Linear(d_model, n_tokens)
        
    def forward(self, src: torch.LongTensor, tgt: torch.LongTensor, tgt_mask=None, src_pad_mask=None, tgt_pad_mask=None) -> torch.LongTensor:
        # src shape: (batch_size, src seq_length), tgt shape: (batch_size, tgt seq_length)

        # embedding + positional encoding shape: (batch_size, sequence length, d_model)
        src = self.embedding(src) * np.sqrt(self.d_model)
        tgt = self.embedding(tgt) * np.sqrt(self.d_model)
        src = self.pos_encoder(src)
        tgt = self.pos_encoder(tgt)
        
        # reshape: (sequence length, batch_size, d_model),
        src = src.permute(1,0,2)
        tgt = tgt.permute(1,0,2)

        # transformer inputs shape: (sequence length, batch_size, n_tokens)
        decoder_out = self.transformer(src, tgt, tgt_mask=tgt_mask, src_key_padding_mask=src_pad_mask, tgt_key_padding_mask=tgt_pad_mask)

        # compute linear function and apply softmax (Cross Entropy Loss already does this just return n_tokens as out)
        out = self.out(decoder_out)
        return out
    
    # create mask for tgt
    def get_tgt_mask(self, seq_length: int) -> torch.LongTensor:
        # keep decoder from peeking ahead (i.e. show one word at a time in the sequence)
        mask = torch.tril(torch.ones(seq_length, seq_length) == 1) # Lower triangular matrix
        mask = mask.float()
        mask = mask.masked_fill(mask == 0, float('-inf')) # Convert zeros to -inf
        mask = mask.masked_fill(mask == 1, float(0.0)) # Convert ones to 0
        return mask
    
    # create binary encoded matrix ignore pad
    def create_pad_mask(self, matrix: torch.LongTensor, pad_val: int) -> torch.BoolTensor:
        # matrix = [3, 2, 1, 8, 0, 0, 0], pad_v = 0 -> [False, False, False, False, True, True, True]
        return (matrix == pad_val)

In [9]:
def generate_dataset(n_tokens, n_samples, minlen, maxlen, pad_val=0, batch_size=64):
    samples = []
    for i in range(n_samples):
        seq_len = np.random.randint(minlen, maxlen + 1)
        seq = np.random.randint(1, n_tokens + 1, seq_len)
        if seq_len < maxlen:
            pad = np.zeros(maxlen - seq_len) + pad_val
            seq = np.append(seq, pad)
        samples.append(seq)
    samples = np.array(samples, dtype=np.int64)
    dataset = torch.utils.data.TensorDataset(torch.from_numpy(samples), torch.from_numpy(samples))
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)
    return dataloader

In [19]:
def train(net, optimizer, loss_fn, dataloader, epochs=3, epoch_pct=0.25, device='cpu'):
    net.train()
    n = len(dataloader.dataset)
    m = len(dataloader)

    acc_loss = 0

    for epoch in range(epochs):

        samples_trained = 0
        
        for i, batch in enumerate(dataloader, 0):
            batch_size = batch[0].size(0)
            # get inputs and move to device
            inputs, labels = batch
            src, tgt = inputs.to(device), labels.to(device)

            # create SOS and EOS tokens
            tgt_input, tgt_output = tgt[:, :-1], tgt[:, 1:]

            # mask out pad tokens and mask target input to keep decoder from cheating
            tgt_len = tgt_input.size(1)
            tgt_mask = net.get_tgt_mask(tgt_len).to(device)
            src_pad_mask = net.create_pad_mask(src, pad_val=0).to(device)
            tgt_pad_mask = net.create_pad_mask(tgt_input, pad_val=0).to(device)

            # compute prediction (softmax)
            pred = net(src, tgt_input, tgt_mask, src_pad_mask, tgt_pad_mask)
            pred = pred.permute(1, 2, 0) # reshape

            # compute loss and backpropagate
            loss = loss_fn(pred, tgt_output)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            acc_loss += loss.detach().item()
            samples_trained += batch_size
            
            if (i + 1) % int(m * epoch_pct) == 0:
                print(f'{(i + 1) / m * 100:.0f}% of epoch completed\n{samples_trained if samples_trained < n else n}/{n} Samples trained\nLoss: {loss.item():.4f}')
        print(f'Epoch {epoch + 1} complete\n{n}/{n} Samples trained\nLoss: {loss.item():.4f}')
    print(f'Training complete\nAverage loss: {acc_loss / (m * epochs):.4f}')

In [23]:
n_tokens = 10000
n_samples = 10000
minlen = 15
maxlen = 30

dataloader = generate_dataset(n_tokens, n_samples, minlen, maxlen)
dataiter = iter(dataloader)
batch = next(dataiter)

inputs, labels = batch
print(f'Batch shape: {inputs.size()}\nExample source: {inputs[0]}\nShifted target input: {labels[0][:-1]}\nShifted target output: {labels[0][1:]}')

Batch shape: torch.Size([64, 30])
Example source: tensor([1413, 1187, 2377,  183, 5033, 6109, 9794, 7173, 6105, 6650,  938, 3494,
        8649, 6451, 8038, 5288, 9154, 7700, 9356, 4300, 9183, 2765,    0,    0,
           0,    0,    0,    0,    0,    0])
Shifted target input: tensor([1413, 1187, 2377,  183, 5033, 6109, 9794, 7173, 6105, 6650,  938, 3494,
        8649, 6451, 8038, 5288, 9154, 7700, 9356, 4300, 9183, 2765,    0,    0,
           0,    0,    0,    0,    0])
Shifted target output: tensor([1187, 2377,  183, 5033, 6109, 9794, 7173, 6105, 6650,  938, 3494, 8649,
        6451, 8038, 5288, 9154, 7700, 9356, 4300, 9183, 2765,    0,    0,    0,
           0,    0,    0,    0,    0])


In [24]:
net = Transformer(n_tokens=n_tokens + 1, d_model=512, n_head=2, n_encoder=3, n_decoder=3, feed_forward=2048, dropout=0.1)
optimizer = optim.SGD(net.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Training on {"gpu" if device == torch.device("cuda") else "cpu"}')
net.to(device);

Training on gpu


In [25]:
train(net, optimizer, loss_fn, dataloader, epochs=1, device=device)

25% of epoch completed
2496/10000 Samples trained
Loss: 7.0962
50% of epoch completed
4992/10000 Samples trained
Loss: 6.7021
75% of epoch completed
7488/10000 Samples trained
Loss: 7.6499
99% of epoch completed
9984/10000 Samples trained
Loss: 7.2695
Epoch 1 complete
10000/10000 Samples trained
Loss: 7.3284
Training complete
Average loss: 7.1726
