## Reversing a Sequence


I can't get my Transformer to work! Let's make the problem even simpler.

Given a sequence of numbers, simply reverse the sequence.

```
input = 0 1 5 9 0 3 5 2 5
reversed = 5 2 5 3 0 9 5 1 0
```

In [1]:
import torch
import random
import numpy as np

from torch.utils.data import Dataset, DataLoader
from transformers import BertConfig, BertModel

from radam import RAdam
from utils import get_accuracy, get_output_for_example, train

In [2]:
def generate_data(num_examples: int, seq_len: int, vocab_size: int):
    inputs = np.random.randint(0, vocab_size, size=(num_examples, seq_len))
    outputs = np.ascontiguousarray(np.flip(inputs, 1)) #PyTorch can't handle negative strides

    return inputs, outputs

In [3]:
generate_data(num_examples=1, seq_len=10, vocab_size=10)

(array([[8, 9, 7, 4, 1, 7, 1, 6, 9, 9]]),
 array([[9, 9, 6, 1, 7, 1, 4, 7, 9, 8]]))

In [4]:
class ToyDataset(Dataset):
   
    def __init__(self, num_examples, sequence_length, vocab_size):
        self.items, self.labels = generate_data(num_examples, sequence_length, vocab_size)
        
    def __getitem__(self, idx):
        
        x = torch.Tensor(self.items[idx]).long()
        y = torch.Tensor(self.labels[idx]).long()
        return x.cuda(), y.cuda()
    
    def __len__(self):
        return len(self.items)

In [5]:
SEQ_LENGTH = 5
VOCAB_SIZE = 10

TRN_EXAMPLES = 128
VAL_EXAMPLES = 12
BATCH_SIZE = 16
LR = 1e-4

In [6]:
train_ds = ToyDataset(num_examples=TRN_EXAMPLES, sequence_length=SEQ_LENGTH, vocab_size=VOCAB_SIZE)
valid_ds = ToyDataset(num_examples=VAL_EXAMPLES, sequence_length=SEQ_LENGTH, vocab_size=VOCAB_SIZE)
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)
valid_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)

In [7]:
class ToyModel(torch.nn.Module):
    """
    Wrapper around a BERT model
    """
    
    def __init__(self, vocab_size):
        super().__init__()
        
        # Untrained BERT Model
        config = BertConfig(vocab_size_or_config_json_file=vocab_size)
        self.bert_model = BertModel(config)
        self.linear = torch.nn.Linear(in_features=768, out_features=vocab_size)
        
    def forward(self, x):    
        out, _ = self.bert_model(x)
        out = self.linear(out)
        return out    

In [8]:
model = ToyModel(VOCAB_SIZE)
model = model.cuda()

In [9]:
# NOTE: We use RAdam to avoid having to use warmup
# If we use regular Adam, this usually won't converge for long sequences
optimizer = RAdam(model.parameters(), lr=LR)
loss_fn = torch.nn.CrossEntropyLoss()

In [10]:
train(model, train_dl, valid_dl, loss_fn, optimizer, num_epochs=15, print_every=100)

Epoch:	 0 	Step:	 0 	Loss:	 2.3936386108398438
Epoch:	 0 			Valid Accuracy	 0.134375
Epoch:	 1 	Step:	 0 	Loss:	 2.263537883758545
Epoch:	 1 			Valid Accuracy	 0.275
Epoch:	 2 	Step:	 0 	Loss:	 2.0607337951660156
Epoch:	 2 			Valid Accuracy	 0.35625
Epoch:	 3 	Step:	 0 	Loss:	 1.768286943435669
Epoch:	 3 			Valid Accuracy	 0.4015625
Epoch:	 4 	Step:	 0 	Loss:	 1.6182388067245483
Epoch:	 4 			Valid Accuracy	 0.446875
Epoch:	 5 	Step:	 0 	Loss:	 1.4450860023498535
Epoch:	 5 			Valid Accuracy	 0.5109375
Epoch:	 6 	Step:	 0 	Loss:	 1.3252371549606323
Epoch:	 6 			Valid Accuracy	 0.56406254
Epoch:	 7 	Step:	 0 	Loss:	 1.2489956617355347
Epoch:	 7 			Valid Accuracy	 0.61875004
Epoch:	 8 	Step:	 0 	Loss:	 1.090881109237671
Epoch:	 8 			Valid Accuracy	 0.675
Epoch:	 9 	Step:	 0 	Loss:	 1.0152145624160767
Epoch:	 9 			Valid Accuracy	 0.72187495
Epoch:	 10 	Step:	 0 	Loss:	 0.8539468050003052
Epoch:	 10 			Valid Accuracy	 0.803125
Epoch:	 11 	Step:	 0 	Loss:	 0.752048671245575
Epoch:	 11 			Vali

In [11]:
# An example from the training set
x, y = train_ds[0]
y_hat = get_output_for_example(model, x)

print("X:\t", x)
print("y:\t", y)
print("y_hat:\t", y_hat.squeeze())

X:	 tensor([2, 2, 4, 6, 4], device='cuda:0')
y:	 tensor([4, 6, 4, 2, 2], device='cuda:0')
y_hat:	 tensor([4, 6, 4, 2, 2], device='cuda:0')


In [12]:
# An out-of-sample example
x = torch.from_numpy(np.arange(SEQ_LENGTH)).long().cuda()
y = torch.flip(x, dims=(0,))
y_hat = get_output_for_example(model, x)

print("X:\t", x)
print("y:\t", y)
print("y_hat:\t", y_hat.squeeze())

X:	 tensor([0, 1, 2, 3, 4], device='cuda:0')
y:	 tensor([4, 3, 2, 1, 0], device='cuda:0')
y_hat:	 tensor([4, 3, 2, 1, 0], device='cuda:0')


### Longer Sequences

Now that we've got something simple working, let's try with longer sequences.

Note that we're going to hold `TRN_EXAMPLES` fixed for now.

In [13]:
# Longer sequence
SEQ_LENGTH = 100
# The rest should be the same as above
VOCAB_SIZE = 10
TRN_EXAMPLES = 128
VAL_EXAMPLES = 12
BATCH_SIZE = 16
LR = 1e-4

In [14]:
train_ds = ToyDataset(num_examples=TRN_EXAMPLES, sequence_length=SEQ_LENGTH, vocab_size=VOCAB_SIZE)
valid_ds = ToyDataset(num_examples=VAL_EXAMPLES, sequence_length=SEQ_LENGTH, vocab_size=VOCAB_SIZE)
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)
valid_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)

In [15]:
model = ToyModel(VOCAB_SIZE)
model = model.cuda()

In [16]:
# NOTE: We use RAdam to avoid having to use warmup
# If we use regular Adam, this usually won't converge for long sequences
optimizer = RAdam(model.parameters(), lr=LR)
loss_fn = torch.nn.CrossEntropyLoss()

In [17]:
train(model, train_dl, valid_dl, loss_fn, optimizer, num_epochs=100, print_every=100)

Epoch:	 0 	Step:	 0 	Loss:	 2.484060764312744
Epoch:	 0 			Valid Accuracy	 0.10148437
Epoch:	 1 	Step:	 0 	Loss:	 2.3821005821228027
Epoch:	 1 			Valid Accuracy	 0.104453124
Epoch:	 2 	Step:	 0 	Loss:	 2.3782126903533936
Epoch:	 2 			Valid Accuracy	 0.10296875
Epoch:	 3 	Step:	 0 	Loss:	 2.340113878250122
Epoch:	 3 			Valid Accuracy	 0.113984376
Epoch:	 4 	Step:	 0 	Loss:	 2.3262312412261963
Epoch:	 4 			Valid Accuracy	 0.12
Epoch:	 5 	Step:	 0 	Loss:	 2.3109114170074463
Epoch:	 5 			Valid Accuracy	 0.12671874
Epoch:	 6 	Step:	 0 	Loss:	 2.300466775894165
Epoch:	 6 			Valid Accuracy	 0.13382813
Epoch:	 7 	Step:	 0 	Loss:	 2.2993288040161133
Epoch:	 7 			Valid Accuracy	 0.1359375
Epoch:	 8 	Step:	 0 	Loss:	 2.2852816581726074
Epoch:	 8 			Valid Accuracy	 0.1425
Epoch:	 9 	Step:	 0 	Loss:	 2.2793452739715576
Epoch:	 9 			Valid Accuracy	 0.15539062
Epoch:	 10 	Step:	 0 	Loss:	 2.2700774669647217
Epoch:	 10 			Valid Accuracy	 0.15609375
Epoch:	 11 	Step:	 0 	Loss:	 2.2544608116149902
Epoch

Epoch:	 92 			Valid Accuracy	 0.9998437
Epoch:	 93 	Step:	 0 	Loss:	 0.00048018962843343616
Epoch:	 93 			Valid Accuracy	 1.0
Epoch:	 94 	Step:	 0 	Loss:	 0.0004154345369897783
Epoch:	 94 			Valid Accuracy	 1.0
Epoch:	 95 	Step:	 0 	Loss:	 0.0005855700583197176
Epoch:	 95 			Valid Accuracy	 1.0
Epoch:	 96 	Step:	 0 	Loss:	 0.0010000172769650817
Epoch:	 96 			Valid Accuracy	 1.0
Epoch:	 97 	Step:	 0 	Loss:	 0.0003944307682104409
Epoch:	 97 			Valid Accuracy	 0.99976563
Epoch:	 98 	Step:	 0 	Loss:	 0.00048108608461916447
Epoch:	 98 			Valid Accuracy	 1.0
Epoch:	 99 	Step:	 0 	Loss:	 0.0005362260271795094
Epoch:	 99 			Valid Accuracy	 1.0


So we can still learn a solution to this problem with a training set of just 128 examples, it just takes us a lot longer!

### Larger Vocab

Let's try the same thing but with a larger vocab. 

Note that we'll hold all other parameters at their original values.

In [18]:
# Longer sequence
SEQ_LENGTH = 10
# The rest should be the same as above
VOCAB_SIZE = 1000
TRN_EXAMPLES = 128
VAL_EXAMPLES = 12
BATCH_SIZE = 16
LR = 1e-4

In [19]:
train_ds = ToyDataset(num_examples=TRN_EXAMPLES, sequence_length=SEQ_LENGTH, vocab_size=VOCAB_SIZE)
valid_ds = ToyDataset(num_examples=VAL_EXAMPLES, sequence_length=SEQ_LENGTH, vocab_size=VOCAB_SIZE)
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)
valid_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)

In [20]:
model = ToyModel(VOCAB_SIZE)
model = model.cuda()

In [21]:
# NOTE: We use RAdam to avoid having to use warmup
# If we use regular Adam, this usually won't converge for long sequences
optimizer = RAdam(model.parameters(), lr=LR)
loss_fn = torch.nn.CrossEntropyLoss()

In [22]:
train(model, train_dl, valid_dl, loss_fn, optimizer, num_epochs=40, print_every=100)

Epoch:	 0 	Step:	 0 	Loss:	 7.012752532958984
Epoch:	 0 			Valid Accuracy	 0.00078125
Epoch:	 1 	Step:	 0 	Loss:	 6.968691349029541
Epoch:	 1 			Valid Accuracy	 0.003125
Epoch:	 2 	Step:	 0 	Loss:	 6.857752799987793
Epoch:	 2 			Valid Accuracy	 0.00625
Epoch:	 3 	Step:	 0 	Loss:	 6.7144598960876465
Epoch:	 3 			Valid Accuracy	 0.00703125
Epoch:	 4 	Step:	 0 	Loss:	 6.581044673919678
Epoch:	 4 			Valid Accuracy	 0.0171875
Epoch:	 5 	Step:	 0 	Loss:	 6.428334712982178
Epoch:	 5 			Valid Accuracy	 0.03671875
Epoch:	 6 	Step:	 0 	Loss:	 6.266859531402588
Epoch:	 6 			Valid Accuracy	 0.08359375
Epoch:	 7 	Step:	 0 	Loss:	 6.009130001068115
Epoch:	 7 			Valid Accuracy	 0.1359375
Epoch:	 8 	Step:	 0 	Loss:	 5.766554832458496
Epoch:	 8 			Valid Accuracy	 0.215625
Epoch:	 9 	Step:	 0 	Loss:	 5.511596202850342
Epoch:	 9 			Valid Accuracy	 0.33359376
Epoch:	 10 	Step:	 0 	Loss:	 5.242946624755859
Epoch:	 10 			Valid Accuracy	 0.5023438
Epoch:	 11 	Step:	 0 	Loss:	 4.986907005310059
Epoch:	 11 			

We can also learn a solution to this problem in roughly 30-50 epochs.

### Larger Vocab and Longer Sequences

Let's make it about as hard as we can. 

We'll crank up both the vocabular and sequence length while keeping everything else the same.

In [23]:
# Longer sequence and larger vocabular
SEQ_LENGTH = 100
VOCAB_SIZE = 1000
# The rest should be the same as above
TRN_EXAMPLES = 128
VAL_EXAMPLES = 12
BATCH_SIZE = 16
LR = 1e-4

In [24]:
train_ds = ToyDataset(num_examples=TRN_EXAMPLES, sequence_length=SEQ_LENGTH, vocab_size=VOCAB_SIZE)
valid_ds = ToyDataset(num_examples=VAL_EXAMPLES, sequence_length=SEQ_LENGTH, vocab_size=VOCAB_SIZE)
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)
valid_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)

In [25]:
model = ToyModel(VOCAB_SIZE)
model = model.cuda()

In [26]:
# NOTE: We use RAdam to avoid having to use warmup
# If we use regular Adam, this usually won't converge for long sequences
optimizer = RAdam(model.parameters(), lr=LR)
loss_fn = torch.nn.CrossEntropyLoss()

In [27]:
train(model, train_dl, valid_dl, loss_fn, optimizer, num_epochs=150, print_every=100)

Epoch:	 0 	Step:	 0 	Loss:	 7.081050395965576
Epoch:	 0 			Valid Accuracy	 0.00078124995
Epoch:	 1 	Step:	 0 	Loss:	 7.065215587615967
Epoch:	 1 			Valid Accuracy	 0.00070312497
Epoch:	 2 	Step:	 0 	Loss:	 7.002694606781006
Epoch:	 2 			Valid Accuracy	 0.0015624999
Epoch:	 3 	Step:	 0 	Loss:	 6.952529430389404
Epoch:	 3 			Valid Accuracy	 0.00140625
Epoch:	 4 	Step:	 0 	Loss:	 6.907620906829834
Epoch:	 4 			Valid Accuracy	 0.002421875
Epoch:	 5 	Step:	 0 	Loss:	 6.8764729499816895
Epoch:	 5 			Valid Accuracy	 0.0027343747
Epoch:	 6 	Step:	 0 	Loss:	 6.852637767791748
Epoch:	 6 			Valid Accuracy	 0.00375
Epoch:	 7 	Step:	 0 	Loss:	 6.813115119934082
Epoch:	 7 			Valid Accuracy	 0.0052343747
Epoch:	 8 	Step:	 0 	Loss:	 6.747159481048584
Epoch:	 8 			Valid Accuracy	 0.009921875
Epoch:	 9 	Step:	 0 	Loss:	 6.6566009521484375
Epoch:	 9 			Valid Accuracy	 0.013828125
Epoch:	 10 	Step:	 0 	Loss:	 6.5482306480407715
Epoch:	 10 			Valid Accuracy	 0.021015625
Epoch:	 11 	Step:	 0 	Loss:	 6.43660

Epoch:	 93 			Valid Accuracy	 0.99953127
Epoch:	 94 	Step:	 0 	Loss:	 0.13630490005016327
Epoch:	 94 			Valid Accuracy	 0.9996875
Epoch:	 95 	Step:	 0 	Loss:	 0.13207511603832245
Epoch:	 95 			Valid Accuracy	 0.9992969
Epoch:	 96 	Step:	 0 	Loss:	 0.12662336230278015
Epoch:	 96 			Valid Accuracy	 0.99953127
Epoch:	 97 	Step:	 0 	Loss:	 0.11802639812231064
Epoch:	 97 			Valid Accuracy	 0.99976563
Epoch:	 98 	Step:	 0 	Loss:	 0.11407341808080673
Epoch:	 98 			Valid Accuracy	 0.99976563
Epoch:	 99 	Step:	 0 	Loss:	 0.10969527065753937
Epoch:	 99 			Valid Accuracy	 0.9998437
Epoch:	 100 	Step:	 0 	Loss:	 0.10536570847034454
Epoch:	 100 			Valid Accuracy	 0.99976563
Epoch:	 101 	Step:	 0 	Loss:	 0.10417982935905457
Epoch:	 101 			Valid Accuracy	 0.9996875
Epoch:	 102 	Step:	 0 	Loss:	 0.09748704731464386
Epoch:	 102 			Valid Accuracy	 0.9998437
Epoch:	 103 	Step:	 0 	Loss:	 0.09466937929391861
Epoch:	 103 			Valid Accuracy	 1.0
Epoch:	 104 	Step:	 0 	Loss:	 0.0924793928861618
Epoch:	 104 		

This also works. 

We've basically made a very expensive `reversed()`