
# Deep Learning CS6073 Assignment 8

    By Akhil Kanna Devarashetti
    
### Question:

    This programming assignment is based on https://github.com/pytorch/examples/tree/master/word_language_model
    But we will only run the Transformer.
    Download train.txt, valid.txt, and test.txt to ./data/wikitext-2/.
    You may need to run python main.py with specification of the selection of Transformer, or python main2.py, 
    which along with model2.py, is a simplified version only for the Transformer and with a few epochs.
    We need data.py to start
    and generate.py to show the learning result.
    Show that you indeed have spent time in studying and running the programs.
    
_____
 
 To show that I studied the program and transformer model, I'm going to add comments to each part of the program.
 I divided the complete program into separate parts here so that I can describe what is happening at each step.
 
 The whole point of this code is to predict the next word given a sequence.
 Transformer model can do it quite well without using recurrent architecture.
 
 In this notebook, I am executing a model with a very small dataset to check how things work.
 But, the execution of the model with the complete dataset is present in the *other notebook* titled `full_execution.ipynb`

In [0]:
import math
import torch
import torch.nn as nn
import os
from io import open
import torch.nn.functional as F

In [0]:
bptt = 4 # Reduced from 20 to check how it works.
loginterval = 200

In [3]:
!apt-get install tree  # If tree doesn't work

Reading package lists... Done
Building dependency tree       
Reading state information... Done
tree is already the newest version (1.7.0-5).
0 upgraded, 0 newly installed, 0 to remove and 25 not upgraded.


### Downloading the dataset into `data/wikitext-2` using bash script

In [4]:
!mkdir -p data/wikitext-2
!curl https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2/test.txt > data/wikitext-2/test.txt
!curl https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2/train.txt > data/wikitext-2/train.txt
!curl https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2/valid.txt > data/wikitext-2/valid.txt
!ls data/wikitext-2/

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1227k  100 1227k    0     0  27.2M      0 --:--:-- --:--:-- --:--:-- 27.2M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.2M  100 10.2M    0     0   117M      0 --:--:-- --:--:-- --:--:--  117M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1095k  100 1095k    0     0  27.4M      0 --:--:-- --:--:-- --:--:-- 27.4M
test.txt  train.txt  valid.txt


## Lets take a small dataset to analyse what's happening.

In [5]:
!mkdir -p data/sample/
!echo -e 'a quick brown fox jumps over the lazy dog\na lazy dog is not a brown fox' > data/sample/test.txt
!echo -e "the quick brown fox moves faster than the lazy dog\nthis is because the quick brown fox is smarter than the lazy dog" > data/sample/train.txt
!echo -e "a quick brown fox jumps over the lazy dog\na lazy dog is not a brown fox" > data/sample/valid.txt
# !apt-get install tree  # If tree doesn't work
!tree data

data
├── sample
│   ├── test.txt
│   ├── train.txt
│   └── valid.txt
└── wikitext-2
    ├── test.txt
    ├── train.txt
    └── valid.txt

2 directories, 6 files


In [6]:
!cat data/sample/train.txt

the quick brown fox moves faster than the lazy dog
this is because the quick brown fox is smarter than the lazy dog


## The dictionary class saves the mapping between the words and their indices.
## Every word is assigned an integer based on the order of its occurance.

    Example input: "a quick brown fox jumps over the lazy dog\na lazy dog is not a brown fox <eos>"

    Output: 
    {'a': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumps': 4, 'over': 5, 'the': 6, 'lazy': 7, 'dog': 8, 'is': 9, 'not': 10, '<eos>': 11}

In [0]:
class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)

In [8]:
sample_input = "a quick brown fox jumps over the lazy dog\na lazy dog is not a brown fox <eos>"

sample_dict = Dictionary()

for word in sample_input.split():
    sample_dict.add_word(word)

print(sample_dict.word2idx)

{'a': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumps': 4, 'over': 5, 'the': 6, 'lazy': 7, 'dog': 8, 'is': 9, 'not': 10, '<eos>': 11}


### Corpus class will create the dictionary for all three datasets.

    For example:
    test data = "a quick brown fox jumps over the lazy dog a lazy dog is not a brown fox"
    corpus.test = tensor([14,  1,  2,  3, 15, 16,  0,  7, 17,  7, 12,  9, 18, 14,  2,  3, 13])

Note that 'a' is repeated twice and we see it's id 14 repeated twice too.

In [0]:
class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            idss = []
            for line in f:
                words = line.split() + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

In [0]:
batch_size = 4 # 20
eval_batch_size = 3 # 10

#corpus = Corpus('./data/wikitext-2')
corpus = Corpus('./data/sample')

In [11]:
print(f'Dictionary: {corpus.dictionary.word2idx}')
print('\nTest data:')
!cat data/sample/test.txt
print(f'\ncorpus.test: {corpus.test}')

Dictionary: {'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'moves': 4, 'faster': 5, 'than': 6, 'lazy': 7, 'dog': 8, '<eos>': 9, 'this': 10, 'is': 11, 'because': 12, 'smarter': 13, 'a': 14, 'jumps': 15, 'over': 16, 'not': 17}

Test data:
a quick brown fox jumps over the lazy dog
a lazy dog is not a brown fox

corpus.test: tensor([14,  1,  2,  3, 15, 16,  0,  7,  8,  9, 14,  7,  8, 11, 17, 14,  2,  3,
         9])


### As given in the documentation, the batchify function will take the complete dataset as input string and converts it into batches.

In [0]:
def batchify(data, bsz):
    # Work out how cleanly we can divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data

In [13]:
print(corpus.test)
sample_batches = batchify(corpus.test, 3)
print(sample_batches)

tensor([14,  1,  2,  3, 15, 16,  0,  7,  8,  9, 14,  7,  8, 11, 17, 14,  2,  3,
         9])
tensor([[14,  0,  8],
        [ 1,  7, 11],
        [ 2,  8, 17],
        [ 3,  9, 14],
        [15, 14,  2],
        [16,  7,  3]])


### get_batch will actually create a batch of sequence size bptt from index i. Source is the output from batchify()

In [0]:
def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i + seq_len]
    target = source[i + 1:i + 1 + seq_len].view(-1)
    return data, target

In [15]:
print(get_batch(sample_batches, 0))
print(get_batch(sample_batches, 1))

(tensor([[14,  0,  8],
        [ 1,  7, 11],
        [ 2,  8, 17],
        [ 3,  9, 14]]), tensor([ 1,  7, 11,  2,  8, 17,  3,  9, 14, 15, 14,  2]))
(tensor([[ 1,  7, 11],
        [ 2,  8, 17],
        [ 3,  9, 14],
        [15, 14,  2]]), tensor([ 2,  8, 17,  3,  9, 14, 15, 14,  2, 16,  7,  3]))


In [0]:
train_data = batchify(corpus.train, batch_size)
val_data = batchify(corpus.valid, eval_batch_size)
test_data = batchify(corpus.test, eval_batch_size)

In [17]:
print(f'\nword_idx: \n{corpus.dictionary.word2idx}')
print(f'\nTrain data: {train_data.shape}\n{train_data}')


word_idx: 
{'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'moves': 4, 'faster': 5, 'than': 6, 'lazy': 7, 'dog': 8, '<eos>': 9, 'this': 10, 'is': 11, 'because': 12, 'smarter': 13, 'a': 14, 'jumps': 15, 'over': 16, 'not': 17}

Train data: torch.Size([6, 4])
tensor([[ 0,  6, 11, 11],
        [ 1,  0, 12, 13],
        [ 2,  7,  0,  6],
        [ 3,  8,  1,  0],
        [ 4,  9,  2,  7],
        [ 5, 10,  3,  8]])


### The cell below is the implementation of Positional Encoder in PyTorch

In [0]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):

        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

### The cell below is the implementation of Transformer network. It consists of the following flow for a *SINGLE* batch:

## 1. Input:
Each sample in a batch is an array of size S where each element is an integer which is an ID of a word in the vocabulary.

Input shape: (S)  (S = sequence length = bptt = 20)

Example: [1, 45, 3, ...]

## 2. Embedding:
The embedding layer generates a real number array of size E for each integer (word).

Output shape: (S, E)   (E = Embedding size) (E = 100)

Example input: [(1.2, 0.3, 4.5), (x, x, x), ... (x, x, x)] For E = 3

## 3. Positional Encoding:
Positional encoding will encode the relative positions of the words in the sequence.

Output shape: (S, E)

Example input: [(1.2, 0.3, 4.5), (x, x, x), ... (x, x, x)] For E = 3

## 4. Transformer Encoding:
This module does the rest of the transformer professes with keys, values etc internally.

Output shape: (S, E)

Example input: [(1.2, 0.3, 4.5), (x, x, x), ... (x, x, x)] For E = 3

## 5. Decoder:
This is a fully-connected layer that outputs a vector size of vocabulary V for each embedding.

Output shape: (S, V)  (V = Vocabulary size)

Example input: [(1.2, 0.3), (x, x), ... (x, x)] For V = 2

## 6. Softmax:
This module converts each of the V size arrays to probability distributions. The word with the highest probability will be the next word.

Output shape: (S, V)  

Example input: [(0.7, 0.3), (x, x), ... (x, x)] For V = 2

In [0]:
class TransformerModel(nn.Module):
    """Container module with an encoder, a recurrent or transformer module, and a decoder."""

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__()
        from torch.nn import TransformerEncoder, TransformerEncoderLayer
        self.model_type = 'Transformer'
        self.src_mask = None
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, ntoken)

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, has_mask=True):
        print(f'\nSource: {src.shape}')
        src = self.encoder(src)
        print(f'after embedding = {src.shape}')
        src = src * math.sqrt(self.ninp)
        print(f'Multiplier = {math.sqrt(self.ninp)}')
        print(f'after mult with embedding = {src.shape}')
        src = self.pos_encoder(src)
        print(f'after positional enc = {src.shape}')
        output = self.transformer_encoder(src, self.src_mask)
        print(f'after transformer enc = {output.shape}')
        output = self.decoder(output)
        print(f'after decoder = {output.shape}')
        softmax_output = F.log_softmax(output, dim=-1)
        print(f'after softmax = {softmax_output.shape}')
        return softmax_output


### Initializing the values for the architecture

In [0]:
emsize = 100  # E = Embedding size
nhead = 2     # Number of heads in the transformer
nhid = 64     # Hidden units size in the transformer
nlayers = 2
dropout = 0.2
ntokens = len(corpus.dictionary)

model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout)
criterion = nn.NLLLoss()

### Defining the training loop

In [0]:
def train():
    model.train()
    total_loss = 0.
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        model.zero_grad()
        output = model(data)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()

        total_loss += loss.item()

        if batch % loginterval == 0 and batch > 0:
            cur_loss = total_loss / loginterval
            print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | '
                  'loss {:5.2f} | ppl {:8.2f}'.format(
                epoch, batch, len(train_data) // bptt, lr,
                cur_loss, math.exp(cur_loss)))
            total_loss = 0


In [0]:
def evaluate(data_source):
    # Turn on evaluation mode which disables dropout.
    model.eval()
    total_loss = 0.
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            output = model(data)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
    print(data_source.shape)
    return total_loss / (len(data_source))

In [23]:
# Loop over epochs.
lr = 20
best_val_loss = None
epochs = 5
# At any point you can hit Ctrl + C to break out of training early.
try:
    for epoch in range(1, epochs + 1):
        train()
        val_loss = evaluate(val_data)
        print('-' * 89)
        print('| end of epoch {:3d} | valid loss {:5.2f} | '
              'valid ppl {:8.2f}'.format(epoch, val_loss, math.exp(val_loss)))
        print('-' * 89)
        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            with open('model.pt', 'wb') as f:
                torch.save(model, f)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
except KeyboardInterrupt:
    print('-' * 89)
    print('Exiting from training early')


Source: torch.Size([4, 4])
after embedding = torch.Size([4, 4, 100])
Multiplier = 10.0
after mult with embedding = torch.Size([4, 4, 100])
after positional enc = torch.Size([4, 4, 100])
after transformer enc = torch.Size([4, 4, 100])
after decoder = torch.Size([4, 4, 18])
after softmax = torch.Size([4, 4, 18])

Source: torch.Size([1, 4])
after embedding = torch.Size([1, 4, 100])
Multiplier = 10.0
after mult with embedding = torch.Size([1, 4, 100])
after positional enc = torch.Size([1, 4, 100])
after transformer enc = torch.Size([1, 4, 100])
after decoder = torch.Size([1, 4, 18])
after softmax = torch.Size([1, 4, 18])

Source: torch.Size([4, 3])
after embedding = torch.Size([4, 3, 100])
Multiplier = 10.0
after mult with embedding = torch.Size([4, 3, 100])
after positional enc = torch.Size([4, 3, 100])
after transformer enc = torch.Size([4, 3, 100])
after decoder = torch.Size([4, 3, 18])
after softmax = torch.Size([4, 3, 18])

Source: torch.Size([1, 3])
after embedding = torch.Size([1, 

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


In [24]:
test_loss = evaluate(test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)


Source: torch.Size([4, 3])
after embedding = torch.Size([4, 3, 100])
Multiplier = 10.0
after mult with embedding = torch.Size([4, 3, 100])
after positional enc = torch.Size([4, 3, 100])
after transformer enc = torch.Size([4, 3, 100])
after decoder = torch.Size([4, 3, 18])
after softmax = torch.Size([4, 3, 18])

Source: torch.Size([1, 3])
after embedding = torch.Size([1, 3, 100])
Multiplier = 10.0
after mult with embedding = torch.Size([1, 3, 100])
after positional enc = torch.Size([1, 3, 100])
after transformer enc = torch.Size([1, 3, 100])
after decoder = torch.Size([1, 3, 18])
after softmax = torch.Size([1, 3, 18])
torch.Size([6, 3])
| End of training | test loss  2.52 | test ppl    12.44
