<a href="https://colab.research.google.com/github/vegarab/code2seq-reproducibility-challenge/blob/feat%2Fmodel/notebooks/code2seq_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Code2Seq model

This notebook is used to experiment with the implementation of the code2seq model in PyTorch.

In [0]:
import torch
import torch.nn.functional as F
import torchbearer
from torchbearer import callbacks
from torch import nn
from torch import optim
from torch.utils.data import DataLoader

from config import Config
from loader import Code2SeqDataset

seed = 7
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

This should all be moved to a global `code2seq.py` file that loads these parameters from a config-file.

In [0]:
config = Config.get_debug_config(None)

In [169]:
test_set = Code2SeqDataset(config.TEST_PATH, config=config)
#train_set = Code2SeqDataset(config.TRAIN_PATH, config=config)
#val_set = Code2SeqDataset(config.VAL_PATH, config=config)

test_loader = DataLoader(test_set, batch_size=config.BATCH_SIZE, shuffle=True)
#train_loader = DataLoader(train_set, batch_size=config.BATCH_SIZE, shuffle=True)
#val_loader = DataLoader(val_set, batch_size=config.BATCH_SIZE, shuffle=True)

Num training samples: 691974
Dictionaries loaded.
Loaded subtoken vocab. size: 73906
Loaded target word vocab. size: 11319
Loaded nodes vocab. size: 323
Processing...



  0%|          | 0/57088 [00:00<?, ?it/s][A
 14%|█▍        | 8267/57088 [00:00<00:00, 82668.11it/s][A
 28%|██▊       | 15915/57088 [00:00<00:00, 80703.69it/s][A
 42%|████▏     | 23885/57088 [00:00<00:00, 80395.96it/s][A
 56%|█████▌    | 31868/57088 [00:00<00:00, 80224.03it/s][A
 66%|██████▋   | 37911/57088 [00:00<00:00, 70725.78it/s][A
 77%|███████▋  | 43804/57088 [00:00<00:00, 64250.75it/s][A
 93%|█████████▎| 53054/57088 [00:00<00:00, 70730.90it/s][A
100%|██████████| 57088/57088 [00:00<00:00, 72903.53it/s][A

This should be placed in `model.py` once final so that it can be imported. I am reluctant to move it there until it is somewhat function to make the workflow a bit easier.

In [0]:
class Encoder(nn.Module):
    def __init__(self, subtoken_input_dim, nodes_vocab_size):
        super(Encoder, self).__init__()

        # Pretty sure the subtokens have a single embedding even though they are shown as two in the architecture figure??
        # ... and are concated as encode_token(value(v1)) and encode_token(value(vl)) - basically encode the first and last token
        # All embeddings need to have the same dim sine they are concated before passed to the fc network?
        # Need to look at thow these are handled during the forward pass... concat?
        # z = tanh(Wh[encoder_path(v1,...,vl);encode_token(value(v1));encode_token(value(vl))])
        self.embedding_subtokens = nn.Embedding(subtoken_input_dim, config.EMBEDDINGS_SIZE) 
        self.embedding_paths = nn.Embedding(nodes_vocab_size, config.EMBEDDINGS_SIZE)

        # LSTM(input_size, hidden_size)
        # This encoder is bidirectional. Not sure what the hidden size is defined as here
        # Could be RNN_SIZE/2 from the paper implementation.. ? n in the paper?
        # Pretty sure the input_size has to follow the output of the embeddings. 
        # This is also why I think they need to be the same for the two different embeddings
        # Dropout?
        self.lstm = nn.LSTM(config.EMBEDDINGS_SIZE, config.RNN_SIZE//2, 
                            bidirectional=True, 
                            num_layers=2,
                            dropout=(1 - config.RNN_DROPOUT_KEEP_PROB))

        # Linear(in_features, out_features)
        # out_features is l in the paper? or vocab size?
        # as far as I can see, they have disabled the bias in their dense layers
        self.lin = nn.Linear(config.RNN_SIZE//2, config.TARGET_VOCAB_MAX_SIZE, bias=False)

    def forward(self, start_leaf, ast_path, end_leaf, target):
        #batch, max_e, _ = start_leaf.size()
        encode_start = self.embedding_subtokens(start_leaf.long())
        encode_end   = self.embedding_subtokens(end_leaf.long())

        encode_start = encode_start.sum(1)
        encode_end = encode_end.sum(1)
        print('ast', ast_path.size())

        ast_embedding = self.embedding_paths(ast_path.long())
        print('ast emb', ast_embedding.size())

        #lengths = torch.tensor([(path==0).nonzero()[0].data 
                                #for path in ast_path])
        #packed = nn.utils.rnn.pack_padded_sequence(ast_embedding, lengths)
        lstm_output, (hidden, cell) = self.lstm(ast_embedding) 
        print(hidden.size())
        # Get the last layer
        hidden = hidden[-2:, :, :]
        print(hidden.size())
        print()
        #hidden = hidden.transpose(0, 1)
        #print(hidden.size())
        hidden = hidden.contiguous().view(9, 1, -1)
        print(hidden.size())
        hidden = hidden.squeeze(1)
        print(hidden.size())

        print()
        print(encode_start.size())
        print(encode_end.size())
        encode_all = torch.cat([hidden, encode_start, encode_end], dim=1)

        encode_all = self.lin(encode_all)
        encode_all = F.tanh(encode_all)

        return encode_all

In [0]:
class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()
        
        # DECODER_SIZE used in config. Is this k in the paper?
        # input should be the output of the FC network.. but is it still vocab size after
        # tanh activation and softmax?
        self.lstm = nn.LSTM(config.TARGET_VOCAB_MAX_SIZE, config.DECODER_SIZE, num_layers=config.NUM_DECODER_LAYERS)

        # as far as I can see, they have disabled the bias in their dense layers
        # out_features here is set to be the vocab size?
        self.lin = nn.Linear(config.DECODER_SIZE, config.TARGET_VOCAB_MAX_SIZE, bias=False)

    def forward(self):
        pass

This should probably be part of a `Code2Seq` class that contains an encoder and decoder?

In [0]:
def Code2Seq(nn.Module):
    def __init__(self):
        super(Code2Seq, self).__init__()
        self.encoder = Encoder(_, _)
        self.decoder = Decoder(_, _)

    def attention(self, *args, **kwargs):
        pass

    def foward(self):
        pass

The fully connected network is supposed to use a `tanh` function and `softmax` output?

$$p(y_t | y_{<t}, z_1, \dots, z_n) = \text{softmax}(W_s\tanh(W_c[c_t;h_t]))$$
$$\mathbf{\alpha^t} = \text{softmax}(h_tW_a\mathbf{z})$$