<a href="https://colab.research.google.com/github/vegarab/code2seq-reproducibility-challenge/blob/feat%2Fmodel/notebooks/code2seq_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Code2Seq model

This notebook is used to experiment with the implementation of the code2seq model in PyTorch.

In [0]:
import torch
import torch.nn.functional as F
import torchbearer
from torchbearer import callbacks
from torch import nn
from torch import optim
from torch.utils.data import DataLoader

from config import Config
from loader import Code2SeqDataset

seed = 7
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

This should all be moved to a global `code2seq.py` file that loads these parameters from a config-file.

In [0]:
config = Config.get_debug_config(None)

In [3]:
test_set = Code2SeqDataset('test', config=config)
#train_set = Code2SeqDataset(config.TRAIN_PATH, config=config)
#val_set = Code2SeqDataset(config.VAL_PATH, config=config)

test_loader = DataLoader(test_set, batch_size=config.BATCH_SIZE, shuffle=True, num_workers=config.NUM_WORKERS)
#train_loader = DataLoader(train_set, batch_size=config.BATCH_SIZE, shuffle=True)
#val_loader = DataLoader(val_set, batch_size=config.BATCH_SIZE, shuffle=True)

Num training samples: 691974
Dictionaries loaded.
Loaded subtoken vocab. size: 73906
Loaded target word vocab. size: 11319
Loaded nodes vocab. size: 323


This should be placed in `model.py` once final so that it can be imported. I am reluctant to move it there until it is somewhat function to make the workflow a bit easier.

In [0]:
class Encoder(nn.Module):
    def __init__(self, subtoken_input_dim, nodes_vocab_size):
        super(Encoder, self).__init__()

        # z = tanh(Wh[encoder_path(v1,...,vl);encode_token(value(v1));encode_token(value(vl))])
        self.embedding_subtokens = nn.Embedding(subtoken_input_dim, config.EMBEDDINGS_SIZE) 
        self.embedding_paths = nn.Embedding(nodes_vocab_size, config.EMBEDDINGS_SIZE)

        self.num_layers = 2
        self.lstm = nn.LSTM(config.EMBEDDINGS_SIZE, config.RNN_SIZE//2, 
                            bidirectional=True, 
                            num_layers=self.num_layers,
                            dropout=(1 - config.RNN_DROPOUT_KEEP_PROB),
                            batch_first=True)

        # Linear(in_features, out_features)
        # in_features is something else.. See last line in self.forward(*args, **kwargs)
        self.lin = nn.Linear(config.RNN_SIZE//2, config.DECODER_SIZE, bias=False)

    def forward(self, start_leaf, ast_path, end_leaf, target, start_leaf_mask, end_leaf_mask,
                target_mask, context_mask, ast_path_lengths):
        # (batch, max_context, max_name_parts, dim)
        start_embed = self.embedding_subtokens(start_leaf.long())
        end_embed = self.embedding_subtokens(end_leaf.long())

        # (batch, max_contexts, max_path_length+1, dim)
        path_embed = self.embedding_paths(ast_path.long())

        # (batch, max_contexts, max_name_parts, 1)
        end_leaf_mask = end_leaf_mask.unsqueeze(-1)
        start_leaf_mask = start_leaf_mask.unsqueeze(-1)

        # (batch, max_contexts, dim)
        start_embed = torch.sum(start_embed * start_leaf_mask, dim=2)
        end_embed = torch.sum(end_embed * end_leaf_mask, dim=2)

        max_context = path_embed.size()[1]
        # (batch * max_contexts, max_path_lenght+1, dim)
        flat_paths = path_embed.view(-1, config.MAX_PATH_LENGTH, config.EMBEDDINGS_SIZE)

        lstm_output, (hidden, cell) = self.lstm(flat_paths) 
        hidden = hidden[-self.num_layers:, :, :]
        hidden = hidden.transpose(0, 1)
        # (batch * max_contexts, rnn_size)
        final_rnn_state = torch.reshape(hidden, (hidden.size()[0], -1))

        # (batch, max_contexts, rnn_size)
        path_aggregated = torch.reshape(final_rnn_state,
                                        (-1, max_context, config.RNN_SIZE))


        # (batch, max_contexts, dim * 2 + rnn_size
        context_embed = torch.cat([start_embed, path_aggregated, end_embed], dim=-1)

        # Input needs to be (batch, _) ?
        context_embed = F.tanh(self.lin(context_embed))

        return context_embed

In [0]:
class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()
        
        # DECODER_SIZE used in config. Is this k in the paper?
        # input should be the output of the FC network.. but is it still vocab size after
        # tanh activation and softmax?
        self.lstm = nn.LSTM(config.TARGET_VOCAB_MAX_SIZE, config.DECODER_SIZE, num_layers=config.NUM_DECODER_LAYERS)

        # as far as I can see, they have disabled the bias in their dense layers
        # out_features here is set to be the vocab size?
        self.lin = nn.Linear(config.DECODER_SIZE, config.TARGET_VOCAB_MAX_SIZE, bias=False)

    def forward(self):
        pass

# New Section

This should probably be part of a `Code2Seq` class that contains an encoder and decoder?

In [0]:
class Code2Seq(nn.Module):
    def __init__(self):
        super(Code2Seq, self).__init__()
        self.encoder = Encoder(_, _)
        self.decoder = Decoder(_, _)

    def attention(self, *args, **kwargs):
        pass

    def foward(self):
        pass

The fully connected network is supposed to use a `tanh` function and `softmax` output?

$$p(y_t | y_{<t}, z_1, \dots, z_n) = \text{softmax}(W_s\tanh(W_c[c_t;h_t]))$$
$$\mathbf{\alpha^t} = \text{softmax}(h_tW_a\mathbf{z})$$