# Baseline for JB test assignment
Method name prediction is a popular problem in ML for SE domain. In addition to its practical value, it serves as a popular benchmark for models aiming at source code understanding.

In the [code2seq](https://github.com/tech-srl/code2seq) work authors suggested several datasets for method name prediction. To speed up experiments, we will use **only 10%** of Java-small dataset. **Please, do not use data other than the selected 10% to train and validate models.**

The goal of this task is to **improve the quality of a method name prediction model**. As the solution, you can either submit a modified notebook, or a link to GitHub repository. In both cases, we ask you to document everything that you try and report which ideas gave the most benefit!

To ease experimenting, we provide you a simple pipeilne:
* Data loading and preparation
* Baseline encoder-decoder model that uses pre-trained [CodeBERT](https://github.com/microsoft/CodeBERT) as an encoder and a BERT decoder
* Computation of widely used metrics for this task
* Train baseline model and report the results



## Data collection

Here, we generate a subsample of ~10% methods from the Java-small dataset.

In [1]:
import os
import re
from pathlib import Path
from tree_sitter import Language, Parser, TreeCursor, Node
from typing import List, Tuple, Dict
from tqdm.auto import tqdm
from collections import namedtuple

from helper import *

### Extract 10% of files from each project

In [2]:
DATA_ROOT = Path("data/java-small")
TRAIN_ROOT = DATA_ROOT / "training"
VAL_ROOT = DATA_ROOT / "validation"
TEST_ROOT = DATA_ROOT / "test"
K = 10

train_files = extract_files_subsample(TRAIN_ROOT, K)
val_files = extract_files_subsample(VAL_ROOT, K)
test_files = extract_files_subsample(TEST_ROOT, K)

In [3]:
len(train_files), len(val_files), len(test_files)

(8944, 188, 527)

### Extract methods from all files

In [4]:
train_methods = extract_methods_from_files(train_files)
val_methods = extract_methods_from_files(val_files)
test_methods = extract_methods_from_files(test_files)

  0%|          | 0/8944 [00:00<?, ?it/s]

  0%|          | 0/188 [00:00<?, ?it/s]

  0%|          | 0/527 [00:00<?, ?it/s]

### Extract names for files and remove them from code

We transform method names into sentences by splitting them by CamelCase and snake_case.

For more accurate method handling, we should also filter out abstract and overloaded methods, methods with empty body, properly handle recursive method calls. These steps are omitted in this assignment for simplicity.

In [5]:
train_samples = prepare_samples(train_methods)
val_samples = prepare_samples(val_methods)
test_samples = prepare_samples(test_methods)

## Baseline model: CodeBERT

To develop the baseline, we will use [transformers](https://github.com/huggingface/transformers) library and PyTorch.

In [6]:
from transformers import (
    AutoTokenizer, 
    AutoModel, 
    BertGenerationDecoder, 
    BertGenerationConfig,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)
from datasets import Dataset
import torch
from torch import nn
import numpy as np

### Prepare data for training



In [7]:
INPUT_LENGTH = 128
OUTPUT_LENGTH = 10
BATCH_SIZE = 32

In [8]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

In [9]:
def sample_to_input(batch: Dict[str, List[str]]) -> Dict[str, List]:
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["code"], 
        padding="max_length", truncation=True, max_length=INPUT_LENGTH
    )
    outputs = tokenizer(
        batch["name"],
        padding="max_length", truncation=True, max_length=OUTPUT_LENGTH
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    # Decoder attention mask makes sure that we don't look into the future.
    batch["decoder_attention_mask"] = [
        [
            [
                int(i >= j and attention_mask[i])
                for j in range(OUTPUT_LENGTH)
            ]
            for i in range(OUTPUT_LENGTH)
        ]
        for attention_mask in outputs.attention_mask   
    ]
    batch["labels"] = outputs.input_ids

    # HuggingFace's implementation of BERT treats -100 as ignored tokens for 
    # loss computation
    batch["masked_labels"] = batch["labels"]
    batch["masked_labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels] 
        for labels in batch["labels"]
    ]

    return batch

def create_dataset(samples: List[Dict[str, str]]) -> Dataset:
    dataset = Dataset.from_list(samples)
    dataset = dataset.map(
        sample_to_input, 
        batched=True, 
        batch_size=BATCH_SIZE, 
        remove_columns=["name", "code"]
    )
    dataset.set_format(
        type="torch", columns=[
            "input_ids", 
            "attention_mask", 
            "decoder_attention_mask", 
            "labels",
            "masked_labels",
        ],
    )
    return dataset

In [10]:
train_dataset = create_dataset(train_samples)
val_dataset = create_dataset(val_samples)
test_dataset = create_dataset(test_samples)

  0%|          | 0/2519 [00:00<?, ?ba/s]

  0%|          | 0/74 [00:00<?, ?ba/s]

  0%|          | 0/162 [00:00<?, ?ba/s]

In [11]:
small_dataset = create_dataset(train_samples[:10 * BATCH_SIZE])

  0%|          | 0/10 [00:00<?, ?ba/s]

### Setup model

As a baseline, we train a seq2seq model with pre-trained CodeBERT as an encoder and a BERT decoder trained from scratch.

In [12]:
class BaselineCodeBERT(nn.Module):

    def __init__(self):
        super(BaselineCodeBERT, self).__init__()
        self.encoder = AutoModel.from_pretrained("microsoft/codebert-base")
        self.config = BertGenerationConfig(
            vocab_size=self.encoder.config.vocab_size,
            hidden_size=self.encoder.config.hidden_size,
            num_hidden_layers=4,
            num_attention_heads=4,
            intermediate_size=1024,
            is_decoder=True,
            add_cross_attention=True,
            decoder_start_token_id=tokenizer.cls_token_id,
            max_length=OUTPUT_LENGTH,
        )
        self.decoder = BertGenerationDecoder(self.config)
        self.main_input_name = "input_ids"

    def forward(
        self, 
        input_ids, 
        attention_mask,
        decoder_attention_mask,
        labels,
        masked_labels,
    ):
        seq_embedding = self.encoder(
            input_ids=input_ids, 
            attention_mask=attention_mask
        )[0]
        output = self.decoder(
            input_ids=labels, 
            attention_mask=decoder_attention_mask,
            encoder_hidden_states=seq_embedding,
            encoder_attention_mask=attention_mask,
            labels=masked_labels,
        )
        return output

    @torch.no_grad()
    def generate(
        self, 
        input_ids, 
        attention_mask,
        max_length=None,
        **kwargs
    ):
        input_ids = input_ids.to(self.encoder.device)
        attention_mask = attention_mask.to(self.encoder.device)
        seq_embedding = self.encoder(
            input_ids=input_ids, 
            attention_mask=attention_mask
        )[0]

        if max_length is None:
            max_length = self.config.max_length

        batch_size = len(input_ids)
        bos_column = torch.full((batch_size, 1), tokenizer.bos_token_id)
        bos_column = bos_column.to(self.decoder.device)
        labels = bos_column
        for i in range(1, max_length):
            decoder_attention_mask = torch.ones(batch_size, i, i)
            predictions = self.decoder(
                input_ids=labels, 
                attention_mask=decoder_attention_mask,
                encoder_hidden_states=seq_embedding,
                encoder_attention_mask=attention_mask,
            ).logits.argmax(-1)[:, -1]
            labels = torch.cat((labels, predictions.view(-1, 1)), -1)
        return labels

In [13]:
model = BaselineCodeBERT()

### Evaluation metrics

To evaluate the model, we use a common metric for method name prediction: an analogue of F-score for sequences.

In [14]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    
    # Eliminate predictions after first EOS token
    got_eos = np.zeros(len(pred_ids), dtype=bool)
    for i in range(pred_ids.shape[1]):
        got_eos |= pred_ids[:, i] == tokenizer.eos_token_id
        pred_ids[:, i][got_eos] = tokenizer.eos_token_id

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    precision, recall, f1score = 0, 0, 0
    n_examples = 0
    for label, pred in zip(label_str, pred_str):
        label_tokens = set(label.strip().split())
        pred_tokens = set(pred.strip().split())
        n_true = len(label_tokens & pred_tokens)
        n_label = len(label_tokens)
        n_pred = len(pred_tokens)

        p = n_true / n_pred if n_pred > 0 else 0.
        r = n_true / n_label if n_label > 0 else 0.
        f1 = (
            2 * p * r / (p + r) 
            if p + r > 0 
            else 0.
        )

        precision += p
        recall += r
        f1score += f1
        
        n_examples += 1

    return {
        "precision": precision / n_examples,
        "recall": recall / n_examples,
        "f1": f1score / n_examples,
    }

### Training pipeline

In [15]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    learning_rate=4e-3,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    output_dir="./",
    logging_steps=10,
    save_steps=500,
    eval_steps=200,
    num_train_epochs=1,
)

In [16]:
%env WANDB_PROJECT=JB_test
# instantiate trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

env: WANDB_PROJECT=JB_test
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [17]:
trainer.train()

***** Running training *****
  Num examples = 80598
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 630
  Number of trainable parameters = 188910169
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mmartslaaf[0m (use `wandb login --relogin` to force relogin)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Step,Training Loss,Validation Loss,Precision,Recall,F1
200,5.5341,7.505348,0.0,0.0,0.0
400,5.6114,7.264141,0.0,0.0,0.0
600,5.5165,7.240836,0.0,0.0,0.0


***** Running Evaluation *****
  Num examples = 2337
  Batch size = 128
***** Running Evaluation *****
  Num examples = 2337
  Batch size = 128
Saving model checkpoint to ./checkpoint-500
Trainer.model is not a `PreTrainedModel`, only saving its state dict.
***** Running Evaluation *****
  Num examples = 2337
  Batch size = 128


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=630, training_loss=5.661440791780986, metrics={'train_runtime': 467.8822, 'train_samples_per_second': 172.261, 'train_steps_per_second': 1.346, 'total_flos': 0.0, 'train_loss': 5.661440791780986, 'epoch': 1.0})