# Baseline for JB test assignment
Method name prediction is a popular problem in ML for SE domain. In addition to its practical value, it serves as a popular benchmark for models aiming at source code understanding.

In the [code2seq](https://github.com/tech-srl/code2seq) work authors suggested several datasets for method name prediction. To speed up experiments, we will use **only 10%** of Java-small dataset. **Please, do not use data other than the selected 10% to train and validate models.**

The goal of this task is to **improve the quality of a method name prediction model**. As the solution, you can either submit a modified notebook, or a link to GitHub repository. In both cases, we ask you to document everything that you try and report which ideas gave the most benefit!

To ease experimenting, we provide you a simple pipeilne:
* Data loading and preparation
* Baseline encoder-decoder model that uses pre-trained [CodeBERT](https://github.com/microsoft/CodeBERT) as an encoder and a BERT decoder
* Computation of widely used metrics for this task
* Train baseline model and report the results



## Data collection

Here, we generate a subsample of ~10% methods from the Java-small dataset.

In [1]:
# !pip install tree_sitter

In [2]:
# !pip install --upgrade transformers datasets

In [3]:
# !mkdir data && \
#     cd data && \
#     wget https://s3.amazonaws.com/code2seq/datasets/java-small.tar.gz && \
#     tar -xzf java-small.tar.gz

In [4]:
import os
import re
from pathlib import Path
from tree_sitter import Language, Parser, TreeCursor, Node
from typing import List, Tuple, Dict
from tqdm.auto import tqdm
from collections import namedtuple

os.environ['CUDA_DEVICE_ORDER']= 'PCI_BUS_ID'

gpus = [2]
os.environ['CUDA_VISIBLE_DEVICES'] = ','.join(map(str, gpus))

NUM_GPUS = len(gpus)

### Download data

### Extract 10% of files from each project

In [5]:
DATA_ROOT = Path("data/java-small")
TRAIN_ROOT = DATA_ROOT / "training"
VAL_ROOT = DATA_ROOT / "validation"
TEST_ROOT = DATA_ROOT / "test"
K = 10

def extract_files_subsample(root: Path, kth: int) -> List[Path]:
    """
    Extract every kth file from each project and return all file paths
    """
    files = []
    for project in os.listdir(root):
        project_files = sorted(os.listdir(root / project))[::kth]
        files += [root / project / filename for filename in project_files]
    return files

train_files = extract_files_subsample(TRAIN_ROOT, K)
val_files = extract_files_subsample(VAL_ROOT, K)
test_files = extract_files_subsample(TEST_ROOT, K)

In [6]:
len(train_files), len(val_files), len(test_files)

(8944, 188, 527)

### Setup tree-sitter to parse Java


In [7]:
!git clone https://github.com/tree-sitter/tree-sitter-java

Language.build_library('build/my-languages.so', ['tree-sitter-java'])
JAVA_LANGUAGE = Language('build/my-languages.so', 'java')
parser = Parser()
parser.set_language(JAVA_LANGUAGE)

fatal: destination path 'tree-sitter-java' already exists and is not an empty directory.


### Extract methods from all files

In [8]:
METHOD_TYPE = "method_declaration"
IDENTIFIER_TYPE = "identifier"
MASK_TOKEN = b"<mask>"

In [9]:
def read(filename: Path) -> bytes:
    return bytes(open(filename, "r").read(), "utf-8")

In [10]:
def traverse(cursor: TreeCursor, extracted_methods: List[str]):

    if cursor.node.type == METHOD_TYPE:
        extracted_methods.append(cursor.node)

    if cursor.goto_first_child():
        traverse(cursor, extracted_methods)
    
    if cursor.goto_next_sibling():
        traverse(cursor, extracted_methods)
    else:
        cursor.goto_parent()

In [11]:
def extract_methods_from_files(files: List[Path]) -> List[Node]:
    extracted_methods = []
    
    for filepath in tqdm(files):
        content = read(filepath)
        parsed_tree = parser.parse(content)
        cursor = parsed_tree.walk()
        try:
            traverse(cursor, extracted_methods)
        except RecursionError:
            pass
    
    return extracted_methods

train_methods = extract_methods_from_files(train_files)
val_methods = extract_methods_from_files(val_files)
test_methods = extract_methods_from_files(test_files)

  0%|          | 0/8944 [00:00<?, ?it/s]

  0%|          | 0/188 [00:00<?, ?it/s]

  0%|          | 0/527 [00:00<?, ?it/s]

### Extract names for files and remove them from code

We transform method names into sentences by splitting them by CamelCase and snake_case.

For more accurate method handling, we should also filter out abstract and overloaded methods, methods with empty body, properly handle recursive method calls. These steps are omitted in this assignment for simplicity.

In [12]:
split_regex = re.compile("(?<=[a-z])(?=[A-Z])|_|[0-9]|(?<=[A-Z])(?=[A-Z][a-z])|\\s+")

def split_token(token: str):
    return split_regex.split(token)

def prepare_sample(method_root: Node) -> Dict:
    name = None
    name_span = None
    for child in method_root.children:
        if child.type == IDENTIFIER_TYPE:
            name = child.text
            name_span = (
                child.start_byte - method_root.start_byte, 
                child.end_byte - method_root.start_byte
            )
            break
    
    if name is None:
        return None

    code = method_root.text
    code = code[:name_span[0]] + MASK_TOKEN + code[name_span[1]:]

    return {
        "name": " ".join(split_token(name.decode())).lower(), 
        "code": code.decode()
    }

def prepare_samples(method_roots: List[Node]) -> List[Dict[str, str]]:
    samples = [prepare_sample(method) for method in method_roots]
    samples = [sample for sample in samples if sample is not None]
    return samples

In [13]:
train_samples = prepare_samples(train_methods)
val_samples = prepare_samples(val_methods)
test_samples = prepare_samples(test_methods)

## Baseline model: CodeBERT

To develop the baseline, we will use [transformers](https://github.com/huggingface/transformers) library and PyTorch.

In [14]:
from transformers import (
    AutoTokenizer, 
    AutoModel, 
    BertGenerationDecoder, 
    BertGenerationConfig,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)
from datasets import Dataset
import torch
from torch import nn
import numpy as np

### Prepare data for training



In [15]:
INPUT_LENGTH = 128
OUTPUT_LENGTH = 10
BATCH_SIZE = 32 // NUM_GPUS

In [16]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base-mlm")

In [17]:
def sample_to_input(batch: Dict[str, List[str]]) -> Dict[str, List]:
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["code"], 
        padding="max_length", truncation=True, max_length=INPUT_LENGTH
    )
    outputs = tokenizer(
        batch["name"],
        padding="max_length", truncation=True, max_length=OUTPUT_LENGTH
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    # Decoder attention mask makes sure that we don't look into the future.
    batch["decoder_attention_mask"] = [
        [
            [
                int(i >= j and attention_mask[i])
                for j in range(OUTPUT_LENGTH)
            ]
            for i in range(OUTPUT_LENGTH)
        ]
        for attention_mask in outputs.attention_mask   
    ]
    batch["labels"] = outputs.input_ids

    # HuggingFace's implementation of BERT treats -100 as ignored tokens for 
    # loss computation
    batch["masked_labels"] = batch["labels"]
    batch["masked_labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels] 
        for labels in batch["labels"]
    ]

    return batch

def create_dataset(samples: List[Dict[str, str]]) -> Dataset:
    dataset = Dataset.from_list(samples)
    dataset = dataset.map(
        sample_to_input, 
        batched=True, 
        batch_size=BATCH_SIZE, 
        remove_columns=["name", "code"]
    )
    dataset.set_format(
        type="torch", columns=[
            "input_ids", 
            "attention_mask", 
            "decoder_attention_mask", 
            "labels",
            "masked_labels",
        ],
    )
    return dataset

In [18]:
train_dataset = create_dataset(train_samples)
val_dataset = create_dataset(val_samples)
test_dataset = create_dataset(test_samples)

  0%|          | 0/2519 [00:00<?, ?ba/s]

  0%|          | 0/74 [00:00<?, ?ba/s]

  0%|          | 0/162 [00:00<?, ?ba/s]

In [19]:
small_dataset = create_dataset(train_samples[:10 * BATCH_SIZE])

  0%|          | 0/10 [00:00<?, ?ba/s]

### Setup model

As a baseline, we train a seq2seq model with pre-trained CodeBERT as an encoder and a BERT decoder trained from scratch.

In [20]:
class BaselineCodeBERT(nn.Module):

    def __init__(self):
        super(BaselineCodeBERT, self).__init__()
        self.encoder = AutoModel.from_pretrained("microsoft/codebert-base-mlm")
        self.config = BertGenerationConfig(
            vocab_size=self.encoder.config.vocab_size,
            hidden_size=self.encoder.config.hidden_size,
            num_hidden_layers=4,
            num_attention_heads=4,
            intermediate_size=1024,
            is_decoder=True,
            add_cross_attention=True,
            decoder_start_token_id=tokenizer.cls_token_id,
            bos_token_id=tokenizer.bos_token_id,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            max_length=OUTPUT_LENGTH,
        )
        self.decoder = BertGenerationDecoder(self.config)
        self.main_input_name = "input_ids"
        

    def forward(
        self, 
        input_ids, 
        attention_mask,
        decoder_attention_mask,
        labels,
        masked_labels,
    ):
        seq_embedding = self.encoder(
            input_ids=input_ids, 
            attention_mask=attention_mask
        )[0]
        output = self.decoder(
            input_ids=labels, 
            attention_mask=decoder_attention_mask,
            encoder_hidden_states=seq_embedding,
            encoder_attention_mask=attention_mask,
            labels=masked_labels,
        )
        return output

    @torch.no_grad()
    def generate(
        self,
        input_ids, 
        attention_mask,
        max_length=None,
        num_beams=5,
        **kwargs
    ):
        input_ids = input_ids.to(self.encoder.device)
        attention_mask = attention_mask.to(self.encoder.device)
        seq_embedding = self.encoder(
            input_ids=input_ids, 
            attention_mask=attention_mask
        )[0]

        if max_length is None:
            max_length = self.config.max_length

        batch_size = len(input_ids)
        bos_column = torch.full((batch_size, 1), tokenizer.bos_token_id).to(self.decoder.device)
        
        # beam initialization
        decoder_attention_mask = torch.ones(batch_size, 1, 1)
        logits = self.decoder(
            input_ids=bos_column,
            attention_mask=decoder_attention_mask,
            encoder_hidden_states=seq_embedding,
            encoder_attention_mask=attention_mask,
        ).logits[:, -1, :]
        log_probs, predictions = torch.topk(logits, num_beams, -1)
        
        labels = bos_column.repeat(1, num_beams).unsqueeze(2)
        labels = torch.cat((labels, predictions.unsqueeze(2)), -1)
        
        # initialize mask for special tokens
        masked_special_labels = (predictions == tokenizer.eos_token_id) | (predictions == tokenizer.pad_token_id)
        
        # beam loop
        for i in range(2, max_length):
            decoder_attention_mask = torch.ones(batch_size, i, i)
            logits = self.decoder(
                input_ids=labels.view(batch_size * num_beams, -1),
                attention_mask=decoder_attention_mask.repeat_interleave(num_beams, dim=0),
                encoder_hidden_states=seq_embedding.repeat_interleave(num_beams, dim=0),
                encoder_attention_mask=attention_mask.repeat_interleave(num_beams, dim=0)
            ).logits[:, -1, :].view(batch_size, num_beams, -1)
            
            # working in log-probabilities domain
            logits = torch.nn.functional.log_softmax(logits, dim=-1) 
            
            # expand current beam
            candidates_logits, candidates_indices = torch.topk(logits, num_beams, -1)
            
            # do not take into account special tokens
            candidates_logits[masked_special_labels] = 0
            candidates_indices[masked_special_labels] = tokenizer.pad_token_id
            
            # get new top-k predictions
            scores = log_probs[:, :, None] + candidates_logits
            log_probs, indices = torch.topk(scores.view(batch_size, -1), num_beams, dim=-1)
            predictions = candidates_indices.view(batch_size, -1).gather(1, indices)
            
            # select correct labels for new predictions
            labels = labels.repeat_interleave(num_beams, dim=1)
            ranger = torch.arange(batch_size)
            labels = labels[ranger[:, None], indices]
            
            # append new predictions
            labels = torch.cat((labels, predictions.unsqueeze(2)), -1)
            masked_special_labels = (predictions == tokenizer.eos_token_id) | (predictions == tokenizer.pad_token_id)

        return labels[:, 0, :]

In [21]:
model = BaselineCodeBERT()
model.load_state_dict(torch.load('./mlm_backbone/checkpoint-75500/pytorch_model.bin'))

Some weights of the model checkpoint at microsoft/codebert-base-mlm were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<All keys matched successfully>

In [30]:
model.eval()

sample = test_dataset[10:20]
labels_ids = sample['labels']
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
for _ in range(1):
    pred_ids = model.generate(sample['input_ids'],
                  sample['attention_mask'], num_beams=1)
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    for a, b in zip(pred_str, label_str):
        print(a, ":", b)

testencode : test concat missing target
test copying with empty file : test concat file on file
test drop files : test concat on self
put group : add group
add group : add group
find counter : find counter
get : find counter
find counter : find counter
get ignored groups : get group names
iterator : iterator


In [31]:
model.eval()

sample = test_dataset[10:20]
labels_ids = sample['labels']
label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
for _ in range(1):
    pred_ids = model.generate(sample['input_ids'],
                  sample['attention_mask'], num_beams=10)
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    for a, b in zip(pred_str, label_str):
        print(a, ":", b)

testencode : test concat missing target
test copying : test concat file on file
test drop files : test concat on self
put group : add group
add group : add group
find counter : find counter
get : find counter
find counter : find counter
get format : get group names
iterator : iterator


### Evaluation metrics

To evaluate the model, we use a common metric for method name prediction: an analogue of F-score for sequences.

In [24]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    
    # Eliminate predictions after first EOS token
    got_eos = np.zeros(len(pred_ids), dtype=bool)
    for i in range(pred_ids.shape[1]):
        got_eos |= pred_ids[:, i] == tokenizer.eos_token_id
        pred_ids[:, i][got_eos] = tokenizer.eos_token_id

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    precision, recall, f1score = 0, 0, 0
    n_examples = 0
    for label, pred in zip(label_str, pred_str):
        label_tokens = set(label.strip().split())
        pred_tokens = set(pred.strip().split())
        n_true = len(label_tokens & pred_tokens)
        n_label = len(label_tokens)
        n_pred = len(pred_tokens)

        p = n_true / n_pred if n_pred > 0 else 0.
        r = n_true / n_label if n_label > 0 else 0.
        f1 = (
            2 * p * r / (p + r) 
            if p + r > 0 
            else 0.
        )

        precision += p
        recall += r
        f1score += f1
        
        n_examples += 1

    return {
        "precision": precision / n_examples,
        "recall": recall / n_examples,
        "f1": f1score / n_examples,
    }

### Training pipeline

In [25]:
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    output_dir="./mlm_backbone",
    logging_steps=10,
    save_steps=500,
    eval_steps=500,
    num_train_epochs=30,
    save_total_limit=10,
    generation_num_beams=10,
)

In [26]:
from transformers import (
    get_cosine_schedule_with_warmup,
    get_cosine_with_hard_restarts_schedule_with_warmup
)

class CustomTrainer(Seq2SeqTrainer):
    def create_optimizer_and_scheduler(self, num_training_steps: int):
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=5e-5)
        self.lr_scheduler = get_cosine_schedule_with_warmup(
            optimizer=self.optimizer,
            num_warmup_steps=self.args.get_warmup_steps(num_training_steps),
            num_training_steps=num_training_steps,
        )
        self.create_scheduler(num_training_steps=num_training_steps, optimizer=self.optimizer)

In [27]:
# instantiate trainer
trainer = CustomTrainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

### Test results

In [29]:
trainer.evaluate(
    test_dataset
)

***** Running Evaluation *****
  Num examples = 5161
  Batch size = 32


Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33msashamn-cs[0m. Use [1m`wandb login --relogin`[0m to force relogin


{'eval_loss': 4.849123954772949,
 'eval_precision': 0.476474197506943,
 'eval_recall': 0.442788403871526,
 'eval_f1': 0.44701666393198985,
 'eval_runtime': 187.0618,
 'eval_samples_per_second': 27.59,
 'eval_steps_per_second': 0.866}