# Bug Localization - 🤗 Transformers

In this notebook we train a bug localization model. The model works by taking as input the error description and the source code (a combination of natural language and programming language) and will output a mask (0/1) for each token in the input. The model learns to make predictions for the PL part of the input by computing the true labels of the tokens in the buggy source code. We obtain the true labels using the diff between the buggy source code and the accepted source code (which is a modified version of the buggy one, such that it works for the problem). Finally, we can extract the character mask of buggy/non buggy from the predicted mask, to be able to display the results in a more human readable format.

## Imports

For this notebook we are going to use transformers and datasets from huggingface. 

In [1]:
!pip install datasets transformers sentencepiece



In [2]:
import os
import json
import torch

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import RobertaTokenizerFast, RobertaForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification
from datasets import load_metric, load_dataset

from difflib import SequenceMatcher
from tqdm.notebook import tqdm
from IPython.display import HTML
from functools import partial

codenet_root = 'C:/Users/Hp/00000000000000 Defence/'

os.environ["WANDB_DISABLED"] = "true"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

  from .autonotebook import tqdm as notebook_tqdm


## Preprocess Data

### Load Data

We load the data from the json file. We compute the diff between the original source code (buggy) and the changed source code (accepted). Then for each modification found by the sequence matcher we fill the character labels with ones and zeros.

In [3]:
dataset = load_dataset("csv", data_files={"train": codenet_root+"filtered_data_train.csv", "test": codenet_root+"filtered_data_test.csv"})

Using custom data configuration default-e1f8db9fd35de7cb
Found cached dataset csv (C:/Users/Hp/.cache/huggingface/datasets/csv/default-e1f8db9fd35de7cb/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)
100%|████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  6.03it/s]


In [4]:
dataset["train"]

Dataset({
    features: ['buggy_code', 'correct_code', 'problem_id', 'buggy_code_submission_id', 'correct_code_submission_id'],
    num_rows: 3063
})

In [5]:
train_dataset = dataset["train"].train_test_split(train_size=0.1)
test_dataset = dataset["test"].train_test_split(test_size=0.1)

In [6]:
train_dataset

DatasetDict({
    train: Dataset({
        features: ['buggy_code', 'correct_code', 'problem_id', 'buggy_code_submission_id', 'correct_code_submission_id'],
        num_rows: 306
    })
    test: Dataset({
        features: ['buggy_code', 'correct_code', 'problem_id', 'buggy_code_submission_id', 'correct_code_submission_id'],
        num_rows: 2757
    })
})

In [7]:
test_dataset

DatasetDict({
    train: Dataset({
        features: ['buggy_code', 'correct_code', 'problem_id', 'buggy_code_submission_id', 'correct_code_submission_id'],
        num_rows: 689
    })
    test: Dataset({
        features: ['buggy_code', 'correct_code', 'problem_id', 'buggy_code_submission_id', 'correct_code_submission_id'],
        num_rows: 77
    })
})

### Label Tokens

To labels the tokens we have to tokenize the source code and then to convert the character mask to a token based mask. The output of the tokenizer has a useful method for this task, that converts the index of a character to an index of a token, basically this character to which token belongs. We concatenate the tokens for the NL part with the PL part and set the labels as -100 so that they are ignored in the backprop phase. Finally we will obtain an object that has the input ids of each token in the format `[cls] NL [sep] PL [sep]`, the attention mask with ones for the content and zero for padding and the labels with the value -100 for the NL and padding and ones/zeros for the PL tokens part.

In [8]:
def generate_char_mask(buggy_code, correct_code):
    s = SequenceMatcher(None, buggy_code, correct_code)
    opcodes = [x for x in s.get_opcodes() if x[0] != "equal"]
    
    buggy_labels = np.zeros_like(list(buggy_code), dtype=np.int32)
    for _, i1, i2, _, _ in opcodes:
        buggy_labels[i1: max(i1+1, i2)] = 1

    return buggy_labels.tolist()

def tokenize_and_align_labels(tokenizer, example):
    example = {
        "buggy_code": example["buggy_code"] + example["correct_code"], 
        "correct_code": example["correct_code"] + example["correct_code"], 
        "error_class_extra": ["Accepted" for _ in example["correct_code"]]
    }
    
    y = [generate_char_mask(x_o, x_c) for (x_o, x_c) in zip(example["buggy_code"], example["correct_code"])]
    X_tokenized = tokenizer(text=example["buggy_code"], padding=True, truncation=True)
    
    labels = np.zeros_like(X_tokenized.input_ids, dtype=np.int32) - 100
    for i, y_i in enumerate(y):
        for j, y_i_j in enumerate(y_i):
            idx = X_tokenized.char_to_token(i, j, sequence_index=1)
            if idx is None:
                continue
            if labels[i, idx] == -100:
                labels[i, idx] = y_i_j
            else:
                labels[i, idx] |= y_i_j
            
    X_tokenized["labels"] = labels.tolist()
    return X_tokenized

tokenizer = RobertaTokenizerFast.from_pretrained("microsoft/codebert-base")

train_dataset = train_dataset.map(partial(tokenize_and_align_labels, tokenizer), batched=True, batch_size=4, remove_columns=train_dataset["train"].column_names)
test_dataset = test_dataset.map(partial(tokenize_and_align_labels, tokenizer), batched=True, batch_size=4, remove_columns=test_dataset.column_names)

ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/microsoft/codebert-base (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000016BEC3B3280>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

## Train

### Training Setup
We load the Token Classification model from microsoft/codebert-base and set the basic parameters for the trainer. We will also compute the metrics that will show precision, recall, f1 score and accuracy. For the task of bug localization, exact match score should be the best option, since we want to find the location of the specified bug in natural language.

In [None]:
training_args = TrainingArguments(
    output_dir='codebert-base-buggy-token-classification',          # output directory
    num_train_epochs=3,                                             # total number of training epochs
    per_device_train_batch_size=4,                                  # batch size per device during training
    per_device_eval_batch_size=4,                                   # batch size for evaluation
    warmup_steps=500,                                               # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                                              # strength of weight decay
    logging_dir='./logs',                                           # directory for storing logs
    logging_steps=1_000,                                            # Steps to report the loss value
    save_strategy ="no",
)

model = RobertaForTokenClassification.from_pretrained("microsoft/codebert-base")
data_collator = DataCollatorForTokenClassification(tokenizer, padding=True)

In [None]:
def document_level_metrics(true_predictions, true_labels):
    tp = 0
    fp = 0
    fn = 0
    tn = 0

    for t_pred, t_label in zip(true_predictions, true_labels):
        ref_accepted = 1 in t_pred
        p_accepted = 1 in t_label

        if ref_accepted and p_accepted:
            tp += 1
        if not ref_accepted and p_accepted:
            fp += 1
        if ref_accepted and not p_accepted:
            fn += 1
        if not ref_accepted and not p_accepted:
            tn += 1

    return {
        "document_precision": tp / (tp + fp) if (tp + fp) != 0 else 0,
        "document_recall": tp / (tp + fn) if (tp + fn) != 0 else 0,
        "document_f1": (2 * tp) / (2 * tp + fp + fn) if (2 * tp + fp + fn) != 0 else 0,
        "document_accuracy": (tp + tn) / (tp + fp + fn + tn) if (tp + fp + fn + tn) != 0 else 0,
    }
            

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictionss = [
        [p for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labelss = [
        [l for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    true_predictions = [p for pred in true_predictionss for p in pred]
    true_labels = [p for pred in true_labelss for p in pred]

    
    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        "accuracy": accuracy_score(true_labels, true_predictions),
        **document_level_metrics(true_predictionss, true_labelss)
    }

In [None]:
trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=train_dataset["train"],         
    eval_dataset=train_dataset["test"],            
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

In [None]:
trainer.evaluate()

In [None]:
trainer.evaluate(test_dataset)

## Inference

To perform inference we will need the error description and the source code. The model will first tokenize the error and the source code, and then concatenate the tokens. After that, the model will output the logits for the buggy/non-buggy classes for each token in the input. We will only be interested in keeping only the labels for the PL tokens, and then to convert each pair of tokens back into words. Finally we can obtain the character labels from the word labels, using an inverse function from the preprocessing stage, that will take a word index and return all indices of the chars contained in that word.

For more readability we displayed the source code with red characters for predictions and blue character for true labels.

In [None]:
def predict(tokenizer, model, source):
    if not isinstance(source, list):
        source = [source]
    
    tokenized_inputs = tokenizer(text=source, padding=True, truncation=True, return_tensors="pt").to(model.device)
    tokenized_labels = np.argmax(model(**tokenized_inputs)['logits'].cpu().detach().numpy(), 2)
    
    all_labels = []
    for i in range(tokenized_labels.shape[0]):
        labels = [0] * len(source[i])
        for j, label in enumerate(tokenized_labels[i]):
            if tokenized_inputs.token_to_sequence(i, j) != 1:
                continue

            word_id = tokenized_inputs.token_to_word(i, j)
            cs = tokenized_inputs.word_to_chars(i, word_id, sequence_index=1)
            if cs.start == cs.end:
                continue
            labels[cs.start:cs.end] |= tokenized_labels[i, j]
        
        all_labels.append(labels)
    
    return all_labels

def color_source(source_code, mask, color='red'):
    text = ""
    for i, char in enumerate(source_code):
        norm_color = 'black'
        if char == ' ':
            char = "•"
            norm_color = 'lightgrey'
        if char == '\n':
            char = "↵\n"
            norm_color = 'lightgrey'
        text += f'<span style="color:{color if mask[i] == 1 else norm_color};">{char}</span>'
    return "<pre>" + text + "</pre>"

def display_example(source_code, mask, true_mask):
    display(HTML("<h2>The source code that is predicted buggy:\n</h2>"))
    display(HTML(color_source(source_code, mask, color='red')))

    display(HTML("<h2>The source code that is buggy:\n</h2>"))
    display(HTML(color_source(source_code, true_mask, color='blue')))

    
viz_data = dataset["test"]
for i in range(50):
    source_code = viz_data[i]["buggy_code"]
    source_code_changed = viz_data[i]["correct_code"]
    true_mask = generate_char_mask(source_code, source_code_changed)
    mask = predict(tokenizer, model, source_code)[0]

    display(HTML(f"<h1>Example {i}</h1>"))
    display_example(source_code, mask, true_mask)