# NLBSE'22 Tool Competition

### Introduction

The first edition of the NLBSE’22 tool competition is on automatic **issue report classification**, an important task in issue management and prioritization.

For the competition, we provide a dataset encompassing more than 800k labeled issue reports (as bugs, enhancements, and questions) extracted from real open-source projects. You are invited to leverage this dataset for evaluating your classification approaches and compare the achieved results against a proposed baseline approach (based on FastText).


### Participation

If you want to participate, you must:
* Train and tune a multi-label classifier using the provided [training set](https://tickettagger.blob.core.windows.net/datasets/github-labels-top3-803k-train.tar.gz).
* Evaluate your classifier on the provided [test set](https://tickettagger.blob.core.windows.net/datasets/github-labels-top3-803k-test.tar.gz)
* Write a paper (4 pages max.) describing:
    * The architecture and details of the classifier
    * The procedure used to pre-process the data
    * The procedure used to tune the classifier on the training set
    * The results of your classifier on the test set
    * Additional info.: provide a link to your code/tool with proper documentation on how to run it
* **Submit the paper** by the deadline (see below). **Email the paper to the tool competition organizers**: Oscar Chaparro (oscarch@wm.edu) and Rafael Kallis (rk@rafaelkallis.com) 

All submissions must conform to the [ICSE’22 formatting and submission instructions](https://conf.researchr.org/track/icse-2022/icse-2022-papers#how-to-submit).

Papers do not need to be double-blinded.

### Important dates

* Paper/tool submission: **February 21, 2022**
* Acceptance and competition results notification: **March 4, 2022**
* Camera-ready paper submission: **March 18, 2022**

All dates are anywhere on earth (AoE).



### Submission acceptance and competition

Submissions will be evaluated and accepted based on **correctness** and **reproducibility**, defined by the following criteria:
* Clarity and detail of the paper content
* Availability of the code/tool, including the training/tuning/evaluation pipeline, released as open-source
* Correct training/tuning/evaluation of your code/tool on the provided data
* Report the metrics and results we outline below
* Clarity of the code documentation

The accepted submissions will be published at the workshop proceedings.

The submissions will be ranked based on the $F_1$ score (defined below) achieved by the proposed classifiers on the test set, as indicated in the papers.

The submission with the highest $F_1$ score will be the winner of the competition.

### Referencing

Since you will be using our dataset (and possibly this Colab notebook) as well as the original work behind the dataset, please cite the following references in your paper:

```
@inproceedings{nlbse2022,
  author={Kallis, Rafael and Chaparro, Oscar and Di Sorbo, Andrea and Panichella, Sebastiano},
  title={NLBSE'22 Tool Competition},
  booktitle={Proceedings of The 1st International Workshop on Natural Language-based Software Engineering (NLBSE'22)},
  year={2022}
}
```

```
@article{ticket-tagger-scp,
  author={Rafael Kallis and Andrea {Di Sorbo} and Gerardo Canfora and Sebastiano Panichella}
  title={Predicting issue types on GitHub},
  journal={Science of Computer Programming},
  volume={205},
  pages={102598},
  year={2021},
  issn={0167-6423},
  doi={https://doi.org/10.1016/j.scico.2020.102598},
  url={https://www.sciencedirect.com/science/article/pii/S0167642320302069}
}
  ```

```
@INPROCEEDINGS{ticket-tagger,
  author={Kallis, Rafael and Di Sorbo, Andrea and Canfora, Gerardo and Panichella, Sebastiano},
  booktitle={2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)}, 
  title={Ticket Tagger: Machine Learning Driven Issue Classification}, 
  year={2019},
  volume={},
  number={},
  pages={406-409},
  doi={10.1109/ICSME.2019.00070}
}
  ```


### Training 

You are provided a [training set](https://tickettagger.blob.core.windows.net/datasets/github-labels-top3-803k-train.tar.gz) encompassing more than 700,000 labeled issue reports extracted from real open source projects. 

Participants are free to select and transform variables from the training set as they please but **no new sources can be be added**.
In other words, any inputs or features used to create the classifier, must be derived from the provided training set.
Participants may preprocess, sample, apply over/under-sampling, select a subset of the attributes, perform feature-engineering, split the training set into a model-finetuning validation set, etc. Please contact us if you have any question about this.

Each issue report contains the following metadata:
- Label
- Issue title
- Issue body
- Issue URL
- Repository URL
- Creation timestamp
- Author association

Each issue is labeled with one class that indicates the issue type, namely, bug, enhancement, and question.

The distribution of (722,899) issues in the training set is:
* bug:            361,239 (50%)
* enhancement:    299,287 (41.4%)
* question:        62,373 (8.6%)

### Evaluation 

> Note: for correct latex rendering, open this notebook in Collab.

Submissions are evaluated based on their class-detection performance over the provided [test set](https://tickettagger.blob.core.windows.net/datasets/github-labels-top3-803k-test.tar.gz). 

The distribution of (80,518) issues in the test set:
* bug:	40,152	(49.9%)
* enhancement:	33,290	(41.3%)
* question:	7,076	(8.8%)

The evaluation must be performed on the entire test set only. **Important:** you may apply any preprocessing or feature engineering on this dataset except sampling, rebalancing, undersampling or oversampling techniques.

Classification performance is measured using the $F_1$ score over all the three classes. 

A submission (i.e., paper) in the tool competition must provide:
- Precision $P_c$ for each class $c$
- Recall $R_c$, for each class $c$
- $F_{1,c}$ score as the harmonic mean between $P_c$ and $R_c$, for each class $c$
- Precision $P$, micro-averaged $P_c$
- Recall $R$, micro-averaged $R_c$
- $F_1$ score as the harmonic mean between $P$ and $R$

These metrics are defined as follows:
\begin{align}
P_c &= \frac{ TP_c }{ TP_c + FP_c } & 
R_c &= \frac{ TP_c }{ TP_c + FN_c } &
F_{1,c} &= 2 \cdot \frac{ P_c \cdot R_c }{ P_c + R_c }\\
P &= \frac{ \sum_c TP_c }{ \sum_c \left( TP_c + FP_c \right) } & 
R &= \frac{ \sum_c TP_c }{ \sum_c \left( TP_c + FN_c \right) } &
F_1 &= 2 \cdot \frac{ P \cdot R }{ P + R }
\end{align}

where $TP_c$, $FP_c$ and $FN_c$ represent true positives, false positives and false negatives over a class $c$, respectively.
Micro-averaging was chosen as the cross-class aggregation method due to the class imbalance present in the data.

Please note that whilst all of the above measures must be provided for acceptance, the submissions will **only** be ranked by their $F_1$ score.

### Colab Notebook

Colab notebooks provide a convenient format for publishing work and make it easier for readers to reproduce results. Colab provides an interactive Python runtime framework that can be accessed from a web browser. 
Colab requires no configuration, provides free access to GPUs and allows for easy sharing between collaborators.

Participants can use a cloud-hosted infrastructure (recommended), connect to a local runtime or not use Colab at all. We do not impose any constraints on the technologies used. You can read more about Colab [here](https://colab.research.google.com/notebooks/intro.ipynb). An example of a GPU-accelerated Colab notebook can be found [here](https://colab.research.google.com/notebooks/gpu.ipynb).


## Submission Template

Participants are encouraged, but not required, to use the following code as a template for their submission. 
The example below trains a FastText model, which is provided as a baseline.

In [None]:
import os
import time
import numpy as np
import pandas as pd
import seaborn as sns
import gensim
import sklearn.metrics
from tqdm.auto import tqdm

In [None]:
# persistent file storage
# https://colab.research.google.com/notebooks/io.ipynb
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Training

In [None]:
# download the training set if it does not exist
if not os.path.isfile("github-labels-top3-803k-train.csv"):
  !curl "https://tickettagger.blob.core.windows.net/datasets/github-labels-top3-803k-train.tar.gz" | tar -xz

trainset = pd.read_csv("github-labels-top3-803k-train.csv")

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.security.selinux'
100  236M  100  236M    0     0  7974k      0  0:00:30  0:00:30 --:--:-- 5912k


In [None]:
trainset.head(5)

Unnamed: 0.1,Unnamed: 0,issue_url,issue_label,issue_created_at,issue_author_association,repository_url,issue_title,issue_body
0,0,https://api.github.com/repos/eamodio/vscode-gi...,bug,2021-01-02T18:07:30Z,NONE,https://api.github.com/repos/eamodio/vscode-gi...,Welcome screen on every editor window is very ...,I just discovered Gitlens and find the functio...
1,1,https://api.github.com/repos/binwiederhier/pco...,bug,2020-12-31T18:19:31Z,OWNER,https://api.github.com/repos/binwiederhier/pcopy,"""pcopy invite"" and ""pcopy paste abc:"" does not...",
2,2,https://api.github.com/repos/binwiederhier/pco...,bug,2021-01-03T04:33:36Z,OWNER,https://api.github.com/repos/binwiederhier/pcopy,"UI: Modal overlay is half transparent, shouldn...",
3,3,https://api.github.com/repos/Sothatsit/RoyalUr...,enhancement,2020-12-25T00:46:00Z,OWNER,https://api.github.com/repos/Sothatsit/RoyalUr...,Make the loading screen scale with browser win...,Currently the loading wheel is a fixed size in...
4,4,https://api.github.com/repos/Malivil/TTT-Custo...,bug,2021-01-02T21:36:57Z,OWNER,https://api.github.com/repos/Malivil/TTT-Custo...,Spectator - Investigate a way to strip weapons...,To bring magneto stick floating


In [None]:
trainset.groupby("issue_label").size()

issue_label
bug            361103
enhancement    299374
question        62422
dtype: int64

In [None]:
# preprocessing can be customized by participants
def preprocess(row):
  # concatenate title and body, then remove whitespaces
  doc = ""
  doc += str(row.issue_title)
  doc += " "
  doc += str(row.issue_body)
  # https://radimrehurek.com/gensim/parsing/preprocessing.html
  doc = gensim.parsing.preprocessing.strip_multiple_whitespaces(doc)
  return doc

In [None]:
# transform dataset into fasttext format
# https://fasttext.cc/docs/en/supervised-tutorial.html#getting-and-preparing-the-data

# rng used to split training set into fasttext hyperparameter finetuning validation set (2 of 9 parts)
random = np.random.default_rng(0)

with open("train.txt", "w") as train_f, open("valid.txt", "w") as valid_f:
  for row in tqdm(trainset.itertuples(), desc="Transform to fastText format", total=len(trainset)):
    doc = f"__label__{row.issue_label} {preprocess(row)}\n"
    is_train = random.uniform() < 7/9
    f = train_f if is_train else valid_f
    f.write(doc)

!wc -l "valid.txt"
!wc -l "train.txt"

!head -n 10 "train.txt"

Transform to fastText format:   0%|          | 0/722899 [00:00<?, ?it/s]

160535 valid.txt
562364 train.txt
__label__bug Welcome screen on every editor window is very tedious I just discovered Gitlens and find the functionality useful, thank you to all who contribute. I have about a dozen editor windows open, and the install process added a Gitlens welcome tab to each and every one of them. Combined with the snowflake effect, all of the sudden VScode was consuming 300-400% cpu and my fan was raging, as soon as I hunted them all down everything was back to fine. The welcome note content is great (although putting it on _all_ the windows is a bit much, don't know how much control you have on that). But overall it was a bit of a sour first-use experience, just wanted to provide that feedback.
__label__bug "pcopy invite" and "pcopy paste abc:" does not check if clipboard exists nan
__label__bug UI: Modal overlay is half transparent, shouldn't be nan
__label__enhancement Make the loading screen scale with browser window size Currently the loading wheel is a fixed

In [None]:
!pip install fasttext
import fasttext

# https://fasttext.cc/docs/en/python-module.html#train_supervised-parameters
model = fasttext.train_supervised("train.txt", 
                                  minCount=50,
                                  autotuneValidationFile="valid.txt",
                                  autotuneDuration=5*60)
model.quantize()
model.save_model(f"github-labels-{int(time.time())}.ftz")
model.save_model(f"drive/MyDrive/github-labels-{int(time.time())}.ftz")
model

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 5.6 MB/s  eta 0:00:01
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.8.1-py2.py3-none-any.whl (208 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3125036 sha256=2cc2f0c1c6329a77a2f14306152305ca06ecf1a961014e6d0b760ba85b657746
  Stored in directory: /root/.cache/pip/wheels/4e/ca/bf/b020d2be95f7641801a6597a29c8f4f19e38f9c02a345bab9b
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.8.1


<fasttext.FastText._FastText at 0x7f072b9246d0>

#### Evaluation

In [None]:
if not os.path.isfile("github-labels-top3-803k-test.csv"):
  !curl "https://tickettagger.blob.core.windows.net/datasets/github-labels-top3-803k-test.tar.gz" | tar -xz

testset = pd.read_csv("github-labels-top3-803k-test.csv")
testset.groupby("issue_label").size()
#testset


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.security.selinux'
100 27.2M  100 27.2M    0     0  9192k      0  0:00:03  0:00:03 --:--:-- 9192k


issue_label
bug            40288
enhancement    33203
question        7027
dtype: int64

In [None]:
# confusion matrix
y_true = []
y_pred = []

for row in tqdm(testset.itertuples(), desc="Benchmarking Inference Performance", total=len(testset)):
  pred = model.predict(preprocess(row))[0][0][9:]
  y_true.append(row.issue_label)
  y_pred.append(pred)

for label in ["bug", "enhancement", "question"]:
  P_c = sklearn.metrics.precision_score(y_true, y_pred, average=None, labels=[label])[0]
  R_c = sklearn.metrics.recall_score(y_true, y_pred, average=None, labels=[label])[0]
  F1_c = sklearn.metrics.f1_score(y_true, y_pred, average=None, labels=[label])[0]
  print(f"=*= {label} =*=")
  print(f"precision:\t{P_c:.4f}")
  print(f"recall:\t\t{R_c:.4f}")
  print(f"F1 score:\t{F1_c:.4f}")
  print()


P = sklearn.metrics.precision_score(y_true, y_pred, average='micro')
R = sklearn.metrics.recall_score(y_true, y_pred, average='micro')
F1 = sklearn.metrics.f1_score(y_true, y_pred, average='micro')

print("=*= global =*=")
print(f"precision:\t{P:.4f}")
print(f"recall:\t\t{R:.4f}")
print(f"F1 score:\t{F1:.4f}")

Benchmarking Inference Performance:   0%|          | 0/80518 [00:00<?, ?it/s]

=*= bug =*=
precision:	0.8314
recall:		0.8725
F1 score:	0.8515

=*= enhancement =*=
precision:	0.8155
recall:		0.8464
F1 score:	0.8307

=*= question =*=
precision:	0.6521
recall:		0.3502
F1 score:	0.4557

=*= global =*=
precision:	0.8162
recall:		0.8162
F1 score:	0.8162
