diff --git a/_data/authors.yml b/_data/authors.yml index daa61473f526..5a566726eb6c 100644 --- a/_data/authors.yml +++ b/_data/authors.yml @@ -578,4 +578,23 @@ Lyle Ungar: url: "mailto:ungar@cis.upenn.edu" - label: "Website" icon: "fas fa-fw fa-link" - url: "https://www.cis.upenn.edu/~ungar/" \ No newline at end of file + url: "https://www.cis.upenn.edu/~ungar/" + +Arthur Wayne: + name : "Arthur Wayne" + avatar : "https://arthursolwayne.com/pfp.png" + bio : "Researcher at Penn" + home : https://arthursolwayne.com/ + links: + - label: "Email" + icon: "fas fa-fw fa-envelope-square" + url: "mailto:artwayne@seas.upenn.edu" + - label: "Website" + icon: "fas fa-fw fa-link" + url: "https://arthursolwayne.com" + - label: "Twitter" + icon: "fab fa-fw fa-twitter-square" + url: "https://x.com/arthursolwayne" + - label: "GitHub" + icon: "fab fa-fw fa-github" + url: "https://github.com/arthursolwayne" \ No newline at end of file diff --git a/_posts/2025-02-19-BAP.md b/_posts/2025-02-19-BAP.md new file mode 100644 index 000000000000..93a4d6e86f53 --- /dev/null +++ b/_posts/2025-02-19-BAP.md @@ -0,0 +1,699 @@ +--- +published: true +title: "Where’s the Bug? Attention Probing for Scalable Fault Localization" +excerpt: "Attention probing LLMs allows for accurately locating bugs in code without direct labels, code execution, or the largest of LLMs." +header: + overlay_color: "#000" + overlay_filter: "0.5" + overlay_image: assets/images/BAP/waldo.jpg + teaser: assets/images/BAP/bap_fig1.png + actions: + - label: "Paper" + url: "https://arxiv.org/abs/2502.13966" + - label: "Code" + url: "https://github.com/adaminsky/BAP" +authors: + - Adam Stein|equal + - Arthur Wayne|equal + - Aaditya Naik + - Mayur Naik + - Eric Wong + +step1: + - url: "/assets/images/BAP/codeprobes_1_blog.png" + image_path: "/assets/images/BAP/codeprobes_1_blog.png" + alt: "" + title: "" +step2: + - url: "/assets/images/BAP/codeprobes_2_blog.png" + image_path: "/assets/images/BAP/codeprobes_2_blog.png" + alt: "" + title: "" +step3: + - url: "/assets/images/BAP/codeprobes_3_blog.png" + image_path: "/assets/images/BAP/codeprobes_3_blog.png" + alt: "" + title: "" +--- + + + + + + +> Locating bugs in code remains a challenging problem for both humans and automated systems. +> Existing methods for locating bugs either require executable test cases, training on costly line-level annotations, or resource-intensive LLMs. We present Bug Attention Probe (BAP), a method which learns to localize bugs without direct localization labels through an attention mechanism over latent code tokens. BAP identifies buggy lines of code by assigning high attention weights to the tokens comprising the buggy line. BAP not only +outperforms traditional fault localization baselines and prompting of large-scale LLMs, but it also requires a fraction of the computational cost of large LLMs. + +Pinpointing the location of bugs in code, known as fault localization (FL), is a central challenge in software engineering. As large language models (LLMs) become increasingly performant on code-related tasks, automating FL becomes even more useful and important. For example, LLMs are now used to propose complete bug fixes from only a user’s bug report, but their effectiveness is limited by their ability to first locate the bug. + +In the field of software engineering, this problem has been extensively studied, but the proposed methods often require code execution and provide overall poor performance. +Deep learning has recently enabled test and execution free FL by training models on line-level supervision (i.e. which lines have a bug and which do not have a bug), but this form of supervision is costly to collect. Such training can now even be avoided by just prompting the best LLMs. + +The limitations of existing FL approaches make their use on real-world code challenging. +LLMs enable FL on arbitrary code, empowering the creation of tools which autonomously fix bugs without special setup or training, but this comes at a cost. +While LLM costs are low for individual calls, the use of the best LLMs becomes highly costly on large and rapidly-evolving codebases where the LLM may be evaluated on all code files for every code change. This leads us to our central question: + +How do we achieve strong FL performance without tests, strong supervision, or large-scale models? +{: .notice--info} + +To this end, we propose the *Bug Attention Probe* (BAP) which elicits strong localization performance from a small LLM with a lightweight form of supervision (not requiring line-level localization labels). Before presenting BAP, we will first introduce the problem of FL. + +## Fault Localization (FL) +What is a bug in the first place? For our purposes, a code fragment can either have a bug or not, so we define a bug as a property $$b: \mathcal{P}\rightarrow \{0, 1\}$$ where $$\mathcal{P}$$ is the space of all possible programs. For a fragment of code $$c\in\mathcal{P}$$, the code is buggy if $$b(c)=1$$. + +We can now understand what it means to localize a bug. We use the notion of a counterfactual explanation to define bug localization. +In particular, a program has a bug localized to line $$i$$ if modifying only line $$i$$ would remove the bug. Not all bugs can be localized to a single line, meaning one must modify multiple lines of code to make a code fragment bug free. Therefore, the groundtruth FL labels for buggy code fragments consist of one or more numbers corresponding to the lines of code which must be modified to remove the bug. + +We can think of an FL method as simply a mapping from a buggy code fragment to a list of numbers representing the lines we should change to remove the bug, $$l: \mathcal{P}\rightarrow \mathbb{Z}^+$$. An FL method is highly useful since it directly outputs which lines of code should be modified to fix a bug, but performing FL in practice presents several challenges. + +### Challenges in FL + +We identify three main challenges faced by existing FL methods. + +1. **Need for Strong Supervision**: Learning-based FL methods train on datasets consisting of code fragments paired with the fault localization line numbers. Such datasets, however, require large amounts of manual effort to create since it requires already having a working method for fault localization. It is instead much easier to collect large amounts of code labelled as buggy or not since there are many accurate bug detection methods such as failing tests. +2. **Localizing Multi-line Bugs**: Most real-world bugs are localized to multiple lines, but existing FL methods mostly focus on the single-line case. +3. **Resource Efficiency**: LLMs are state-of-the-art for FL, but only the most powerful LLMs which are expensive, especially for repeated calls. + +To address these challenges, we next present our method based on LLM probing. + + +## Attention Probing for FL + +BAP is built around the insight that an LLM's attention mechanism can implicitly reveal the most suspicious parts of code. If we train a small model to classify code as buggy or not (which is weak supervision for the task of FL), its learned attention weights may reveal which tokens lead to that decision. +Summing those attention patterns at the line level produces a human-interpretable localization—no explicit line-level bug label needed. + +BAP consists of 3 steps: training with weak supervision, token-level attention extraction, and line-level attention aggregation. Click through the tabs below for an explanation of each step. + +
+ + + + + + +
+
+
+

BAP trains on a bug detection dataset: each sample is labeled just "buggy" or "clean" and no line-level information is needed. While the training signal for BAP is a binary label, BAP uses an attention mechanism over a frozen LLM's latent representations to implicitly learn to localize the bug.

+ Weak Supervision +
+
+

After BAP is trained on the detection task, we evaluate BAP on a code fragment and collect each code token's attention score to the final token to get token-level localization information. This information is not immediately interpretable since programmers operate on the code-level or line-level rather than the token-level.

+ Attention Extraction +
+
+

For each code line, we sum the attention scores for all tokens belonging to the line to get line-level localization scores. The line(s) with the highest sum is flagged as most likely containing the bug.

+ Line-level Ranking +
+
+
+ + + + + + + + +## A Toy Example +We now demonstrate BAP on a simple example. +Consider the following Java code snippet which we wrote (without much thought): +```java +Integer vote; +public void addVote(int age) { + if (age >= 18) { + System.out.println("You're a minor!"); + } else { + vote++; + System.out.println("You can vote!"); + } +} +``` +If you look closely at the code, there are two major bugs! The first is that the condition for the if statement is inverted (`>= 18` prints "You’re a minor!"). The second issue is slightly more subtle in that the `vote` variable is defined as an `Integer` reference which is `null` so it will be a problem if we try to increment its value. How would we apply BAP to this code and what will the output look like? + +### BAP Attends to Buggy Lines + + +
+
+ +
0.03Integer vote;
+
0.07public void addVote(int age) {
+
0.28 if (age >= 18) {
+
0.05 System.out.println("You're a minor!");
+
0.04 } else {
+
0.28 vote++;
+
0.10 System.out.println("You can vote!");
+
0.02 }
+
0.01}
+ +
+
Figure 1: BAP's attention visualization showing line-level bug detection scores. Higher scores (in red) indicate lines that the model identifies as likely containing bugs. The visualization correctly highlights both bugs: the inverted condition (age >= 18) and the potential null reference error (vote++).
+
+ + + + + + +After training on code examples labelled with just "buggy" or "clean," Figure 1 shows that BAP’s line-level attention highlights `if (age >= 18)` and `vote++;` as top suspects for buggy lines. No test harness or line-level annotations were needed—only knowledge of what buggy and non-buggy code snippets tend to look like. + +In reality, the bugs we evaluate on are far less trivial and better resemble real-world bug encounters. + + +## BAP Evaluation +We evaluate BAP on eight diverse fault localization datasets including Python, Java, and C bugs, and we show that BAP outperforms the baselines at top-1 FL performance. See the first tab below for the plot of top-1 FL accuracy for all eight datasets of BAP compared to GPT-4o prompting as well as a recent deep learning FL approach called LLMAO. + + +Because BAP produces a ranking for each line of code in a given snippet, multiple suspicious lines can be identified simultaneously. See the second tab below for BAP's performance on multi-line bug classification where we see that BAP outperforms the baselines. + +BAP is also significantly more efficient than prompting, outperforming large open-weight models at a small fraction of the computational cost. Even a 1B parameter base model with BAP outperforms zero-shot prompts on 90B parameter models and GPT-4o. See the third tab below. + + + + + + + + All Charts Example + + + + + + +
+ + + + + +
+
+
+ +

+ Averaged across all eight datasets, BAP improves by 34.6% top-1 accuracy + compared to the strongest baseline and 93.4% over zero-shot prompting GPT-4. +

+ + +
+
+ +

+ On multi-line bugs, BAP captures more relevant lines than typical single-line + localizers, showing better precision at top-k lines. +

+ + +
+
+ +

+ BAP’s smaller memory requirements and lower FLOPs mean it can be integrated + into continuous integration workflows or potentially even local development + IDEs. +

+ + +
+
+
+ + +## Future Directions + +We are excited about future research directions in leveraging LLM probing for FL and related problems. For instance, a current limitation of BAP is that we only evaluated on code fragments of at most a couple hundred lines of code, but we are interested in future work on extending BAP to full repositories. In addition, we have yet to incorporate BAP into a full bug repair solution where the FL information from BAP is used by an LLM to propose bug fixes. Finally, beyond code bugs, can we use an approach like BAP for finding "bugs" in non-code data? For instance, can an approach like BAP which probes an LLM's latent representations allow us to localize a mistake in an answer to a particular erroneous reasoning step? + + +## Conclusion + +The best LLMs are now highly capable at FL, enabling FL in many real-world settings where existing techniques are inapplicable due to their requirement for execution scripts, test-cases, or custom models. These LLMs, however, are highly expensive. +BAP demonstrates we can uncover strong line-level FL from a lightweight LLM by probing using just weak supervision. We show that BAP is highly performant across eight FL benchmarks, localizes multi-line bugs, and makes use of significantly less computational resources than LLM prompting with comparable performance. + +Check out our **[Paper](https://arxiv.org/abs/2502.13966)** and **[Code](https://github.com/adaminsky/BAP)** if interested in learning more or using BAP. + +```bibtex +@article{stein2025s, + title={Where's the Bug? Attention Probing for Scalable Fault Localization}, + author={Stein, Adam and Wayne, Arthur and Naik, Aaditya and Naik, Mayur and Wong, Eric}, + journal={arXiv preprint arXiv:2502.13966}, + year={2025} +} +``` \ No newline at end of file diff --git a/assets/images/BAP/bap_fig1.png b/assets/images/BAP/bap_fig1.png new file mode 100644 index 000000000000..44c91185dd8a Binary files /dev/null and b/assets/images/BAP/bap_fig1.png differ diff --git a/assets/images/BAP/codeprobes_1.png b/assets/images/BAP/codeprobes_1.png new file mode 100644 index 000000000000..811084b80893 Binary files /dev/null and b/assets/images/BAP/codeprobes_1.png differ diff --git a/assets/images/BAP/codeprobes_2.png b/assets/images/BAP/codeprobes_2.png new file mode 100644 index 000000000000..757e9f827299 Binary files /dev/null and b/assets/images/BAP/codeprobes_2.png differ diff --git a/assets/images/BAP/codeprobes_3.png b/assets/images/BAP/codeprobes_3.png new file mode 100644 index 000000000000..6efe116eb48f Binary files /dev/null and b/assets/images/BAP/codeprobes_3.png differ diff --git a/assets/images/BAP/waldo.jpg b/assets/images/BAP/waldo.jpg new file mode 100644 index 000000000000..da123a361363 Binary files /dev/null and b/assets/images/BAP/waldo.jpg differ