A new benchmark for measuring LLM's capability to detect bugs in large codebase.
- Similar to the Needle In The Haystack benchmark, the Bug In The Haystack benchmark utilizes Python source code (randomly assembled) as the background noise and syntactic bug as the needle.
- This allows measurement of LLM's capability to retrieve code-related information at a very large context window, which is useful for SWE agent and co-pilot applications.
1 | def fahrenheit_to_celsius(fahrenheit):
2 | return (fahrenheit - 32) * 5.0/9.0
3 |
4 | def is_prime(num:
5 | if num <= 1:
6 | return False
7 | for i in range(2, int(num**0.5) + 1):
8 | if num % i == 0:
9 | return False
10| return True
Answer: 4, missing_parenthesis- *All models were evaluated on their latest versions.
Comparison (Target Depth @ 0.5)
GPT-4o
GPT-4-Turbo
Claude 3 Opus
Gemini 1.5 Pro
Gemini 1.5 Flash
GPT-3.5-Turbo
Codestral
Llama3-70B
Command-R+
Gemini-1.0-Pro
CodeQwen1.5 (Zero-Shot)
notebooks/bug_in_the_code_stack_python_source_code_preprocessing.ipynbcontains Colab notebook for data processing.notebooks/bug_in_the_code_stack_experiment_openai.ipynbcontains Colab notebook for running the experiment on OpenAI models.notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt4.ipynbcontains Colab notebook for running the experiment on GPT-4 w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_openai_gpt35.ipynbcontains Colab notebook for running the experiment on GPT-3.5 w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_anthropic_claude3.ipynbcontains Colab notebook for running the experiment on Claude 3 Opus w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_cohere_commandr.ipynbcontains Colab notebook for running the experiment on Command-R+ w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_litellm_meta_llama3.ipynbcontains Colab notebook for running the experiment on Llama3 70B w/t LiteLLM.notebooks/bug_in_the_code_stack_experiment_mistral_codestral.ipynbcontains Colab notebook for running the experiment on Mistral Codestral w/t LiteLLMnotebooks/bug_in_the_code_stack_experiment_qwen_codeqwen_local.ipynbcontains Colab notebook for running the experiment on CodeQwen1.5 locally. Make sure to run this on Colab with A100 GPU.notebooks/bug_in_the_code_stack_experiment_genai_gemini10.pycontains Python script for running the experiment on Gemini-1.0-Pro w/t Generative AI package. Make sure to run this locally (doesn't work on Colab).notebooks/bics_helper_result_graphs.ipynbcontains helper functions to create beautiful graphs.notebooks/bics_result_analysis_graphs.ipynbcontains code to analyze the properties of codegen-focused models compared to larger, general-purpose models.
datasets/bug_in_the_code_stack_alpaca_dataset.csvis the preprocessed dataset used for the experiment.
- All notebooks and datasets can also be found at Bug In The Code Stack Google Drive. If you don't have access, please request access to
techandy42@gmail.com.












