diff --git a/codemmlu/index.html b/codemmlu/index.html index f0fa840..9f15ec4 100644 --- a/codemmlu/index.html +++ b/codemmlu/index.html @@ -26,7 +26,8 @@
- The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at https://github.com/FSoft-AI4Code/RepoExec. + Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses modelsโ ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.
- RepoExec is a pioneering benchmark that places a strong emphasis on the executability and correctness of generated code. Unlike traditional benchmarks, RepoExec ensures that the code not only compiles but also performs as intended in real-world scenarios. This is achieved through an automated system that verifies installation and runtime requirements, and dynamically generates high-coverage test cases. + We introduce CodeMMLU, a novel benchmark designed to evaluate CodeLLMs' ability to understand and comprehend code through multi-choice question answering (MCQA). This approach enables a deeper assessment of how CodeLLMs grasp coding concepts, moving beyond mere generation capabilities. Inspired by the MMLU dataset from natural language understanding, CodeMMLU offers a robust and easily evaluable methodology with the following key features:
- Key Features of RepoExec:
+
++ Our key contributions are: +
-
+ - The experiments conducted using RepoExec have provided several valuable insights into the capabilities of LLMs in code generation: -
| + | Model name | +Size (B) | +Syntactic knowledge | +Semantic knowledge | +Real-world tasks | +CodeMMLU | +
|---|---|---|---|---|---|---|
| Closed-source models | +||||||
| Anthropic | +Claude-3-sonnet@20240229 | +- | +67.22 | +66.08 | +38.26 | +53.97 | +
| OpenAI | +GPT-4o-2024-05-13 | +- | +60.41 | +57.82 | +77.18 | +67.0 | +
| GPT-3.5-turbo-0613 | +- | +61.68 | +53.64 | +45.26 | +51.7 | +|
| Open-source models | +||||||
| Meta Llama | +CodeLlama-34b-Instruct-hf | +34 | +56.81 | +46.93 | +23.55 | +38.73 | +
| Meta-Llama-3-70B | +70 | +63.38 | +57.64 | +35.29 | +48.98 | +|
| Meta-Llama-3-70B-Instruct | +70 | +64.90 | +62.96 | +60.84 | +62.45 | +|
| Meta-Llama-3.1-70B | +70 | +64.09 | +59.00 | +8.22 | +37.56 | +|
| Meta-Llama-3.1-70B-Instruct | +70 | +64.42 | +62.25 | +56.11 | +60 | +|
| Mistral | +Mistral-7B-Instruct-v0.3 | +7 | +54.42 | +51.25 | +31.85 | +43.33 | +
| Mixtral-8x7B-Instruct-v0.1 | +46.7 | +61.17 | +54.89 | +24.90 | +42.96 | +|
| Codestral-22B-v0.1 | +22 | +60.34 | +52.11 | +37.86 | +47.6 | +|
| Phi | +Phi-3-medium-128k-instruct | +14 | +58.54 | +54.56 | +37.89 | +48.03 | +
| Phi-3-mini-128k-instruct | +3.8 | +53.01 | +48.65 | +22.36 | +37.93 | +|
| Qwen | +Qwen2-57B-A14B-Instruct | +57 | +61.34 | +57.48 | +30.48 | +46.34 | +
| CodeQwen1.5-7B-Chat | +7 | +49.66 | +46.58 | +56.37 | +49.82 | +|
| Yi | +Yi-1.5-34B-Chat | +34 | +58.32 | +55.59 | +40.27 | +49.39 | +
| Yi-1.5-9B-Chat | +9 | +55.64 | +55.06 | +37.15 | +47.23 | +|
| Deep Seek | +DeepSeek-coder-7b-instruct-v1.5 | +7 | +56.67 | +47.90 | +28.46 | +41.21 | +
| DeepSeek-coder-33b-instruct | +33 | +53.65 | +46.11 | +21.47 | +36.6 | +|
| DeepSeek-moe-16b-chat | +16.4 | +31.74 | +35.43 | +27.33 | +31.01 | +|
| DeepSeek-Coder-V2-Lite-Instruct | +16 | +59.91 | +54.76 | +33.62 | +46.51 | +|
| InternLM | +InternLM2-5-20b-chat | +20 | +57.85 | +55.51 | +30.44 | +44.89 | +
| StarCoder2 | +StarCoder2-15b-instruct-v0.1 | +15 | +56.58 | +49.07 | +42.79 | +47.94 | +
+ For more benchmark detail, please check ๐ HERE ๐ +
+
+