diff --git a/codemmlu/index.html b/codemmlu/index.html index f0fa840..9f15ec4 100644 --- a/codemmlu/index.html +++ b/codemmlu/index.html @@ -26,7 +26,8 @@ CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs - + + @@ -172,7 +173,7 @@

CodeMMLU: A Multi-Task Benchmark for As - @@ -226,7 +227,7 @@

Abstract

- The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at https://github.com/FSoft-AI4Code/RepoExec. + Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses modelsโ€™ ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.

@@ -241,23 +242,33 @@

Abstract

Overview

- RepoExec is a pioneering benchmark that places a strong emphasis on the executability and correctness of generated code. Unlike traditional benchmarks, RepoExec ensures that the code not only compiles but also performs as intended in real-world scenarios. This is achieved through an automated system that verifies installation and runtime requirements, and dynamically generates high-coverage test cases. + We introduce CodeMMLU, a novel benchmark designed to evaluate CodeLLMs' ability to understand and comprehend code through multi-choice question answering (MCQA). This approach enables a deeper assessment of how CodeLLMs grasp coding concepts, moving beyond mere generation capabilities. Inspired by the MMLU dataset from natural language understanding, CodeMMLU offers a robust and easily evaluable methodology with the following key features:

- Key Features of RepoExec:

    -
  • Enhanced Executability: RepoExec goes beyond match-based evaluation to ensure that the generated code can be executed in real-world environments. This involves verifying that the code can be installed and run, addressing a critical aspect of real-world applicability.
  • -
  • Dynamic Test Case Generation: One of the standout features of RepoExec is its sophisticated mechanism for generating test cases. These test cases are designed to thoroughly assess the functionality of the generated code, ensuring that it performs the intended tasks correctly.
  • -
  • Dependency Usage Evaluation: RepoExec evaluates how effectively LLMs utilize code dependencies. This involves analyzing whether the models can accurately integrate and manage external libraries and dependencies, which is crucial for creating functional software at a repository level.
  • -
  • Dependency Invocation Rate (DIR): A novel metric introduced by RepoExec, the Dependency Invocation Rate measures how frequently and effectively generated code invokes dependencies. This metric provides deeper insights into the integration capabilities of LLMs, highlighting their potential for creating more complex and interconnected software systems.
  • +
  • Comprehensiveness: CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources, mitigating potential bias from limited evaluation data.
  • +
  • Diversity in task, domain, and language: The dataset covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.
  • +
+

+

+ CodeMMLU enables us to assess LLMsโ€™ capabilities in coding and software tasks from a novel perspective, extending beyond traditional code generation and completion. Our analysis reveals several notable findings: (1) previously unexplored bias issues in CodeLLMs, aligning with those observed in natural language MCQA tasks; (2) GPT-4 consistently achieving the highest average performance among closed-source models, while (3) the Meta-Llama family demonstrated the greatest accuracy among open-source models; (4) scaling laws related to model size were partially observed within the same model family but not across different families, suggesting the significant influence of pretraining datasets, methodologies, and model architectures; (5) advanced prompting techniques, such as Chain-of-Thought (CoT), consistently degraded performance, raising concerns about CodeLLMsโ€™ reasoning abilities on complex, step-by-step tasks; and (6) benchmarks like HumanEval, when converted from open-ended code generation to MCQA format, show that LLMs perform worse on MCQA, raising concerns about their real capability to understand and comprehend code. These findings highlight the current shortcomings of CodeLLMs and the intricate relationship between model architecture, training data quality, and evaluation methods in determining performance on software-related tasks. +

+

+

+

+ Our key contributions are: +

    +
  • We present the first MCQA benchmark for software and coding-related knowledge, addressing the need for diverse evaluation scenarios in the code domain. CodeMMLU enables the evaluation of LLMs' alignment with human inference in the software knowledge domain, similar to advancements in the NLP field.
  • +
  • CodeMMLU provides a thorough assessment of LLM capabilities, ensuring a substantial number of samples and diversity across tasks, domains, and languages. This enables a more nuanced understanding of an LLM's strengths and weaknesses, facilitating the development of models better aligned with the complexities and demands of the software domain.
  • +
  • Our experiments offer critical insights into LLM performance, highlighting the impact of factors such as model size, model family, and prompting techniques. This provides essential information to the community on effectively utilizing LLMs for specific tasks and domains in software engineering.

- -
Figure 1: Data Processing Pipeline of RepoExec
+ +
Overview of CodeMMLU data creation pipeline. The blue diagram describe the process of collecting raw multiple-choice questions (MCQs) from open source internet for a knowledge testset. Otherwise, the pipeline of real-world problem indicated in orange area.
@@ -274,16 +285,247 @@

Overview

Evaluation Results

- The experiments conducted using RepoExec have provided several valuable insights into the capabilities of LLMs in code generation: -

    -
  • Correctness: Pretrained LLMs have shown a high level of correctness in the code they generate. This means that the code produced by these models is syntactically accurate and adheres to the basic structure expected by programming languages.
  • -
  • Dependency Management and Debugging: Instruction-tuned models, on the other hand, excel in managing dependencies and debugging. These models have demonstrated a better ability to handle the complexities of integrating external libraries and resolving issues that arise during the execution of the code.
  • -
+ CodeMMLU revealed significant performance differences across models, as shown in the table below. OpenAI's GPT-4o outperformed all models on CodeMMLU, demonstrating its quality across diverse tasks. Notably, despite not being the latest model, the instructed version of Meta-Llama-3-70B achieved the highest score among open-source models from 8 families. While LLMs perform well on knowledge-based tasks, they struggle with real-world problems, particularly in defect detection tasks.

-
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Model nameSize (B)Syntactic knowledgeSemantic knowledgeReal-world tasksCodeMMLU
Closed-source models     
AnthropicClaude-3-sonnet@20240229-67.2266.0838.2653.97
OpenAIGPT-4o-2024-05-13-60.4157.8277.1867.0
GPT-3.5-turbo-0613-61.6853.6445.2651.7
Open-source models     
Meta LlamaCodeLlama-34b-Instruct-hf3456.8146.9323.5538.73
Meta-Llama-3-70B7063.3857.6435.2948.98
Meta-Llama-3-70B-Instruct7064.9062.9660.8462.45
Meta-Llama-3.1-70B7064.0959.008.2237.56
Meta-Llama-3.1-70B-Instruct7064.4262.2556.1160
MistralMistral-7B-Instruct-v0.3754.4251.2531.8543.33
Mixtral-8x7B-Instruct-v0.146.761.1754.8924.9042.96
Codestral-22B-v0.12260.3452.1137.8647.6
PhiPhi-3-medium-128k-instruct1458.5454.5637.8948.03
Phi-3-mini-128k-instruct3.853.0148.6522.3637.93
QwenQwen2-57B-A14B-Instruct5761.3457.4830.4846.34
CodeQwen1.5-7B-Chat749.6646.5856.3749.82
YiYi-1.5-34B-Chat3458.3255.5940.2749.39
Yi-1.5-9B-Chat955.6455.0637.1547.23
Deep SeekDeepSeek-coder-7b-instruct-v1.5756.6747.9028.4641.21
DeepSeek-coder-33b-instruct3353.6546.1121.4736.6
DeepSeek-moe-16b-chat16.431.7435.4327.3331.01
DeepSeek-Coder-V2-Lite-Instruct1659.9154.7633.6246.51
InternLMInternLM2-5-20b-chat2057.8555.5130.4444.89
StarCoder2StarCoder2-15b-instruct-v0.11556.5849.0742.7947.94
+
Summary performance of LLM family on CodeMMLU. The evaluation results (accuracy %) of different language models across CodeMMLU task.
+
+ +
+

+ For more benchmark detail, please check ๐Ÿ‘‰ HERE ๐Ÿ‘ˆ +

+
+
+ +
CodeMMLU accuracy by task on LLMs. While knowledge tasks are following the scaling law, real-world tasks offer more challenges to LLMs which indicate the performance of instruction tuning and data quality when evaluating on CodeMMLU.
@@ -292,7 +534,7 @@

Evaluation Results

-
+ diff --git a/codemmlu/static/images/codemmlu-logo.png b/codemmlu/static/images/codemmlu-logo.png new file mode 100644 index 0000000..235d66f Binary files /dev/null and b/codemmlu/static/images/codemmlu-logo.png differ diff --git a/codemmlu/static/images/data-creation-flow.png b/codemmlu/static/images/data-creation-flow.png new file mode 100644 index 0000000..13125d6 Binary files /dev/null and b/codemmlu/static/images/data-creation-flow.png differ diff --git a/codemmlu/static/images/task_detail.png b/codemmlu/static/images/task_detail.png new file mode 100644 index 0000000..8d3fbdf Binary files /dev/null and b/codemmlu/static/images/task_detail.png differ diff --git a/leaderboards/codemmlu/_results.json b/leaderboards/codemmlu/_results.json new file mode 100644 index 0000000..12ae803 --- /dev/null +++ b/leaderboards/codemmlu/_results.json @@ -0,0 +1,301 @@ +{ + "CodeLlama-34B-Instruct": { + "link": "https://huggingface.co/codellama/CodeLlama-34b-hf", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 38.73 + }, + "prompted": true, + "size": 34, + "direct_complete": false, + "lazy": false, + "elo_mle": 942 + }, + "Meta-Llama-3-70B": { + "link": "https://huggingface.co/meta-llama/Meta-Llama-3-70B", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 48.98 + }, + "prompted": false, + "size": 70, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Meta-Llama-3-70B-Instruct": { + "link": "https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 62.45 + }, + "prompted": true, + "size": 70, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Meta-Llama-3.1-70B-Instruct": { + "link": "https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 60 + }, + "prompted": true, + "size": 70, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Meta-Llama-3.1-70B": { + "link": "https://huggingface.co/meta-llama/Llama-3.1-70B", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 37.56 + }, + "prompted": false, + "size": 70, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Mistral-7B-Instruct-v0.3": { + "link": "https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 43.33 + }, + "prompted": true, + "size": 7, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Mixtral-8x7B-Instruct-v0.1": { + "link": "https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 42.96 + }, + "prompted": true, + "size": 7, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Codestral-22B-v0.1": { + "link": "https://huggingface.co/mistralai/Codestral-22B-v0.1", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 47.6 + }, + "prompted": true, + "size": 22, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Phi-3-medium-128k-instruct": { + "link": "https://huggingface.co/microsoft/Phi-3-medium-128k-instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 48.03 + }, + "prompted": true, + "size": 14, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Phi-3-mini-128k-instruct": { + "link": "https://huggingface.co/microsoft/Phi-3-mini-128k-instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 37.93 + }, + "prompted": true, + "size": 3.8, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Qwen2-57B-A14B-Instruct": { + "link": "https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 46.34 + }, + "prompted": true, + "size": 57, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "CodeQwen1.5-7B-Chat": { + "link": "https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 49.82 + }, + "prompted": true, + "size": 7, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Yi-1.5-34B-Chat": { + "link": "https://huggingface.co/01-ai/Yi-1.5-34B-Chat", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 49.39 + }, + "prompted": true, + "size": 34, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Yi-1.5-9B-Chat": { + "link": "https://huggingface.co/01-ai/Yi-1.5-9B-Chat", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 47.23 + }, + "prompted": true, + "size": 9, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "DeepSeek-coder-7b-instruct-v1.5": { + "link": "https://huggingface.co/deepseek-ai/deepseek-coder-7b-instruct-v1.5", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 41.21 + }, + "prompted": true, + "size": 7, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "DeepSeek-coder-33b-instruct": { + "link": "https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 36.6 + }, + "prompted": true, + "size": 33, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "DeepSeek-moe-16b-chat": { + "link": "https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 31.01 + }, + "prompted": true, + "size": 16.4, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "DeepSeek-Coder-V2-Lite-Instruct": { + "link": "https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 46.51 + }, + "prompted": true, + "size": 16, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "InternLM2-5-20b-chat": { + "link": "https://huggingface.co/internlm/internlm2_5-20b-chat", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 44.89 + }, + "prompted": true, + "size": 20, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "StarCoder2-15b-instruct-v0.1": { + "link": "https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 47.94 + }, + "prompted": true, + "size": 15, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Claude-3-sonnet@20240229": { + "link": "", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 53.97 + }, + "prompted": true, + "size": null, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "GPT-4o-2024-05-13": { + "link": "", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 67 + }, + "prompted": true, + "size": null, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "GPT-3.5-turbo-0613": { + "link": "", + "open-data": null, + "pass@1": { + "instruct": null, + "complete": 51.7 + }, + "prompted": true, + "size": null, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + } +} \ No newline at end of file diff --git a/leaderboards/codemmlu/images/codemmlu-logo.png b/leaderboards/codemmlu/images/codemmlu-logo.png new file mode 100644 index 0000000..235d66f Binary files /dev/null and b/leaderboards/codemmlu/images/codemmlu-logo.png differ diff --git a/leaderboards/codemmlu/index.html b/leaderboards/codemmlu/index.html index 64513d3..9fc39c4 100644 --- a/leaderboards/codemmlu/index.html +++ b/leaderboards/codemmlu/index.html @@ -19,7 +19,7 @@ section */ + .btn-group-lg > .btn, .btn-lg { + padding: 0.5rem 1rem; + font-size: 1.25rem; + border-radius: 0.3rem; + } + + .btn-outline-hard { + color: #ff6b6b; + border: 2px solid #ff6b6b; + background-color: transparent; + } + + .btn-outline-hard:hover, + .btn-check:checked + .btn-outline-hard, + .btn-outline-hard:active { + color: #fff; + background-color: #ff6b6b !important; + border-color: #ff6b6b; + } + + .btn-outline-full { + color: #4ecdc4; + border: 2px solid #4ecdc4; + background-color: transparent; + } + + .btn-outline-full:hover, + .btn-check:checked + .btn-outline-full, + .btn-outline-full:active { + color: #fff; + background-color: #4ecdc4 !important; + border-color: #4ecdc4; + } @@ -467,7 +501,7 @@

๐Ÿ™ Acknowledgements

], }; - const theaders = ["Model", "Accuracy"]; + const theaders = ["Model", "Syntactic Accuracy", "Semantic Accuracy", "Real-task Accuracy", "CodeMMLU"]; // score: 'complete', 'instruct' const displayTable = (table, score) => { @@ -490,7 +524,7 @@

๐Ÿ™ Acknowledgements

theaders.forEach(function (header) { var th = document.createElement("th"); th.textContent = header; - if (header == "Pass@1") { + if (header == "CodeMMLU") { th.style.backgroundColor = "#EEFFEE"; } headerRow.appendChild(th); @@ -554,7 +588,24 @@

๐Ÿ™ Acknowledgements

// promptedSymbol.textContent = "๐Ÿ’™"; // modelCell.appendChild(promptedSymbol); // } + + // Add Syntactic Accuracy column + + dataRow.appendChild(modelCell); + + var syntacticCell = document.createElement("td"); + syntacticCell.textContent = row["syntactic_accuracy"] || "-"; + dataRow.appendChild(syntacticCell); + + var semanticCell = document.createElement("td"); + semanticCell.textContent = row["semantic_accuracy"] || "-"; + dataRow.appendChild(semanticCell); + + var rtaskCell = document.createElement("td"); + rtaskCell.textContent = row["realtask_accuracy"] || "-"; + dataRow.appendChild(rtaskCell); + var passCell = document.createElement("td"); passCell.classList.add("text-nowrap"); if (lazy) { diff --git a/leaderboards/codemmlu/results.json b/leaderboards/codemmlu/results.json index 12ae803..6e06758 100644 --- a/leaderboards/codemmlu/results.json +++ b/leaderboards/codemmlu/results.json @@ -1,299 +1,848 @@ { - "CodeLlama-34B-Instruct": { - "link": "https://huggingface.co/codellama/CodeLlama-34b-hf", + "claude-3-sonnet@20240229": { + "link": "claude-3-sonnet@20240229", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 53.97 + }, + "realtask_accuracy": 38.26, + "syntactic_accuracy": 67.22, + "semantic_accuracy": 66.08, + "prompted": false, + "size": null, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "gpt-4o-2024-05-13": { + "link": "gpt-4o-2024-05-13", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 67.0 + }, + "realtask_accuracy": 77.18, + "syntactic_accuracy": 60.41, + "semantic_accuracy": 57.81, + "prompted": false, + "size": null, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "gpt-3.5-turbo-0613": { + "link": "gpt-3.5-turbo-0613", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 51.7 + }, + "realtask_accuracy": 45.26, + "syntactic_accuracy": 61.68, + "semantic_accuracy": 53.65, + "prompted": false, + "size": null, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "CodeLlama-7b-Instruct-hf": { + "link": "CodeLlama-7b-Instruct-hf", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 27.01 + }, + "realtask_accuracy": 4.78, + "syntactic_accuracy": 50.14, + "semantic_accuracy": 41.22, + "prompted": true, + "size": 7.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "CodeLlama-7b-Python-hf": { + "link": "CodeLlama-7b-Python-hf", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 29.49 + }, + "realtask_accuracy": 19.36, + "syntactic_accuracy": 38.7, + "semantic_accuracy": 36.87, + "prompted": false, + "size": 7.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "CodeLlama-13b-Instruct-hf": { + "link": "CodeLlama-13b-Instruct-hf", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 30.25 + }, + "realtask_accuracy": 10.53, + "syntactic_accuracy": 50.58, + "semantic_accuracy": 43.0, + "prompted": true, + "size": 13.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "CodeLlama-13b-Python-hf": { + "link": "CodeLlama-13b-Python-hf", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 29.82 + }, + "realtask_accuracy": 56.98, + "syntactic_accuracy": 12.89, + "semantic_accuracy": 4.88, + "prompted": false, + "size": 13.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "CodeLlama-13b-hf": { + "link": "CodeLlama-13b-hf", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 28.51 + }, + "realtask_accuracy": 6.65, + "syntactic_accuracy": 50.58, + "semantic_accuracy": 42.95, + "prompted": false, + "size": 13.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "CodeLlama-34b-Instruct-hf": { + "link": "CodeLlama-34b-Instruct-hf", "open-data": "None", "pass@1": { "instruct": null, "complete": 38.73 }, + "realtask_accuracy": 23.55, + "syntactic_accuracy": 56.8, + "semantic_accuracy": 46.93, "prompted": true, - "size": 34, + "size": 34.0, "direct_complete": false, "lazy": false, - "elo_mle": 942 + "elo_mle": 874 + }, + "CodeLlama-34b-Python-hf": { + "link": "CodeLlama-34b-Python-hf", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 9.4 + }, + "realtask_accuracy": 9.37, + "syntactic_accuracy": 15.57, + "semantic_accuracy": 5.34, + "prompted": false, + "size": 34.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 }, "Meta-Llama-3-70B": { - "link": "https://huggingface.co/meta-llama/Meta-Llama-3-70B", + "link": "Meta-Llama-3-70B", "open-data": "None", "pass@1": { "instruct": null, "complete": 48.98 }, + "realtask_accuracy": 35.29, + "syntactic_accuracy": 63.38, + "semantic_accuracy": 57.64, "prompted": false, - "size": 70, + "size": 70.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "Meta-Llama-3-70B-Instruct": { - "link": "https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct", + "link": "Meta-Llama-3-70B-Instruct", "open-data": "None", "pass@1": { "instruct": null, "complete": 62.45 }, + "realtask_accuracy": 60.84, + "syntactic_accuracy": 64.9, + "semantic_accuracy": 62.96, "prompted": true, - "size": 70, + "size": 70.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "Meta-Llama-3.1-70B-Instruct": { - "link": "https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct", + "Meta-Llama-3-8B": { + "link": "Meta-Llama-3-8B", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 51.89 + }, + "realtask_accuracy": 53.84, + "syntactic_accuracy": 54.14, + "semantic_accuracy": 47.8, + "prompted": false, + "size": 8.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Meta-Llama-3-8B-Instruct": { + "link": "Meta-Llama-3-8B-Instruct", "open-data": "None", "pass@1": { "instruct": null, - "complete": 60 + "complete": 46.04 }, + "realtask_accuracy": 38.38, + "syntactic_accuracy": 58.1, + "semantic_accuracy": 48.21, "prompted": true, - "size": 70, + "size": 8.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "Meta-Llama-3.1-70B": { - "link": "https://huggingface.co/meta-llama/Llama-3.1-70B", + "link": "Meta-Llama-3.1-70B", "open-data": "None", "pass@1": { "instruct": null, "complete": 37.56 }, + "realtask_accuracy": 8.22, + "syntactic_accuracy": 64.09, + "semantic_accuracy": 59.0, + "prompted": false, + "size": 70.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Meta-Llama-3.1-70B-Instruct": { + "link": "Meta-Llama-3.1-70B-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 60.0 + }, + "realtask_accuracy": 56.11, + "syntactic_accuracy": 64.41, + "semantic_accuracy": 62.25, + "prompted": true, + "size": 70.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Meta-Llama-3.1-8B": { + "link": "Meta-Llama-3.1-8B", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 42.06 + }, + "realtask_accuracy": 31.58, + "syntactic_accuracy": 53.95, + "semantic_accuracy": 48.09, "prompted": false, - "size": 70, + "size": 8.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Meta-Llama-3.1-8B-Instruct": { + "link": "Meta-Llama-3.1-8B-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 45.22 + }, + "realtask_accuracy": 35.7, + "syntactic_accuracy": 56.54, + "semantic_accuracy": 50.36, + "prompted": true, + "size": 8.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Mistral-7B-Instruct-v0.1": { + "link": "Mistral-7B-Instruct-v0.1", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 45.55 + }, + "realtask_accuracy": 41.49, + "syntactic_accuracy": 52.74, + "semantic_accuracy": 46.16, + "prompted": true, + "size": 6.7, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Mistral-7B-Instruct-v0.2": { + "link": "Mistral-7B-Instruct-v0.2", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 39.14 + }, + "realtask_accuracy": 26.01, + "syntactic_accuracy": 52.14, + "semantic_accuracy": 47.97, + "prompted": true, + "size": 7.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "Mistral-7B-Instruct-v0.3": { - "link": "https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3", + "link": "Mistral-7B-Instruct-v0.3", "open-data": "None", "pass@1": { "instruct": null, "complete": 43.33 }, + "realtask_accuracy": 31.85, + "syntactic_accuracy": 54.42, + "semantic_accuracy": 51.25, "prompted": true, - "size": 7, + "size": 7.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "Mixtral-8x7B-Instruct-v0.1": { - "link": "https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1", + "link": "Mixtral-8x7B-Instruct-v0.1", "open-data": "None", "pass@1": { "instruct": null, - "complete": 42.96 + "complete": 40.93 }, + "realtask_accuracy": 13.49, + "syntactic_accuracy": 61.17, + "semantic_accuracy": 54.89, "prompted": true, - "size": 7, + "size": 46.7, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "Codestral-22B-v0.1": { - "link": "https://huggingface.co/mistralai/Codestral-22B-v0.1", + "link": "Codestral-22B-v0.1", "open-data": "None", "pass@1": { "instruct": null, "complete": 47.6 }, - "prompted": true, - "size": 22, + "realtask_accuracy": 37.86, + "syntactic_accuracy": 60.34, + "semantic_accuracy": 52.11, + "prompted": false, + "size": 22.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "Phi-3-medium-128k-instruct": { - "link": "https://huggingface.co/microsoft/Phi-3-medium-128k-instruct", + "link": "Phi-3-medium-128k-instruct", "open-data": "None", "pass@1": { "instruct": null, "complete": 48.03 }, + "realtask_accuracy": 37.89, + "syntactic_accuracy": 58.54, + "semantic_accuracy": 54.56, + "prompted": true, + "size": 14.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Phi-3-medium-4k-instruct": { + "link": "Phi-3-medium-4k-instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 50.95 + }, + "realtask_accuracy": 43.17, + "syntactic_accuracy": 58.42, + "semantic_accuracy": 56.34, "prompted": true, - "size": 14, + "size": 14.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "Phi-3-mini-128k-instruct": { - "link": "https://huggingface.co/microsoft/Phi-3-mini-128k-instruct", + "link": "Phi-3-mini-128k-instruct", "open-data": "None", "pass@1": { "instruct": null, "complete": 37.93 }, + "realtask_accuracy": 22.36, + "syntactic_accuracy": 53.01, + "semantic_accuracy": 48.65, "prompted": true, "size": 3.8, "direct_complete": false, "lazy": false, "elo_mle": 874 }, + "Phi-3-mini-4k-instruct": { + "link": "Phi-3-mini-4k-instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 39.99 + }, + "realtask_accuracy": 27.63, + "syntactic_accuracy": 54.73, + "semantic_accuracy": 46.65, + "prompted": true, + "size": 3.8, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Phi-3-small-8k-instruct": { + "link": "Phi-3-small-8k-instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 43.69 + }, + "realtask_accuracy": 26.81, + "syntactic_accuracy": 57.6, + "semantic_accuracy": 56.92, + "prompted": true, + "size": 7.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Phind-CodeLlama-34B-v2": { + "link": "Phind-CodeLlama-34B-v2", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 39.96 + }, + "realtask_accuracy": 25.51, + "syntactic_accuracy": 57.57, + "semantic_accuracy": 47.47, + "prompted": false, + "size": 7.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Qwen2-0.5B-Instruct": { + "link": "Qwen2-0.5B-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 34.21 + }, + "realtask_accuracy": 29.55, + "syntactic_accuracy": 38.58, + "semantic_accuracy": 37.53, + "prompted": true, + "size": 0.5, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Qwen2-1.5B-Instruct": { + "link": "Qwen2-1.5B-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 34.03 + }, + "realtask_accuracy": 15.18, + "syntactic_accuracy": 51.54, + "semantic_accuracy": 47.5, + "prompted": true, + "size": 1.5, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, "Qwen2-57B-A14B-Instruct": { - "link": "https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct", + "link": "Qwen2-57B-A14B-Instruct", "open-data": "None", "pass@1": { "instruct": null, "complete": 46.34 }, + "realtask_accuracy": 30.48, + "syntactic_accuracy": 61.34, + "semantic_accuracy": 57.48, + "prompted": true, + "size": 57.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Qwen2-7B": { + "link": "Qwen2-7B", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 53.28 + }, + "realtask_accuracy": 49.3, + "syntactic_accuracy": 58.31, + "semantic_accuracy": 55.23, + "prompted": false, + "size": 7.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Qwen2-7B-Instruct": { + "link": "Qwen2-7B-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 51.3 + }, + "realtask_accuracy": 42.66, + "syntactic_accuracy": 59.9, + "semantic_accuracy": 57.08, "prompted": true, - "size": 57, + "size": 7.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "CodeQwen1.5-7B": { + "link": "CodeQwen1.5-7B", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 42.56 + }, + "realtask_accuracy": 36.76, + "syntactic_accuracy": 52.51, + "semantic_accuracy": 43.65, + "prompted": false, + "size": 7.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "CodeQwen1.5-7B-Chat": { - "link": "https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat", + "link": "CodeQwen1.5-7B-Chat", "open-data": "None", "pass@1": { "instruct": null, "complete": 49.82 }, - "prompted": true, - "size": 7, + "realtask_accuracy": 56.37, + "syntactic_accuracy": 49.66, + "semantic_accuracy": 41.18, + "prompted": false, + "size": 7.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "Yi-1.5-34B-Chat": { - "link": "https://huggingface.co/01-ai/Yi-1.5-34B-Chat", + "link": "Yi-1.5-34B-Chat", "open-data": "None", "pass@1": { "instruct": null, "complete": 49.39 }, - "prompted": true, - "size": 34, + "realtask_accuracy": 40.27, + "syntactic_accuracy": 58.31, + "semantic_accuracy": 55.59, + "prompted": false, + "size": 34.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "Yi-1.5-6B-Chat": { + "link": "Yi-1.5-6B-Chat", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 44.13 + }, + "realtask_accuracy": 33.57, + "syntactic_accuracy": 55.1, + "semantic_accuracy": 50.91, + "prompted": false, + "size": 6.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, "Yi-1.5-9B-Chat": { - "link": "https://huggingface.co/01-ai/Yi-1.5-9B-Chat", + "link": "Yi-1.5-9B-Chat", "open-data": "None", "pass@1": { "instruct": null, "complete": 47.23 }, - "prompted": true, - "size": 9, + "realtask_accuracy": 37.14, + "syntactic_accuracy": 55.64, + "semantic_accuracy": 55.06, + "prompted": false, + "size": 9.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "DeepSeek-coder-7b-instruct-v1.5": { - "link": "https://huggingface.co/deepseek-ai/deepseek-coder-7b-instruct-v1.5", + "deepseek-coder-33b-base": { + "link": "deepseek-coder-33b-base", "open-data": "None", "pass@1": { "instruct": null, - "complete": 41.21 + "complete": 6.69 }, - "prompted": true, - "size": 7, + "realtask_accuracy": 11.05, + "syntactic_accuracy": 0.0, + "semantic_accuracy": 5.33, + "prompted": false, + "size": 33.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "DeepSeek-coder-33b-instruct": { - "link": "https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct", + "deepseek-coder-33b-instruct": { + "link": "deepseek-coder-33b-instruct", "open-data": "None", "pass@1": { "instruct": null, "complete": 36.6 }, + "realtask_accuracy": 21.46, + "syntactic_accuracy": 53.64, + "semantic_accuracy": 45.43, "prompted": true, - "size": 33, + "size": 33.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "DeepSeek-moe-16b-chat": { - "link": "https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat", + "deepseek-coder-6.7b-base": { + "link": "deepseek-coder-6.7b-base", "open-data": "None", "pass@1": { "instruct": null, - "complete": 31.01 + "complete": 27.06 + }, + "realtask_accuracy": 4.8, + "syntactic_accuracy": 49.45, + "semantic_accuracy": 41.81, + "prompted": false, + "size": 6.7, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "deepseek-coder-6.7b-instruct": { + "link": "deepseek-coder-6.7b-instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 29.4 }, + "realtask_accuracy": 8.54, + "syntactic_accuracy": 50.8, + "semantic_accuracy": 42.94, "prompted": true, - "size": 16.4, + "size": 6.7, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "DeepSeek-Coder-V2-Lite-Instruct": { - "link": "https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct", + "deepseek-coder-7b-base-v1.5": { + "link": "deepseek-coder-7b-base-v1.5", "open-data": "None", "pass@1": { "instruct": null, - "complete": 46.51 + "complete": 37.48 }, + "realtask_accuracy": 17.19, + "syntactic_accuracy": 58.79, + "semantic_accuracy": 50.35, + "prompted": false, + "size": 7.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "deepseek-coder-7b-instruct-v1.5": { + "link": "deepseek-coder-7b-instruct-v1.5", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 41.21 + }, + "realtask_accuracy": 28.46, + "syntactic_accuracy": 56.67, + "semantic_accuracy": 47.9, "prompted": true, - "size": 16, + "size": 7.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "InternLM2-5-20b-chat": { - "link": "https://huggingface.co/internlm/internlm2_5-20b-chat", + "deepseek-moe-16b-base": { + "link": "deepseek-moe-16b-base", "open-data": "None", "pass@1": { "instruct": null, - "complete": 44.89 + "complete": 29.31 + }, + "realtask_accuracy": 18.53, + "syntactic_accuracy": 39.98, + "semantic_accuracy": 36.56, + "prompted": false, + "size": 16.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "deepseek-moe-16b-chat": { + "link": "deepseek-moe-16b-chat", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 31.01 }, + "realtask_accuracy": 27.33, + "syntactic_accuracy": 31.74, + "semantic_accuracy": 35.43, "prompted": true, - "size": 20, + "size": 16.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "StarCoder2-15b-instruct-v0.1": { - "link": "https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1", + "DeepSeek-Coder-V2-Lite-Base": { + "link": "DeepSeek-Coder-V2-Lite-Base", "open-data": "None", "pass@1": { "instruct": null, - "complete": 47.94 + "complete": 40.88 + }, + "realtask_accuracy": 23.47, + "syntactic_accuracy": 59.44, + "semantic_accuracy": 51.71, + "prompted": false, + "size": 16.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "DeepSeek-Coder-V2-Lite-Instruct": { + "link": "DeepSeek-Coder-V2-Lite-Instruct", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 46.51 }, + "realtask_accuracy": 33.62, + "syntactic_accuracy": 59.91, + "semantic_accuracy": 54.75, "prompted": true, - "size": 15, + "size": 16.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "Claude-3-sonnet@20240229": { - "link": "", + "internlm2_5-20b-chat": { + "link": "internlm2_5-20b-chat", "open-data": "None", "pass@1": { "instruct": null, - "complete": 53.97 + "complete": 44.89 }, + "realtask_accuracy": 30.43, + "syntactic_accuracy": 57.85, + "semantic_accuracy": 55.51, "prompted": true, - "size": null, + "size": 20.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "GPT-4o-2024-05-13": { - "link": "", + "internlm2_5-7b-chat": { + "link": "internlm2_5-7b-chat", "open-data": "None", "pass@1": { "instruct": null, - "complete": 67 + "complete": 42.64 }, + "realtask_accuracy": 27.43, + "syntactic_accuracy": 57.32, + "semantic_accuracy": 53.13, "prompted": true, - "size": null, + "size": 7.0, "direct_complete": false, "lazy": false, "elo_mle": 874 }, - "GPT-3.5-turbo-0613": { - "link": "", - "open-data": null, + "starcoder2-15b-instruct-v0.1": { + "link": "starcoder2-15b-instruct-v0.1", + "open-data": "None", "pass@1": { "instruct": null, - "complete": 51.7 + "complete": 47.94 }, + "realtask_accuracy": 42.78, + "syntactic_accuracy": 56.57, + "semantic_accuracy": 49.07, "prompted": true, - "size": null, + "size": 15.0, + "direct_complete": false, + "lazy": false, + "elo_mle": 874 + }, + "starcoder2-7b": { + "link": "starcoder2-7b", + "open-data": "None", + "pass@1": { + "instruct": null, + "complete": 35.64 + }, + "realtask_accuracy": 27.42, + "syntactic_accuracy": 45.87, + "semantic_accuracy": 39.77, + "prompted": false, + "size": 7.0, "direct_complete": false, "lazy": false, "elo_mle": 874