diff --git a/codemmlu/index.html b/codemmlu/index.html
index f0fa840..9f15ec4 100644
--- a/codemmlu/index.html
+++ b/codemmlu/index.html
@@ -26,7 +26,8 @@
   <title>CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs</title>
   <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
   rel="stylesheet">
-  <link rel="icon" href="static/images/repoexec_logo.png">
+  <!-- TODO: replace with CodeMMLU logo -->
+  <link rel="icon" href="static/images/codemmlu-logo.png">
 
   <link rel="stylesheet" href="static/css/bulma.min.css">
   <link rel="stylesheet" href="static/css/bulma-carousel.min.css">
@@ -172,7 +173,7 @@ <h1 class="title is-1 publication-title">CodeMMLU: A Multi-Task Benchmark for As
 
                   <!-- ArXiv abstract Link -->
                   <span class="link-block">
-                    <a href="https://arxiv.org/html/2406.11927v1" target="_blank"
+                    <a href="https://arxiv.org/abs/2410.01999" target="_blank"
                     class="external-link button is-normal is-rounded is-dark">
                     <span class="icon">
                       <i class="ai ai-arxiv"></i>
@@ -226,7 +227,7 @@ <h2 class="subtitle has-text-centered">
         <h2 class="title is-3">Abstract</h2>
         <div class="content has-text-justified">
           <p>
-            The ability of CodeLLMs to generate executable and functionally correct code at the <i>repository-level scale</i> remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at <a href="https://github.com/FSoft-AI4Code/RepoExec">https://github.com/FSoft-AI4Code/RepoExec</a>.
+            Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models’ ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.
           </p>
         </div>
       </div>
@@ -241,23 +242,33 @@ <h2 class="title is-3">Abstract</h2>
         <h2 class="title is-3">Overview</h2>
         <div class="content has-text-justified">
           <p>
-            RepoExec is a pioneering benchmark that places a strong emphasis on the executability and correctness of generated code. Unlike traditional benchmarks, RepoExec ensures that the code not only compiles but also performs as intended in real-world scenarios. This is achieved through an automated system that verifies installation and runtime requirements, and dynamically generates high-coverage test cases.
+            We introduce CodeMMLU, a novel benchmark designed to evaluate CodeLLMs' ability to understand and comprehend code through multi-choice question answering (MCQA). This approach enables a deeper assessment of how CodeLLMs grasp coding concepts, moving beyond mere generation capabilities. Inspired by the MMLU dataset from natural language understanding, CodeMMLU offers a robust and easily evaluable methodology with the following key features:
           </p>
           <p>
           </p>
           <p>
-            <b>Key Features of RepoExec:</b>
             <ul>
-              <li><b>Enhanced Executability:</b> RepoExec goes beyond match-based evaluation to ensure that the generated code can be executed in real-world environments. This involves verifying that the code can be installed and run, addressing a critical aspect of real-world applicability.</li>
-              <li><b>Dynamic Test Case Generation:</b> One of the standout features of RepoExec is its sophisticated mechanism for generating test cases. These test cases are designed to thoroughly assess the functionality of the generated code, ensuring that it performs the intended tasks correctly.</li>
-              <li><b>Dependency Usage Evaluation:</b> RepoExec evaluates how effectively LLMs utilize code dependencies. This involves analyzing whether the models can accurately integrate and manage external libraries and dependencies, which is crucial for creating functional software at a repository level.</li>
-              <li><b>Dependency Invocation Rate (DIR):</b> A novel metric introduced by RepoExec, the Dependency Invocation Rate measures how frequently and effectively generated code invokes dependencies. This metric provides deeper insights into the integration capabilities of LLMs, highlighting their potential for creating more complex and interconnected software systems.</li>
+              <li><b>Comprehensiveness:</b> CodeMMLU comprises over 10,000 questions curated from diverse, high-quality sources, mitigating potential bias from limited evaluation data.</li>
+              <li><b>Diversity in task, domain, and language:</b> The dataset covers a wide spectrum of software knowledge, including general QA, code generation, defect detection, and code repair across various domains and more than 10 programming languages.</li>
+            </ul>
+          </p>
+          <p></p>
+          CodeMMLU enables us to assess LLMs’ capabilities in coding and software tasks from a novel perspective, extending beyond traditional code generation and completion. Our analysis reveals several notable findings: (1) previously unexplored bias issues in CodeLLMs, aligning with those observed in natural language MCQA tasks; (2) GPT-4 consistently achieving the highest average performance among closed-source models, while (3) the Meta-Llama family demonstrated the greatest accuracy among open-source models; (4) scaling laws related to model size were partially observed within the same model family but not across different families, suggesting the significant influence of pretraining datasets, methodologies, and model architectures; (5) advanced prompting techniques, such as Chain-of-Thought (CoT), consistently degraded performance, raising concerns about CodeLLMs’ reasoning abilities on complex, step-by-step tasks; and (6) benchmarks like HumanEval, when converted from open-ended code generation to MCQA format, show that LLMs perform worse on MCQA, raising concerns about their real capability to understand and comprehend code. These findings highlight the current shortcomings of CodeLLMs and the intricate relationship between model architecture, training data quality, and evaluation methods in determining performance on software-related tasks.
+          </p>
+          <p>
+          </p>
+          <p>
+            <b>Our key contributions are:</b>
+            <ul>
+              <li>We present the first MCQA benchmark for software and coding-related knowledge, addressing the need for diverse evaluation scenarios in the code domain. CodeMMLU enables the evaluation of LLMs' alignment with human inference in the software knowledge domain, similar to advancements in the NLP field.</li>
+              <li>CodeMMLU provides a thorough assessment of LLM capabilities, ensuring a substantial number of samples and diversity across tasks, domains, and languages. This enables a more nuanced understanding of an LLM's strengths and weaknesses, facilitating the development of models better aligned with the complexities and demands of the software domain.</li>
+              <li>Our experiments offer critical insights into LLM performance, highlighting the impact of factors such as model size, model family, and prompting techniques. This provides essential information to the community on effectively utilizing LLMs for specific tasks and domains in software engineering.</li>
             </ul>
           </p>
         </div>
         <figure>
-          <img src="static/images/data_pipeline.png", width="100%"></img>
-          <figcaption><i>Figure 1: Data Processing Pipeline of RepoExec</i></figcaption>
+          <img src="static/images/data-creation-flow.png", width="100%"></img>
+          <figcaption><i><b>Overview of CodeMMLU data creation pipeline.</b> The blue diagram describe the process of collecting raw multiple-choice questions (MCQs) from open source internet for a knowledge testset. Otherwise, the pipeline of real-world problem indicated in orange area.</i></figcaption>
         </figure>
         <!-- <embed src="static/images/data_pipeline.pdf" width="100%"/> -->
       </div>
@@ -274,16 +285,247 @@ <h2 class="title is-3">Overview</h2>
         <h2 class="title is-3">Evaluation Results</h2>
         <div class="content has-text-justified">
           <p>
-            The experiments conducted using RepoExec have provided several valuable insights into the capabilities of LLMs in code generation:
-            <ul>
-              <li><b>Correctness:</b> Pretrained LLMs have shown a high level of correctness in the code they generate. This means that the code produced by these models is syntactically accurate and adheres to the basic structure expected by programming languages.</li>
-              <li><b>Dependency Management and Debugging:</b> Instruction-tuned models, on the other hand, excel in managing dependencies and debugging. These models have demonstrated a better ability to handle the complexities of integrating external libraries and resolving issues that arise during the execution of the code.</li>
-            </ul>
+            CodeMMLU revealed significant performance differences across models, as shown in the table below. OpenAI's GPT-4o outperformed all models on CodeMMLU, demonstrating its quality across diverse tasks. Notably, despite not being the latest model, the instructed version of Meta-Llama-3-70B achieved the highest score among open-source models from 8 families. While LLMs perform well on knowledge-based tasks, they struggle with real-world problems, particularly in defect detection tasks.
           </p>
         </div>
-        <figure>
+        <div class="content has-text-justified">
+          <style type="text/css">
+            .tg  {border-collapse:collapse;border-color:#ccc;border-spacing:0;}
+            .tg td{background-color:#fff;border-color:#ccc;border-style:solid;border-width:1px;color:#333;
+              font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;}
+            .tg th{background-color:#f0f0f0;border-color:#ccc;border-style:solid;border-width:1px;color:#333;
+              font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
+            .tg .tg-baqh{text-align:center;vertical-align:top}
+            .tg .tg-buh4{background-color:#f9f9f9;text-align:left;vertical-align:top}
+            .tg .tg-0lax{text-align:left;vertical-align:top}
+            .tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
+            </style>
+            <table class="tg"><thead>
+              <tr>
+                <th class="tg-0lax"></th>
+                <th class="tg-amwm">Model name</th>
+                <th class="tg-amwm">Size (B)</th>
+                <th class="tg-amwm">Syntactic knowledge</th>
+                <th class="tg-amwm">Semantic knowledge</th>
+                <th class="tg-amwm">Real-world tasks</th>
+                <th class="tg-amwm">CodeMMLU</th>
+              </tr></thead>
+            <tbody>
+              <tr>
+                <td class="tg-baqh" colspan="7">Closed-source models&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh">Anthropic</td>
+                <td class="tg-buh4">Claude-3-sonnet@20240229</td>
+                <td class="tg-0lax">-</td>
+                <td class="tg-buh4"><b>67.22</b></td>
+                <td class="tg-0lax"><b>66.08</b></td>
+                <td class="tg-buh4">38.26</td>
+                <td class="tg-0lax">53.97</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh" rowspan="2">OpenAI</td>
+                <td class="tg-buh4">GPT-4o-2024-05-13</td>
+                <td class="tg-0lax">-</td>
+                <td class="tg-buh4">60.41</td>
+                <td class="tg-0lax">57.82</td>
+                <td class="tg-buh4"><b>77.18</b></td>
+                <td class="tg-0lax"><b>67.0</b></td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">GPT-3.5-turbo-0613</td>
+                <td class="tg-0lax">-</td>
+                <td class="tg-buh4">61.68</td>
+                <td class="tg-0lax">53.64</td>
+                <td class="tg-buh4">45.26</td>
+                <td class="tg-0lax">51.7</td>
+              </tr>
+              <tr>
+                <td class="tg-0lax" colspan="7">Open-source models&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh" rowspan="5">Meta Llama</td>
+                <td class="tg-buh4">CodeLlama-34b-Instruct-hf</td>
+                <td class="tg-0lax">34</td>
+                <td class="tg-buh4">56.81</td>
+                <td class="tg-0lax">46.93</td>
+                <td class="tg-buh4">23.55</td>
+                <td class="tg-0lax">38.73</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">Meta-Llama-3-70B</td>
+                <td class="tg-0lax">70</td>
+                <td class="tg-buh4">63.38</td>
+                <td class="tg-0lax">57.64</td>
+                <td class="tg-buh4">35.29</td>
+                <td class="tg-0lax">48.98</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">Meta-Llama-3-70B-Instruct</td>
+                <td class="tg-0lax">70</td>
+                <td class="tg-buh4"><b>64.90</b></td>
+                <td class="tg-0lax"><b>62.96</b></td>
+                <td class="tg-buh4"><b>60.84</b></td>
+                <td class="tg-0lax"><b>62.45</b></td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">Meta-Llama-3.1-70B</td>
+                <td class="tg-0lax">70</td>
+                <td class="tg-buh4">64.09</td>
+                <td class="tg-0lax">59.00</td>
+                <td class="tg-buh4">8.22</td>
+                <td class="tg-0lax">37.56</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">Meta-Llama-3.1-70B-Instruct</td>
+                <td class="tg-0lax">70</td>
+                <td class="tg-buh4">64.42</td>
+                <td class="tg-0lax">62.25</td>
+                <td class="tg-buh4">56.11</td>
+                <td class="tg-0lax">60</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh" rowspan="3">Mistral</td>
+                <td class="tg-buh4">Mistral-7B-Instruct-v0.3</td>
+                <td class="tg-0lax">7</td>
+                <td class="tg-buh4">54.42</td>
+                <td class="tg-0lax">51.25</td>
+                <td class="tg-buh4">31.85</td>
+                <td class="tg-0lax">43.33</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">Mixtral-8x7B-Instruct-v0.1</td>
+                <td class="tg-0lax">46.7</td>
+                <td class="tg-buh4">61.17</td>
+                <td class="tg-0lax">54.89</td>
+                <td class="tg-buh4">24.90</td>
+                <td class="tg-0lax">42.96</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">Codestral-22B-v0.1</td>
+                <td class="tg-0lax">22</td>
+                <td class="tg-buh4">60.34</td>
+                <td class="tg-0lax">52.11</td>
+                <td class="tg-buh4">37.86</td>
+                <td class="tg-0lax">47.6</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh" rowspan="2">Phi</td>
+                <td class="tg-buh4">Phi-3-medium-128k-instruct</td>
+                <td class="tg-0lax">14</td>
+                <td class="tg-buh4">58.54</td>
+                <td class="tg-0lax">54.56</td>
+                <td class="tg-buh4">37.89</td>
+                <td class="tg-0lax">48.03</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">Phi-3-mini-128k-instruct</td>
+                <td class="tg-0lax">3.8</td>
+                <td class="tg-buh4">53.01</td>
+                <td class="tg-0lax">48.65</td>
+                <td class="tg-buh4">22.36</td>
+                <td class="tg-0lax">37.93</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh" rowspan="2">Qwen</td>
+                <td class="tg-buh4">Qwen2-57B-A14B-Instruct</td>
+                <td class="tg-0lax">57</td>
+                <td class="tg-buh4">61.34</td>
+                <td class="tg-0lax">57.48</td>
+                <td class="tg-buh4">30.48</td>
+                <td class="tg-0lax">46.34</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">CodeQwen1.5-7B-Chat</td>
+                <td class="tg-0lax">7</td>
+                <td class="tg-buh4">49.66</td>
+                <td class="tg-0lax">46.58</td>
+                <td class="tg-buh4">56.37</td>
+                <td class="tg-0lax">49.82</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh" rowspan="2">Yi</td>
+                <td class="tg-buh4">Yi-1.5-34B-Chat</td>
+                <td class="tg-0lax">34</td>
+                <td class="tg-buh4">58.32</td>
+                <td class="tg-0lax">55.59</td>
+                <td class="tg-buh4">40.27</td>
+                <td class="tg-0lax">49.39</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">Yi-1.5-9B-Chat</td>
+                <td class="tg-0lax">9</td>
+                <td class="tg-buh4">55.64</td>
+                <td class="tg-0lax">55.06</td>
+                <td class="tg-buh4">37.15</td>
+                <td class="tg-0lax">47.23</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh" rowspan="4">Deep Seek</td>
+                <td class="tg-buh4">DeepSeek-coder-7b-instruct-v1.5</td>
+                <td class="tg-0lax">7</td>
+                <td class="tg-buh4">56.67</td>
+                <td class="tg-0lax">47.90</td>
+                <td class="tg-buh4">28.46</td>
+                <td class="tg-0lax">41.21</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">DeepSeek-coder-33b-instruct</td>
+                <td class="tg-0lax">33</td>
+                <td class="tg-buh4">53.65</td>
+                <td class="tg-0lax">46.11</td>
+                <td class="tg-buh4">21.47</td>
+                <td class="tg-0lax">36.6</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">DeepSeek-moe-16b-chat</td>
+                <td class="tg-0lax">16.4</td>
+                <td class="tg-buh4">31.74</td>
+                <td class="tg-0lax">35.43</td>
+                <td class="tg-buh4">27.33</td>
+                <td class="tg-0lax">31.01</td>
+              </tr>
+              <tr>
+                <td class="tg-buh4">DeepSeek-Coder-V2-Lite-Instruct</td>
+                <td class="tg-0lax">16</td>
+                <td class="tg-buh4">59.91</td>
+                <td class="tg-0lax">54.76</td>
+                <td class="tg-buh4">33.62</td>
+                <td class="tg-0lax">46.51</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh">InternLM</td>
+                <td class="tg-buh4">InternLM2-5-20b-chat</td>
+                <td class="tg-0lax">20</td>
+                <td class="tg-buh4">57.85</td>
+                <td class="tg-0lax">55.51</td>
+                <td class="tg-buh4">30.44</td>
+                <td class="tg-0lax">44.89</td>
+              </tr>
+              <tr>
+                <td class="tg-baqh">StarCoder2</td>
+                <td class="tg-buh4">StarCoder2-15b-instruct-v0.1</td>
+                <td class="tg-0lax">15</td>
+                <td class="tg-buh4">56.58</td>
+                <td class="tg-0lax">49.07</td>
+                <td class="tg-buh4">42.79</td>
+                <td class="tg-0lax">47.94</td>
+              </tr>
+            </tbody></table>
+            <figcaption><i><b>Summary performance of LLM family on CodeMMLU.</b> The evaluation results (accuracy %) of different language models across CodeMMLU task.</i></figcaption>
+        </div>
+        <!-- <figure>
           <img src="static/images/llm_result.png", width="100%">
           <figcaption><i>Table 1:  Pass@k (k= 1 and 5) and DIR results of various LLMs on RepoExec</i></figcaption>
+        </figure> -->
+        <div class="content has-text-justified">
+          <p>
+            For more benchmark detail, please check <a href="https://fsoft-ai4code.github.io/leaderboards/codemmlu/">👉 HERE 👈</a>
+          </p>
+        </div>
+        <figure>
+          <img src="static/images/task_detail.png", width="100%"></img>
+          <figcaption><i><b>CodeMMLU accuracy by task on LLMs.</b> While knowledge tasks are following the scaling law, real-world tasks offer more challenges to LLMs which indicate the performance of instruction tuning and data quality when evaluating on CodeMMLU.</i></figcaption>
         </figure>
       </div>
     </div>
@@ -292,7 +534,7 @@ <h2 class="title is-3">Evaluation Results</h2>
 </section>
 
 
-<section class="section hero is-light">
+<!-- <section class="section hero is-light">
   <div class="container is-max-desktop">
     <div class="columns is-centered has-text-centered">
       <div class="column is-four-fifths">
@@ -317,7 +559,7 @@ <h2 class="title is-3">Enhancing Functional Correctness and Dependency Invocatio
       </div>
     </div>
   </div>
-</section>
+</section> -->
 <!-- End paper abstract -->
 
 
diff --git a/codemmlu/static/images/codemmlu-logo.png b/codemmlu/static/images/codemmlu-logo.png
new file mode 100644
index 0000000..235d66f
Binary files /dev/null and b/codemmlu/static/images/codemmlu-logo.png differ
diff --git a/codemmlu/static/images/data-creation-flow.png b/codemmlu/static/images/data-creation-flow.png
new file mode 100644
index 0000000..13125d6
Binary files /dev/null and b/codemmlu/static/images/data-creation-flow.png differ
diff --git a/codemmlu/static/images/task_detail.png b/codemmlu/static/images/task_detail.png
new file mode 100644
index 0000000..8d3fbdf
Binary files /dev/null and b/codemmlu/static/images/task_detail.png differ
diff --git a/leaderboards/codemmlu/_results.json b/leaderboards/codemmlu/_results.json
new file mode 100644
index 0000000..12ae803
--- /dev/null
+++ b/leaderboards/codemmlu/_results.json
@@ -0,0 +1,301 @@
+{
+    "CodeLlama-34B-Instruct": {
+        "link": "https://huggingface.co/codellama/CodeLlama-34b-hf",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 38.73
+        },
+        "prompted": true,
+        "size": 34,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 942
+    },
+    "Meta-Llama-3-70B": {
+        "link": "https://huggingface.co/meta-llama/Meta-Llama-3-70B",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 48.98
+        },
+        "prompted": false,
+        "size": 70,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Meta-Llama-3-70B-Instruct": {
+        "link": "https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 62.45
+        },
+        "prompted": true,
+        "size": 70,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Meta-Llama-3.1-70B-Instruct": {
+        "link": "https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 60
+        },
+        "prompted": true,
+        "size": 70,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Meta-Llama-3.1-70B": {
+        "link": "https://huggingface.co/meta-llama/Llama-3.1-70B",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 37.56
+        },
+        "prompted": false,
+        "size": 70,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Mistral-7B-Instruct-v0.3": {
+        "link": "https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 43.33
+        },
+        "prompted": true,
+        "size": 7,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Mixtral-8x7B-Instruct-v0.1": {
+        "link": "https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 42.96
+        },
+        "prompted": true,
+        "size": 7,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Codestral-22B-v0.1": {
+        "link": "https://huggingface.co/mistralai/Codestral-22B-v0.1",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 47.6
+        },
+        "prompted": true,
+        "size": 22,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Phi-3-medium-128k-instruct": {
+        "link": "https://huggingface.co/microsoft/Phi-3-medium-128k-instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 48.03
+        },
+        "prompted": true,
+        "size": 14,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Phi-3-mini-128k-instruct": {
+        "link": "https://huggingface.co/microsoft/Phi-3-mini-128k-instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 37.93
+        },
+        "prompted": true,
+        "size": 3.8,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Qwen2-57B-A14B-Instruct": {
+        "link": "https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 46.34
+        },
+        "prompted": true,
+        "size": 57,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "CodeQwen1.5-7B-Chat": {
+        "link": "https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 49.82
+        },
+        "prompted": true,
+        "size": 7,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Yi-1.5-34B-Chat": {
+        "link": "https://huggingface.co/01-ai/Yi-1.5-34B-Chat",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 49.39
+        },
+        "prompted": true,
+        "size": 34,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Yi-1.5-9B-Chat": {
+        "link": "https://huggingface.co/01-ai/Yi-1.5-9B-Chat",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 47.23
+        },
+        "prompted": true,
+        "size": 9,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "DeepSeek-coder-7b-instruct-v1.5": {
+        "link": "https://huggingface.co/deepseek-ai/deepseek-coder-7b-instruct-v1.5",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 41.21
+        },
+        "prompted": true,
+        "size": 7,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "DeepSeek-coder-33b-instruct": {
+        "link": "https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 36.6
+        },
+        "prompted": true,
+        "size": 33,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "DeepSeek-moe-16b-chat": {
+        "link": "https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 31.01
+        },
+        "prompted": true,
+        "size": 16.4,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "DeepSeek-Coder-V2-Lite-Instruct": {
+        "link": "https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 46.51
+        },
+        "prompted": true,
+        "size": 16,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "InternLM2-5-20b-chat": {
+        "link": "https://huggingface.co/internlm/internlm2_5-20b-chat",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 44.89
+        },
+        "prompted": true,
+        "size": 20,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "StarCoder2-15b-instruct-v0.1": {
+        "link": "https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 47.94
+        },
+        "prompted": true,
+        "size": 15,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Claude-3-sonnet@20240229": {
+        "link": "",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 53.97
+        },
+        "prompted": true,
+        "size": null,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "GPT-4o-2024-05-13": {
+        "link": "",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 67
+        },
+        "prompted": true,
+        "size": null,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "GPT-3.5-turbo-0613": {
+        "link": "",
+        "open-data": null,
+        "pass@1": {
+            "instruct": null,
+            "complete": 51.7
+        },
+        "prompted": true,
+        "size": null,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    }
+}
\ No newline at end of file
diff --git a/leaderboards/codemmlu/images/codemmlu-logo.png b/leaderboards/codemmlu/images/codemmlu-logo.png
new file mode 100644
index 0000000..235d66f
Binary files /dev/null and b/leaderboards/codemmlu/images/codemmlu-logo.png differ
diff --git a/leaderboards/codemmlu/index.html b/leaderboards/codemmlu/index.html
index 64513d3..9fc39c4 100644
--- a/leaderboards/codemmlu/index.html
+++ b/leaderboards/codemmlu/index.html
@@ -19,7 +19,7 @@
     <script src="https://cdn.jsdelivr.net/npm/echarts@5.3.3/dist/echarts.min.js"></script>
     <link
       rel="icon"
-      href=""
+      href="images/codemmlu-logo.png"
     />
     <link
       rel="stylesheet"
@@ -47,7 +47,7 @@
       }
 
       #notes {
-        font-size: 1em;
+        font-size: 0.75em;
       }
 
       #notes h3 {
@@ -77,7 +77,7 @@
         }
 
         #content {
-          width: 100%;
+          width: 75%;
         }
 
         h1 {
@@ -96,6 +96,40 @@
           font-size: medium;
         }
       }
+      /* Add this to your existing <style> section */
+        .btn-group-lg > .btn, .btn-lg {
+            padding: 0.5rem 1rem;
+            font-size: 1.25rem;
+            border-radius: 0.3rem;
+        }
+    
+        .btn-outline-hard {
+            color: #ff6b6b;
+            border: 2px solid #ff6b6b;
+            background-color: transparent;
+        }
+    
+        .btn-outline-hard:hover,
+        .btn-check:checked + .btn-outline-hard,
+        .btn-outline-hard:active {
+            color: #fff;
+            background-color: #ff6b6b !important;
+            border-color: #ff6b6b;
+        }
+    
+        .btn-outline-full {
+            color: #4ecdc4;
+            border: 2px solid #4ecdc4;
+            background-color: transparent;
+        }
+    
+        .btn-outline-full:hover,
+        .btn-check:checked + .btn-outline-full,
+        .btn-outline-full:active {
+            color: #fff;
+            background-color: #4ecdc4 !important;
+            border-color: #4ecdc4;
+        }
     </style>
   </head>
 
@@ -467,7 +501,7 @@ <h3>🙏 Acknowledgements</h3>
         ],
       };
 
-      const theaders = ["Model", "Accuracy"];
+      const theaders = ["Model", "Syntactic Accuracy", "Semantic Accuracy", "Real-task Accuracy", "CodeMMLU"];
 
       // score: 'complete', 'instruct'
       const displayTable = (table, score) => {
@@ -490,7 +524,7 @@ <h3>🙏 Acknowledgements</h3>
         theaders.forEach(function (header) {
           var th = document.createElement("th");
           th.textContent = header;
-          if (header == "Pass@1") {
+          if (header == "CodeMMLU") {
             th.style.backgroundColor = "#EEFFEE";
           }
           headerRow.appendChild(th);
@@ -554,7 +588,24 @@ <h3>🙏 Acknowledgements</h3>
           //   promptedSymbol.textContent = "💙";
           //   modelCell.appendChild(promptedSymbol);
           // }
+
+          // Add Syntactic Accuracy column
+          
+          
           dataRow.appendChild(modelCell);
+
+          var syntacticCell = document.createElement("td");
+          syntacticCell.textContent = row["syntactic_accuracy"] || "-";
+          dataRow.appendChild(syntacticCell);
+
+          var semanticCell = document.createElement("td");
+          semanticCell.textContent = row["semantic_accuracy"] || "-";
+          dataRow.appendChild(semanticCell);
+
+          var rtaskCell = document.createElement("td");
+          rtaskCell.textContent = row["realtask_accuracy"] || "-";
+          dataRow.appendChild(rtaskCell);
+
           var passCell = document.createElement("td");
           passCell.classList.add("text-nowrap");
           if (lazy) {
diff --git a/leaderboards/codemmlu/results.json b/leaderboards/codemmlu/results.json
index 12ae803..6e06758 100644
--- a/leaderboards/codemmlu/results.json
+++ b/leaderboards/codemmlu/results.json
@@ -1,299 +1,848 @@
 {
-    "CodeLlama-34B-Instruct": {
-        "link": "https://huggingface.co/codellama/CodeLlama-34b-hf",
+    "claude-3-sonnet@20240229": {
+        "link": "claude-3-sonnet@20240229",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 53.97
+        },
+        "realtask_accuracy": 38.26,
+        "syntactic_accuracy": 67.22,
+        "semantic_accuracy": 66.08,
+        "prompted": false,
+        "size": null,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "gpt-4o-2024-05-13": {
+        "link": "gpt-4o-2024-05-13",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 67.0
+        },
+        "realtask_accuracy": 77.18,
+        "syntactic_accuracy": 60.41,
+        "semantic_accuracy": 57.81,
+        "prompted": false,
+        "size": null,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "gpt-3.5-turbo-0613": {
+        "link": "gpt-3.5-turbo-0613",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 51.7
+        },
+        "realtask_accuracy": 45.26,
+        "syntactic_accuracy": 61.68,
+        "semantic_accuracy": 53.65,
+        "prompted": false,
+        "size": null,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "CodeLlama-7b-Instruct-hf": {
+        "link": "CodeLlama-7b-Instruct-hf",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 27.01
+        },
+        "realtask_accuracy": 4.78,
+        "syntactic_accuracy": 50.14,
+        "semantic_accuracy": 41.22,
+        "prompted": true,
+        "size": 7.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "CodeLlama-7b-Python-hf": {
+        "link": "CodeLlama-7b-Python-hf",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 29.49
+        },
+        "realtask_accuracy": 19.36,
+        "syntactic_accuracy": 38.7,
+        "semantic_accuracy": 36.87,
+        "prompted": false,
+        "size": 7.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "CodeLlama-13b-Instruct-hf": {
+        "link": "CodeLlama-13b-Instruct-hf",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 30.25
+        },
+        "realtask_accuracy": 10.53,
+        "syntactic_accuracy": 50.58,
+        "semantic_accuracy": 43.0,
+        "prompted": true,
+        "size": 13.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "CodeLlama-13b-Python-hf": {
+        "link": "CodeLlama-13b-Python-hf",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 29.82
+        },
+        "realtask_accuracy": 56.98,
+        "syntactic_accuracy": 12.89,
+        "semantic_accuracy": 4.88,
+        "prompted": false,
+        "size": 13.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "CodeLlama-13b-hf": {
+        "link": "CodeLlama-13b-hf",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 28.51
+        },
+        "realtask_accuracy": 6.65,
+        "syntactic_accuracy": 50.58,
+        "semantic_accuracy": 42.95,
+        "prompted": false,
+        "size": 13.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "CodeLlama-34b-Instruct-hf": {
+        "link": "CodeLlama-34b-Instruct-hf",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 38.73
         },
+        "realtask_accuracy": 23.55,
+        "syntactic_accuracy": 56.8,
+        "semantic_accuracy": 46.93,
         "prompted": true,
-        "size": 34,
+        "size": 34.0,
         "direct_complete": false,
         "lazy": false,
-        "elo_mle": 942
+        "elo_mle": 874
+    },
+    "CodeLlama-34b-Python-hf": {
+        "link": "CodeLlama-34b-Python-hf",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 9.4
+        },
+        "realtask_accuracy": 9.37,
+        "syntactic_accuracy": 15.57,
+        "semantic_accuracy": 5.34,
+        "prompted": false,
+        "size": 34.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
     },
     "Meta-Llama-3-70B": {
-        "link": "https://huggingface.co/meta-llama/Meta-Llama-3-70B",
+        "link": "Meta-Llama-3-70B",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 48.98
         },
+        "realtask_accuracy": 35.29,
+        "syntactic_accuracy": 63.38,
+        "semantic_accuracy": 57.64,
         "prompted": false,
-        "size": 70,
+        "size": 70.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "Meta-Llama-3-70B-Instruct": {
-        "link": "https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct",
+        "link": "Meta-Llama-3-70B-Instruct",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 62.45
         },
+        "realtask_accuracy": 60.84,
+        "syntactic_accuracy": 64.9,
+        "semantic_accuracy": 62.96,
         "prompted": true,
-        "size": 70,
+        "size": 70.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "Meta-Llama-3.1-70B-Instruct": {
-        "link": "https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct",
+    "Meta-Llama-3-8B": {
+        "link": "Meta-Llama-3-8B",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 51.89
+        },
+        "realtask_accuracy": 53.84,
+        "syntactic_accuracy": 54.14,
+        "semantic_accuracy": 47.8,
+        "prompted": false,
+        "size": 8.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Meta-Llama-3-8B-Instruct": {
+        "link": "Meta-Llama-3-8B-Instruct",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 60
+            "complete": 46.04
         },
+        "realtask_accuracy": 38.38,
+        "syntactic_accuracy": 58.1,
+        "semantic_accuracy": 48.21,
         "prompted": true,
-        "size": 70,
+        "size": 8.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "Meta-Llama-3.1-70B": {
-        "link": "https://huggingface.co/meta-llama/Llama-3.1-70B",
+        "link": "Meta-Llama-3.1-70B",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 37.56
         },
+        "realtask_accuracy": 8.22,
+        "syntactic_accuracy": 64.09,
+        "semantic_accuracy": 59.0,
+        "prompted": false,
+        "size": 70.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Meta-Llama-3.1-70B-Instruct": {
+        "link": "Meta-Llama-3.1-70B-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 60.0
+        },
+        "realtask_accuracy": 56.11,
+        "syntactic_accuracy": 64.41,
+        "semantic_accuracy": 62.25,
+        "prompted": true,
+        "size": 70.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Meta-Llama-3.1-8B": {
+        "link": "Meta-Llama-3.1-8B",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 42.06
+        },
+        "realtask_accuracy": 31.58,
+        "syntactic_accuracy": 53.95,
+        "semantic_accuracy": 48.09,
         "prompted": false,
-        "size": 70,
+        "size": 8.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Meta-Llama-3.1-8B-Instruct": {
+        "link": "Meta-Llama-3.1-8B-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 45.22
+        },
+        "realtask_accuracy": 35.7,
+        "syntactic_accuracy": 56.54,
+        "semantic_accuracy": 50.36,
+        "prompted": true,
+        "size": 8.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Mistral-7B-Instruct-v0.1": {
+        "link": "Mistral-7B-Instruct-v0.1",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 45.55
+        },
+        "realtask_accuracy": 41.49,
+        "syntactic_accuracy": 52.74,
+        "semantic_accuracy": 46.16,
+        "prompted": true,
+        "size": 6.7,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Mistral-7B-Instruct-v0.2": {
+        "link": "Mistral-7B-Instruct-v0.2",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 39.14
+        },
+        "realtask_accuracy": 26.01,
+        "syntactic_accuracy": 52.14,
+        "semantic_accuracy": 47.97,
+        "prompted": true,
+        "size": 7.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "Mistral-7B-Instruct-v0.3": {
-        "link": "https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3",
+        "link": "Mistral-7B-Instruct-v0.3",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 43.33
         },
+        "realtask_accuracy": 31.85,
+        "syntactic_accuracy": 54.42,
+        "semantic_accuracy": 51.25,
         "prompted": true,
-        "size": 7,
+        "size": 7.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "Mixtral-8x7B-Instruct-v0.1": {
-        "link": "https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1",
+        "link": "Mixtral-8x7B-Instruct-v0.1",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 42.96
+            "complete": 40.93
         },
+        "realtask_accuracy": 13.49,
+        "syntactic_accuracy": 61.17,
+        "semantic_accuracy": 54.89,
         "prompted": true,
-        "size": 7,
+        "size": 46.7,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "Codestral-22B-v0.1": {
-        "link": "https://huggingface.co/mistralai/Codestral-22B-v0.1",
+        "link": "Codestral-22B-v0.1",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 47.6
         },
-        "prompted": true,
-        "size": 22,
+        "realtask_accuracy": 37.86,
+        "syntactic_accuracy": 60.34,
+        "semantic_accuracy": 52.11,
+        "prompted": false,
+        "size": 22.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "Phi-3-medium-128k-instruct": {
-        "link": "https://huggingface.co/microsoft/Phi-3-medium-128k-instruct",
+        "link": "Phi-3-medium-128k-instruct",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 48.03
         },
+        "realtask_accuracy": 37.89,
+        "syntactic_accuracy": 58.54,
+        "semantic_accuracy": 54.56,
+        "prompted": true,
+        "size": 14.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Phi-3-medium-4k-instruct": {
+        "link": "Phi-3-medium-4k-instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 50.95
+        },
+        "realtask_accuracy": 43.17,
+        "syntactic_accuracy": 58.42,
+        "semantic_accuracy": 56.34,
         "prompted": true,
-        "size": 14,
+        "size": 14.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "Phi-3-mini-128k-instruct": {
-        "link": "https://huggingface.co/microsoft/Phi-3-mini-128k-instruct",
+        "link": "Phi-3-mini-128k-instruct",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 37.93
         },
+        "realtask_accuracy": 22.36,
+        "syntactic_accuracy": 53.01,
+        "semantic_accuracy": 48.65,
         "prompted": true,
         "size": 3.8,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
+    "Phi-3-mini-4k-instruct": {
+        "link": "Phi-3-mini-4k-instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 39.99
+        },
+        "realtask_accuracy": 27.63,
+        "syntactic_accuracy": 54.73,
+        "semantic_accuracy": 46.65,
+        "prompted": true,
+        "size": 3.8,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Phi-3-small-8k-instruct": {
+        "link": "Phi-3-small-8k-instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 43.69
+        },
+        "realtask_accuracy": 26.81,
+        "syntactic_accuracy": 57.6,
+        "semantic_accuracy": 56.92,
+        "prompted": true,
+        "size": 7.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Phind-CodeLlama-34B-v2": {
+        "link": "Phind-CodeLlama-34B-v2",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 39.96
+        },
+        "realtask_accuracy": 25.51,
+        "syntactic_accuracy": 57.57,
+        "semantic_accuracy": 47.47,
+        "prompted": false,
+        "size": 7.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Qwen2-0.5B-Instruct": {
+        "link": "Qwen2-0.5B-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 34.21
+        },
+        "realtask_accuracy": 29.55,
+        "syntactic_accuracy": 38.58,
+        "semantic_accuracy": 37.53,
+        "prompted": true,
+        "size": 0.5,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Qwen2-1.5B-Instruct": {
+        "link": "Qwen2-1.5B-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 34.03
+        },
+        "realtask_accuracy": 15.18,
+        "syntactic_accuracy": 51.54,
+        "semantic_accuracy": 47.5,
+        "prompted": true,
+        "size": 1.5,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
     "Qwen2-57B-A14B-Instruct": {
-        "link": "https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct",
+        "link": "Qwen2-57B-A14B-Instruct",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 46.34
         },
+        "realtask_accuracy": 30.48,
+        "syntactic_accuracy": 61.34,
+        "semantic_accuracy": 57.48,
+        "prompted": true,
+        "size": 57.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Qwen2-7B": {
+        "link": "Qwen2-7B",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 53.28
+        },
+        "realtask_accuracy": 49.3,
+        "syntactic_accuracy": 58.31,
+        "semantic_accuracy": 55.23,
+        "prompted": false,
+        "size": 7.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Qwen2-7B-Instruct": {
+        "link": "Qwen2-7B-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 51.3
+        },
+        "realtask_accuracy": 42.66,
+        "syntactic_accuracy": 59.9,
+        "semantic_accuracy": 57.08,
         "prompted": true,
-        "size": 57,
+        "size": 7.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "CodeQwen1.5-7B": {
+        "link": "CodeQwen1.5-7B",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 42.56
+        },
+        "realtask_accuracy": 36.76,
+        "syntactic_accuracy": 52.51,
+        "semantic_accuracy": 43.65,
+        "prompted": false,
+        "size": 7.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "CodeQwen1.5-7B-Chat": {
-        "link": "https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat",
+        "link": "CodeQwen1.5-7B-Chat",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 49.82
         },
-        "prompted": true,
-        "size": 7,
+        "realtask_accuracy": 56.37,
+        "syntactic_accuracy": 49.66,
+        "semantic_accuracy": 41.18,
+        "prompted": false,
+        "size": 7.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "Yi-1.5-34B-Chat": {
-        "link": "https://huggingface.co/01-ai/Yi-1.5-34B-Chat",
+        "link": "Yi-1.5-34B-Chat",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 49.39
         },
-        "prompted": true,
-        "size": 34,
+        "realtask_accuracy": 40.27,
+        "syntactic_accuracy": 58.31,
+        "semantic_accuracy": 55.59,
+        "prompted": false,
+        "size": 34.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "Yi-1.5-6B-Chat": {
+        "link": "Yi-1.5-6B-Chat",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 44.13
+        },
+        "realtask_accuracy": 33.57,
+        "syntactic_accuracy": 55.1,
+        "semantic_accuracy": 50.91,
+        "prompted": false,
+        "size": 6.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
     "Yi-1.5-9B-Chat": {
-        "link": "https://huggingface.co/01-ai/Yi-1.5-9B-Chat",
+        "link": "Yi-1.5-9B-Chat",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 47.23
         },
-        "prompted": true,
-        "size": 9,
+        "realtask_accuracy": 37.14,
+        "syntactic_accuracy": 55.64,
+        "semantic_accuracy": 55.06,
+        "prompted": false,
+        "size": 9.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "DeepSeek-coder-7b-instruct-v1.5": {
-        "link": "https://huggingface.co/deepseek-ai/deepseek-coder-7b-instruct-v1.5",
+    "deepseek-coder-33b-base": {
+        "link": "deepseek-coder-33b-base",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 41.21
+            "complete": 6.69
         },
-        "prompted": true,
-        "size": 7,
+        "realtask_accuracy": 11.05,
+        "syntactic_accuracy": 0.0,
+        "semantic_accuracy": 5.33,
+        "prompted": false,
+        "size": 33.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "DeepSeek-coder-33b-instruct": {
-        "link": "https://huggingface.co/deepseek-ai/deepseek-coder-33b-instruct",
+    "deepseek-coder-33b-instruct": {
+        "link": "deepseek-coder-33b-instruct",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
             "complete": 36.6
         },
+        "realtask_accuracy": 21.46,
+        "syntactic_accuracy": 53.64,
+        "semantic_accuracy": 45.43,
         "prompted": true,
-        "size": 33,
+        "size": 33.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "DeepSeek-moe-16b-chat": {
-        "link": "https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat",
+    "deepseek-coder-6.7b-base": {
+        "link": "deepseek-coder-6.7b-base",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 31.01
+            "complete": 27.06
+        },
+        "realtask_accuracy": 4.8,
+        "syntactic_accuracy": 49.45,
+        "semantic_accuracy": 41.81,
+        "prompted": false,
+        "size": 6.7,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "deepseek-coder-6.7b-instruct": {
+        "link": "deepseek-coder-6.7b-instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 29.4
         },
+        "realtask_accuracy": 8.54,
+        "syntactic_accuracy": 50.8,
+        "semantic_accuracy": 42.94,
         "prompted": true,
-        "size": 16.4,
+        "size": 6.7,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "DeepSeek-Coder-V2-Lite-Instruct": {
-        "link": "https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
+    "deepseek-coder-7b-base-v1.5": {
+        "link": "deepseek-coder-7b-base-v1.5",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 46.51
+            "complete": 37.48
         },
+        "realtask_accuracy": 17.19,
+        "syntactic_accuracy": 58.79,
+        "semantic_accuracy": 50.35,
+        "prompted": false,
+        "size": 7.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "deepseek-coder-7b-instruct-v1.5": {
+        "link": "deepseek-coder-7b-instruct-v1.5",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 41.21
+        },
+        "realtask_accuracy": 28.46,
+        "syntactic_accuracy": 56.67,
+        "semantic_accuracy": 47.9,
         "prompted": true,
-        "size": 16,
+        "size": 7.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "InternLM2-5-20b-chat": {
-        "link": "https://huggingface.co/internlm/internlm2_5-20b-chat",
+    "deepseek-moe-16b-base": {
+        "link": "deepseek-moe-16b-base",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 44.89
+            "complete": 29.31
+        },
+        "realtask_accuracy": 18.53,
+        "syntactic_accuracy": 39.98,
+        "semantic_accuracy": 36.56,
+        "prompted": false,
+        "size": 16.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "deepseek-moe-16b-chat": {
+        "link": "deepseek-moe-16b-chat",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 31.01
         },
+        "realtask_accuracy": 27.33,
+        "syntactic_accuracy": 31.74,
+        "semantic_accuracy": 35.43,
         "prompted": true,
-        "size": 20,
+        "size": 16.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "StarCoder2-15b-instruct-v0.1": {
-        "link": "https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1",
+    "DeepSeek-Coder-V2-Lite-Base": {
+        "link": "DeepSeek-Coder-V2-Lite-Base",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 47.94
+            "complete": 40.88
+        },
+        "realtask_accuracy": 23.47,
+        "syntactic_accuracy": 59.44,
+        "semantic_accuracy": 51.71,
+        "prompted": false,
+        "size": 16.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "DeepSeek-Coder-V2-Lite-Instruct": {
+        "link": "DeepSeek-Coder-V2-Lite-Instruct",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 46.51
         },
+        "realtask_accuracy": 33.62,
+        "syntactic_accuracy": 59.91,
+        "semantic_accuracy": 54.75,
         "prompted": true,
-        "size": 15,
+        "size": 16.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "Claude-3-sonnet@20240229": {
-        "link": "",
+    "internlm2_5-20b-chat": {
+        "link": "internlm2_5-20b-chat",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 53.97
+            "complete": 44.89
         },
+        "realtask_accuracy": 30.43,
+        "syntactic_accuracy": 57.85,
+        "semantic_accuracy": 55.51,
         "prompted": true,
-        "size": null,
+        "size": 20.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "GPT-4o-2024-05-13": {
-        "link": "",
+    "internlm2_5-7b-chat": {
+        "link": "internlm2_5-7b-chat",
         "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 67
+            "complete": 42.64
         },
+        "realtask_accuracy": 27.43,
+        "syntactic_accuracy": 57.32,
+        "semantic_accuracy": 53.13,
         "prompted": true,
-        "size": null,
+        "size": 7.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874
     },
-    "GPT-3.5-turbo-0613": {
-        "link": "",
-        "open-data": null,
+    "starcoder2-15b-instruct-v0.1": {
+        "link": "starcoder2-15b-instruct-v0.1",
+        "open-data": "None",
         "pass@1": {
             "instruct": null,
-            "complete": 51.7
+            "complete": 47.94
         },
+        "realtask_accuracy": 42.78,
+        "syntactic_accuracy": 56.57,
+        "semantic_accuracy": 49.07,
         "prompted": true,
-        "size": null,
+        "size": 15.0,
+        "direct_complete": false,
+        "lazy": false,
+        "elo_mle": 874
+    },
+    "starcoder2-7b": {
+        "link": "starcoder2-7b",
+        "open-data": "None",
+        "pass@1": {
+            "instruct": null,
+            "complete": 35.64
+        },
+        "realtask_accuracy": 27.42,
+        "syntactic_accuracy": 45.87,
+        "semantic_accuracy": 39.77,
+        "prompted": false,
+        "size": 7.0,
         "direct_complete": false,
         "lazy": false,
         "elo_mle": 874

	Model name	Size (B)	Syntactic knowledge	Semantic knowledge	Real-world tasks	CodeMMLU
Closed-source models
Anthropic	Claude-3-sonnet@20240229	-	67.22	66.08	38.26	53.97
OpenAI	GPT-4o-2024-05-13	-	60.41	57.82	77.18	67.0
OpenAI	GPT-3.5-turbo-0613	-	61.68	53.64	45.26	51.7
Open-source models
Meta Llama	CodeLlama-34b-Instruct-hf	34	56.81	46.93	23.55	38.73
	Meta-Llama-3-70B	70	63.38	57.64	35.29	48.98
	Meta-Llama-3-70B-Instruct	70	64.90	62.96	60.84	62.45
	Meta-Llama-3.1-70B	70	64.09	59.00	8.22	37.56
	Meta-Llama-3.1-70B-Instruct	70	64.42	62.25	56.11	60
Mistral	Mistral-7B-Instruct-v0.3	7	54.42	51.25	31.85	43.33
	Mixtral-8x7B-Instruct-v0.1	46.7	61.17	54.89	24.90	42.96
	Codestral-22B-v0.1	22	60.34	52.11	37.86	47.6
Phi	Phi-3-medium-128k-instruct	14	58.54	54.56	37.89	48.03
Phi	Phi-3-mini-128k-instruct	3.8	53.01	48.65	22.36	37.93
Qwen	Qwen2-57B-A14B-Instruct	57	61.34	57.48	30.48	46.34
Qwen	CodeQwen1.5-7B-Chat	7	49.66	46.58	56.37	49.82
Yi	Yi-1.5-34B-Chat	34	58.32	55.59	40.27	49.39
Yi	Yi-1.5-9B-Chat	9	55.64	55.06	37.15	47.23
Deep Seek	DeepSeek-coder-7b-instruct-v1.5	7	56.67	47.90	28.46	41.21
	DeepSeek-coder-33b-instruct	33	53.65	46.11	21.47	36.6
	DeepSeek-moe-16b-chat	16.4	31.74	35.43	27.33	31.01
	DeepSeek-Coder-V2-Lite-Instruct	16	59.91	54.76	33.62	46.51
InternLM	InternLM2-5-20b-chat	20	57.85	55.51	30.44	44.89
StarCoder2	StarCoder2-15b-instruct-v0.1	15	56.58	49.07	42.79	47.94