Skip to content

Latest commit

 

History

History
50 lines (50 loc) · 2.84 KB

eval_details.md

File metadata and controls

50 lines (50 loc) · 2.84 KB

Llama 3 Evaluation Details

This document contains additional context on the settings and parameters for how we evaluated the Llama 3 pre-trained and instruct-aligned models.

Auto-eval benchmark notes

MMLU

  • We are reporting macro averages for MMLU benchmarks. The micro average numbers for MMLU are: 65.4 and 67.4 for the 8B pre-trained and instruct-aligned models, 78.9 and 82.0 for the 70B pre-trained and instruct-aligned models
  • For the instruct-aligned MMLU we ask the model to generate the best choice character

AGI English

  • We use the default few-shot and prompt settings as specified here. The score is averaged over the english subtasks.

CommonSenseQA

Winogrande

  • We use a choice based setup for evaluation where we fill in the missing blank with the two possible choices and then compute log-likelihood over the suffix. We use 5 shots for evaluation.

BIG-Bench Hard

  • We use a 3-shot chain of thought style prompting and compute the average exact match over the subsets in this task.

ARC-Challenge

  • We use the arc-challenge subset from the arc benchmark. We use 25 shots and use the MMLU setup for evaluation where we provide all the choices in the prompt and calculate likelihood over choice characters

TriviaQA-WIKI

  • We evaluate on the Wiki validation set and use 5 few-shot examples.

SQuAD

  • We are using SQuAD v2 and compute exact match in a 1-shot setting.

QuAC

  • Same setting as Llama 2 (1-shot, f1).

BoolQ

  • Same setting as Llama 1 and Llama 2 (0-shot, accuracy).

DROP

  • For each validation example, we draw 3 random few-shot examples from the train split.

GPQA

  • We report 0-shot exact match scores over the possible options using the Main subset for our models and other open-source models (Mistral, Gemma).

HumanEval

  • Same setting as Llama 1 and Llama 2 (pass@1).

GSM8K

MATH

Human evaluation notes

This evaluation set contains 1,800 prompts that cover 12 key use cases: asking for advice, brainstorming, classification, closed question answering, coding, creative writing, extraction, inhabiting a character/persona, open question answering, reasoning, rewriting, and summarization.

Category Count
Coding 150
Mathematical reasoning 150
Asking for Advice 150
Brainstorming 150
Classification 150
Closed Question Answering 150
Creative Writing 150
Extraction 150
Inhabiting a Character/Persona 150
Open Question Answering 150
Rewriting 150
Summarization 150