Evalution of Llama on Hellaswag #539

SophieLiao0001 · 2023-06-01T06:39:42Z

Hi, there, how do you evaluate Llama on Hellaswag? Llama does not have an API that can pass in an argument called echo and returns the log probs. To the best of my knowledge, if we are using Llama from Huggingface, we can only get logits as model output. How to make it work on Hellaswag?

lintangsutawika · 2023-06-01T15:24:09Z

You should be able to use HF's implementation of Llama

Also note there is an issue with tokenization which is currently a work-in-progress.
#531

gakada · 2023-06-02T15:58:13Z

I evaluated it recently like so:

python main.py --model hf-causal-experimental --model_args pretrained=huggyllama/llama-7b,use_accelerate=True --tasks hellaswag --batch_size auto

|  Task   |Version| Metric |Value|   |Stderr|
|---------|------:|--------|----:|---|-----:|
|hellaswag|      0|acc     | 0.57|±  |0.0049|
|         |       |acc_norm| 0.76|±  |0.0043|

For the implementation, you compute summed log probabilities (log_softmax of logits) corresponding to each answer by calling model(question + answer) on each answer, then the argmax of that is what the model "thinks" is the right answer, even if it generates some nonsense not matching with the actual answer, we just use it as a classifier.

gakada · 2023-06-04T20:47:27Z

As en example

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('huggyllama/llama-7b')
model = AutoModelForCausalLM.from_pretrained('huggyllama/llama-7b', device_map='auto', load_in_8bit=True)

test = [
    ('Paris is the capital of', ' England.'),
    ('Paris is the capital of', ' Germany.'),
    ('Paris is the capital of', ' France.'),
    ('Paris is the capital of', ' Japan.'),
]

# encode full sentences, questions, and answers, no padding for simplicity
batched_sentences = tokenizer.batch_encode_plus([q + a for q, a in test], add_special_tokens=False, return_tensors='pt')['input_ids']
batched_questions = tokenizer.batch_encode_plus([q for q, _ in test], add_special_tokens=False, return_tensors='pt')['input_ids']

# run the model on full sentences and get the log probabilities
batched_logprobs = F.log_softmax(model(batched_sentences.cuda())['logits'], dim=-1).cpu()

# take log probabilities corresponding to possible answer tokens
batched_logprobs = batched_logprobs[:, len(batched_questions[0]) - 1 : -1, :]

# get the scores by summing log probabilities corresponding to correct answer tokens, unvectorized
scores = []
for sentence, question, logprobs in zip(batched_sentences, batched_questions, batched_logprobs):
    answer = sentence[len(question):]
    guess = logprobs.argmax(dim=-1)
    print(tokenizer.decode(guess), bool((guess == answer).all()))
    scores.append(float(torch.gather(logprobs, 1, answer.unsqueeze(-1)).sum()))

# predict the answer
test[torch.tensor(scores).argmax()]

Can answer correctly even if the guess in print is wrong.

jane644 · 2023-08-17T18:04:33Z

As en example

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('huggyllama/llama-7b')
model = AutoModelForCausalLM.from_pretrained('huggyllama/llama-7b', device_map='auto', load_in_8bit=True)

test = [
    ('Paris is the capital of', ' England.'),
    ('Paris is the capital of', ' Germany.'),
    ('Paris is the capital of', ' France.'),
    ('Paris is the capital of', ' Japan.'),
]

# encode full sentences, questions, and answers, no padding for simplicity
batched_sentences = tokenizer.batch_encode_plus([q + a for q, a in test], add_special_tokens=False, return_tensors='pt')['input_ids']
batched_questions = tokenizer.batch_encode_plus([q for q, _ in test], add_special_tokens=False, return_tensors='pt')['input_ids']

# run the model on full sentences and get the log probabilities
batched_logprobs = F.log_softmax(model(batched_sentences.cuda())['logits'], dim=-1).cpu()

# take log probabilities corresponding to possible answer tokens
batched_logprobs = batched_logprobs[:, len(batched_questions[0]) - 1 : -1, :]

# get the scores by summing log probabilities corresponding to correct answer tokens, unvectorized
scores = []
for sentence, question, logprobs in zip(batched_sentences, batched_questions, batched_logprobs):
    answer = sentence[len(question):]
    guess = logprobs.argmax(dim=-1)
    print(tokenizer.decode(guess), bool((guess == answer).all()))
    scores.append(float(torch.gather(logprobs, 1, answer.unsqueeze(-1)).sum()))

# predict the answer
test[torch.tensor(scores).argmax()]

Can answer correctly even if the guess in print is wrong.

Hello, I was wondering how could we alter this code for few-shot prompting? Thank you so much.

SophieLiao0001 closed this as completed Jun 4, 2023

lantiga mentioned this issue Jun 6, 2023

Evaluation of lit-llama models Lightning-AI/lit-llama#368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evalution of Llama on Hellaswag #539

Evalution of Llama on Hellaswag #539

SophieLiao0001 commented Jun 1, 2023

lintangsutawika commented Jun 1, 2023

gakada commented Jun 2, 2023 •

edited

Loading

gakada commented Jun 4, 2023

jane644 commented Aug 17, 2023

Evalution of Llama on Hellaswag #539

Evalution of Llama on Hellaswag #539

Comments

SophieLiao0001 commented Jun 1, 2023

lintangsutawika commented Jun 1, 2023

gakada commented Jun 2, 2023 • edited Loading

gakada commented Jun 4, 2023

jane644 commented Aug 17, 2023

gakada commented Jun 2, 2023 •

edited

Loading