Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evalution of Llama on Hellaswag #539

Closed
SophieLiao0001 opened this issue Jun 1, 2023 · 4 comments
Closed

Evalution of Llama on Hellaswag #539

SophieLiao0001 opened this issue Jun 1, 2023 · 4 comments

Comments

@SophieLiao0001
Copy link

Hi, there, how do you evaluate Llama on Hellaswag? Llama does not have an API that can pass in an argument called echo and returns the log probs. To the best of my knowledge, if we are using Llama from Huggingface, we can only get logits as model output. How to make it work on Hellaswag?

@lintangsutawika
Copy link
Contributor

You should be able to use HF's implementation of Llama

Also note there is an issue with tokenization which is currently a work-in-progress.
#531

@gakada
Copy link
Contributor

gakada commented Jun 2, 2023

I evaluated it recently like so:

python main.py --model hf-causal-experimental --model_args pretrained=huggyllama/llama-7b,use_accelerate=True --tasks hellaswag --batch_size auto
|  Task   |Version| Metric |Value|   |Stderr|
|---------|------:|--------|----:|---|-----:|
|hellaswag|      0|acc     | 0.57|±  |0.0049|
|         |       |acc_norm| 0.76|±  |0.0043|

For the implementation, you compute summed log probabilities (log_softmax of logits) corresponding to each answer by calling model(question + answer) on each answer, then the argmax of that is what the model "thinks" is the right answer, even if it generates some nonsense not matching with the actual answer, we just use it as a classifier.

@gakada
Copy link
Contributor

gakada commented Jun 4, 2023

As en example

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('huggyllama/llama-7b')
model = AutoModelForCausalLM.from_pretrained('huggyllama/llama-7b', device_map='auto', load_in_8bit=True)

test = [
    ('Paris is the capital of', ' England.'),
    ('Paris is the capital of', ' Germany.'),
    ('Paris is the capital of', ' France.'),
    ('Paris is the capital of', ' Japan.'),
]

# encode full sentences, questions, and answers, no padding for simplicity
batched_sentences = tokenizer.batch_encode_plus([q + a for q, a in test], add_special_tokens=False, return_tensors='pt')['input_ids']
batched_questions = tokenizer.batch_encode_plus([q for q, _ in test], add_special_tokens=False, return_tensors='pt')['input_ids']

# run the model on full sentences and get the log probabilities
batched_logprobs = F.log_softmax(model(batched_sentences.cuda())['logits'], dim=-1).cpu()

# take log probabilities corresponding to possible answer tokens
batched_logprobs = batched_logprobs[:, len(batched_questions[0]) - 1 : -1, :]

# get the scores by summing log probabilities corresponding to correct answer tokens, unvectorized
scores = []
for sentence, question, logprobs in zip(batched_sentences, batched_questions, batched_logprobs):
    answer = sentence[len(question):]
    guess = logprobs.argmax(dim=-1)
    print(tokenizer.decode(guess), bool((guess == answer).all()))
    scores.append(float(torch.gather(logprobs, 1, answer.unsqueeze(-1)).sum()))

# predict the answer
test[torch.tensor(scores).argmax()]

Can answer correctly even if the guess in print is wrong.

@jane644
Copy link

jane644 commented Aug 17, 2023

As en example

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('huggyllama/llama-7b')
model = AutoModelForCausalLM.from_pretrained('huggyllama/llama-7b', device_map='auto', load_in_8bit=True)

test = [
    ('Paris is the capital of', ' England.'),
    ('Paris is the capital of', ' Germany.'),
    ('Paris is the capital of', ' France.'),
    ('Paris is the capital of', ' Japan.'),
]

# encode full sentences, questions, and answers, no padding for simplicity
batched_sentences = tokenizer.batch_encode_plus([q + a for q, a in test], add_special_tokens=False, return_tensors='pt')['input_ids']
batched_questions = tokenizer.batch_encode_plus([q for q, _ in test], add_special_tokens=False, return_tensors='pt')['input_ids']

# run the model on full sentences and get the log probabilities
batched_logprobs = F.log_softmax(model(batched_sentences.cuda())['logits'], dim=-1).cpu()

# take log probabilities corresponding to possible answer tokens
batched_logprobs = batched_logprobs[:, len(batched_questions[0]) - 1 : -1, :]

# get the scores by summing log probabilities corresponding to correct answer tokens, unvectorized
scores = []
for sentence, question, logprobs in zip(batched_sentences, batched_questions, batched_logprobs):
    answer = sentence[len(question):]
    guess = logprobs.argmax(dim=-1)
    print(tokenizer.decode(guess), bool((guess == answer).all()))
    scores.append(float(torch.gather(logprobs, 1, answer.unsqueeze(-1)).sum()))

# predict the answer
test[torch.tensor(scores).argmax()]

Can answer correctly even if the guess in print is wrong.

Hello, I was wondering how could we alter this code for few-shot prompting? Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants