add belebele #885

ManuelFay · 2023-09-25T14:44:25Z

Big-Refactor version of https://github.com/EleutherAI/lm-evaluation-harness/pull/882/files

lm_eval/tasks/belebele/belebele_acm_Arab.yaml

juletx · 2023-09-30T10:44:43Z

Thank for adding this dataset! The prompt is a bit different from the paper, they use A: instead of A. And they use no description for five-shot evaluation. They also include one space on both sides of \n, I don't know why. Maybe it is only to make the prompt more readable in the paper. We should run some five-shot experiments with LLama to validate that we get the same results. This is the prompt with the changes I mentioned:

P: {{flores_passage}} \n Q: {{question.strip()}} \n A: {{mc_answer1}} \n B: {{mc_answer2}} \n C: {{mc_answer3}} \n D: {{mc_answer4}} \n Answer：

lintangsutawika · 2023-10-05T07:59:01Z

@ManuelFay I'm still seeing description in the yamls, do you think this should still be kept?

ManuelFay · 2023-10-05T14:35:33Z

Ok I'm removing the description, in the paper, from my understanding there was a 0-shot scenario with a description and a 5-shot without. Since the exact content of the description was not stated explicitly, let's just do the 5 shot I guess.

ManuelFay · 2023-10-05T14:38:25Z

I changed the . to : in the prompt and removed the description. As for the spaces between the \n, I left them, I really feel it's for paper readibility cause it makes no sense to me to add useless whitespaces otherwise.

ManuelFay · 2023-10-05T14:39:41Z

Maybe we can tag @satyanshukla, the author for his input !
[edit] Opened an issue requesting author review. facebookresearch/belebele#7

lucasbandarkar · 2023-10-06T01:11:18Z

Maybe we can tag @satyanshukla, the author for his input !
[edit] Opened an issue requesting author review. facebookresearch/belebele#7

Hey one of the authors of the paper here, happy to help ! Though it's unclear to me what the question is, is it just about : or . ?

If so, I know we were trying out different prompts and the one reported is the one that worked best though idk if we tried using . instead.

ManuelFay · 2023-10-06T01:27:31Z

Hey Lucas ! Yes it's essentially for:

Are there spaces between \n like in the paper or was it just for readability but the real prompt has no added whitespaces?
Is there a description of the task in 0-shot settings ? If so, which is it ? I did not find it in the paper ?
Lastly, just nice to get your approval for the task implementation so we can get an "author approved" benchmark implementation, makes it more official !

Thanks again for everything

lucasbandarkar · 2023-10-06T17:16:41Z

1. there were \n in the real prompt 2. the end of Section 4.2 has a quick paragraph on the zero-shot evaluation but I can get for you the exact natural language prompts we used 3. Just to clarify: am I approving EleutherAI the authority to use my dataset or am I approving that the implementation is what I want ?

…

On Thu, Oct 5, 2023 at 6:28 PM Manuel Faysse ***@***.***> wrote: Hey Lucas ! Yes it's essentially for: - Are there spaces between \n like in the paper or was it just for readability but the real prompt has no added whitespaces? - Is there a description of the task in 0-shot settings ? If so, which is it ? I did not find it in the paper ? - Lastly, just nice to get your approval for the task implementation so we can get an "author approved" benchmark implementation, makes it more official ! Thanks again for everything — Reply to this email directly, view it on GitHub <#885 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHQITKRFO7XYWAEOCYE5SU3X55NA3AVCNFSM6AAAAAA5GG3PGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBZHA3DKNRXGI> . You are receiving this because you commented.Message ID: ***@***.***>

haileyschoelkopf · 2023-10-06T18:06:03Z

Just to clarify: am I approving EleutherAI the authority to use my
dataset or am I approving that the implementation is what I want ?

This would simply be agreeing that the implementation matches what you used in your paper!

juletx · 2023-10-06T18:41:16Z

To clarify 1, the doubt is not if \n was used in the prompt. Our doubt is if there is any whitespace between \n and text.

lucasbandarkar · 2023-10-06T19:19:12Z

Ahhh I see, yeah there was probably no extra whitespace Ok let me parse through the implementation and try to understand deeper and will get the zero-shot natural language instructions for y'all

…

On Fri, Oct 6, 2023 at 11:41 AM Julen Etxaniz ***@***.***> wrote: To clarify 1, the doubt is not if \n was used in the prompt. Our doubt is if there is any whitespace between \n and text. — Reply to this email directly, view it on GitHub <#885 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHQITKUXZOVKV7UBU42C43LX6BGFRAVCNFSM6AAAAAA5GG3PGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJRGI2DENZQGI> . You are receiving this because you commented.Message ID: ***@***.***>

lucasbandarkar · 2023-10-09T20:52:28Z

cc-ing @davisliang
So the format did change slightly across the models because some things worked better for some than others, but it was all just punctuation/minutia. The f-string generally looked like this:

f"{instruction}\n###\nPassage:\n{passage}\n###\nQuery:\n{query}\n###\nChoices:\n(A) {A}\n(B) {B}\n(C) {C}\n(D) {D}\n###\nAnswer:\n"

Example:

Given the following passage, query, and answer choices, output the letter corresponding to the correct answer.
###
Passage:
Though many of the animals in the park are used to seeing humans, the wildlife is nonetheless wild and should not be fed or disturbed. According to park authorities, stay at least 100 yards/meters away from bears and wolves and 25 yards/meters from all other wild animals! No matter how docile they may look, bison, elk, moose, bears, and nearly all large animals can attack. Each year, dozens of visitors are injured because they didn't keep a proper distance. These animals are large, wild, and potentially dangerous, so give them their space. In addition, be aware that odors attract bears and other wildlife, so avoid carrying or cooking odorous foods and keep a clean camp.
###
Query:
Which of the following is not mentioned in the passage as a possible cause of wildlife attacks?
###
Choices:
(A) Strong smells
(B) Failure to maintain distance
(C) Feeding the wildlife
(D) Animals that are unfamiliar with humans
###
Answer:

Proccessing the outputs:
Our response processing looked something like this, where we accepted 'A', '(A)', and some other closely related variants.

correct = 0
for item in data[language]:
    qid = item['qid']
    
    answer = answers[language][qid].replace('(','').replace(')','')
    if answer not in ['A','B','C','D']:
        print("###############################")
        print("FAILED: ", answer)
        print("ACTUAL: ", item['answer'])
        answer = answer[0]  
    
    if item['answer'] == answer:
        correct += 1
        
print(correct/len(data[language]))

juletx · 2023-10-10T12:23:52Z

Thank you! If I understand correctly, that is the prompt for instruction/chat models, right? The prompt used for 5-shot in-context learning is the one you mention in the paper (removing the extra spaces between \n and text).

lucasbandarkar · 2023-10-10T15:32:08Z

Correct

…

On Tue, Oct 10, 2023 at 5:24 AM Julen Etxaniz ***@***.***> wrote: Thank you! If I understand correctly, that is the prompt for instruction/chat models, right? The prompt used for 5-shot in-context learning is the one you mention in the paper (removing the extra spaces between \n and text). — Reply to this email directly, view it on GitHub <#885 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHQITKTWFXDKUWNGSYCX72DX6U46JAVCNFSM6AAAAAA5GG3PGGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJVGI3TIMJZGU> . You are receiving this because you commented.Message ID: ***@***.***>

ManuelFay · 2023-10-12T05:05:31Z

Okay then, I guess we are good for the 5-shot one (the one adapted for the lm eval harness), let's merge ?

add belebele

c5ebdd0

ManuelFay requested review from haileyschoelkopf and lintangsutawika as code owners September 25, 2023 14:44

ManuelFay mentioned this pull request Sep 25, 2023

Add Belebele dataset #882

Closed

lintangsutawika reviewed Sep 25, 2023

View reviewed changes

lm_eval/tasks/belebele/belebele_acm_Arab.yaml Outdated Show resolved Hide resolved

ManuelFay added 2 commits September 25, 2023 19:37

remove description

2d6bc23

match beleble paper prompts more closely

47a7d41

change : to . in the prompt

24ecd38

ManuelFay mentioned this pull request Oct 5, 2023

Author approval of the task implementation in Eleuther's LM evaluation harness . facebookresearch/belebele#7

Closed

lintangsutawika approved these changes Oct 12, 2023

View reviewed changes

lintangsutawika merged commit 17095c8 into EleutherAI:big-refactor Oct 12, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add belebele #885

add belebele #885

ManuelFay commented Sep 25, 2023

juletx commented Sep 30, 2023

lintangsutawika commented Oct 5, 2023

ManuelFay commented Oct 5, 2023

ManuelFay commented Oct 5, 2023

ManuelFay commented Oct 5, 2023 •

edited

lucasbandarkar commented Oct 6, 2023 •

edited

ManuelFay commented Oct 6, 2023

lucasbandarkar commented Oct 6, 2023 via email

haileyschoelkopf commented Oct 6, 2023 •

edited

juletx commented Oct 6, 2023

lucasbandarkar commented Oct 6, 2023 via email

lucasbandarkar commented Oct 9, 2023

juletx commented Oct 10, 2023

lucasbandarkar commented Oct 10, 2023 via email

ManuelFay commented Oct 12, 2023

add belebele #885

add belebele #885

Conversation

ManuelFay commented Sep 25, 2023

juletx commented Sep 30, 2023

lintangsutawika commented Oct 5, 2023

ManuelFay commented Oct 5, 2023

ManuelFay commented Oct 5, 2023

ManuelFay commented Oct 5, 2023 • edited

lucasbandarkar commented Oct 6, 2023 • edited

ManuelFay commented Oct 6, 2023

lucasbandarkar commented Oct 6, 2023 via email

haileyschoelkopf commented Oct 6, 2023 • edited

juletx commented Oct 6, 2023

lucasbandarkar commented Oct 6, 2023 via email

lucasbandarkar commented Oct 9, 2023

juletx commented Oct 10, 2023

lucasbandarkar commented Oct 10, 2023 via email

ManuelFay commented Oct 12, 2023

ManuelFay commented Oct 5, 2023 •

edited

lucasbandarkar commented Oct 6, 2023 •

edited

haileyschoelkopf commented Oct 6, 2023 •

edited