Improve jaqket v2 task #48

kumapo · 2023-07-09T05:27:31Z

TODO

reduce tokens that come from few shot examples part of the prompts
drop few shot examples if the prompt with them exceeds model max_length
decrease TOP_K_LIMIT to 5 to fit max_length of some models
evaluate some existing models with jaqket_v2-0.1-0.2 task

evaluation

almost all models get better score with jaqket_v2-0.2 task, improved version, than previous version even if improved one has only a half of chance to see the context that contains the answer (topk=10 vs topk=5).

	jaqket_v2-0.1-0.2 w/topk=10	jaqket_v2-0.2-0.2w/topk=5
EM@open-calm-medium	21.5361	30.3265
f1@open-calm-medium	26.6475	34.9970
EM@open-calm-large	31.2751	44.5876
f1@open-calm-large	36.2288	49.1384
EM@open-calm-1b	25.4016	41.9244
f1@open-calm-1b	30.2354	47.1261
EM@rinna-japanese-gpt-1b	22.6406	36.2543
f1@rinna-japanese-gpt-1b	40.9337	56.4699

kumapo · 2023-07-10T13:41:00Z

@mkshing

This PR makes almost all models get better score with the improved task than current one.

mkshing

@kumapo sorry for my late review. I left one comment regarding fallback. thanks!

mkshing · 2023-07-18T00:02:08Z

lm_eval/tasks/ja/jaqket_v2.py

+            {k: v[i] for k, v in doc["ctxs"].items()}
+            for i, a in enumerate(doc["ctxs"]["has_answer"]) if a == True
+        ]
+        if len(answering_contexts) < 1:


How many times is the fallback doc called? There's only one fallback and that might lead to lack of randomness. For example, a model is very good at that fallback and it's called multiple times. Then the score would be biased.

I think this is OK under the assumption this isn't called very much. I'm not sure what it would mean for the model to be good at the fallback - in that case it would just be a like a special prompt, which should be reported, but isn't leakage or anything.

I also had two separate concerns about this code.

First, when is has_answer not True? It's not clear to me why there would be any question like that in the dataset.

Second, I find this code hard to follow. You're iterating over doc["ctxs"] as a dictionary, but in the fallback doc it's a list. What is the actual structure of this object? Can you give an example in a comment?

@polm-stability

regarding to the first point, has_answer=False existed due to slicing contexts by TOP_K_LIMIT .
but now the task does slicing by top k and filtering by has_answer on load_dataset().

and, the second point, I dropped iterating over doc["ctxs"] and fallback doc totally.

polm-stability · 2023-07-19T05:00:51Z

lm_eval/tasks/ja/jaqket_v2.py

+            {k: v[i] for k, v in doc["ctxs"].items()}
+            for i, a in enumerate(doc["ctxs"]["has_answer"]) if a == True
+        ]
+        if len(answering_contexts) < 1:


I think this is OK under the assumption this isn't called very much. I'm not sure what it would mean for the model to be good at the fallback - in that case it would just be a like a special prompt, which should be reported, but isn't leakage or anything.

I also had two separate concerns about this code.

First, when is has_answer not True? It's not clear to me why there would be any question like that in the dataset.

Second, I find this code hard to follow. You're iterating over doc["ctxs"] as a dictionary, but in the fallback doc it's a list. What is the actual structure of this object? Can you give an example in a comment?

lm_eval/tasks/ja/jaqket_v2.py

polm-stability · 2023-07-19T05:03:20Z

lm_eval/tasks/ja/jaqket_v2.py

+        # if there is no example and still the prompt is too long, fail
+        if len(ctxs) < 2:
+            raise ValueError(f"0-shot description+example doesn't fit in max length. ctx: {ctx}")


The comment here doesn't describe what's happening. This checks if there is only one example (besides the main question) and fails if that would have to be removed.

Is this the desired behavior? I would assume that if the question is very long it's better to just do it with no few-shot examples and not fail, but I could understand failing.

Either way the comment should match the behavior here.

@polm-stability

i've updated both comment and message as following:

# if there is no example and still the description + QA prompt is too long, fail if len(ctxs) < 2: raise ValueError(f"description + QA prompt with no example (0-shot) doesn't fit in max_length. ctx: {ctx}")

i'm happy if i could clarify what happens with that.

polm-stability · 2023-07-19T05:16:37Z

I used the wrong account and left some duplicate comments here, sorry for any extra notifications you get.

mkshing · 2023-07-28T05:27:48Z

@kumapo please feel free to assign us for reviews again when you fix. Thanks!

kumapo · 2023-07-29T06:17:29Z

@mkshing @polm-stability

could you check it again? thank you.

mkshing

@kumapo thank you for fixing! The task code looks good to me. But, could you update scores across all models?
@mrorii added scores for jaqket_v2-0.1 in #72 and #67. You can skip some internal stablelm models.
Thank you 🙏

kumapo · 2023-07-30T07:13:48Z

Due to the MARC-ja dataset issue, I started evaluation only on jaqket_v2.
after that, I will merge the jaqket_v2 result into result.json.

…rocess_ctx()

kumapo · 2023-08-01T15:43:07Z

@mkshing

Could you check result.jsons in my PR?

I've evaluated as much models as possible on new jaqket_v2 task except for models that have 7b params.
Sorry for that I cannot evaluate larger models due to my infrastructure limitation.

kumapo changed the title ~~Improve jaqketv2~~ Improve jaqket v2 task Jul 9, 2023

kumapo marked this pull request as ready for review July 10, 2023 13:34

kumapo requested a review from jon-tow as a code owner July 10, 2023 13:34

mkshing requested review from mkshing and polm-stability and removed request for jon-tow July 11, 2023 00:24

mkshing reviewed Jul 18, 2023

View reviewed changes

polm-stability requested changes Jul 19, 2023

View reviewed changes

kumapo requested review from polm-stability and mkshing July 29, 2023 06:14

mkshing suggested changes Jul 29, 2023

View reviewed changes

kumapo added 12 commits August 2, 2023 00:32

reduce number of tokens in jaqket v2 prompts if exceeds max_length

1289630

ja alpaca prompts should include same number of new lines as gpt-neox

52aa5c1

EXAMPLE_SEP -> FEWSHOT_SEP

6e25369

evaluate some models with jaqket_v2

84d9601

need preprocess_ctx() for JAQKETV2WithRinnaInstructionSFT

9350b22

simplify doc_to_answering_text()

04991af

wont need FALLBACK_DOC

402f908

re-evaluate some models on jaqket_v2

6e01439

jaqket v2 task should use kumapo/JAQKET dataset

892adce

load_dataset with num_contexts in jaqket_v2

6b5bd05

a comment and error message should be matched to what happens in prep…

744f9ef

…rocess_ctx()

re-evaluate some models on jaqket_v2

542cb96

mkshing merged commit 5921154 into Stability-AI:jp-stable Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve jaqket v2 task #48

Improve jaqket v2 task #48

kumapo commented Jul 9, 2023 •

edited by leemengtw

Loading

kumapo commented Jul 10, 2023

mkshing left a comment

mkshing Jul 18, 2023

polm-stability Jul 19, 2023

kumapo Jul 23, 2023 •

edited

Loading

polm-stability Jul 19, 2023

polm-stability Jul 19, 2023

kumapo Jul 23, 2023 •

edited

Loading

polm-stability commented Jul 19, 2023

mkshing commented Jul 28, 2023

kumapo commented Jul 29, 2023

mkshing left a comment

kumapo commented Jul 30, 2023

kumapo commented Aug 1, 2023 •

edited

Loading

Improve jaqket v2 task #48

Improve jaqket v2 task #48

Conversation

kumapo commented Jul 9, 2023 • edited by leemengtw Loading

TODO

evaluation

kumapo commented Jul 10, 2023

mkshing left a comment

Choose a reason for hiding this comment

mkshing Jul 18, 2023

Choose a reason for hiding this comment

polm-stability Jul 19, 2023

Choose a reason for hiding this comment

kumapo Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

polm-stability Jul 19, 2023

Choose a reason for hiding this comment

polm-stability Jul 19, 2023

Choose a reason for hiding this comment

kumapo Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

polm-stability commented Jul 19, 2023

mkshing commented Jul 28, 2023

kumapo commented Jul 29, 2023

mkshing left a comment

Choose a reason for hiding this comment

kumapo commented Jul 30, 2023

kumapo commented Aug 1, 2023 • edited Loading

kumapo commented Jul 9, 2023 •

edited by leemengtw

Loading

kumapo Jul 23, 2023 •

edited

Loading

kumapo Jul 23, 2023 •

edited

Loading

kumapo commented Aug 1, 2023 •

edited

Loading