Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve jaqket v2 task #48

Merged
merged 12 commits into from
Aug 15, 2023
Merged

Improve jaqket v2 task #48

merged 12 commits into from
Aug 15, 2023

Conversation

kumapo
Copy link

@kumapo kumapo commented Jul 9, 2023

TODO

  • reduce tokens that come from few shot examples part of the prompts
  • drop few shot examples if the prompt with them exceeds model max_length
  • decrease TOP_K_LIMIT to 5 to fit max_length of some models
  • evaluate some existing models with jaqket_v2-0.1-0.2 task

evaluation

almost all models get better score with jaqket_v2-0.2 task, improved version, than previous version even if improved one has only a half of chance to see the context that contains the answer (topk=10 vs topk=5).

jaqket_v2-0.1-0.2 w/topk=10 jaqket_v2-0.2-0.2w/topk=5
EM@open-calm-medium 21.5361 30.3265
f1@open-calm-medium 26.6475 34.9970
EM@open-calm-large 31.2751 44.5876
f1@open-calm-large 36.2288 49.1384
EM@open-calm-1b 25.4016 41.9244
f1@open-calm-1b 30.2354 47.1261
EM@rinna-japanese-gpt-1b 22.6406 36.2543
f1@rinna-japanese-gpt-1b 40.9337 56.4699

@kumapo kumapo changed the title Improve jaqketv2 Improve jaqket v2 task Jul 9, 2023
@kumapo kumapo marked this pull request as ready for review July 10, 2023 13:34
@kumapo kumapo requested a review from jon-tow as a code owner July 10, 2023 13:34
@kumapo
Copy link
Author

kumapo commented Jul 10, 2023

@mkshing

This PR makes almost all models get better score with the improved task than current one.

@mkshing mkshing requested review from mkshing and polm-stability and removed request for jon-tow July 11, 2023 00:24
Copy link

@mkshing mkshing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumapo sorry for my late review. I left one comment regarding fallback. thanks!

{k: v[i] for k, v in doc["ctxs"].items()}
for i, a in enumerate(doc["ctxs"]["has_answer"]) if a == True
]
if len(answering_contexts) < 1:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many times is the fallback doc called? There's only one fallback and that might lead to lack of randomness. For example, a model is very good at that fallback and it's called multiple times. Then the score would be biased.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK under the assumption this isn't called very much. I'm not sure what it would mean for the model to be good at the fallback - in that case it would just be a like a special prompt, which should be reported, but isn't leakage or anything.

I also had two separate concerns about this code.

First, when is has_answer not True? It's not clear to me why there would be any question like that in the dataset.

Second, I find this code hard to follow. You're iterating over doc["ctxs"] as a dictionary, but in the fallback doc it's a list. What is the actual structure of this object? Can you give an example in a comment?

Copy link
Author

@kumapo kumapo Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polm-stability

regarding to the first point, has_answer=False existed due to slicing contexts by TOP_K_LIMIT .
but now the task does slicing by top k and filtering by has_answer on load_dataset().

and, the second point, I dropped iterating over doc["ctxs"] and fallback doc totally.

{k: v[i] for k, v in doc["ctxs"].items()}
for i, a in enumerate(doc["ctxs"]["has_answer"]) if a == True
]
if len(answering_contexts) < 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK under the assumption this isn't called very much. I'm not sure what it would mean for the model to be good at the fallback - in that case it would just be a like a special prompt, which should be reported, but isn't leakage or anything.

I also had two separate concerns about this code.

First, when is has_answer not True? It's not clear to me why there would be any question like that in the dataset.

Second, I find this code hard to follow. You're iterating over doc["ctxs"] as a dictionary, but in the fallback doc it's a list. What is the actual structure of this object? Can you give an example in a comment?

lm_eval/tasks/ja/jaqket_v2.py Outdated Show resolved Hide resolved
lm_eval/tasks/ja/jaqket_v2.py Outdated Show resolved Hide resolved
Comment on lines 232 to 234
# if there is no example and still the prompt is too long, fail
if len(ctxs) < 2:
raise ValueError(f"0-shot description+example doesn't fit in max length. ctx: {ctx}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here doesn't describe what's happening. This checks if there is only one example (besides the main question) and fails if that would have to be removed.

Is this the desired behavior? I would assume that if the question is very long it's better to just do it with no few-shot examples and not fail, but I could understand failing.

Either way the comment should match the behavior here.

Copy link
Author

@kumapo kumapo Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polm-stability

i've updated both comment and message as following:

        # if there is no example and still the description + QA prompt is too long, fail
        if len(ctxs) < 2:
            raise ValueError(f"description + QA prompt with no example (0-shot) doesn't fit in max_length. ctx: {ctx}")

i'm happy if i could clarify what happens with that.

@polm-stability
Copy link
Collaborator

I used the wrong account and left some duplicate comments here, sorry for any extra notifications you get.

@mkshing
Copy link

mkshing commented Jul 28, 2023

@kumapo please feel free to assign us for reviews again when you fix. Thanks!

@kumapo
Copy link
Author

kumapo commented Jul 29, 2023

@mkshing @polm-stability

could you check it again? thank you.

Copy link

@mkshing mkshing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kumapo thank you for fixing! The task code looks good to me. But, could you update scores across all models?
@mrorii added scores for jaqket_v2-0.1 in #72 and #67. You can skip some internal stablelm models.
Thank you 🙏

@kumapo
Copy link
Author

kumapo commented Jul 30, 2023

Due to the MARC-ja dataset issue, I started evaluation only on jaqket_v2.
after that, I will merge the jaqket_v2 result into result.json.

@kumapo
Copy link
Author

kumapo commented Aug 1, 2023

@mkshing

Could you check result.jsons in my PR?

I've evaluated as much models as possible on new jaqket_v2 task except for models that have 7b params.
Sorry for that I cannot evaluate larger models due to my infrastructure limitation.

@mkshing mkshing merged commit 5921154 into Stability-AI:jp-stable Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants