-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve jaqket v2 task #48
Conversation
This PR makes almost all models get better score with the improved task than current one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kumapo sorry for my late review. I left one comment regarding fallback. thanks!
lm_eval/tasks/ja/jaqket_v2.py
Outdated
{k: v[i] for k, v in doc["ctxs"].items()} | ||
for i, a in enumerate(doc["ctxs"]["has_answer"]) if a == True | ||
] | ||
if len(answering_contexts) < 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many times is the fallback doc called? There's only one fallback and that might lead to lack of randomness. For example, a model is very good at that fallback and it's called multiple times. Then the score would be biased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK under the assumption this isn't called very much. I'm not sure what it would mean for the model to be good at the fallback - in that case it would just be a like a special prompt, which should be reported, but isn't leakage or anything.
I also had two separate concerns about this code.
First, when is has_answer
not True
? It's not clear to me why there would be any question like that in the dataset.
Second, I find this code hard to follow. You're iterating over doc["ctxs"]
as a dictionary, but in the fallback doc it's a list. What is the actual structure of this object? Can you give an example in a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding to the first point, has_answer=False
existed due to slicing contexts by TOP_K_LIMIT
.
but now the task does slicing by top k and filtering by has_answer
on load_dataset()
.
and, the second point, I dropped iterating over doc["ctxs"] and fallback doc totally.
lm_eval/tasks/ja/jaqket_v2.py
Outdated
{k: v[i] for k, v in doc["ctxs"].items()} | ||
for i, a in enumerate(doc["ctxs"]["has_answer"]) if a == True | ||
] | ||
if len(answering_contexts) < 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is OK under the assumption this isn't called very much. I'm not sure what it would mean for the model to be good at the fallback - in that case it would just be a like a special prompt, which should be reported, but isn't leakage or anything.
I also had two separate concerns about this code.
First, when is has_answer
not True
? It's not clear to me why there would be any question like that in the dataset.
Second, I find this code hard to follow. You're iterating over doc["ctxs"]
as a dictionary, but in the fallback doc it's a list. What is the actual structure of this object? Can you give an example in a comment?
lm_eval/tasks/ja/jaqket_v2.py
Outdated
# if there is no example and still the prompt is too long, fail | ||
if len(ctxs) < 2: | ||
raise ValueError(f"0-shot description+example doesn't fit in max length. ctx: {ctx}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment here doesn't describe what's happening. This checks if there is only one example (besides the main question) and fails if that would have to be removed.
Is this the desired behavior? I would assume that if the question is very long it's better to just do it with no few-shot examples and not fail, but I could understand failing.
Either way the comment should match the behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've updated both comment and message as following:
# if there is no example and still the description + QA prompt is too long, fail
if len(ctxs) < 2:
raise ValueError(f"description + QA prompt with no example (0-shot) doesn't fit in max_length. ctx: {ctx}")
i'm happy if i could clarify what happens with that.
I used the wrong account and left some duplicate comments here, sorry for any extra notifications you get. |
@kumapo please feel free to assign us for reviews again when you fix. Thanks! |
could you check it again? thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the MARC-ja dataset issue, I started evaluation only on jaqket_v2. |
Could you check I've evaluated as much models as possible on new jaqket_v2 task except for models that have 7b params. |
TODO
max_length
TOP_K_LIMIT
to 5 to fitmax_length
of some modelsevaluation
almost all models get better score with jaqket_v2-0.2 task, improved version, than previous version even if improved one has only a half of chance to see the context that contains the answer (topk=10 vs topk=5).