You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to conduct multi-round evaluations for chat models? For example, I want to study how a chat model can take hints to solve math problems. Say, I have a multiple-choice math problem, with one correct choice, and one hint for each wrong choice. I first ask the question to the model, get a model answer (using generate_until and some parsing), in the following workflow:
User: Solve the following math question, and output a single letter corresponding to your selection. [question]
Agent: B
If this is correct, I stop here. If not, I continue, where the agent left off:
User: Solve the following math question, and output a single letter corresponding to your selection. [question]
Agent: C
User: This is not correct. Consider the hint and try again: [hint corresponding to choice C]
Agent: B
In the evaluation, I want to calculate the percentage of the first round correct answer, as well as the percentage of the second round correct answer when the first round answer is incorrect.
Is it possible to do in the current framework? Or is the library not suitable for this kind of evaluation?
The text was updated successfully, but these errors were encountered:
right now I'm working on solution for your problem!
take a look at #1571
here I suggest the way multi-step and multi-round tasks may be handled. The magic lies in update_request and update_storage funcs. You can customize them to cover your problem
it would be great, if you can take a look at this PR and leave your feedback, so that I can make it better and also draw @haileyschoelkopf@lintangsutawika attention to the desired feature
Is it possible to conduct multi-round evaluations for chat models? For example, I want to study how a chat model can take hints to solve math problems. Say, I have a multiple-choice math problem, with one correct choice, and one hint for each wrong choice. I first ask the question to the model, get a model answer (using generate_until and some parsing), in the following workflow:
If this is correct, I stop here. If not, I continue, where the agent left off:
In the evaluation, I want to calculate the percentage of the first round correct answer, as well as the percentage of the second round correct answer when the first round answer is incorrect.
Is it possible to do in the current framework? Or is the library not suitable for this kind of evaluation?
The text was updated successfully, but these errors were encountered: