Skip to content

reproduction issue with qwen2.5-7B-Instruct #7

@hzy312

Description

@hzy312

When I use gpt4o, it will generate critique then with the final prediction like this:

{"id": "gsm8k-0", "generator": "Qwen2-7B-Instruct", "problem": "Sue lives in a fun neighborhood. One weekend, the neighbors decided to play a prank on Sue. On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard. On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard. Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?", "steps": ["To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.", "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \(1/3 \times 18 = 6\) flamingos are taken back. So, they have \(18 - 6 = 12\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \(12 + 6 = 18\) pink flamingos and 6 white flamingos.", "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \(18 + 18 = 36\) pink flamingos and still 6 white flamingos.", "To find the difference, subtract the number of white flamingos from the number of pink flamingos: \(36 - 6 = 30\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \(\boxed{30}\)."], "final_answer_correct": false, "label": 1, "generated_critique": "Let's review the solution paragraph by paragraph:\n\n<paragraph_0>\nTo find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.\n</paragraph_0>\nThis paragraph correctly states the initial number of pink flamingos. No errors here.\n\n<paragraph_1>\nOn Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \(1/3 \times 18 = 6\) flamingos are taken back. So, they have \(18 - 6 = 12\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \(12 + 6 = 18\) pink flamingos and 6 white flamingos.\n</paragraph_1>\nThis paragraph correctly calculates the number of flamingos taken back, painted, and returned. It also correctly states the number of pink and white flamingos by the end of Saturday. No errors here.\n\n<paragraph_2>\nOn Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \(18 + 18 = 36\) pink flamingos and still 6 white flamingos.\n</paragraph_2>\nThis paragraph correctly adds the additional pink flamingos on Sunday and states the correct totals. No errors here.\n\n<paragraph_3>\nTo find the difference, subtract the number of white flamingos from the number of pink flamingos: \(36 - 6 = 30\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \(\boxed{30}\).\n</paragraph_3>\nThis paragraph correctly calculates the difference between the number of pink and white flamingos. No errors here.\n\nSince there are no errors in any of the paragraphs, the final answer is \(\boxed{-1}\).", "prediction": -1, "match": false}

Bt when i use the qwen2.5-7b-instruct, it will generate the answer direct. And the final F1 score is totally different with the score in original paper. The trajectory like this:

{"id": "gsm8k-0", "generator": "Qwen2-7B-Instruct", "problem": "Sue lives in a fun neighborhood. One weekend, the neighbors decided to play a prank on Sue. On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard. On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard. Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?", "steps": ["To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.", "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \(1/3 \times 18 = 6\) flamingos are taken back. So, they have \(18 - 6 = 12\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \(12 + 6 = 18\) pink flamingos and 6 white flamingos.", "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \(18 + 18 = 36\) pink flamingos and still 6 white flamingos.", "To find the difference, subtract the number of white flamingos from the number of pink flamingos: \(36 - 6 = 30\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \(\boxed{30}\)."], "final_answer_correct": false, "label": 1, "generated_critique": "\boxed{1}", "prediction": 1, "match": true}

It's weird, and i use the same code in your repo.

transformers==4.47.0
vllm==0.6.4.post1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions