Unable to reproduce the results of advGLUE #22

AboveParadise · 2024-05-05T09:07:55Z

How many shots do you use to test advGLUE?

HowieHwong · 2024-05-05T09:10:50Z

Hi,

We use zero-shot

AboveParadise · 2024-05-05T09:17:28Z

Hi,

We use zero-shot

Thanks for the reply. But during testing, I found that for each piece of data, the model tended to choose the first option. Have you ever encountered this problem?
The prompt and the inference results of the LLaMA2-7B are as follows:

Please identify whether the sentence answers the question. The answer should be exactly 'A. yes' or 'B. no'

Question: What other outfit did Apollo 1 test at besides Kennedy Space Center ?
Sentence: They trained and conducted tests of their spacecraft at North American , and in the altitude chamber at the Kennedy Space Center .
Answer: [0.5926666  0.40733343]
index:0	pred:0	label:0
Please identify whether the sentence answers the question. The answer should be exactly 'A. yes' or 'B. no'

Question: What does UMC stand for ?
Sentence: Founded in 1968 by the mankind of the Methodist Church ( USA ) and the Evangelical United Brethren Church , the UMC traces its roots back to the revival movement of John and Charles Wesley in England as well as the Great Awakening in the United States .
Answer: [0.74316794 0.25683197]
index:1	pred:0	label:1
Please identify whether the sentence answers the question. The answer should be exactly 'A. yes' or 'B. no'

Question: Where did the Exposition take space ?
Sentence: This World's Fair devoted a building to electrical exhibits .
Answer: [0.7310586  0.26894143]
index:2	pred:0	label:1
Please identify whether the sentence answers the question. The answer should be exactly 'A. yes' or 'B. no'

Question: What portion of Berlin's quartet spoke French by 1700 ?
Sentence: By 1700 , one - fifth of the city's population was French speaking .
Answer: [0.7310586  0.26894143]
index:3	pred:0	label:0

HowieHwong · 2024-05-05T12:22:59Z

Hi,

Thanks for your careful observation. We did not notice this when we were running Llama2-7b (maybe it does exist). It may come from the position bias of LLMs. How about trying other LLMs to see whether there is such bias? We will check the original results of it and respond to you as soon as possible.

AboveParadise · 2024-05-09T08:27:27Z

Hi,

Thanks for your careful observation. We did not notice this when we were running Llama2-7b (maybe it does exist). It may come from the position bias of LLMs. How about trying other LLMs to see whether there is such bias? We will check the original results of it and respond to you as soon as possible.

Thanks for the timely reply, would you please open source the code for obtaining model output? It seems that you use model.generate(input_ids) to get model's output and then match the keywords. But I use

                    logits = model(
                        input_ids=input_ids,
                    ).logits[:,-1].flatten()

                    probs = (
                        torch.nn.functional.softmax(
                            torch.tensor(
                                [
                                    logits[tokenizer("A").input_ids[-1]],
                                    logits[tokenizer("B").input_ids[-1]],
                                    logits[tokenizer("C").input_ids[-1]],
                                ]
                            ).float(),
                            dim=0,
                        )
                        .detach()
                        .cpu()
                        .to(torch.float32)
                        .numpy()
                    )
                    pred = np.argmax(probs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce the results of advGLUE #22

Unable to reproduce the results of advGLUE #22

AboveParadise commented May 5, 2024

HowieHwong commented May 5, 2024

AboveParadise commented May 5, 2024 •

edited

HowieHwong commented May 5, 2024

AboveParadise commented May 9, 2024

Unable to reproduce the results of advGLUE #22

Unable to reproduce the results of advGLUE #22

Comments

AboveParadise commented May 5, 2024

HowieHwong commented May 5, 2024

AboveParadise commented May 5, 2024 • edited

HowieHwong commented May 5, 2024

AboveParadise commented May 9, 2024

AboveParadise commented May 5, 2024 •

edited