Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The default performance evaluation shows strange results #295

Open
RomanPlusPlus opened this issue Sep 19, 2021 · 1 comment
Open

The default performance evaluation shows strange results #295

RomanPlusPlus opened this issue Sep 19, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@RomanPlusPlus
Copy link
Contributor

RomanPlusPlus commented Sep 19, 2021

Hi all,

If one runs the evaluate.py script against our transformation (#230), the results are very strange. The performance is too good, considering the dramatic changes made by our transformation.

Here is the performance of the model aychang/roberta-base-imdb on the test[:20%] split of the imdb dataset
The accuracy on this subset which has 1000 examples = 96.0
Applying transformation:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:19<00:00, 51.83it/s]
Finished transformation! 1000 examples generated from 1000 original examples, with 1000 successfully transformed and 0 unchanged (1.0 perturb rate)
Here is the performance of the model on the transformed set
The accuracy on this subset which has 1000 examples = 100.0

On the other hand, if we use non-default models, they produce reasonable results (kudos to @sotwi):

roberta-base-SST-2: 94.0 -> 51.0
bert-base-uncased-QQP: 92.0 -> 67.0
roberta-large-mnli: 91.0 -> 43.0

I speculate that the problem in the default test could be caused by some deficiency in the model aychang/roberta-base-imdb and / or the imdb dataset. But I'm not knowledgeable enough in the inner workings of the model to identify the source of the problem.


How to reproduce the strange results:

Get the writing_system_replacement transformation from #230.

cd to the NL-Augmenter dir.

Run this:

python3 evaluate.py -t WritingSystemReplacement


Expected results:

a massive drop in accuracy, similar to the results by @sotwi on non-default models, as mentioned above.

Observed results:

a perfect accuracy of 100.0.

@AbinayaM02
Copy link
Collaborator

@tongshuangwu and @ashish3586 : Please take a look at this issue on evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants