The default performance evaluation shows strange results #295

RomanPlusPlus · 2021-09-19T13:52:15Z

Hi all,

If one runs the evaluate.py script against our transformation (#230), the results are very strange. The performance is too good, considering the dramatic changes made by our transformation.

Here is the performance of the model aychang/roberta-base-imdb on the test[:20%] split of the imdb dataset
The accuracy on this subset which has 1000 examples = 96.0
Applying transformation:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:19<00:00, 51.83it/s]
Finished transformation! 1000 examples generated from 1000 original examples, with 1000 successfully transformed and 0 unchanged (1.0 perturb rate)
Here is the performance of the model on the transformed set
The accuracy on this subset which has 1000 examples = 100.0

On the other hand, if we use non-default models, they produce reasonable results (kudos to @sotwi):

roberta-base-SST-2: 94.0 -> 51.0
bert-base-uncased-QQP: 92.0 -> 67.0
roberta-large-mnli: 91.0 -> 43.0

I speculate that the problem in the default test could be caused by some deficiency in the model aychang/roberta-base-imdb and / or the imdb dataset. But I'm not knowledgeable enough in the inner workings of the model to identify the source of the problem.

How to reproduce the strange results:

Get the writing_system_replacement transformation from #230.

cd to the NL-Augmenter dir.

Run this:

python3 evaluate.py -t WritingSystemReplacement

Expected results:

a massive drop in accuracy, similar to the results by @sotwi on non-default models, as mentioned above.

Observed results:

a perfect accuracy of 100.0.

The text was updated successfully, but these errors were encountered:

AbinayaM02 · 2021-09-22T05:28:31Z

@tongshuangwu and @ashish3586 : Please take a look at this issue on evaluation.

RomanPlusPlus mentioned this issue Sep 19, 2021

Add writing system replacement transformation #230

Merged

AbinayaM02 added the bug Something isn't working label Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The default performance evaluation shows strange results #295

The default performance evaluation shows strange results #295

RomanPlusPlus commented Sep 19, 2021 •

edited

AbinayaM02 commented Sep 22, 2021

The default performance evaluation shows strange results #295

The default performance evaluation shows strange results #295

Comments

RomanPlusPlus commented Sep 19, 2021 • edited

AbinayaM02 commented Sep 22, 2021

RomanPlusPlus commented Sep 19, 2021 •

edited