You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If one runs the evaluate.py script against our transformation (#230), the results are very strange. The performance is too good, considering the dramatic changes made by our transformation.
Here is the performance of the model aychang/roberta-base-imdb on the test[:20%] split of the imdb dataset
The accuracy on this subset which has 1000 examples = 96.0
Applying transformation:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:19<00:00, 51.83it/s]
Finished transformation! 1000 examples generated from 1000 original examples, with 1000 successfully transformed and 0 unchanged (1.0 perturb rate)
Here is the performance of the model on the transformed set
The accuracy on this subset which has 1000 examples = 100.0
On the other hand, if we use non-default models, they produce reasonable results (kudos to @sotwi):
I speculate that the problem in the default test could be caused by some deficiency in the model aychang/roberta-base-imdb and / or the imdb dataset. But I'm not knowledgeable enough in the inner workings of the model to identify the source of the problem.
How to reproduce the strange results:
Get the writing_system_replacement transformation from #230.
cd to the NL-Augmenter dir.
Run this:
python3 evaluate.py -t WritingSystemReplacement
Expected results:
a massive drop in accuracy, similar to the results by @sotwi on non-default models, as mentioned above.
Observed results:
a perfect accuracy of 100.0.
The text was updated successfully, but these errors were encountered:
Hi all,
If one runs the
evaluate.py
script against our transformation (#230), the results are very strange. The performance is too good, considering the dramatic changes made by our transformation.On the other hand, if we use non-default models, they produce reasonable results (kudos to @sotwi):
I speculate that the problem in the default test could be caused by some deficiency in the model
aychang/roberta-base-imdb
and / or theimdb
dataset. But I'm not knowledgeable enough in the inner workings of the model to identify the source of the problem.How to reproduce the strange results:
Get the writing_system_replacement transformation from #230.
cd to the NL-Augmenter dir.
Run this:
python3 evaluate.py -t WritingSystemReplacement
Expected results:
a massive drop in accuracy, similar to the results by @sotwi on non-default models, as mentioned above.
Observed results:
a perfect accuracy of 100.0.
The text was updated successfully, but these errors were encountered: