I show that reconstruction error can be used to detect adversarial attacks against encoder-decoder network architectures. These attacks are carried out in a white-box scenario for the classification+encoder network (in this case a capsule network), and black-box for the decoder network. This method can detect ~70% of adversarial attacks at a 5% false positive rate. Check out Attack-CapsNet.ipynb for implementation details and results.
This project was done as part of my final project for CMPS 290C: Advanced Machine Learning at UC Santa Cruz. The associated talk can be found at Adversarial-Defenses-Talk