The full derivation of Transformer gradient. compare the theory attention gradient with PyTorch attention gradient
If you find this open source release useful, please cite in your paper:
@software{He_The_full_derivation_2022,
author = {He, Longxiang},
month = may,
title = {{The full derivation of Transformer gradient}},
url = {https://github.com/Say-Hello2y/Transformer-attention.git},
version = {0.0.0},
year = {2022}
}