You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ternary and binary neural networks enable multiplication-free computation andpromise multiple orders of magnitude efficiency gains over full-precisionnetworks if implemented on specialized hardware. However, since both theparameter and the output space are highly discretized, such networks haveproven very difficult to optimize. The difficulties are compounded for theclass of transformer text generation models due to the sensitivity of theattention operation to quantization and the noise-compounding effects ofautoregressive decoding in the high-cardinality output space. We approach theproblem with a mix of statistics-based quantization for the weights and elasticquantization of the activations and demonstrate the first ternary and binarytransformer models on the downstream tasks of summarization and machinetranslation. Our ternary BART base achieves an R1 score of 41 on theCNN/DailyMail benchmark, which is merely 3.9 points behind the full model whilebeing 16x more efficient. Our binary model, while less accurate, achieves ahighly non-trivial score of 35.6. For machine translation, we achieved BLEUscores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a fullprecision mBART model score of 26.8. We also compare our approach in the 8-bitactivation setting, where our ternary and even binary weight models can matchor outperform the best existing 8-bit weight models in the literature. Our codeand models are available at:https://github.com/facebookresearch/Ternary_Binary_Transformer
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: