You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Language models only really need to use an exponential fraction of theirneurons for individual inferences. As proof, we present UltraFastBERT, a BERTvariant that uses 0.3% of its neurons during inference while performing on parwith similar BERT models. UltraFastBERT selectively engages just 12 out of 4095neurons for each layer inference. This is achieved by replacing feedforwardnetworks with fast feedforward networks (FFFs). While no truly efficientimplementation currently exists to unlock the full acceleration potential ofconditional neural execution, we provide high-level CPU code achieving 78xspeedup over the optimized baseline feedforward implementation, and a PyTorchimplementation delivering 40x speedup over the equivalent batched feedforwardinference. We publish our training code, benchmarking setup, and model weights.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: