You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We empirically study a simple layer-pruning strategy for popular families ofopen-weight pretrained LLMs, finding minimal degradation of performance ondifferent question-answering benchmarks until after a large fraction (up tohalf) of the layers are removed. To prune these models, we identify the optimalblock of layers to prune by considering similarity across layers; then, to"heal" the damage, we perform a small amount of finetuning. In particular, weuse parameter-efficient finetuning (PEFT) methods, specifically quantizationand Low Rank Adapters (QLoRA), such that each of our experiments can beperformed on a single A100 GPU. From a practical perspective, these resultssuggest that layer pruning methods can complement other PEFT strategies tofurther reduce computational resources of finetuning on the one hand, and canimprove the memory and latency of inference on the other hand. From ascientific perspective, the robustness of these LLMs to the deletion of layersimplies either that current pretraining methods are not properly leveraging theparameters in the deeper layers of the network or that the shallow layers playa critical role in storing knowledge.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: