QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers+, N/A, arXiv'23 #881

AkihikoWatanabe · 2023-07-22T08:22:42Z

URL

https://arxiv.org/abs/2305.14314

Affiliations

Tim Dettmers, N/A
Artidoro Pagnoni, N/A
Ari Holtzman, N/A
Luke Zettlemoyer, N/A

Abstract

We present QLoRA, an efficient finetuning approach that reduces memory usageenough to finetune a 65B parameter model on a single 48GB GPU while preservingfull 16-bit finetuning task performance. QLoRA backpropagates gradients througha frozen, 4-bit quantized pretrained language model into Low RankAdapters~(LoRA). Our best model family, which we name Guanaco, outperforms allprevious openly released models on the Vicuna benchmark, reaching 99.3% of theperformance level of ChatGPT while only requiring 24 hours of finetuning on asingle GPU. QLoRA introduces a number of innovations to save memory withoutsacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that isinformation theoretically optimal for normally distributed weights (b) doublequantization to reduce the average memory footprint by quantizing thequantization constants, and (c) paged optimziers to manage memory spikes. Weuse QLoRA to finetune more than 1,000 models, providing a detailed analysis ofinstruction following and chatbot performance across 8 instruction datasets,multiple model types (LLaMA, T5), and model scales that would be infeasible torun with regular finetuning (e.g. 33B and 65B parameter models). Our resultsshow that QLoRA finetuning on a small high-quality dataset leads tostate-of-the-art results, even when using smaller models than the previousSoTA. We provide a detailed analysis of chatbot performance based on both humanand GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonablealternative to human evaluation. Furthermore, we find that current chatbotbenchmarks are not trustworthy to accurately evaluate the performance levels ofchatbots. A lemon-picked analysis demonstrates where Guanaco fails compared toChatGPT. We release all of our models and code, including CUDA kernels for4-bit training.

Translation (by gpt-3.5-turbo)

私たちは、QLoRAという効率的なファインチューニング手法を提案します。この手法は、メモリ使用量を削減し、48GBの単一のGPU上で65Bパラメータモデルをファインチューニングすることができます。また、16ビットのファインチューニングタスクのパフォーマンスを維持します。QLoRAは、凍結された4ビット量子化された事前学習済み言語モデルの勾配をLow Rank Adapters（LoRA）に逆伝播させます。私たちの最良のモデルファミリーであるGuanacoは、Vicunaベンチマークで以前に公開されたすべてのモデルを上回り、ChatGPTのパフォーマンスレベルの99.3%に達します。また、単一のGPU上でのファインチューニングには24時間しかかかりません。QLoRAは、パフォーマンスを犠牲にすることなくメモリを節約するためのいくつかの革新を導入しています。具体的には、(a) 通常分布された重みに対して情報理論的に最適な新しいデータ型である4ビットNormalFloat（NF4）、(b) 平均メモリフットプリントを削減するためのダブル量子化、および(c) メモリスパイクを管理するためのページドオプティマイザです。私たちはQLoRAを使用して1,000以上のモデルをファインチューニングし、8つの命令データセット、複数のモデルタイプ（LLaMA、T5）、および従来のファインチューニングでは実行不可能なモデルスケール（33Bおよび65Bパラメータモデル）にわたる命令の追跡とチャットボットのパフォーマンスの詳細な分析を提供します。私たちの結果は、QLoRAを使用して小規模な高品質のデータセットでのファインチューニングが、以前のSoTAよりも小さいモデルを使用しても最先端の結果をもたらすことを示しています。また、人間の評価とGPT-4の評価に基づいたチャットボットのパフォーマンスの詳細な分析を提供し、GPT-4の評価が安価で合理的な人間の評価の代替手段であることを示します。さらに、現在のチャットボットのベンチマークは、チャットボットのパフォーマンスレベルを正確に評価するためには信頼性がないことがわかります。GuanacoがChatGPTと比較してどこで失敗するかを示す分析も行っています。私たちは、4ビットトレーニングのためのCUDAカーネルを含む、すべてのモデルとコードを公開しています。

Summary (by gpt-3.5-turbo)

私たちは、QLoRAという効率的なファインチューニング手法を提案します。この手法は、メモリ使用量を削減し、48GBの単一のGPU上で65Bパラメータモデルをファインチューニングすることができます。また、16ビットのファインチューニングタスクのパフォーマンスを維持します。QLoRAは、凍結された4ビット量子化された事前学習済み言語モデルの勾配をLow Rank Adapters（LoRA）に逆伝播させます。私たちの最良のモデルファミリーであるGuanacoは、Vicunaベンチマークで以前に公開されたすべてのモデルを上回り、ChatGPTのパフォーマンスレベルの99.3%に達します。また、単一のGPU上でのファインチューニングには24時間しかかかりません。QLoRAは、パフォーマンスを犠牲にすることなくメモリを節約するためのいくつかの革新を導入しています。具体的には、4ビットNormalFloat（NF4）という情報理論的に最適な新しいデータ型、ダブル量子化による平均メモリフットプリントの削減、およびページドオプティマイザによるメモリスパイクの管理です。私たちはQLoRAを使用して1,000以上のモデルをファインチューニングし、8つの命令データセット、複数のモデルタイプ（LLaMA、T5）、および従来のファインチューニングでは実行不可能なモデルスケール（33Bおよび65Bパラメータモデル）にわたる命令の追跡とチャットボットのパフォーマンスの詳細な分析を提供します。私たちの結果は、QLoRAを使用して小規模な高品質のデータセットでのファインチューニングが、以前のSoTAよりも小さいモデルを使用しても最先端の結果をもたらすことを示しています。また、人間の評価とGPT-4の評価に基づいたチャットボットのパフォーマンスの詳細な分析を提供し、GPT-4の評価が安価で合理的な人間の評価の代替手段であることを示します。さらに、現在のチャットボットのベンチマークは、チャットボットのパフォーマンスレベルを正確に評価するためには信頼性がないことがわかります。GuanacoがChatGPTと比較してどこで失敗するかを示す分析も行っています。私たちは、4ビットトレーニングのためのCUDAカーネルを含む、すべてのモデルとコードを公開しています。

AkihikoWatanabe · 2023-07-22T08:23:08Z

実装: https://github.com/artidoro/qlora
PEFTにもある

AkihikoWatanabe · 2023-07-26T21:38:56Z

参考: https://twitter.com/hillbig/status/1662946722690236417?s=46&t=TDHYK31QiXKxggPzhZbcAQ

AkihikoWatanabe added action_wanted Pocket labels Jul 22, 2023

AkihikoWatanabe changed the title あ QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers+, N/A, arXiv'23 Jul 22, 2023

AkihikoWatanabe added Efficiency/SpeedUp MachineLearning Quantization and removed action_wanted labels Oct 21, 2023

AkihikoWatanabe added the Adapter/LoRA label Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers+, N/A, arXiv'23 #881

QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers+, N/A, arXiv'23 #881

AkihikoWatanabe commented Jul 22, 2023 •

edited

AkihikoWatanabe commented Jul 22, 2023

AkihikoWatanabe commented Jul 26, 2023 •

edited

QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers+, N/A, arXiv'23 #881

QLoRA: Efficient Finetuning of Quantized LLMs, Tim Dettmers+, N/A, arXiv'23 #881

Comments

AkihikoWatanabe commented Jul 22, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jul 22, 2023

AkihikoWatanabe commented Jul 26, 2023 • edited

AkihikoWatanabe commented Jul 22, 2023 •

edited

AkihikoWatanabe commented Jul 26, 2023 •

edited