<h1><b>PEFT (Parameter-Efficient Fine-Tuning) and LoRA (Low-Rank Adaptation): A brief introduction to LoRA</b></h1>
<ul>
  <li>This article is licensed under a <a href=https://creativecommons.org/licenses/by/4.0/>CC BY 4.0</a> License.</li>
  <li><a href=https://github.com/Allen33669/stable_video_diffusion_project>source github</a></li>
</ul>

<h1><b>Please forgive any errors or omissions.</b></h1>
<h2>Content:<h2>
<blockquote>
<h5>
PEFT brief introduction and Why LoRA?<br>
LoRA position / layer<br>
LoRA rank<br>
LoRA learning rate<br>
LoRA architecture<br>
REFERENCES<br>
</h5>
</blockquote>

<h2>PEFT brief introduction and Why LoRA?</h2>
Parameter-Efficient Fine-Tuning (PEFT) technology demonstrates notable costeffectiveness during the fine-tuning process. This technique minimizes the trainable parameters and computational overhead while aspiring to near fully fine-tuned performance on downstream tasks [1].
<blockquote>
<h3>Source:</h3>
<blockquote>
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies [2].<br>
Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment [3].<br>
When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method [4].<br>
</blockquote>
</blockquote>
<br>
<blockquote>
<h3>Considerations factors:</h3>
<blockquote>
Model architecture: Will the model structure change?<br>
Training: Forward propagation, backward propagation, optimization algorithms, memory usage, and etc.<br>
Inference: Forward propagation, memory usage, and etc.<br>
Forgetting: Can the base model parameters be preserved?<br>
Flexibility: Convenient switching of tasks, merging of knowledge, etc.<br>
Stable: Is the input stable?<br>
</blockquote>
</blockquote>
<br>
<blockquote>
<h3>PEFT types:</h3>
<blockquote>
<h5>Selective PEFT:</h5>
"Methods for this category refer to either selectively finetuning a subset of the original model’s parameters while keeping the rest frozen, or introducing a minimal number of additional parameters to train, without altering the original parameters" [1].<br>
<blockquote>
<h6>Advantage:</h6>
It has little impact on the model structure.<br>
No significant impact on inference costs: the method add few or no parameters to the base model.<br>
<h6>Disadvantage:</h6>
Forgetting: the parameters of the base model are affected and the forgetting effect may occur.<br>
Memory risk: "Some techniques within this category involve the integration of a masking matrix, which results in a spike in memory usage" [1].<br>
Training time cost risk: "Some methods might lead to longer training periods due to a special selection mechanism. This could potentially offset the benefits of having fewer trainable parameters" [1].<br>
</blockquote>
</blockquote>
<br>
<blockquote>
<h5>Additive PEFT:</h5>
"The core idea behind adapters is to learn a set of parameters that can transform the output of one layer into the input of the next layer in a given task-specific way. Adapters are small parameter sets that can be inserted between the layers of FMs. They allow the network to be fine-tuned for a new task without modifying its original parameters" [1].<br>
<blockquote>
<h6>Advantage:</h6>
Keep the original parameters of the base model.<br>
Easy to switch tasks: Can use different adapters for different tasks.<br>
Integrate the knowledge: "Can integrate the knowledge of various tasks without forgetting the knowledge from previous tasks" [1].<br>
<h6>Disadvantage:</h6>
Inference cost: "This category may cause an increase in inference overhead due to the additional computation required by the adapter layer" [1].<br>
Configuration issue: "This category of methods may require careful initialization and training strategies, such as optimal settings of adapter dimensions and sparsity rates" [1].<br>
</blockquote>
</blockquote>
<br>
<blockquote>
<h5>Prompt PEFT:</h5>
"This category involves incorporating a carefully designed prompt into the input or the transformer’s layers, aiming to align the input distribution with the original training data and guide the model toward generating the desired output" [1].<br>
<blockquote>
<h6>Advantage:</h6>
Keep the original parameters of the base model: Maybe just adjust the embedding.<br>
Possible no training cost: Maybe don’t need to retrain the model like hard prompt.<br>
<h6>Disadvantage:</h6>
Performance is sometimes poor compared to other methods.<br>
Poor Transferability: "Some prompts trained for specific a task cannot be directly transferred to other tasks. Because the prompt vectors for each task are optimized based on the data and features of that task, they have strong task-specific characteristics and are not easily generalized across different tasks" [1].<br>
Model dependency: "This category of PEFT relies on the model’s already possessed capabilities. If the FMs have some deficiencies, it is difficult to compensate for these shortcomings through prompt tuning, and the room for performance improvement is limited" [1].<br>
Unstable: "Sometimes relies on human input" [1].<br>
Suboptimal performance: "Including prompt tokens in the input sequence can reduce the effective sequence length, potentially leading to suboptimal performance" [1].<br>
</blockquote>
</blockquote>
<br>
<blockquote>
<h5>Reparameterization PEFT:</h5>
"This technique reparameterizes the low-dimensional representation of the initial model parameters for training while converting the weights back for inference" [1].<br>
<blockquote>
<h6>Advantage:</h6>
Keep the original parameters of the base model.<br>
High flexibility: "It can be applied to almost all mainstream models and is very flexible, allowing for rapid adaptation to new tasks and domains" [1].<br>
Easy to switch tasks: Can use different LoRAs for different tasks.<br>
Integrate the knowledge: Can integrate the knowledge of various tasks without forgetting the knowledge from previous tasks.<br>
No significant impact on inference costs: Adjust model parameters directly for inference.<br>
<h6>Disadvantage:</h6>
Hyperparameters sensitivity: "This type of method is sensitive to hyperparameters. Like, the rank of the inserted adaptation matrices significantly impacts the ability to adapt the model to a new task" [1].<br>
Limited representation: "This category of PEFT assumes that model adaptations can be represented using low-rank matrices. In tasks where the feature space is highly complex, this assumption may limit expressiveness and lead to suboptimal performance" [1].<br>
</blockquote>
</blockquote>
<br>
</blockquote>
<br>

<h2>LoRA position / layer</h2>
Where to apply LoRA? Transformer Layers? First layer? Last layer? All layer?
<blockquote>
<h3>Source:</h3>
<blockquote>
Tied-Lora: Enhancing parameter efficiency of LoRA with weight tying [5].<br>
LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation [6].<br>
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [7].<br>
</blockquote>
Refer to Table III [5], "Interestingly, we note that when applying LoRA to a single transformer layer, the lower layers (usually layer 4 or 8) resulted in higher performance than higher layers. This suggests that there is potentially a single low-rank update that can be applied to all layers to boost performance, but it is hard to find a low-rank update for a single-layer that results in strong performance." [5].<br><br>
Refer to Fig. 2 [6], "Observations show that the squared norm of ∆Wixi for certain layers consistently remains close to zero, indicating that LoRA for these layers has almost no impact on the frozen model. Conversely, some layers show a more significant impact on the frozen model." [6].<br><br>
Refer to Fig. 4 [6], "We observe that the importance distributions differ across datasets, indicating that the importance assigned by LoRA is data-dependent." [6].<br><br>
Refer to Fig. 5 [6], "The results of LoRA for Query and Value are shown in Figure 5 and Figure 12. As the training data increases, the importance order of each layer remains consistent. For LoRA applied to the query matrices, the 10th layer has always been the most important, while the importance of layers 7, 8, and 9 maintains a consistently high level of importance. Indicating that this operation is insensitive to the size of the sampled data and exhibits robustness." [6].<br><br>
Refer to Fig. 1 [7], "We compare the performance of LoRA when fine-tuning specific modules or layers with the same number of trainable parameters. Figure 1a shows that fine-tuning feed-forward networks (FFN) achieves better performance than self-attention modules. In addition, Figure 1b demonstrates that weight matrices in top layers are more important than those in bottom layers." [7].<br><br>
Refer to Fig. 3 [7], "Figure 3 shows the resulting rank of each incremental matrix of DeBERTaV3-base fine-tuned with AdaLoRA. We find that AdaLoRA always prefers to allocating more budget to FFNs and top layers. Such behavior aligns with our empirical conclusions presented in Figure 1 that weight matrices of FFN moduels and top layers are more important for model performance. " [7].<br><br>
<br>
<br>
<br>
The importance of LoRA is not equal: The importance of LoRA may vary according to different layers, different positions (such as Q, K, V), different tasks, etc.<br><br>
Performance: Generally speaking, the top layer may be more influential, and increasing the training set may not have much impact.<br><br>
</blockquote>


<h2>LoRA rank</h2>
How many ranks are suitable?
<blockquote>
<h3>Source:</h3>
<blockquote>
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [7].<br>
Flora: Low-Rank Adapters Are Secretly Gradient Compressors [8].<br>
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA [9].<br>
LoRA Training in the NTK Regime has No Spurious Local Minima [10].<br>
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation [11].<br>
</blockquote>
Refer to Fig. 2 [7], "Figure 2 illustrates experimental results of fine-tuning DeBERTaV3-base under different budget levels." [7]. As the parameter ratio of LoRA increases, the performance is generally positively correlated.<br><br>
Refer to Table I [8], Table II [8], and Table IV [8], As the rank increases, the memory used increases, and the performance also increases, but not in equal proportion. Sometimes the performance improvement is not as good as the proportion of memory increase, and sometimes the opposite is true.<br><br>
Refer to Fig. 2 [9], "The study (Ding et al., 2022) asserts that fine-tuning on an increased number of parameters tends to perform better, with full-model fine-tuning consistently outperforming parameter efficient methods. Therefore we have reason to conjecture that training with larger ranks should outperform training with smaller ranks. Indeed, as illustrated in figure 2, we find that rsLoRA unlocks this performance increase for larger ranks, while LoRA’s overly aggressive scaling factor collapses and slows learning with larger ranks such that there is little to no performance difference when compared to low ranks." [9].<br><br>
Refer to Fig. 3 [9], "Validating our predictions, we illustrate in figure 3 that LoRA has collapsing gradients with higher ranks, whereas rsLoRA maintains the same gradient norm for each rank at the onset of training, while the norms remain approximately within the same order of magnitude throughout the training process." [9].<br><br>
Refer to Fig. 2 [10] and Fig. 3 [10], The training loss of Lora will gradually converge to full fine tuning as the number of epochs increases, which means that more epochs are needed to train using Lora. In addition, the convergence speed is faster as the rank increases, but this effect is not obvious when the rank exceeds a certain level.<br><br>
Refer to Table II [11], As Lora's rank increases, performance improves, but the improvement is getting smaller and smaller.<br><br>
<br>
<br>
<br>
Performance: Generally speaking, as Lora's rank increases, its performance will also increase, but the increase will become smaller and smaller or not increase in the same proportion. Sometimes, the performance will decrease as the rank increases. As the rank of Lora increases, the influence of a single parameter in Lora will decrease.<br><br>
Training cost: Although using Lora can reduce the parameters of training, it requires more epochs. As the rank of Lora increases, the training loss decreases faster, but the effect is not obvious when the rank exceeds a threshold.<br><br>
Memory usage: As lora ranks increase, memory usage also increases, mainly because the optimizer needs to use more memory.<br><br>
</blockquote>

<h2>LoRA learning rate</h2>
<blockquote>
<h3>Source:</h3>
<blockquote>
LoRA+: Efficient Low Rank Adaptation of Large Models [12].<br>
</blockquote>
Refer to Fig. 2 [12], "We observe that both the best train and test losses are consistently achieved by a combination of learning rates where ηb ≫ ηa, which validates our analysis in the previous section. Notice also that optimal learning rates (ηA, ηB) are generally close to the edge of stability, a well-known behaviour in training dynamics of deep networks (Cohen et al., 2021)." [12].<br><br>
<br>
<br>
<br>
Performance: Generally speaking, the learning rate of project-up matrix B is set higher than that of project-down matrix A, and the training loss is smaller for the same learning rate.<br><br>
</blockquote>

<h2>LoRA architecture</h2>
The structure of LoRA not only has the classic project-up matrix B and project-down matrix A, but also tensor train based low-rank adaptation, singular value decomposition based adaptation, and etc.
<blockquote>
<h3>Source:</h3>
<blockquote>
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning [7].<br>
Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs [13].<br>
</blockquote>
AdaLoRA: AdaLoRA simulates singular value decomposition (SVD), and then truncates parameters by importance scoring.<br><br>
TT-LoRA: TT-LoRA uses tensor train decomposition. The matrix is ​​decomposed into many small tensor cores. The head and tail matrices are 2D, and all the intermediate matrices are 3D.<br><br>
Different LoRA structures may result in fewer parameters, but the computational cost required for training varies according to different methods. At the same time, the phenomena observed in the classic LoRA structure may not necessarily apply to other LoRA structures.<br><br>
</blockquote>

<h2>REFERENCES</h2>
<blockquote>
[1] Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang, “Parameter-Efficient Fine-Tuning for Foundation Models,” arXiv, arXiv:2501.13787v1, 2025.<br>
[2] Luping Wang, Sheng Chen, Linnan Jiang, Shu Pan, Runze Cai, Sen Yang, Fei Yang, "Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies,"  arXiv, arXiv:2410.19878v3, 2025.<br>
[3] Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, Fu Lee Wang, "Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment," arXiv, arXiv:2312.12148v1, 2023.<br>
[4] Biao Zhang, Zhongtao Liu, Colin Cherry, Orhan Firat, "When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method," arXiv, arXiv:2402.17193v1, 2024.<br>
[5] Adithya Renduchintala, Tugrul Konuk, Oleksii Kuchaiev, "Tied-Lora: Enhancing parameter efficiency of LoRA with weight tying," arXiv, arXiv:2311.09578v2, 2024.<br>
[6] Hongyun Zhou, Xiangyu Lu, Wang Xu, Conghui Zhu, Tiejun Zhao, Muyun Yang, "LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation," arXiv, arXiv:2402.07721v2, 2024.<br>
[7] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, Tuo Zhao, "AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning," arXiv, arXiv:2303.10512v2, 2023.<br>
[8] Yongchang Hao, Yanshuai Cao, Lili Mou, "Flora: Low-Rank Adapters Are Secretly Gradient Compressors," arXiv, arXiv:2402.03293v2, 2024.<br>
[9] Damjan Kalajdzievski, "A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA," arXiv, arXiv:2312.03732v1, 2023.<br>
[10] Uijeong Jang, Jason D. Lee, Ernest K. Ryu, "LoRA Training in the NTK Regime has No Spurious Local Minima," arXiv, arXiv:2402.11867v3, 2024.<br>  
[11] Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, Ali Ghodsi, "DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation," arXiv, arXiv:2210.07558v2, 2023.<br>  
[12] Soufiane Hayou, Nikhil Ghosh, Bin Yu, "LoRA+: Efficient Low Rank Adaptation of Large Models," arXiv, arXiv:2402.12354v2, 2024.<br>
[13] Afia Anjum, Maksim E. Eren, Ismael Boureima, Boian Alexandrov, Manish Bhattarai, "Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs," arXiv, arXiv:2408.01008v1, 2024.<br>
</blockquote>