# What you need to know about fine tuning

1. PEFT, or Parameter Efficient Fine Tuning methods.
2. Quantization and the main target precision formats
3. Lora & QLora
4. SFT: Supervised Fine-Tuning
5. RLHF: Reinforcement Learning with Human Feedback
6. Basic system requirements
7. Industry standard options available 

# 1. PEFT

Parameter Efficient Fine-Tuning (PEFT) corresponds to methods for reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning for LLMs like Llama3. 

Notable examples: Lora & QLora.

[paper](https://arxiv.org/abs/2312.12148#:~:text=Parameter%20Efficient%20Fine%2DTuning%20(PEFT,performance%20to%20full%20fine%2Dtuning.))

# 1 Quantization and the main target precision formats

- Quantization is the process of converting a

floating-point model into a fixed-point or integer

representation.

- The benefits include smaller model size, faster

inference speed, and lower power consumption.

## Intuition

- Numbers are represented in computers as sequences

of bits.

- Integers can be unsigned or signed and have fixed

bit lengths (8, 16, 32, 64).

- Real numbers can be represented with fixed-point

or floating-point formats. Floating-point

representation includes sign bit, exponent, and

mantissa components.

- Quantization reduces the precision of a model by

keeping only certain bits for integer values or

fractional parts.

- The process involves selecting appropriate

quantization levels (for integers) or bit

allocations (for floating-point) while ensuring

minimal loss in accuracy.

## Types of Quantization

- There are three types of quantization:

post-training quantization, quantization aware

training, and dynamic quantization.

- Post-training quantization applies quantization

to the weights and activations of a pre-trained

floating-point model without retraining.

- Quantization aware training is a method that

performs quantization during training while keeping

accuracy in mind.

- Dynamic quantization adapts the quantization

levels at runtime based on the input data.


- The paper "Quantization and Training of Neural

Networks for Efficient Integer-Arithmetic-Only

Inference" provides further information on

quantization techniques.

In its essence quantization represents weights and activations in lower-precision data types, such as 8-bit integers (int8), instead of the standard 32-bit floating point (float32). 

This process not only reduces memory usage and power consumption but also enables faster operations due to the efficiency of integer arithmetic and facilitates deployment on embedded devices which may only support integer data types.

# LoRA Notes

[paper](https://arxiv.org/abs/2106.09685)

1. **Lower-Rank Decomposition**: When we talk about lower-rank decomposition of matrices, we are essentially representing a high-dimensional matrix as a combination of two lower-dimensional matrices. In LoRA, the weight matrices of the pre-trained model are decomposed into two matrices: B and A, where B has dimensions (d x r) and A has dimensions (r x k), with 'r' being the rank of the decomposition.

2. **Efficient Parameter Updates**: By decomposing the weight matrices into lower-rank components, LoRA focuses on updating and optimizing these smaller matrices during adaptation, rather than updating the entire set of parameters in the model. This approach reduces the computational complexity and memory requirements during training, making the adaptation process more efficient.

3. **Capturing Important Features**: The lower-rank decomposition matrices capture the essential features of the pre-trained model that are relevant to the specific task being adapted to. This means that the model can retain its performance while using a reduced set of parameters, leading to improved efficiency without sacrificing quality.

4. **Storage and Computational Efficiency**: Using lower-rank decomposition reduces the storage requirements for storing model parameters and speeds up computations during training and inference. This efficiency is crucial for scaling up large language models and deploying them in real-world applications.

# QLora Notes

[paper](https://arxiv.org/abs/2305.14314)

QLORA is a method to perform finetuning on large language models (LLMs) using low-rank adapters (LoRas) and 4-bit quantization, achieving comparable performance to full finetuning with 16-bits.

- LoRAs are learnable, low-rank matrices that adapt the weight matrices of a LLM to specific tasks by adding task-specific parameters.
- QLORA uses a two-step training process: first, pretrain LoRas on a large dataset; second, finetune the full model with the adapted LoRas.
- The main advantages of QLORA are: reduction in computational and memory requirements, faster training times, and the ability to perform finetuning on larger models.
- To evaluate the performance of instruction following models, MMLU (Multilingual Masked Language Understanding) is used as a benchmark for zero-shot evaluation, Vicuna for fine-tuned evaluation, and OA (Open Assistant) for chatbot evaluation.

# Supervised Fine-Tuning (SFT): 

[llama2 paper with SFT explanation](https://arxiv.org/pdf/2307.09288)

Models are trained on a dataset of instructions and responses. It adjusts the weights in the LLM to minimize the difference between the generated answers and ground-truth responses, acting as labels.

# RLHF

[llama 2 paper with RLHF explanation](https://arxiv.org/pdf/2307.09288)

Reinforcement Learning from Human Feedback (RLHF): Models learn by interacting with their environment and receiving feedback. They are trained to maximize a reward signal (using PPO), which is often derived from human evaluations of model outputs.

# Basic System Requirements

Consumer-grade GPUs with 24GB of VRAM seems to be the small ball part.

# Industry Standard Options Available


- [together ai](https://docs.together.ai/docs/fine-tuning-cli)
- [llama 3 fine tuning with autotrain by hugging face](https://ubiai.tools/how-to-fine-tune-llama3-using-autotrain-a-step-by-step-guide/)
- [llama3 fine tuning on alpaca dataset using unsloth](https://colab.research.google.com/drive/1mPw6P52cERr93w3CMBiJjocdTnyPiKTX#scrollTo=6bZsfBuZDeCL)
- [special mentioning of axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
- [llama factory](https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file#supported-models) github repo mention
- replicate.ai may have some options too