# Fine-tuning Large Language Models with QLoRA and FSDP

Created by rUv, cause I could. This notebook demonstrates how to efficiently fine-tune large language models like Llama 2 70B using Quantized LoRA (QLoRA) and Fully Sharded Data Parallel (FSDP) on consumer hardware.

Key concepts:
- QLoRA enables fine-tuning large models by only training the LoRA adapter layers 
- FSDP allows sharded data parallel training to distribute the model across multiple GPUs
- This makes it feasible to fine-tune 70B models on hardware like 2x 24GB GPUs

We'll cover the installation, training options, and provide detailed examples.

> Note: Treat this as an alpha/preview release. If you're not comfortable with testing and debugging models, we'd suggest holding off for a few months while the community more fully tests the approach.

## Installation

First, let's install the required dependencies. The following steps should work (tested on CUDA 11.7, 11.8 and 12.1):

In [None]:
# Clone the fsdp_qlora repo
!git clone https://github.com/AnswerDotAI/fsdp_qlora

# Install llama-recipes and fastcore
!pip install llama-recipes fastcore --extra-index-url https://download.pytorch.org/whl/test/cu118

# Install bitsandbytes
!pip install bitsandbytes>=0.43.0

# Login to Hugging Face to access Llama 2 model
!huggingface-cli login

Optional libraries:
- For HQQ quantization, follow the [HQQ installation instructions](https://github.com/mobiusml/hqq#installation). Make sure to also build the custom kernels.
- For Weights and Biases logging: `!pip install wandb`

PyTorch >= 2.2 is recommended to make use of the native flash-attention 2 kernel.

## Fine-tuning Llama 2 70B on Dual 24GB GPUs

Once installed, `cd` into the `fsdp_qlora` directory. Then run the following command to begin fine-tuning Llama 2 70B on the Alpaca dataset at a maximum sequence length of 2048 tokens:

In [None]:
!python train.py \
  --model_name meta-llama/Llama-2-70b-hf \
  --batch_size 2 \
  --context_length 2048 \
  --precision bf16 \
  --train_type qlora \
  --use_gradient_checkpointing true \
  --use_cpu_offload true \
  --dataset alpaca \
  --reentrant_checkpointing true

## Training Options

The training script supports various options for quantization, LoRA training, and mixed precision.

### Quantization
We support both HQQ and bitsandbytes for quantization. If using bitsandbytes, pass `--reentrant_checkpointing True` to avoid triggering a bug that results in high memory usage.

### LoRA Training
- `--train_type full`: Full model fine-tuning 
- `--train_type lora`: LoRA fine-tuning using HF PEFT library
- `--train_type custom_lora`: LoRA fine-tuning using a custom LoRA module
- `--train_type qlora`: 4-bit quantized LoRA using bitsandbytes Linear4bit layer with NF4 quantization and HF PEFT 
- `--train_type custom_qlora`: 4-bit quantized LoRA using bitsandbytes Linear4bit layer with NF4 quantization and custom LoRA module
- `--train_type hqq_lora`: 4-bit quantized LoRA using HQQ library and custom LoRA module

### Mixed Precision
- `--precision bf16`: Pure bfloat16 training
- `--precision fp32`: Pure float32 training  
- `--precision mp_fp16_autocast`: Mixed float16 with autocast
- `--precision mp_bf16_autocast`: Mixed bfloat16 with autocast
- `--precision mp_bf16_buffers_autocast`: Bfloat16 params and float32 buffers with autocast (important for RoPE layer)

## Examples

Here are a few detailed examples demonstrating different training configurations.

### Full Model Fine-tuning

In [None]:
# Optionally set visible devices
!export CUDA_VISIBLE_DEVICES=4,5

!python train.py \
  --world_size 2 \
  --master_port 12356 \
  --model_name meta-llama/Llama-2-7b-hf \
  --gradient_accumulation_steps 4 \
  --batch_size 8 \
  --context_length 512 \
  --precision bf16 \
  --train_type full \
  --use_gradient_checkpointing true \
  --use_cpu_offload false \
  --use_activation_cpu_offload false \
  --log_to wandb \
  --dataset alpaca

### LoRA Fine-tuning with HF PEFT

In [None]:
!python train.py \
  --model_name meta-llama/Llama-2-7b-hf \
  --gradient_accumulation_steps 4 \
  --batch_size 8 \
  --context_length 512 \
  --precision bf16 \
  --train_type lora \
  --use_gradient_checkpointing true \
  --use_cpu_offload false \
  --use_activation_cpu_offload false \
  --log_to wandb \
  --dataset alpaca

### 4-bit QLoRA with Bitsandbytes

In [None]:
!python train.py \
  --model_name meta-llama/Llama-2-7b-hf \
  --gradient_accumulation_steps 4 \
  --batch_size 8 \
  --context_length 512 \
  --precision bf16 \
  --train_type qlora \
  --use_gradient_checkpointing true \
  --use_cpu_offload false \
  --use_activation_cpu_offload false \
  --log_to wandb \
  --dataset alpaca \
  --reentrant_checkpointing true

### 4-bit QLoRA with HQQ

In [None]:
!python train.py \
  --model_name meta-llama/Llama-2-7b-hf \
  --gradient_accumulation_steps 4 \
  --batch_size 8 \
  --context_length 512 \
  --precision bf16 \
  --train_type hqq_lora \
  --use_gradient_checkpointing true \
  --use_cpu_offload false \
  --use_activation_cpu_offload false \
  --log_to wandb \
  --dataset alpaca

## SLURM Training

Here's an example of how to run multi-node training with SLURM:

In [None]:
#!/bin/bash
#SBATCH --job-name=fsdp_qlora
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=12
#SBATCH --mem=480GB
#SBATCH --gres=gpu:8
#SBATCH --time=48:00:00
#SBATCH --output=fsdp_qlora_%j.out

module load anaconda3
source activate fsdp_qlora

srun python train.py \
  --model_name meta-llama/Llama-2-70b-hf \
  --batch_size 2 \
  --context_length 2048 \
  --precision bf16 \
  --train_type qlora \
  --use_gradient_checkpointing true \
  --use_cpu_offload true \
  --dataset alpaca \
  --reentrant_checkpointing true

This script requests 2 nodes with 8 GPUs each (16 total GPUs). It loads the fsdp_qlora conda environment and then launches the training script using `srun`.

The training script will automatically detect it is running under SLURM and set up the distributed process group accordingly. 

Adjust the SLURM parameters like `--nodes`, `--ntasks-per-node`, `--gres` etc. based on your cluster configuration and model size.

## Conclusion

In this notebook, we covered how to use QLoRA and FSDP to efficiently fine-tune large language models like Llama 2 70B on consumer hardware. 

The key takeaways are:

- QLoRA enables fine-tuning large models by only training the LoRA adapter layers
- FSDP allows sharded data parallel training to distribute the model across multiple GPUs 
- This makes it feasible to fine-tune 70B+ models on hardware like 2-4 GPUs
- Use the appropriate training options for quantization (e.g. bitsandbytes, HQQ), LoRA variants, and mixed precision
- Be aware of current limitations around model loading and saving

We hope this notebook and example training script are helpful for your own large model fine-tuning experiments. Feel free to open issues or PRs in the [fsdp_qlora](https://github.com/AnswerDotAI/fsdp_qlora) repo with any questions or suggestions!

Happy training! ðŸš€