## Local LLMs fine-tuning with different quantization techniques (`bitsandbytes` and `gptq`)

This notebooks provide a quick overview of using various quantization techniques to fine-tune LLMs on comodity hardware (memory constrained). Especially on Colab GPU (free-tier), to fine-tune small LLM variant (7B) with 16GiB, quantization techniques like 4-bit quantization and GPTQ is needed to prevent Out-of-Memory errors with long sequences length.

Install prerequisite packages

In [1]:
!git clone https://github.com/taprosoft/llm_finetuning/
%cd llm_finetuning
!pip install -r requirements.txt
!pip install -r cuda_quant_requirements.txt
!wandb disabled

Cloning into 'llm_finetuning'...
remote: Enumerating objects: 61, done.[K
remote: Counting objects: 100% (61/61), done.[K
remote: Compressing objects: 100% (48/48), done.[K
remote: Total 61 (delta 13), reused 54 (delta 9), pack-reused 0[K
Unpacking objects: 100% (61/61), 84.66 KiB | 1.76 MiB/s, done.
/content/llm_finetuning
Collecting git+https://github.com/huggingface/peft.git (from -r requirements.txt (line 14))
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-on1mg_ot
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-on1mg_ot
  Resolved https://github.com/huggingface/peft.git to commit 06fd06a4d2e8ed8c3a253c67d9c3cb23e0f497ad
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate (from -r requirements.txt (line 3))
  Downloading accelerate-0.20.3-py3-none-an

Download some model weights from HuggingFace [model hub](https://huggingface.co/models) using the `download_model.py` script.

In [None]:
!mkdir models
# download a 7B GPTQ base model
!python download_model.py TheBloke/open-llama-7b-open-instruct-GPTQ
# download a normal 7B model (note that we have to use sharded checkpoint due to memory limit of Colab)
!python download_model.py CleverShovel/vicuna-7b-v1.3-sharded-bf16

Downloading the model to models/TheBloke_open-llama-7b-open-instruct-GPTQ
100% 9.80k/9.80k [00:00<00:00, 37.1MiB/s]
100% 576/576 [00:00<00:00, 3.43MiB/s]
100% 132/132 [00:00<00:00, 935kiB/s]
100% 3.90G/3.90G [03:33<00:00, 18.3MiB/s]
100% 185/185 [00:00<00:00, 1.55MiB/s]
100% 435/435 [00:00<00:00, 3.35MiB/s]
100% 1.98M/1.98M [00:01<00:00, 1.89MiB/s]
100% 534k/534k [00:00<00:00, 71.7MiB/s]
100% 727/727 [00:00<00:00, 4.63MiB/s]
Downloading the model to models/Epimachok_vicuna-7b-v1.3-sharded-bf16
100% 552/552 [00:00<00:00, 3.70MiB/s]
100% 137/137 [00:00<00:00, 1.05MiB/s]
100% 1.98G/1.98G [00:22<00:00, 86.7MiB/s]
100% 1.99G/1.99G [00:24<00:00, 81.0MiB/s]
 29% 568M/1.99G [01:15<01:21, 17.4MiB/s]

Use `finetune.py` script to run training / inference. We first perform evaluation of the downloaded models on a public instruction-tuning datasets.

To understand the format of the dataset, take a look at [alpaca-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) or the guideline in [README](https://github.com/taprosoft/llm_finetuning).

It looks something likes this:

```json
[
    {
        "instruction": "do something with the input",
        "input": "input string",
        "output": "output string"
    }
]
```

We start with the 7B model on 4-bit quantization mode from `bitsandbytes`. Take a look at the output loss and processing time per step.

In [None]:
!python finetune.py \
    --base_model 'models/CleverShovel_vicuna-7b-v1.3-sharded-bf16' \
    --data_path 'yahma/alpaca-cleaned' \
    --output_dir 'output_lora' \
    --batch_size 32 \
    --micro_batch_size 1 \
    --train_on_inputs True \
    --num_epochs 1 \
    --learning_rate 2e-4 \
    --cutoff_len 1600 \
    --group_by_length \
    --val_set_size 0.05 \
    --eval_steps 0 \
    --logging_steps 5 \
    --save_steps 5 \
    --gradient_checkpointing 1 \
    --mode 4 \
    --eval

Now we will run the same script with GPTQ quantization mode (`--mode gptq`). Note that we need to switch to a compatible model weight to be used with this method. (look for `gptq` in the model name). We can see some significant difference in processing time using different quantization methods.

In [None]:
!python finetune.py \
    --base_model 'models/TheBloke_open-llama-7b-open-instruct-GPTQ' \
    --data_path 'yahma/alpaca-cleaned' \
    --output_dir 'output_lora' \
    --batch_size 32 \
    --micro_batch_size 1 \
    --train_on_inputs True \
    --num_epochs 1 \
    --learning_rate 2e-4 \
    --cutoff_len 1600 \
    --group_by_length \
    --val_set_size 0.05 \
    --eval_steps 0 \
    --logging_steps 5 \
    --save_steps 5 \
    --gradient_checkpointing 1 \
    --mode gptq \
    --eval

Evaluation loop only provides the loss and run time measurement. To actually see the model output in text format, use `inference.py` script. Note that perform inference / generation will take much longer time than evaluation loop due to the additional overhead in token generation steps. We will use `exllama` inference backend to speed up the inference time.

In [None]:
# to fix some Colab install issue with Exllama
!git clone https://github.com/taprosoft/exllama.git
!cd exllama && pip install -e .

In [None]:
!python inference.py \
    --base models/TheBloke_open-llama-7b-open-instruct-GPTQ \
    --mode exllama \
    --data 'yahma/alpaca-cleaned' \
    --selected_ids [0,1,2,3]

Now we can start training. On a relatively old GPU like T4, it can take about 20-30h to complete the training on Alpaca dataset. Output checkpoint is stored in `output_lora`. Checkpoint is created at regular interval so you can stop earlier if needed.

In [None]:
!python finetune.py \
    --base_model 'models/TheBloke_open-llama-7b-open-instruct-GPTQ' \
    --data_path 'yahma/alpaca-cleaned' \
    --output_dir 'output_lora' \
    --batch_size 32 \
    --micro_batch_size 1 \
    --train_on_inputs True \
    --num_epochs 1 \
    --learning_rate 2e-4 \
    --cutoff_len 1600 \
    --group_by_length \
    --val_set_size 0.05 \
    --eval_steps 0 \
    --logging_steps 5 \
    --save_steps 5 \
    --gradient_checkpointing 1 \
    --mode gptq