## Recap and Skill Test

In this notebook, you will recap the skills you've learned about **Quantization, Parameter-Efficient Fine-Tuning (PEFT), and Unsloth** techniques. You will apply these methods to memory-efficiently finetune a lange model.

### Objectives:

1. **Quantize a Model**: Apply 4-bit quantization when loading a pre-trained model.
2. **Parameter-Efficient Fine-Tuning (PEFT)**: Use adapter layers to fine-tune a model efficiently.

## Task:

1. Choose a different pre-trained model from the Huggingface Hub (e.g., Phi 3.5 mini or Mistral 7B Instruct).
2. Load the model using 4-bit quantization.
3. Finetune the model on a dataset of your choice (e.g. openassistant-guanaco).
4. Evaluate the model's performance before and after finetuning.

### Hint:

Some models' tokenizers do not come with a `pad_token`. It might be necessary to manually set the `pad_token` to some other token, e.g.:
```
tokenizer.pad_token = tokenizer.unk_token
```

In [1]:
import torch
from accelerate import PartialState
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer

model_id = 'mistralai/Mistral-7B-Instruct-v0.3'

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.unk_token_id
tokenizer.padding_side = 'right'

data = load_dataset('timdettmers/openassistant-guanaco', split='train')

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    attn_implementation='sdpa',  # 'eager', 'sdpa', or "flash_attention_2"
    torch_dtype=torch.bfloat16,
)

model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False

peft_config = LoraConfig(
    task_type='CAUSAL_LM',
    r=16,
    lora_alpha=32,  # rule: lora_alpha should be 2*r
    lora_dropout=0.05,
    bias='none',
    target_modules='all-linear',
)

project_name = 'mistral7b-guanaco'
run_name = '1'

training_arguments = TrainingArguments(
    output_dir=f'{project_name}-{run_name}',
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True, # Gradient checkpointing improves memory efficiency, but slows down training,
        # e.g. Mistral 7B with PEFT using bitsandbytes:
        # - enabled: 11 GB GPU RAM and 12 samples/second
        # - disabled: 40 GB GPU RAM and 8 samples/second
    gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
    optim='adamw_torch',  # 'paged_adamw_32bit' can save GPU memory
    learning_rate=2e-4,  # QLoRA suggestions: 2e-4 for 7B or 13B, 1e-4 for 33B or 65B
    warmup_steps=200,
    lr_scheduler_type='cosine',
    logging_strategy='steps',  # 'no', 'epoch' or 'steps'
    logging_steps=50,
    save_strategy='no',  # 'no', 'epoch' or 'steps'
    max_steps=10,
    fp16=True,  # mixed precision training
    report_to='none',  # disable wandb
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=data,
    peft_config=peft_config,
    tokenizer=tokenizer,
    packing=False,
    dataset_text_field='text',
    max_seq_length=1024,
)

if hasattr(trainer.model, "print_trainable_parameters"):
    trainer.model.print_trainable_parameters()

result = trainer.train()

# Print statistics:
print(f"Run time: {result.metrics['train_runtime']:.2f} seconds")
print(f"Training speed: {result.metrics['train_samples_per_second']:.1f} samples/s")

# trainer.save_model()

2024-09-25 19:44:28.156212: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-25 19:44:28.156614: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-25 19:44:28.181529: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-25 19:44:28.244090: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


tokenizer_config.json:   0%|          | 0.00/141k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100%|██████████| 20.9M/20.9M [00:01<00:00, 20.6MB/s]
Downloading data: 100%|██████████| 1.11M/1.11M [00:00<00:00, 2.01MB/s]


Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 41,943,040 || all params: 7,289,966,592 || trainable%: 0.5753529796148619


Step,Training Loss


Run time: 86.99 seconds
Training speed: 0.9 samples/s
