### Fine Tunining Model with QLora
Large language models got bigger but, at the same time, we finally got the tools to do fine-tuning and inference on consumer hardware.

Thanks to LoRa, and now QLoRa, we can fine-tune models with billion parameters without relying on cloud computing and without a significant drop in performance according to the QLoRa paper.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = "EleutherAI/gpt-neox-20b"

#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

### Installing libraries 

In [None]:
# ! pip install -q -U bitsandbytes
# ! pip install -q -U git+https://github.com/huggingface/transformers.git 
# ! pip install -q -U git+https://github.com/huggingface/peft.git
# ! pip install -q -U git+https://github.com/huggingface/accelerate.git
# ! pip install -q datasets

### Details  of Quantizer 

- load_in_4bit: The model will be loaded in the memory with 4-bit precision.
- bnb_4bit_use_double_quant: We will do the double quantization proposed by QLoRa.
- bnb_4bit_quant_type: This is the type of quantization. “nf4” stands for 4-bit NormalFloat.
- bnb_4bit_compute_dtype: While we load and store the model in 4-bit, we will partially dequantize it when needed and do all the computations with a 16-bit precision (bfloat16).

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

### Load the Model 

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quant_config, device_map={"":0})

#### Enable the checkpointing

In [None]:
model.gradient_checkpointing_enable()

### Preprocessing the GPT model for LoRa
This is where we use PEFT. We prepare the model for LoRa, adding trainable adapters for each layer.

In [None]:
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["query_key_value"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

### Load a sample Dataset

In [None]:
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

In [None]:
import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        warmup_steps=2,
        max_steps=20,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)


In [None]:
trainer.train()

### Infererence

In [None]:
text = "Ask not what your country"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))