In [None]:
!pip install accelerate peft bitsandbytes transformers trl auto-gptq optimum

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# **Quantization**
Quantization techniques reduces memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn’t be able to fit into memory, and speeding up inference. Transformers supports the ***AWQ*** and ***GPTQ*** quantization algorithms and it supports 8-bit and 4-bit quantization with bitsandbytes.

Quantization techniques that aren’t supported in Transformers can be added with the *HfQuantizer* class.

https://huggingface.co/docs/transformers/quantization

# **AutoGPTQ**

The *AutoGPTQ* library implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they’re restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU’s global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate.

***gptq just like bitsandbytes***

# **prepare_model_for_kbit_training**

In the context of fine-tuning Large Language Models (LLMs), **prepare_model_for_kbit_training** is a function used to preprocess a quantized model for training with a specific number of bits per weight (k bits). This process is crucial for enabling efficient training with reduced memory footprint and potentially faster computation.

# **Difference between BitsAndBytesConfig vs prepare_model_for_kbit_training**

**BitsAndBytesConfig:**

**Function:** This class defines the configuration for how the model will be quantized. It acts as a blueprint for the quantization process.

**Purpose:** It allows you to specify various parameters related to quantization

**Usage:** You create a BitsAndBytesConfig object, setting the desired quantization parameters, and then pass it to the prepare_model_for_kbit_training function.

**prepare_model_for_kbit_training:**

**Function:** This function is an action-oriented method that takes a pre-trained model and a BitsAndBytesConfig object and performs the actual quantization process.

**Purpose:** It utilizes the configuration provided by BitsAndBytesConfig

**Usage:** Once you have a pre-trained model and a configured
BitsAndBytesConfig object, you call prepare_model_for_kbit_training with these as arguments to prepare the model for fine-tuning with the specified bit-width.

In [3]:
import torch
from datasets import load_dataset, Dataset
from peft import LoraConfig, AutoPeftModelForCausalLM, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig, TrainingArguments
from trl import SFTTrainer
import os

def finetune_xwin_lm():
    data = load_dataset("yahma/alpaca-cleaned", split="train")
    data_df = data.to_pandas()
    data_df = data_df[:10000]
    data_df["text"] = data_df[["input", "instruction", "output"]].apply(lambda x: "###Human: " + x["instruction"] + "\n" + x["input"] + "\n###Assistant: " +x["output"], axis=1)
    print(data_df.iloc[0])
    data = Dataset.from_pandas(data_df)
    tokenizer = AutoTokenizer.from_pretrained("TheBloke/Xwin-LM-7B-V0.1-GPTQ")
    tokenizer.pad_token = tokenizer.eos_token
    quantization_config_loading = GPTQConfig(bits=4, disable_exllama=True)
    model = AutoModelForCausalLM.from_pretrained(
                              "TheBloke/Xwin-LM-7B-V0.1-GPTQ",
                              quantization_config=quantization_config_loading,
                              device_map="auto"
                          )
    model.config.use_cache=False
    model.config.pretraining_tp=1
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)
    peft_config = LoraConfig(
        r=16, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
    )
    model = get_peft_model(model, peft_config)
    training_arguments = TrainingArguments(
        output_dir="xwin-finetuned-alpaca-cleaned",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=1,
        optim="paged_adamw_32bit",
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        save_strategy="epoch",
        logging_steps=10,
        num_train_epochs=1,
        max_steps=250,
        push_to_hub=True
    )
    trainer = SFTTrainer(
        model=model,
        train_dataset=data,
        peft_config=peft_config,
        dataset_text_field="text",
        args=training_arguments,
        tokenizer=tokenizer,
        packing=False,
        max_seq_length=512
    )
    trainer.train()
    trainer.push_to_hub()

if __name__ == "__main__":
    finetune_xwin_lm()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

input                                                           
output         1. Eat a balanced and nutritious diet: Make su...
instruction                 Give three tips for staying healthy.
text           ###Human: Give three tips for staying healthy....
Name: 0, dtype: object


tokenizer_config.json:   0%|          | 0.00/748 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

Using `disable_exllama` is deprecated and will be removed in version 4.37. Use `use_exllama` instead and specify the version with `exllama_config`.The value of `use_exllama` will be overwritten by `disable_exllama` passed in `GPTQConfig` or stored in your config file.


config.json:   0%|          | 0.00/973 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class
The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class


generation_config.json:   0%|          | 0.00/183 [00:00<?, ?B/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
10,1.3153
20,1.2148
30,1.1053
40,1.1313
50,1.0389
60,1.0022
70,1.0367
80,1.0517
90,1.0092
100,0.9746


events.out.tfevents.1710309530.528a653df07f.188.0:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

# **Inference**

In [5]:
from peft import AutoPeftModelForCausalLM
from transformers import GenerationConfig
from transformers import AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("SNV/xwin-finetuned-alpaca-cleaned")

inputs = tokenizer("""
###Instruction: I dropped my mobile phone in water. What to do?
###Response: """, return_tensors="pt").to("cuda")

model = AutoPeftModelForCausalLM.from_pretrained(
    "SNV/xwin-finetuned-alpaca-cleaned",
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="cuda")

generation_config = GenerationConfig(
    penalty_alpha=0.6,
    do_sample = True,
    top_k=5,
    temperature=0.5,
    repetition_penalty=1.2,
    max_new_tokens=100
)
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


tokenizer_config.json:   0%|          | 0.00/893 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/437 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/626 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/33.6M [00:00<?, ?B/s]


###Instruction: I dropped my mobile phone in water. What to do?
###Response: 1. Turn it off immediately by pressing the power button or flipping its screen upwards, and then remove any battery if possible before putting it into a container of rice for at least two days; this will help absorb moisture from inside the device. After that time has passed, dry the phone thoroughly with paper towels or cloths and gently shake out excess liquid. If there is no visible damage after these steps are taken, you can attempt to turn your phone
