### Fine Tunining Model with QLora
- Large language models got bigger but, at the same time, we finally got the tools to do fine-tuning and inference on consumer hardware.

- QLoRa, we can fine-tune models with billion parameters without relying on cloud computing and without a significant drop in performance according to the QLoRa paper.

In [8]:
import torch
import numpy as np
import pandas as pd 
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import json 
model_name = "EleutherAI/gpt-neox-20b"

#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [23]:
free_in_GB = round(torch.cuda.mem_get_info()[1] / 1024**3,2)
max_memory = f"{free_in_GB}GB"
n_gpus = torch.cuda.device_count()
print("Total number of GPU's ",n_gpus)
print("Maximum Availbale Memory for GPU :",max_memory)


Total number of GPU's  1
Maximum Availbale Memory for GPU : 4.0GB


### Installing libraries 

In [None]:
# ! pip install -q -U bitsandbytes
# ! pip install -q -U git+https://github.com/huggingface/transformers.git 
# ! pip install -q -U git+https://github.com/huggingface/peft.git
# ! pip install -q -U git+https://github.com/huggingface/accelerate.git
# ! pip install -q datasets

### Details  of Quantizer 

- load_in_4bit: The model will be loaded in the memory with 4-bit precision.
- bnb_4bit_use_double_quant: We will do the double quantization proposed by QLoRa.
- bnb_4bit_quant_type: This is the type of quantization. “nf4” stands for 4-bit NormalFloat.
- bnb_4bit_compute_dtype: While we load and store the model in 4-bit, we will partially dequantize it when needed and do all the computations with a 16-bit precision (bfloat16).

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

### Load the Model 

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quant_config, device_map={"":0})

#### Enable the checkpointing

In [None]:
model.gradient_checkpointing_enable()

### Preprocessing the GPT model for LoRa
This is where we use PEFT. We prepare the model for LoRa, adding trainable adapters for each layer.

In [None]:
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["query_key_value"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

### Load a sample Dataset

In [None]:
from datasets import load_dataset
data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

### Training the LLM for Sample Dataset 

In [None]:
import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        warmup_steps=2,
        max_steps=20,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)


In [None]:
trainer.train()

### Infererence

In [None]:
text = "Ask not what your country"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### Training for Custom Dataset 

In [7]:
with open("../data/Processed_data/Audio_data.json") as json_file:
    data = json.load(json_file)

## Print a Audio conversation 
for i,content in enumerate(data['data']):
    print("Text :",content['text'])
    if i>2:
        break

Text : conversation between Customer and relationship manager Harvinder  in en language: Hello. Hello. This is Harvinder from Tripco Services. Are you looking for any mortgage? Yes. Yes. Let's see. Let's say I need a mortgage. Yeah. So I was just calling you that time. You had told me to do that. Okay. No worries. Okay. So it's a good time to talk to you sir. Raghasi mortgage sir. Yeah. Okay. So I just need some few details to check your eligibility. Okay. I just want to know are you residency or non-residency? Residency. Residency. Okay. Mehanu, what's your age sir? Thirty. You're age? Thirty three. Thirty three. Okay. And your salary or your self-employed sir? Ego. What's your salary? Your salary. Okay. Yeah. And Mehanu, like how much your salary? Forty. Forty thousand. Forty thousand. Okay. And Mehanu, since how long you're working in the same company? I was before you know that. Now I joined this one since December. December? Yeah. December means... You can say five months? Five to

In [9]:
pd.DataFrame(data['data']).head()

Unnamed: 0,audio_url,text,customer,relationship_manager,language,call duration
0,3d5c6413-397f-41c9-8044-effc34cdc4a2.mp3,conversation between Customer and relationship...,Harvinder Yesar,,en,471.384
1,3dd36ed7-a347-4a55-884e-5c26a305daab.mp3,conversation between Customer Federico and rel...,Federico,Juraro,en,186.48
2,40b7e8cd-192b-4b74-bc10-ce5d2f84e993.mp3,conversation between Customer Behrouz and rela...,Behrouz,Juraira,en,389.52
3,412cf035-fbcc-4d88-baab-8cdaed16bcdc.mp3,conversation between Customer Ms. Emmy and rel...,Ms. Emmy,Jurara,en,218.664
4,41b0cbd5-882f-4c62-85bc-944c2e3e7beb.mp3,conversation between Customer Ms. Vora and rel...,Ms. Laura,Naila,en,266.328


## Load the Falcon Model and Tokenizer 

In [None]:
Model_Name = 'tiiuae/falcon-7b'

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(Model_Name, 
                                             quantization_config=quant_config, 
                                             device_map="auto",
                                             trust_remote_code=True,
                                             )
tokenizer = AutoTokenizer.from_pretrained(Model_Name)
tokenizer.pad_token  = tokenizer.eos_token  ## Setting Padding token to end of the sequence

In [24]:
def print_trainable_params(model):
    trainable_params = 0
    all_params  =0 
    for _,param in model.named_parameters():
        all_param += param.numel()
        if param.required_grad:
            trainable_params+=param.model()
    print("Trainable Parameters {} || All params : {} Trainable {}"
          .format(trainable_params,all_params,100*(trainable_params/all_params)))

print_trainable_params(model)

model.gradient_checkpointing_enable()


In [None]:

from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["query_key_value"], 
    lora_dropout=0.05, 
    bias="none", 
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)