<h1 align='center'>Fine Tuning LLAMA2 7B Model </h1>

### Theory:

**Calculating GPU requirements**<br>
One weight parameter represents 32 bits. <br>
So for LLAMA2 7b parameter model, memory in bits = $7*10_{}^{12}*32$ bits.<br>
8 Bits = 1 Byte <br>
So, memory requirement = $7*10_{}^{12}*32/8$ bytes = 28 GB <br>
With 4 Bit quantization , we scale down the model weights. <br>
With 4 bit quantization LLAMA 7B, Instead of using 32-bits to represent a weight. You scale it to 4-bits. $7*10_{}^{12}*4/8$ = 3.5 GB

**References:**<br>
- https://huggingface.co/blog/hf-bitsandbytes-integration
- https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing#scrollTo=E0Nl5mWL0k2T

### STEP 1 : Installations and Imports

In [3]:
### Mount to g-drive (Sign in required for the mount)
from google.colab import drive
drive.mount("/content/gdrive", force_remount=True)
%cd "/content/gdrive/MyDrive/Python_DA_DS_ML/Gen AI/LLM Fine Tuning"

Mounted at /content/gdrive
/content/gdrive/MyDrive/Python_DA_DS_ML/Gen AI/LLM Fine Tuning


In [4]:
!pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install transformers==4.31 #temporary fix required owing to breaking changes on Aug 9th 2023
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install huggingface_hub -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.31
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m65.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers==4.31)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.31)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m99.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers==4.31)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x8

In [5]:
# Required when training models/data that are gated on HuggingFace, and required for pushing models to HuggingFace
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
### Use the following command for huggingface CLI login
# !huggingface-cli login

In [31]:
### Imports
import torch
import transformers
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from peft import prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

### STEP 2 : Load Model

`load_in_4bit`: Load the model in 4-bit precision, i.e., divide memory usage by 4.

`bnb_4bit_use_double_quant`: Use nested quantization techniques for more memory-efficient inference at no additional cost.

`bnb_4bit_quant_type`: Set quantization data type. The options are either FP4 (4-bit precision), which is the default quantization data type, or NF4 (Normal Float 4), a new 4-bit data type adapted for weights that have been initialized using a normal distribution.

`bnb_4bit_compute_dtype`: Set the computational data type for 4-bit models. Default value: torch.float32

In [7]:
def create_bnb_config(load_in_4bit, bnb_4bit_use_double_quant, bnb_4bit_quant_type, bnb_4bit_compute_dtype):
    '''
    Configures model quantization method using bitsandbytes to speed up training and inference

    :param load_in_4bit: Load model in 4-bit precision mode
    :param bnb_4bit_use_double_quant: Nested quantization for 4-bit model
    :param bnb_4bit_quant_type: Quantization data type for 4-bit model
    :param bnb_4bit_compute_dtype: Computation data type for 4-bit model
    '''

    bnb_config = BitsAndBytesConfig(
    load_in_4bit = load_in_4bit,
    bnb_4bit_use_double_quant = bnb_4bit_use_double_quant,
    bnb_4bit_quant_type = bnb_4bit_quant_type,
    bnb_4bit_compute_dtype = bnb_4bit_compute_dtype,
    )

    return bnb_config

In [10]:
def load_model(model_name, bnb_config):
    '''
    Loads model and model tokenizer

    :param model_name: Hugging Face model name
    :param bnb_config: Bitsandbytes configuration
    '''

    # Get number of GPU device and set maximum memory
    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

    # Load model
    model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config = bnb_config,
    device_map = 'auto', # dispatch the model efficiently on the available resources
    max_memory = {i: max_memory for i in range(n_gpus)},
    )

    # Load model tokenizer with the user authentication token
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token = True)

    # Set padding token as EOS token
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

In [9]:
# bnb configs
load_in_4bit = True
bnb_4bit_use_double_quant = True
bnb_4bit_quant_type = 'nf4'
bnb_4bit_compute_dtype = torch.bfloat16
bnb_config = create_bnb_config(load_in_4bit, bnb_4bit_use_double_quant, bnb_4bit_quant_type, bnb_4bit_compute_dtype)
bnb_config

BitsAndBytesConfig {
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "load_in_4bit": true
}

In [18]:
# Load Model
model_name =  "meta-llama/Llama-2-7b-chat-hf"
model, tokenizer = load_model(model_name,bnb_config=bnb_config)

(…)ma-2-7b-chat-hf/resolve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

(…)esolve/main/model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

(…)t-hf/resolve/main/generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

(…)at-hf/resolve/main/tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

(…)2-7b-chat-hf/resolve/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

(…)-hf/resolve/main/special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [19]:
## model has been loaded
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


### STEP 3 : Training Setup

In [23]:
# Gradient checkpointing is a technique used to trade off memory usage for computation time during backpropagation.
# Enabling this make the training faster
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [24]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [25]:
### Setting up the LoraConfig for model training
config = LoraConfig(
    r=8,
    lora_alpha=32,
    # target_modules=["query_key_value"],
    target_modules=["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj"], #specific to Llama models.
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
### Reducing the model dimension by taking only the most important weights of the model
model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 8388608 || all params: 3508801536 || trainable%: 0.23907331075678143


### STEP 4 : Load Data

Dataset : https://huggingface.co/datasets/Abirate/english_quotes

In [27]:
data = load_dataset("Abirate/english_quotes")

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [28]:
data

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags'],
        num_rows: 2508
    })
})

In [29]:
## Tokenizing the data
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)
data

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['quote', 'author', 'tags', 'input_ids', 'attention_mask'],
        num_rows: 2508
    })
})

### STEP 5 : Training

In [41]:
def set_training_args():
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    )
    return args

def set_trainer(model,training_args):
    model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
    trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args= training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),)
    return trainer

In [42]:
training_args = set_training_args()
trainer = set_trainer(model, training_args)
trainer

<transformers.trainer.Trainer at 0x7b8893d393f0>

In [43]:
### Train the model
trainer.train()

Step,Training Loss
1,2.349
2,1.5248
3,1.7539
4,1.4698
5,2.2051
6,1.6167
7,1.7955
8,1.2189
9,2.2919
10,1.7817


TrainOutput(global_step=10, training_loss=1.8007404685020447, metrics={'train_runtime': 57.083, 'train_samples_per_second': 0.701, 'train_steps_per_second': 0.175, 'total_flos': 30825159745536.0, 'train_loss': 1.8007404685020447, 'epoch': 0.02})

### STEP 6 : Inference

In [45]:
model.config.use_cache = True
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
              )
              (k_proj): Linear4bit(
                (lora_dropout): Module

In [46]:
# Define a stream *without* function calling capabilities
def stream(user_prompt):
    runtimeFlag = "cuda:0"
    system_prompt = 'You are a helpful assistant that provides accurate and concise responses'

    B_INST, E_INST = "[INST]", "[/INST]"
    B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

    prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}\n\n"
    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)
    streamer = transformers.TextStreamer(tokenizer)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)

In [47]:
stream('Provide a very brief comparison of salsa and bachata.')

<s> [INST] <<SYS>>
You are a helpful assistant that provides accurate and concise responses
<</SYS>>

Provide a very brief comparison of salsa and bachata. [/INST]

Sure, here's a brief comparison between salsa and bachata:

Salsa:

* Originated in Cuba and is a fast-paced, energetic dance style
* Characterized by quick footwork and hip movements
* Typically danced to upbeat, lively music with a strong rhythm

Bachata:

* Originated in the Dominican Republic and is a romantic, sensual dance style
* Characterized by slow, fluid movements and a focus on hip action
* Typically danced to mellow, romantic music with a strong beat

In summary, salsa is fast-paced and energetic, while bachata is slow and sensual.</s>


In [50]:
stream('"So many books and so little time" is a quote of which author ?')

<s> [INST] <<SYS>>
You are a helpful assistant that provides accurate and concise responses
<</SYS>>

"So many books and so little time" is a quote of which author ? [/INST]

The quote "So many books and so little time" is attributed to the author Louisa May Alcott.</s>


In [49]:
stream("'If you tell the truth, you don't have to remember anything' is a quote of which author ?")

<s> [INST] <<SYS>>
You are a helpful assistant that provides accurate and concise responses
<</SYS>>

'If you tell the truth, you don't have to remember anything' is a quote of which author ? [/INST]

The quote "If you tell the truth, you don't have to remember anything" is attributed to Mark Twain.</s>


In [51]:
stream('"Be the change that you wish to see in the world." is a quote of which author?')

<s> [INST] <<SYS>>
You are a helpful assistant that provides accurate and concise responses
<</SYS>>

"Be the change that you wish to see in the world." is a quote of which author [/INST]

The quote "Be the change that you wish to see in the world" is attributed to Mahatma Gandhi.</s>


### STEP 7 : Push Model to HF Hub

In [None]:
# Extract the last portion of the base_model
base_model_name = model_name.split("/")[-1]

# Define the save and push paths
adapter_model = f"KD440/{base_model_name}-fine-tuned-adapters"  #adjust 'Trelis' to your HuggingFace organisation
new_model = f"KD440/{base_model_name}-fine-tuned" #adjust 'Trelis' to your HuggingFace organisation

In [None]:
# Save the model
model.save_pretrained(adapter_model, push_to_hub=True, use_auth_token=True)
# Push the model to the hub
model.push_to_hub(adapter_model, use_auth_token=True)

In [None]:
# reload the base model (you might need a pro subscription for this because you may need a high RAM environment for the 13B model since this is loading the full original model, not quantized)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cpu', trust_remote_code=True, torch_dtype=torch.float16, cache_dir=cache_dir)

In [None]:
from peft import PeftModel

# load perf model with new adapters
model = PeftModel.from_pretrained(
    model,
    adapter_model,
)

In [None]:
model = model.merge_and_unload() # merge adapters with the base model.
model.push_to_hub(new_model, use_auth_token=True, max_shard_size="5GB")

In [None]:
#Push the tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.push_to_hub(new_model, use_auth_token=True)