### Fine-tuning LLama model
The goal of this stage is to prepare the data that will be used to fine-tune the model. Data preparation is a critical step because the quality and format of your data significantly impact how well your model learns and performs.

First, the code installs several Python libraries that are essential for the process. These libraries help with model acceleration, efficient computation, working with transformers, and training the model.


Let's install a required packages befor start our journey

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m72.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

Essential Python libraries and modules from the transformers and other packages are imported. These will help load datasets, process data, and define the model architecture.


In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer



Let start our first stage which is the Data Preparation stage

In [None]:
# The instruction dataset to use
dataset_name = "mlabonne/guanaco-llama2-1k"

dataset = load_dataset(dataset_name, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

After loding the data now is the time to focuse on model and tokenizer prepration

A tokenizer is loaded using AutoTokenizer.from_pretrained. The tokenizer prepares text data for the model by converting words into tokens (numerical representations) the model can understand.

The model is initially configured to use specific computational optimizations for efficiency. This includes using a 4-bit quantized version if specified, which reduces the model's memory footprint.



In [None]:
# Model and Tokenizer Configuration Parameters

# Parameter: bnb_4bit_compute_dtype
# Purpose: Specifies the data type for computations when using 4-bit precision
# Here, it's set to use 16-bit floating point numbers (float16)
bnb_4bit_compute_dtype = "float16"

# Parameter: use_4bit
# Purpose: Flag to indicate if the model should be loaded with 4-bit quantized weights
# Using 4-bit weights can significantly reduce model size and memory footprint
use_4bit = True

# Parameter: bnb_4bit_quant_type
# Purpose: Specifies the type of quantization, can be 'fp4' or 'nf4'
# 'nf4' is used here, which stands for normal float 4-bit quantization
bnb_4bit_quant_type = "nf4"

# Parameter: use_nested_quant
# Purpose: Flag to indicate if nested quantization is used for 4-bit models
# Nested quantization is not used in this case
use_nested_quant = False

# Parameter: model_name
# Purpose: Specifies the identifier of the model to be loaded from Hugging Face model hub
# This is the name of the pre-trained model
model_name = "NousResearch/Llama-2-7b-chat-hf"

# Parameter: device_map
# Purpose: Maps model layers to specific devices, like GPUs
# Here, it maps all layers to GPU 0
device_map = {"": 0}

# Prepare the dtype for model computation based on the bnb_4bit_compute_dtype string
# This converts the string 'float16' to the actual torch.float16 data type
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)


compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"



config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

### Model Building
In this stage, you set up the model with necessary configurations, especially focusing on adjustments that allow the model to learn from your specific dataset effectively.

LoRA Configuration: LoRA (Low-Rank Adaptation) is a technique to adapt large models with minimal additional parameters. Here, specific LoRA configurations are set to adjust the model without extensive retraining.


In [None]:
# Parameters for LoRA (Low-Rank Adaptation) Configuration

# Parameter: lora_alpha
# Purpose: Scaling factor for LoRA layers, which helps in controlling the magnitude
#          of the updates to the attention mechanism.
# Here, it's set to 16, meaning the low-rank matrices will be scaled by this factor.
lora_alpha = 16

# Parameter: lora_dropout
# Purpose: Dropout rate for the LoRA layers, which helps prevent overfitting by
#          randomly dropping units (along with their connections) during the training process.
# Set to 0.1, so there is a 10% chance that individual neurons will be dropped out.
lora_dropout = 0.1

# Parameter: lora_r
# Purpose: The rank of the low-rank matrices that are used to approximate the original
#          high-rank matrices in the attention layers.
# This is set to 64, meaning the rank of the adaptation matrix is 64.
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)


Training Arguments, These are configurations related to how the model should be trained, including the number of epochs, batch sizes, learning rate, and whether to use mixed precision training for faster computation.


In [None]:
output_dir = "./results"  # Directory to save the model
num_train_epochs = 3  # Number of training epochs
per_device_train_batch_size = 8  # Batch size per device (GPU/TPU)
gradient_accumulation_steps = 1  # Number of updates steps to accumulate before performing a backward/update pass
optim = "adamw_torch"  # Optimizer to use
save_steps = 500  # Save checkpoint every X updates steps
logging_steps = 100  # Log every X updates steps
learning_rate = 5e-5  # Learning rate
weight_decay = 0.01  # Weight decay
fp16 = False  # Use 16-bit (mixed) precision training
bf16 = False  # Use bfloat16 precision training
max_grad_norm = 1.0  # Max gradient norm
max_steps = -1  # If > 0: set total number of training steps to perform (overrides num_train_epochs)
warmup_ratio = 0.1  # Ratio of total training steps used for a linear warmup from 0 to learning_rate
group_by_length = False  # Group sequences of roughly the same length together when batching
lr_scheduler_type = "linear"


In [None]:

training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

After we done with setup the Hyperparams let's now train our model

### Model Training and Testing
Initialize Trainer, The SFTTrainer from the trl library is used to handle the fine-tuning of the model. It is configured with the model, training dataset, tokenizer, and the training arguments set earlier.

In [None]:
trainer = SFTTrainer(
    model=model,  # Your model instance
    train_dataset=dataset, # Your dataset
    peft_config=peft_config, # PEFT configuration
    dataset_text_field="text",
    max_seq_length=512,  # Maximum sequence length for the inputs
    tokenizer=tokenizer,  # Your tokenizer instance
    args=training_arguments,
    packing=True,   # Packing configuration (true or false)
)



The model is trained using the train method of the SFTTrainer. This method adjusts the model parameters based on the training data to minimize the prediction error.


In [None]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
100,1.7239


Step,Training Loss
100,1.7239
200,1.4466


After training, the model is saved for later use or deployment. The trained model can then generate text based on prompts to evaluate its performance qualitatively.


In [None]:
trainer.model.save_pretrained(new_model)
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])