## Setting up the environment

1. See ph1-inmemory for cuda setup
2. Install the required packages

In [None]:

# accessible large language models via k-bit quantization for PyTorch
#%pip install bitsandbytes
# a library for easily accessing, sharing, and processing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks
#%pip install datasets
# stands for Parameter-Efficient Fine-Tuning is a library for efficiently adapting large pre-trained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly
#%pip install peft


## Run the sample

In [None]:
# pip install setuptools GPUtil datasets bitsandbytes accelerate trl

In [None]:
# validate that Cuda is available
import torch

# print the cuda device name
print("Cuda_is_available? {}\nUsing: {} ".format(
    torch.cuda.is_available(),
    torch.cuda.get_device_name())
)

In [None]:
# setup a helper function to print GPU utilization transformers
import GPUtil


def print_gpu_utilization():
    """Prints GPU usage using GPUtil."""
    gpus = GPUtil.getGPUs()
    for gpu in gpus:
        print(f"GPU {gpu.id}: {gpu.name}, Utilization: {gpu.load * 100:.2f}%")


In [None]:
import torch
from transformers import AutoTokenizer

base_model_id = "microsoft/phi-1"

# from transformers import AutoModel
# checkpoint_directory = "./fine-tuned-model/checkpoint-500"
# 
# model_in_progress = AutoModel.from_pretrained(checkpoint_directory, load_in_8bit=True)
# fine_tuned_tokenizer = AutoTokenizer.from_pretrained(checkpoint_directory)


# this line of code initializes a tokenizer (AutoTokenizer) using pretrained weights specified by base_model_id.
# The use_fast=True option indicates that it should use a faster tokenizer implementation if possible
tokenizer = AutoTokenizer.from_pretrained(base_model_id, use_fast=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token

The BitsAndBytesConfig class is a wrapper class that provides configuration options for working with models that have been loaded using bitsandbytes. It allows you to specify various attributes and features for quantization and loading models in different bit formats.

In the provided code, the BitsAndBytesConfig object is initialized with several arguments. The load_in_4bit argument is set to True, indicating that the model should be loaded using 4-bit quantization. The bnb_4bit_quant_type argument is set to "nf4", specifying that the quantization data type for the bnb.nn.Linear4Bit layers should be NF4. The bnb_4bit_compute_dtype argument is set to compute_dtype, which is obtained using the getattr function.

The getattr function is a built-in Python function that returns the value of a named attribute of an object. In this case, it is used to dynamically retrieve the value of the compute_dtype attribute from the torch module. The compute_dtype attribute is obtained using the string "bfloat16". This allows for flexibility in specifying the data type for computation.

Overall, the provided code initializes a BitsAndBytesConfig object with specific configuration options for 4-bit quantization and computation data type. It demonstrates the use of the getattr function to dynamically retrieve attribute values based on string inputs.

In [None]:
from transformers import BitsAndBytesConfig

compute_dtype = getattr(torch, "bfloat16")
print(f'compute_dtype: {compute_dtype}' )
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)

The *from_pretrained* method is a convenient way to load a pre-trained model for causal language modeling, with options to specify various configurations such as quantization, device mapping, and attention implementation.

The *from_pretrained* method takes several arguments. The first argument is `pretrained_model_name_or_path`, which specifies the name or path of the pre-trained model to load. This argument is required.

The other arguments are passed as keyword arguments (`**kwargs`). In the selected code, several keyword arguments are provided. Let's go through them one by one:
- `trust_remote_code=True`: This argument specifies whether to trust remote code when loading the model. If set to True, the code will attempt to load the model from a remote source. If set to False, it will only load the model from a local source. In the selected code, it is set to True.
- `quantization_config=bnb_config`: This argument specifies the quantization configuration for the model. Quantization is a technique used to reduce the memory footprint and improve the performance of the model. In the selected code, it is set to bnb_config.
- `device_map={"": 0}`: This argument specifies the device map for the model. It maps device names to device IDs. In the selected code, it maps an empty string to device ID 0.
- `torch_dtype="auto"`: This argument specifies the torch data type for the model. In the selected code, it is set to "auto", which means the data type will be automatically determined.
- `attn_implementation="eager"`: This argument specifies the attention implementation for the model. Attention is a mechanism used in neural networks to focus on relevant parts of the input. In the selected code, it is set to "eager".

The *from_pretrained* method performs several operations. It sets up the necessary configurations for loading the model. It also checks if the model has remote code and whether to trust it. If remote code is trusted, it dynamically loads the model class and registers it. If the model configuration is recognized, it retrieves the corresponding model class and loads the model.



In [None]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
          base_model_id, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}, torch_dtype="auto", attn_implementation="eager"
)

We need a dataset that we can use to teach the model about new data. There are many from Hugging face, and you can build your own. Here we will use the Guanaco dataset from Tim Dettmers.

> This dataset is a subset of the Open Assistant dataset, which you can find here: https://huggingface.co/datasets/OpenAssistant/oasst1/tree/main
> This subset of the data only contains the highest-rated paths in the conversation tree, with a total of 9,846 samples.

* https://huggingface.co/datasets/timdettmers/openassistant-guanaco

In [None]:
from datasets import load_dataset

dataset = load_dataset("timdettmers/openassistant-guanaco")

 The LoraConfig class provides a flexible way to configure the LoraModel and customize its behavior based on specific requirements and preferences.

 - `r`:his attribute represents the Lora attention dimension, also known as the "rank". This influences the model's capacity to learn and represent complex patterns. Higher rank allows more nuanced relationships in the data. The default is 8.
 - `lora_alpha`: This attribute represents the alpha parameter for Lora scaling. It has a default value of 8. This parameter allows for fine-tuning the balance between the original attention outputs and the adjustments introduced by Lora.
 - `lora_dropout`: This attribute represents the dropout probability for Lora layers. It has a default value of 0.0. Dropout is a form of regularization that helps prevent overfitting by randomly setting a fraction of input units to 0 at each update during training time.
 - `bias`: The right balance of bias (and variance) is crucial for creating accurate and reliable models. Too much bias can lead to poor performance due to underfitting, while too little bias can cause overfitting, where the model performs well on training data but poorly on new, unseen data. It can be set to "none", "all", or "lora_only". If set to "all" or "lora_only", the corresponding biases will be updated during training.
 - `target_modules`: This attribute specifies the names of the modules to which the Lora adapter should be applied. A module refers to a component or layer of a neural network model. Modules are building blocks that perform specific operations or computations within the model. They can include components such as linear layers, convolutional layers, attention layers, and more.


In [None]:
from peft import LoraConfig

peft_config = LoraConfig(
        lora_alpha=8,  # Reduced from 16
        lora_dropout=0.05,
        r=8,  # Reduced from 16
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ["q_proj", "k_proj"] #, "v_proj", "fc2","fc1"]
)

The *prepare_model_for_kbit_training* function is a convenient wrapper that handles various steps to prepare a model for training in the transformers library. It takes care of freezing base model layers, casting parameters to the appropriate precision, and enabling gradient checkpointing if desired.

In [None]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

This code initializes a *TrainingArguments* object with various parameters that control the training process. These parameters define settings such as output directory, evaluation strategy, batch sizes, learning rate, optimizer, and more. The specific values chosen for these parameters will depend on the requirements of the training task and the available hardware.

In [None]:
from transformers import TrainingArguments

training_arguments = TrainingArguments(
        output_dir="./phi1-results1",
        eval_strategy="steps",
        do_eval=True,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=3,  # Reduced from 12 to 3
        per_device_eval_batch_size=1,
        log_level="debug",
        save_steps=100,
        logging_steps=25, 
        learning_rate=1e-4,
        eval_steps=50,
        optim='paged_adamw_8bit',
        bf16=True, #change to fp16 if are using an older GPU
        num_train_epochs=1,  # Reduced from 3 to 1
        warmup_steps=100,
        lr_scheduler_type="linear",
        use_cpu=False
)

The *SFTTrainer* constructor performs various checks and configurations based on the provided arguments. It handles cases where the model is a string identifier, initializes the `PeftModel` if *peft_config* is provided, sets the tokenizer if not specified, handles packing-related arguments, and prepares the datasets based on the specified options.

In [None]:
from trl import SFTTrainer

trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="text",
        max_seq_length=1024,  # Reduced from 1024
        tokenizer=tokenizer,
        args=training_arguments,
        packing=True
)

In [None]:
trainer.train()

# Save the fine-tuned model
tokenizer.save_pretrained("./fine-tuned-model")
model.save_pretrained("./fine-tuned-model", safe_serialization=True)

In [None]:
from transformers import AutoModelForCausalLM, AutoModel, AutoTokenizer

fine_tuned_model = AutoModelForCausalLM.from_pretrained(
    "./fine-tuned-model",
    quantization_config=bnb_config,
    device_map={"": 0},
    torch_dtype="auto",
)

# Load the tokenizer
fine_tuned_tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model", use_fast=True)

# Ensure padding and EOS token settings are the same as during training
fine_tuned_tokenizer.padding_side = 'right'
fine_tuned_tokenizer.pad_token = tokenizer.eos_token


In [None]:

import time

duration = 0.0
total_length = 0
prompt = []
#prompt.append("Write the recipe for a chicken curry with coconut milk.")
#prompt.append("Translate into French the following sentence: I love bread and cheese!")
prompt.append("### Human: Cite 20 famous people.### Assistant:")
#prompt.append("Where is the moon right now?")

for i in range(len(prompt)):
  model_inputs = fine_tuned_tokenizer(prompt[i], return_tensors="pt").to("cuda:0")
  start_time = time.time()
  output = fine_tuned_model.generate(**model_inputs, max_length=500)[0]
  duration += float(time.time() - start_time)
  total_length += len(output)
  tok_sec_prompt = round(len(output)/float(time.time() - start_time),3)
  print("Prompt --- %s tokens/seconds ---" % (tok_sec_prompt))
  print(print_gpu_utilization())
  print(tokenizer.decode(output, skip_special_tokens=True))