<h1> Finetuning Llama2 7b on the TweetSumm customer support dialog summary dataset</h1>

**1. We start by installing the necessary packages and libraries and their revelant versions to avoid any issues related to version updates.**

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7
!pip install wandb
!pip install huggingface-hub

**2. We login to huggingface_hub using a personal account to import the formatted instruction dataset and the Llama2 base model.**

In [None]:
from huggingface_hub import login
login(new_session=False,
write_permission=True,
token='...',
add_to_git_credential=True)

**3. We create 3 variables containing the respective names of the base model, the result model, and the instruction dataset.**

In [None]:
model_name = "meta-llama/Llama-2-7b-chat-hf"
new_model = "Llama2-7b-Dialog-Summary"
dataset_name = "Marouane50/Dialog-Summarization-Dataset-Formatted"

**4. We load the instruction dataset and store the training and validation sets in two variables.
**

In [None]:
from datasets import load_dataset

training_dataset = load_dataset(dataset_name, split="training")
validation_dataset = load_dataset(dataset_name, split="validation")

**5. To reduce the memory and computational costs, we decided to integrate quantization into the model by representing the model’s weights with lower-precision data types. So we use 4-bit quantization (nf4), and double quantization to compress the base model like stated in the QLora paper.**

In [None]:
import torch
import os
from transformers import BitsAndBytesConfig

compute_dtype = getattr(torch, "float16") # Imports the 16-bit floating point data from the PyTorch library

bnb_config = BitsAndBytesConfig(
    load_in_4bit = True, # Loads the model with 4-bit quantization to reduce its memory footprint
    bnb_4bit_quant_type="nf4", # Type of 4-bit quantization to use
    bnb_4bit_compute_dtype= compute_dtype, # Sets the float16 as data type to use for computations
    bnb_4bit_use_double_quant=True # Enables double quantization 
)

**6. We load the base model that we are going to use for the training, by importing from huggingface. Then we configure some of its parameters to recommended values in the context of finetuning.
**

In [None]:
# Loading the base model with QLoRA config
from transformers import AutoModelForCausalLM

#We use the AutoModelForCausalLM clas from the transformers library as an architecture to load the model as it's adapted for text generation tasks.
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config = bnb_config,
    device_map = {"": 0},
)
model.config.use_cache = False
model.config.pretraining_tp = 1

**7. We load the tokenizer of the model we're using, and we configure it to avoid token overflow**

In [None]:
from transformers import AutoTokenizer

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)

#This code was added as after a recommendation to stop the model from generating text and avoid overflow cases.
tokenizer.pad_token_id = 18610
tokenizer.padding_side = "right"

**8. We setup the Lora parameters as stated in the LoRA paper. Lora accelerates the finetuning of large language models while consuming less memory. We create a parameter-efficient fine-tuning configuration where we define LoRA specific parameters. LoRA is a method that accelerates the finetuning of large language models by: a)Tracking changes to weights instead of updating them directly. b)Decomposing large matrices of weight changes into smaller matrices that contains the trainable parameters.**

In [None]:
from peft import LoraConfig

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

**9. We create a TrainingArguments class which contains all the parameters we can adjust as well as flags for activating different training options. There are built-in default training parameters, and we can optimisee the training by setting new parameters. The reference values for each parameter is available in the Trainer class documentation (https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#tftrainingarguments), the values were slightly adjusted to improve the training loss and validation loss for the fine-tuning.**


In [None]:
from transformers import TrainingArguments

# Settin the training parameters
training_arguments = TrainingArguments(
    output_dir="./results", # Folder where the training files will be saved
    num_train_epochs=2, # Number of interation through the training dataset
    per_device_train_batch_size=4, # Number of data samples processed before the model is updated
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit", # Type of optmizer used
    save_steps=0,
    evaluation_strategy="steps", 
    eval_steps = 0.1, # Frequency at which the evaluation loss is computed
    logging_steps=20, # Sets how often logging information is recorded
    learning_rate=2e-4,# Allows the optimizer to adjust the learning speed for each iteration
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine", # Type of learning rate chosed by the scheduler for the training.
    report_to="wandb" # Visualisation library for training metrics report
)

**10. We initialize an instance of SFTTrainer, to configure the parameters of the self-supervised learning for transformers, used fot fine-tuning.**

In [None]:
from trl import SFTTrainer
# Setting the fine-tuning parameters
trainer = SFTTrainer(
    model=model, 
    train_dataset=training_dataset, # Training set
    eval_dataset=validation_dataset, # Validation set
    peft_config=peft_config, # LoRA configuration
    dataset_text_field="text", #
    max_seq_length=4096, # Optional maximum tokens length of the sequences used during the training
    tokenizer=tokenizer, # Tokenizer for the base model used
    args=training_arguments, 
    packing=False,
)

# Fine-tuning the model with the previously set training parameters
trainer.train()

In [None]:
# Saving the trained model
trainer.save_model(new_model)

**11. We run the text generation pipeline with our base model using the Llama2 chat prompt format.**

In [None]:
from transformers import pipeline
# Default system prompt for a dialog summarization task
system_prompt = "The following text is a conversation between a human and an AI agent. Write a summary of the conversation."

# The instruction text
prompt = """user: Hello, i'm looking to get a refund for a computer that i bought from your store. /n agent: Could you please let me know the reason why you want a refund ? /n user: The screen was broken. /n agent: Ok, i will send your request to the customer service department and you will be notified in 5 working days. /n user: Ok, thank you."""

# The pipeline function from the transformers library allows to start using the model
pipe = pipeline(
    task="text-generation", # Type of task the model will have to perform
    model=model, # Model used in the pipeline
    tokenizer=tokenizer, # Tokenizer compatible with the model
    max_length=300 # Optional, limits the number of characters in the output
)

result = pipe(f"[INST] <<SYS>> {system_prompt} <</SYS>> {prompt} [/INST]")  # We pass the formatted prompt to the pipeline function
print(result[0]['generated_text']) # The result is in the form of a list, we select the generated_text entry to get the model's output

<b>12. Reload model in FP16 and merge it with LoRA weights, this is when we incorporatethe trained weights in the base model, which results in the fine-tuned model. </b>

In [None]:
from peft import PeftModel

# Lora Configuration
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True, # Optimizes memory usage during loading 
    return_dict=True, # Dictionary output format
    torch_dtype=torch.float16, # Loads the model with 16-bit floating point LoRAprecision to reduce memory usage and computation time
    device_map={"": 0},
)

model = PeftModel.from_pretrained(base_model, new_model) # Loads a parameter-efficient finetuning model using the base model and trained model weights 
model = model.merge_and_unload() # This methods merges the weights from the trained model with the base mode

# We reload the tokenizr of Llama2 model to be used in the new text generation pipeline
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.padding_side = "right" # Line of code to avoid overflow during text generation

**13. After merging the weights, we run the text generation pipeline in with our new fine-tuned model.**

In [None]:
system_prompt = "The following text is a conversation between a user and an AI agent. Write a summary of the conversation."
#prompt = "user: Hello, i'm looking to get a refund for a computer that i bought from your store. agent: Could you please let me know the reason why you want a refund ? user: The screen was broken. agent: Ok, i will send your request to the customer service department and you will be notified in 5 working days. user: Ok, thank you."
prompt = "user: SO I Can't castfrom my app to my TV?Really? agent: As long as both devices are connected to your home network you should be able to cast content. If either device is not connected to your home network then casting is blocked due to our network agreements. ^RT user: Spectrum Live TV and Spectrum Internet. But No option to cast. It's frustrating. agent: It should be under Settings, then Display, then Cast. You should see a list of compatible devices on your network to cast to. The instructions for Apple Airplay are different and can be found here as well. ^RT user: Nah. Not even an option. See agent: Are you streaming to a chromecast or directly to the TV? ^RT user: It's the APP. I'm wanting to stream/cast to the Chromecast. agent: Are you able to reboot your modem and then log out of the app and see if the display option comes up for you when you log back in? ^RT user: I will attempt that. Give me a few minutes to see if that helps. agent: Sure thing. I'll be here once everything is done rebooting. user: There is not a 'Display' setting under settings even after reboot of everything. agent: I would be happy to get this issue escalated for you. Can you please DM the service address and phone number?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=1000)
result = pipe(f"<s>[INST] <<SYS>> {system_prompt} <</SYS>> {prompt} [/INST]")
print(result[0]['generated_text'])

**14. We store the trained model and tokenizer in Huggginface Hub to evaluate it with the ROUGE benchmark**

In [None]:
model.push_to_hub("Marouane50/Llama2-Dialog-Summarization", check_pr=True)

tokenizer.push_to_hub("Marouane50/Llama2-Dialog-Summarization",check_pr=True)

References:
Training Arguments: https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#tftrainingarguments)
Fintuning : https://huggingface.co/docs/transformers/training
Open-source code: https://colab.research.google.com/drive/1p68M5E5fZ7kSa7nA-e-20489nuFSXVp2?usp=sharing