<a href="https://colab.research.google.com/github/DanieleCecca/NLP-project/blob/main/FinetuningLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning LLM model with distillation
*Daniele Cecca*

*Matr. 914358*

*MSc Artificial Intelligence for Science and Technology*

*Email: d.cecca@campus.unimib.it*

The following notebook is useful to finetune an LLM model by using unsloth AI.
We will train the model on the answer provided by gemini or gpt4, by using LoRA.

For the momento thi finetunig is done by passing phrase by phrase.Another try that could be done in the future is to pass the entire context;the same that is passed to the bigger model.


## Utility

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
import pandas as pd
import json
from unsloth.chat_templates import standardize_sharegpt
from datasets import Dataset


NotImplementedError: Unsloth: No NVIDIA GPU found? Unsloth currently only supports GPUs!

### Utility functiona

In [None]:
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

## Data preparation

In [None]:
DATA_PATH='/content/drive/MyDrive/Nlp/data/dataset_gemini.csv'
TRAIN_PATH='/content/drive/MyDrive/Nlp/data/training_gemini.csv'
TEST_PATH='/content/drive/MyDrive/Nlp/data/test_gemini.csv'

In [None]:
df=pd.read_csv(TRAIN_PATH, encoding='unicode_escape')
df.head(5)

In [None]:
prompt='''voglio che assegni ad ogni parola della frase una label, seguendo il formato IOB, usato per i task di NER(name entity recognition).
Le entità sono le seguenti:
-PROBLEMA
-ESAME
-FARMACI
-OPERAZIONE\n

frase:
'''

In [None]:
df['frase'] = df['frase'].apply(lambda x: prompt + x)

In [None]:
df.frase[2]

We now use the Llama-3.1 format for conversation style finetunes.
Llama-3 renders multi turn conversations like below:

```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hey there! How are you?<|eot_id|><|start_header_id|>user<|end_header_id|>

I'm great thanks!<|eot_id|>
```

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3` and more.

In [None]:
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

In [None]:
df["conversations"] = df.apply(
    lambda x: [
        {"content": x["frase"], "role": "user"},
        {"content": x["label"], "role": "assistant"}
    ], axis=1
)

# Ora convertiamo il DataFrame in un Dataset di HuggingFace, rimuovendo le vecchie colonne
dataset = Dataset.from_pandas(df.drop(columns=["frase", "label"]))

In [None]:
dataset

In [None]:
dataset['conversations'][0]

We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:
```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

In [None]:
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

In [None]:
dataset[5]["conversations"]

In [None]:
dataset[5]["text"]

## Model

Load the model

In [None]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

We add LoRA

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

## Train

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [None]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

### Show final memory and time stats

In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

 ## Save model

In [None]:
model.save_pretrained("/content/drive/MyDrive/Nlp/lora_llama")  # Local saving
tokenizer.save_pretrained("/content/drive/MyDrive/Nlp/lora_llama")

## Inference streaming without LoRA

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_llama", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

## Inference streaming with LoRA

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_llama", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Assegna ad ogni parola."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

## Test

#### Utility functions

In [None]:
def sendLLM(prompt_text,model='lora_llama'):
    if True:
      from unsloth import FastLanguageModel
      model, tokenizer = FastLanguageModel.from_pretrained(
          model_name = model, # YOUR MODEL YOU USED FOR TRAINING
          max_seq_length = max_seq_length,
          dtype = dtype,
          load_in_4bit = load_in_4bit,
      )
      FastLanguageModel.for_inference(model) # Enable native 2x faster inference

    messages = [
        {"role": "user", "content": "Assegna ad ogni parola."},
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize = True,
        add_generation_prompt = True, # Must add for generation
        return_tensors = "pt",
    ).to("cuda")

    outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                            temperature = 1.5, min_p = 0.1)
    return outputs

In [None]:
#VEDERE output di risposta
def create_dataset_IOB(list_sentences, prompt_base, model='lora_llama'):
  data = []  # Lista per raccogliere i dati

  for sentence in list_sentences:
      prompt_text = prompt_base + sentence
      response = sendLLM(prompt_text, model)  # Supponendo che questa funzione restituisca l'output corretto

      data.append({'frase': sentence, 'label': response})  # Aggiungi la riga alla lista

  dataset_IOB = pd.DataFrame(data)  # Crea il DataFrame una volta terminato il ciclo
  return dataset_IOB

In [None]:
def preprocess_iob_dataframe(df):
    new_data = []

    for index, row in df.iterrows():
        sentence = row['frase']
        labels = row['label']

        words = sentence.split()
        label_list = labels.split()

        if len(words) != len(label_list):
            print(f"Mismatch in row {index}, sentence '{sentence}'. Skipping this row.")
            continue

        for i in range(len(words)):
            new_data.append([index, words[i], label_list[i], sentence])

    return pd.DataFrame(new_data, columns=["#Sentence", "Word", "Label", "Sentence"])


#### Evaluation

In [None]:
prompt_base="""Sei un medico e voglio che mi  assegni ad ogni parola una label, seguendo il formato IOB, usato per i task di NER.
Le entità sono le seguenti:
-PROBLEMA
-ESAME
-FARMACI
-OPERAZIONE


Con problema intendiamo anche malattie. Mentre con esame , intendiamo sia esame clinico che strumentale. Considera sia il nome dell'esame sia il valore-risultato dell'esame che puo' essere sia numerico che categorico. Lo stesso vale per i problemi per i farmaci e per le operazioni.

Fai attenzione il numero di label deve essere uguale al numero di parole:
O
I_PROBLEMA
B_PROBLEMA
I_ESAME
B_ESAME
I_FARMACI
B_FARMACI
I_OPERAZIONE
B_OPERAZION

context:
"""


In [None]:
df_test=pd.read_csv(TEST_PATH, encoding='unicode_escape')
list_sentences=df_test['frase']

In [None]:
df_prediction=create_dataset_IOB(list_sentences,prompt_base,model='lora_llama')

In [None]:
labels=preprocess_iob_dataframe(df_test)['Label']
prediction=preprocess_iob_dataframe(df_prediction)['Label']

In [None]:
from seqeval.metrics import classification_report
print(classification_report(labels, predictions))