<a href="https://colab.research.google.com/github/Ilvecho/LLM_fine_tuning/blob/main/LoRA_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we are going to perform the actual LoRA fine tuning of our model.

We will use the data scraped in the Web_Scraping notebook and then elaborated in the Docs_elaboration notebook.

Thanks to the processing steps, we have already available data in the desired JSON format.

Note: there might be some issues with the code (in particular with the data structures).
If so, apologies

In [None]:
import numpy as np
import pandas as pd
import torch
import os
import re
import json
import random
import pickle
import plotly.graph_objects as go

from google.colab import userdata
from google.colab import files,drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
!pip install trl transformers datasets torch peft
!pip install -qU accelerate
!pip install -qU bitsandbytes
!pip install thefuzz

In [None]:
from datasets import load_dataset, Dataset, DatasetDict

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig, GenerationConfig, pipeline
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, AutoPeftModelForCausalLM, PeftConfig, PeftModel
from trl import SFTTrainer
from thefuzz import fuzz

# Initialization

### Load datasets

Load the created dataset

In [None]:
#train & test.json are in same folder as the jupyter notebook
data_files = {'train':'/content/gdrive/MyDrive/Syllog/train_data.json',
              'test':'/content/gdrive/MyDrive/Syllog/test_data.json'}
dataset = load_dataset('json',data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

There is also a smaller version of the created dataset.

It was used to test out the pipeline before moving to the full database

In [None]:
data_files = {'train':'/content/gdrive/MyDrive/Syllog/small_data_train.json',
              'test':'/content/gdrive/MyDrive/Syllog/small_data_test.json'}
dataset = load_dataset('json',data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

### Model and tokenizer

Load the model and configure it to use 4bit quantization (because of RAM limitations)

In [None]:
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype= torch.bfloat16,
        bnb_4bit_use_double_quant= False,
)

model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

# We want each generation to be independent & save memory
model.config.use_cache = False
# The backprop gradient is computed not using all parameters, to save memory
model.gradient_checkpointing_enable()
# Makes training faster but a little less accurate
model.config.pretraining_tp = 1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Load the tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='right')
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True # It was true, changed to False but I am not sure the change was enforced
tokenizer.add_bos_token = True
# tokenizer.add_bos_token, tokenizer.add_eos_token

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Before proceeding with the fine tuning, let's first evaluate the performance of the model non fine tuned

# Original model Generation

Here we put together the Generation with LLM and the string processing

In [None]:
# Define the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

Note that the system prompt used in the Test generation from the base model is similar to the prompt used in ChatGPT to create the dataset:

template_general_questions =

    I provide you with the following context: '''{transcript}'''.
    You must identify the general topic that is discussed in the provided context.
    Once the general topic is identified, you need to generate 5 pairs of Question-Answer on the general topic.
    Since the questions are generic, the answers must be at least 2 sentences (but do not go above 6 sentences).



template_specific_questions =

    I provide you with the following context: '''{transcript}'''.
    You must identify the general topic that is discussed in the provided context.
    Once the general topic is identified, one related sub-topic covered in the provided context.
    In the output list all the identified sub-topics in a numbered list. You can use it to double check that the identified sub-topics are five.
    Create two Question-Answer pair for said sub-topic. Double check that they are two.
    Since the question are specific to a sub-topic, the answer must be at most four sentences long.
    Repeat the above actions for five different sub-topics covered in the context.
    Before providing the output, review your answer and make sure that five sub topics have been identified.

system =

    You are a helpful assistant that reads documents, understand their content, and generate Question-Answer pairs.
    Your output will be used to perform supervised fine tuning of a LLM - keep it in mind when formulating both the question and the answer.
    The desired output format is the following:
    - The first line of the output should be "Topic:" followed by the topic identified in the provided document
    - Identify the questions with "Question:" and the answers with "Answer:"
    - each question and each answer need to be in one line only. The result of this is that each line will start either with "Question:" or with "Answer:"
    Avoid referring to any Named Entity in the questions, unless extremely relevant for the document content.
    Email addresses and phone numbers are not relevant for me - do not mention them at any time.


In [None]:
# test_answers = []
with open('/content/gdrive/MyDrive/Syllog/test_full_raw_answers.pkl', 'rb') as file:
  test_answers = pickle.load(file)

cont = 0
for row in dataset['test']:

  if cont < 33:
    cont += 1
    continue

  ##############################################
  #############     GENERATION     #############
  ##############################################

  system_message = "Sei un assistente AI utile e conciso. Rispondi in massimo cinque frasi, va bene anche usarne meno."

  prompt_template=f"""<|im_start|>Sistema: {system_message}<|im_end|>
  <|im_start|>Utente: {row['question']}<|im_end|>
  <|im_start|>Assistente: """

  # Call the pipeline also with args to be passed to the model
  sequences = pipe(
      prompt_template,
      max_new_tokens=200,
      do_sample=False,
      return_full_text=False,
      num_return_sequences=1,
      eos_token_id=tokenizer.eos_token_id,
      pad_token_id=tokenizer.eos_token_id,
      decoder_start_token_id=0,
  )

  answer = sequences[0]['generated_text']

  ##############################################
  #############     PROCESSING     #############
  ##############################################

  # If there is the end tag, let's just consider what's before it
  if '<|im_end|>' in answer:
    answer = answer.split('<|im_end|>')[0]

  # Then, we want to remove the numbers of the numbered item list
  answer = re.sub(r'\d+\.\s*', '- ', answer)

  # Then, what we want  to do is to verify that each sentence generated by the model is not similar to the others
  # We want to discard the last element as the model will always close a sentence with a dot.
  # If no dot is present, it means that the generation was interrupted because of the max tokens limit
  sentences = re.split(r'[.?!:;]', answer.strip())

  if len(sentences[-1]) > 0:
    answer = answer[:-len(sentences[-1])]

  # If there are multiple sentences, check that they are different from each other
  if len(sentences) > 1:
    sentences = sentences[:-1]

    # Build the Fuzzy matching matrix
    size = len(sentences)
    fuzz_match = np.zeros((size, size))

    for i, sentence in enumerate(sentences):
      for j, compare in enumerate(sentences):
        if sentence is compare:
          continue
        else:
          score = fuzz.token_set_ratio(sentence,compare)
          fuzz_match[i][j] = score

    # Discard sentences with high score
    max_score = np.max(fuzz_match)
    argmax_score = np.argmax(fuzz_match)

    while max_score > 80:
      # Find the two matching sentences
      i = argmax_score // size
      j = argmax_score % size

      # print(f'Size: {size}, argmax: {argmax_score}, i: {i}, j: {j}')

      # out of the two, find the one with the highest average score (the sentence on average more similar to all the others)
      if fuzz_match[i].mean() < fuzz_match[j].mean():
        to_delete = j
      else:
        assert fuzz_match[i].mean() >= fuzz_match[j].mean()
        to_delete = i

      # Delete sentence from the fuzz match
      fuzz_match = np.delete(fuzz_match, to_delete, axis=0)
      fuzz_match = np.delete(fuzz_match, to_delete, axis=1)

      # Since we are deleting one sentence, we need to reduce the size as well
      size -= 1

      # Delete sentence from sentences
      sentences.pop(to_delete)


      # Values for the new While cycle
      max_score = np.max(fuzz_match)
      argmax_score = np.argmax(fuzz_match)

    output = ''

    for sentence in sentences:
      idx = answer.find(sentence)

      if idx != -1 and idx + len(sentence) < len(answer):
          punctuation = answer[idx + len(sentence)]
          output += sentence.strip() + punctuation + '\n'
      else:
          print("Substring not found or character after the substring does not exist.")

  else:
    assert len(sentences) == 1
    output = sentences[0]

  new_element = {"id": row['id'],
                 "answer_raw_model": output}

  test_answers.append(new_element)

  print(f'Question {cont}: ', row['question'])
  print(f'Generated output: {output}')
  # print(f'Reference output: ', row['answer'])
  print('###################################\n')

  with open('/content/gdrive/MyDrive/Syllog/test_full_raw_answers.pkl', 'wb') as file:
    pickle.dump(test_answers, file)

  cont += 1


Question 33:  Quali sono alcuni esempi di rischi fisici sul posto di lavoro?
Generated output: - Infortuni fisici, - Stress, - Inquinamento, - Radiazioni, - Incendi, - Incidenti, - Accidenti, - Malattie, - Infezioni, - Malattie professionali, - Malattie respiratorie, - Malattie cardiovascolari, - Malattie gastrointestinali, - Malattie nervose, - Malattie muscoloscheletriche, - Malattie endocrine, - Malattie renali, - Malattie cutanee, - Malattie oculari, - Malattie orali, - Malattie gen
###################################

Question 34:  Qual è stata la prima legge federale di vasta portata volta a proteggere i lavoratori americani?
Generated output: La prima legge federale di vasta portata volta a proteggere i lavoratori americani fu la legge sul lavoro di 1938, nota anche come legge sul lavoro e sul salario di - La legge ha istituito il Dipartimento del Lavoro degli Stati Uniti e ha stabilito il salario minimo nazionale, il limite massimo di ore lavorative settimanali e le regole per 

# Fine Tuning

In this section we are going to perform the actual fine tuning of the model.

Prepare the model for the training, and create the LoRA configuration

In [None]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=4,                # The number of data samples is extemely limited, so let's use a low rank to reduce resources requirements
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

print the trainable parameters

In [None]:
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
    all_param += param.numel()
    if param.requires_grad:
        trainable_params += param.numel()
print(
    f"trainable params: {trainable_params} || all params: {all_param} || trainable %: {100 * trainable_params / all_param}"
)

trainable params: 5767168 || all params: 3757838336 || trainable %: 0.153470359401858


Define the hyperparameters for the training

In [None]:
training_arguments = TrainingArguments(
    output_dir="/content/gdrive/MyDrive/Syllog/full_results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to=None                   # We dont have a WanDB account, so cannot use it to visualize the training metrics
)

We actually need to modify the data to make it in a format suitable for fine tuning.

Hence, we define a formatting function and then pass it to the trainer

In [None]:
def prompt_instruction_format(sample):
  system_prompt = 'Sei un assistente AI utile e conciso. Rispondi in massimo cinque frasi, va bene anche usarne meno.'
  return [f"""<|im_start|>Sistema: {system_prompt}<|im_end|><|im_start|>Utente: {sample['question']}<|im_end|><|im_start|>Assistente: {sample['answer']}<|im_end|>"""]

create the trainer

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=peft_config,
    max_seq_length=200,
    tokenizer=tokenizer,
    formatting_func=prompt_instruction_format,
    args=training_arguments,
    packing= False,
)

Map:   0%|          | 0/645 [00:00<?, ? examples/s]

Map:   0%|          | 0/59 [00:00<?, ? examples/s]

We dont have a WanDB account, so cannot use it to visualize the training metrics

In [None]:
#import wandb
#wandb.init(mode='disabled')

os.environ['WANDB_DISABLED'] = 'true'

perform the actual training

In [None]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=1, training_loss=1.7766133546829224, metrics={'train_runtime': 5.9683, 'train_samples_per_second': 0.168, 'train_steps_per_second': 0.168, 'total_flos': 8539712716800.0, 'train_loss': 1.7766133546829224, 'epoch': 1.0})

In [None]:
trainer.save_model('/content/gdrive/MyDrive/Syllog/full_results/tuned_model')

# Generation with tuned model

Change the data structure of the test answers variable from list to dict to access the answer with ID (needed later)

In [None]:
with open('/content/gdrive/MyDrive/Syllog/test_full_raw_answers.pkl', 'rb') as file:
  test_answers = pickle.load(file)

answers_dict = {}

for answer in test_answers:
  answers_dict[answer['id']] = answer

with open('/content/gdrive/MyDrive/Syllog/answers_full_dict.pkl', 'wb') as file:
  pickle.dump(answers_dict, file)


Load the tuned model

In [None]:
PEFT_MODEL = '/content/gdrive/MyDrive/Syllog/full_results/tuned_model'

config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

# Load the Lora model
model = PeftModel.from_pretrained(model, PEFT_MODEL)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

We decide to use a pipeline for the inference.

Actually this is not needed and the model can be used directly for inference.
Feel free to modify the code, and in case drop me an email: massimoterzi@hotmail.it

In [None]:
# Define the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonF

Generation with the model.

Note that we apply the **same post processing steps** as to the Raw model generated output

In [None]:
with open('/content/gdrive/MyDrive/Syllog/answers_full_dict.pkl', 'rb') as file:
  answers_dict = pickle.load(file)

cont = 0

for row in dataset['test']:

  ##############################################
  #############     GENERATION     #############
  ##############################################

  system_message = "Sei un assistente AI utile e conciso. Rispondi in massimo cinque frasi, va bene anche usarne meno."

  prompt_template=f"""<|im_start|>Sistema: {system_message}<|im_end|>
  <|im_start|>Utente: {row['question']}<|im_end|>
  <|im_start|>Assistente: """

  # Call the pipeline also with args to be passed to the model
  sequences = pipe(
      prompt_template,
      max_new_tokens=200,
      do_sample=False,
      return_full_text=False,
      num_return_sequences=1,
      eos_token_id=tokenizer.eos_token_id,
      pad_token_id=tokenizer.eos_token_id,
      decoder_start_token_id=0,
  )

  answer = sequences[0]['generated_text']

  ##############################################
  #############     PROCESSING     #############
  ##############################################

  # If there is the end tag, let's just consider what's before it
  if '<|im_end|>' in answer:
    answer = answer.split('<|im_end|>')[0]

  # Then, we want to remove the numbers of the numbered item list
  answer = re.sub(r'\d+\.\s*', '- ', answer)

  # Then, what we want  to do is to verify that each sentence generated by the model is not similar to the others
  # We want to discard the last element as the model will always close a sentence with a dot.
  # If no dot is present, it means that the generation was interrupted because of the max tokens limit
  sentences = re.split(r'[.?!:;]', answer.strip())

  if len(sentences[-1]) > 0:
    answer = answer[:-len(sentences[-1])]

  # If there are multiple sentences, check that they are different from each other
  if len(sentences) > 1:
    sentences = sentences[:-1]

    # Build the Fuzzy matching matrix
    size = len(sentences)
    fuzz_match = np.zeros((size, size))

    for i, sentence in enumerate(sentences):
      for j, compare in enumerate(sentences):
        if sentence is compare:
          continue
        else:
          score = fuzz.token_set_ratio(sentence,compare)
          fuzz_match[i][j] = score

    # Discard sentences with high score
    max_score = np.max(fuzz_match)
    argmax_score = np.argmax(fuzz_match)

    while max_score > 80:
      # Find the two matching sentences
      i = argmax_score // size
      j = argmax_score % size

      # print(f'Size: {size}, argmax: {argmax_score}, i: {i}, j: {j}')

      # out of the two, find the one with the highest average score (the sentence on average more similar to all the others)
      if fuzz_match[i].mean() < fuzz_match[j].mean():
        to_delete = j
      else:
        assert fuzz_match[i].mean() >= fuzz_match[j].mean()
        to_delete = i

      # Delete sentence from the fuzz match
      fuzz_match = np.delete(fuzz_match, to_delete, axis=0)
      fuzz_match = np.delete(fuzz_match, to_delete, axis=1)

      # Since we are deleting one sentence, we need to reduce the size as well
      size -= 1

      # Delete sentence from sentences
      sentences.pop(to_delete)


      # Values for the new While cycle
      max_score = np.max(fuzz_match)
      argmax_score = np.argmax(fuzz_match)

    output = ''

    for sentence in sentences:
      idx = answer.find(sentence)

      if idx != -1 and idx + len(sentence) < len(answer):
          punctuation = answer[idx + len(sentence)]
          output += sentence.strip() + punctuation + '\n'
      else:
          print("Substring not found or character after the substring does not exist.")

  else:
    assert len(sentences) == 1
    output = sentences[0]

  assert row['id'] in answers_dict.keys()

  answers_dict[row['id']]['answer_tuned_model'] = output

  print(f'Question {cont}: ', row['question'])
  print(f'Generated output: {output}')
  # print(f'Reference output: ', row['answer'])
  print('###################################\n')

  with open('/content/gdrive/MyDrive/Syllog/answers_full_dict.pkl', 'wb') as file:
    pickle.dump(answers_dict, file)

  cont += 1


Question 0:  Quali sono le responsabilità principali di un responsabile della conformità delle risorse umane?
Generated output: - Assicurarsi che le procedure e le politiche di conformità siano adeguatamente implementate e applicate.
- Monitorare e valutare la conformità delle risorse umane.
- Gestire le violazioni e le correzioni.
- Fornire formazione e supporto per la conformità delle risorse umane.
- Collaborare con altri dipartimenti per garantire la conformità generale.

###################################

Question 1:  Che cos'è la conformità delle risorse umane?
Generated output: La conformità delle risorse umane è un processo di valutazione che verifica se le risorse umane di un'organizzazione sono conformi ai requisiti e alle norme stabiliti.

###################################

Question 2:  Come definisce il Dipartimento del Lavoro degli Stati Uniti la condotta molesta?
Generated output: La condotta molesta è definita dal Dipartimento del Lavoro degli Stati Uniti come un com



Question 7:  Quali sono le potenziali sfide che sorgono quando le caratteristiche, le competenze e gli interessi di un dipendente non si allineano bene con il suo lavoro?
Generated output: - L'assenza di motivazione e la mancanza di interesse nel lavoro possono portare a una ridotta produttività e a un'alta assenza.
- La mancanza di competenze e abilità necessarie per svolgere il lavoro può portare a difficoltà e problemi.
- La mancanza di opportunità di crescita e sviluppo può portare a una sensazione di stagnazione e mancanza di motivazione.

###################################

Question 8:  Che cos'è l'onboarding dei dipendenti?
Generated output: L'onboarding dei dipendenti è il processo di assimilazione di un nuovo dipendente in un'organizzazione.
Questo processo può includere la formazione, l'integrazione e l'assimilazione del dipendente in un'organizzazione.

###################################

Question 9:  Quali sono i vantaggi di un orientamento per i nuovi dipendenti ben orch

# Performance evaluation

We are going to use the typical NLP libraries to perform the performance evaluation.

The idea is to compute the similarity between the generated outputs (both by the raw model and the tuned model) and the reference output of the Test set.

In [None]:
!pip install thefuzz
!pip install spacy
!python -m spacy download it_core_news_sm

In [None]:
import string

from thefuzz import fuzz

import nltk
from nltk import pos_tag, word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import spacy
# Load the italian model
nlp = spacy.load("it_core_news_sm")

In the pre-processing we are going to:
- remove the stop words
- split the sentences in tokens
- extract the lemma for each token

Since the sentences to be analyzed are in italian, we need to be mindful of the libraries used

Then, after the pre-processing, we are simply going to compute the **cosine similarity** between the reference output and the generated output.

In [None]:
nltk.download('stopwords')
italian_stopwords = stopwords.words('italian')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
with open('/content/gdrive/MyDrive/Syllog/answers_full_dict.pkl', 'rb') as file:
  answers_dict = pickle.load(file)

In [None]:
raw_results = []
tuned_results = []

for key in answers_dict.keys():
  # Save the values in local variables
  ans_raw = answers_dict[key]['answer_raw_model'].lower()
  ans_tuned = answers_dict[key]['answer_tuned_model'].lower()
  ans = answers_dict[key]['answer'].lower()

  # Remove punctuation
  ans_raw = ans_raw.translate(str.maketrans('', '', string.punctuation))
  ans_tuned = ans_tuned.translate(str.maketrans('', '', string.punctuation))
  ans = ans.translate(str.maketrans('', '', string.punctuation))

  # Remove stop words
  ans_raw = ' '.join([word for word in ans_raw.split() if word not in (italian_stopwords)])
  ans_tuned = ' '.join([word for word in ans_tuned.split() if word not in (italian_stopwords)])
  ans = ' '.join([word for word in ans.split() if word not in (italian_stopwords)])

  # Lemmization
  ans_raw = ' '.join([token.lemma_ for token in nlp(ans_raw)])
  ans_raw_doc = nlp(ans_raw)
  ans_tuned = ' '.join([token.lemma_ for token in nlp(ans_tuned)])
  ans_tuned_doc = nlp(ans_tuned)
  ans = ' '.join([token.lemma_ for token in nlp(ans)])
  ans_doc = nlp(ans)

  # Similarity score
  score_raw = ans_doc.similarity(ans_raw_doc)
  raw_results.append(score_raw)
  score_tuned = ans_doc.similarity(ans_tuned_doc)
  tuned_results.append(score_tuned)

  print(f'Raw model score: {score_raw} vs Tuned model score: {score_tuned}')

  score_raw = ans_doc.similarity(ans_raw_doc)
  score_tuned = ans_doc.similarity(ans_tuned_doc)


Raw model score: 0.8020824087014337 vs Tuned model score: 0.9260227190615818
Raw model score: 0.8286242581933615 vs Tuned model score: 0.8777372285979718
Raw model score: 0.9198674068067116 vs Tuned model score: 0.946362257362798
Raw model score: 0.9179835326036908 vs Tuned model score: 0.9179835326036908
Raw model score: 0.8436379380114625 vs Tuned model score: 0.8584787073924983
Raw model score: 0.7008524064236885 vs Tuned model score: 0.5201808986843949
Raw model score: 0.8798836936483461 vs Tuned model score: 0.9442174425166351
Raw model score: 0.8845289385757162 vs Tuned model score: 0.8997057239149233
Raw model score: 0.8318938959168984 vs Tuned model score: 0.8560446700574239
Raw model score: 0.8050501294476652 vs Tuned model score: 0.8230686087598567
Raw model score: 0.6812185142298836 vs Tuned model score: 0.8864186014124764
Raw model score: 0.890043292098842 vs Tuned model score: 0.8381881608912514
Raw model score: 0.8769699616362508 vs Tuned model score: 0.8937899728530032
R

Save the results

In [None]:
with open('/content/gdrive/MyDrive/Syllog/raw_results.pkl', 'wb') as file:
  pickle.dump(raw_results, file)

with open('/content/gdrive/MyDrive/Syllog/tuned_results.pkl', 'wb') as file:
  pickle.dump(tuned_results, file)

Some numbers

In [None]:
print(f"  Raw model -> Median {np.median(raw_results):.3}, Q1: {np.percentile(raw_results, 25):.3}, Q3: {np.percentile(raw_results, 75):.3}")
print(f"Tuned model -> Median {np.median(tuned_results):.3}, Q1: {np.percentile(tuned_results, 25):.3}, Q3: {np.percentile(tuned_results, 75):.3}")

  Raw model -> Median 0.882, Q1: 0.825, Q3: 0.918
Tuned model -> Median 0.911, Q1: 0.858, Q3: 0.926


plot

In [None]:
trace1 = go.Box(y=raw_results, name='Raw model results')
trace2 = go.Box(y=tuned_results, name='Tuned model results')

layout = go.Layout(title='Performance comparison between raw and fine-tuned models',
                    yaxis=dict(title='Similarity to ground truth'))

fig = go.Figure(data=[trace1, trace2], layout=layout)
fig.show()


# Conclusion

The dataset used for the fine tuning was very small compared to the industry standards.

And yet, we can observe a slight improvement of the performance of the tuned model compared to the raw version.

From a qualitative point of view, we notice that the answers generated by the tuned model are often quite complete and straight to the point, sometimes at the expense of fluency.

This behavior can be connected to the system prompts we defined, both during the generation & training, but also during the creation of the dataset itself.

Also, we need to keep in mind that the models (both the raw version and the tuned one) are running with **4 bit quantization**, which is known to lower the overall quality of the answers.

Despite all the above, we are very satisfied with the results: not the output of the tuned LLM itself, but the fact that we are now **more familiar** with the LoRA fine tuning process and all its hurdles.