<a href="https://colab.research.google.com/github/Ilvecho/Happy-Customers/blob/main/LoRA_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we are going to perform the actual LoRA fine tuning of our model.

We will use the data scraped in the Web_Scraping notebook and then elaborated in the Docs_elaboration notebook.

Thanks to the processing steps, we have already available data in the desired JSON format.

In [1]:
import numpy as np
import pandas as pd
import torch
import os
import re
import json
import random
import pickle
import plotly.graph_objects as go

from google.colab import userdata
from google.colab import files,drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [3]:
!pip install trl transformers datasets torch peft
!pip install -qU accelerate
!pip install -qU bitsandbytes
!pip install thefuzz

Collecting trl
  Downloading trl-0.7.11-py3-none-any.whl (155 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting peft
  Downloading peft-0.8.2-py3-none-any.whl (183 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate (from trl)
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tyro>=0.5.11 (from trl)
  Downloading tyro-0.7.3-py3-none-any.whl (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0

In [4]:
from datasets import load_dataset, Dataset, DatasetDict

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig, GenerationConfig, pipeline
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, AutoPeftModelForCausalLM
from trl import SFTTrainer
from thefuzz import fuzz

# Initialization

### Load datasets

Load the created dataset

In [None]:
#train & test.json are in same folder as the jupyter notebook
data_files = {'train':'/content/gdrive/MyDrive/Syllog/train_data.json',
              'test':'/content/gdrive/MyDrive/Syllog/test_data.json'}
dataset = load_dataset('json',data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Load a small chunck of the created dataset:
the idea is to reduce the number of samples because of the limited available resources.

In [None]:
with open('/content/gdrive/MyDrive/Syllog/data.json', 'r', encoding='utf-8') as json_file:
  data = json.load(json_file)

small_data = []
# We identified a random topic, which is covered by data samples IDs from 20 to 58 included
for id in range(20, 59):
  sample = data[id]
  small_data.append(sample)

with open('/content/gdrive/MyDrive/Syllog/small_data.json', 'w', encoding='utf-8') as json_file:
  json.dump(small_data, json_file, ensure_ascii=False)

In [None]:
small_data = Dataset.from_json('/content/gdrive/MyDrive/Syllog/small_data.json')

split_dataset = small_data.train_test_split(test_size=0.25)

# Create a DatasetDict object
dataset = DatasetDict({
    'train': split_dataset['train'].shuffle(),
    'test': split_dataset['test'].shuffle()
})

with open('/content/gdrive/MyDrive/Syllog/small_data_split.pkl', 'wb') as file:
  pickle.dump(dataset, file)

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
with open('/content/gdrive/MyDrive/Syllog/small_data_split.pkl', 'rb') as file:
  dataset = pickle.load(file)

Saved the abvoe train & test split in JSON files (code has been deleted, unfortunately). Now we can load them in a dataset variable:

In [7]:
data_files = {'train':'/content/gdrive/MyDrive/Syllog/small_data_train.json',
              'test':'/content/gdrive/MyDrive/Syllog/small_data_test.json'}
dataset = load_dataset('json',data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

### Model and tokenizer

Load the model and configure it to use 4bit quantization (because of RAM limitations)

In [8]:
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype= torch.bfloat16,
        bnb_4bit_use_double_quant= False,
)

model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

# We want each generation to be independent & save memory
model.config.use_cache = False
# The backprop gradient is computed not using all parameters, to save memory
model.gradient_checkpointing_enable()
# Makes training faster but a little less accurate
model.config.pretraining_tp = 1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Load the tokenizer

In [49]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left')
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True # It was true, changed to False but I am not sure the change was enforced
tokenizer.add_bos_token = True
# tokenizer.add_bos_token, tokenizer.add_eos_token

Before proceeding with the fine tuning, let's first evaluate the performance of the model non fine tuned

# Dev playground

In [None]:
# Define the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# prompt = "<|prompter|>Answer in maximum four sentences: What are the best ways to reduce Employee turnover?</s><|assistant|>"
# prompt = "<s>How to resolve conflict in the workplace?</s>"
# system_message = "<s>You are a useful and concise AI assistant. You are a pro at using bullet points when needed. You are allowed to use maximum five sentences for your answer</s>"

#prompt_template=f"""<|im_start|>System: {system_message}<|im_end|>
#<|im_start|>User: {prompt}<|im_end|>
#<|im_start|>Assistant: """

#prompt = "Perché è importante che le organizzazioni no-profit sviluppino un programma di formazione per i donatori e quale impatto può avere?"
prompt = "Quali sono le potenziali sfide che sorgono quando le caratteristiche, le competenze e gli interessi di un dipendente non si allineano bene con il suo lavoro?"
system_message = "Sei un assistente AI utile e conciso. Rispondi in massimo cinque frasi, va bene anche usarne meno."

prompt_template=f"""<|im_start|>Sistema: {system_message}<|im_end|>
<|im_start|>Utente: {prompt}<|im_end|>
<|im_start|>Assistente: """

# Call the pipeline also with args to be passed to the model
sequences = pipe(
    prompt_template,
    max_new_tokens=200,
    do_sample=False,
    return_full_text=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    decoder_start_token_id=0,
)

answer = sequences[0]['generated_text']
print(answer)

1. L'assenza di motivazione e la mancanza di interesse nel lavoro possono portare a una diminuzione della produttività e della qualità del lavoro. 2. La mancanza di competenze e abilità necessarie per svolgere il lavoro può portare a difficoltà e problemi. 3. La mancanza di opportunità di crescita e sviluppo può portare a una sensazione di stagnazione e mancanza di motivazione. 4. La mancanza di un clima di lavoro positivo e motivante può portare a un'atmosfera negativa e a un'alta assenza. 5. La mancanza di un'adeguata compensazione e riconoscimento può portare a una mancanza di motivazione e una sensazione di insoddisfazione.



Let's do some basic processing of the output:
- We look for the closing tag '<|im_end|>' and we cut the answer there
- We remove all the numbers of the numbered list
- We split the answer in sentences using the classical sentences delimiters [ . ? ! : ; ]
- We build a matrix containing all the Fuzzy matching scores for all the sentences. The score function used is the **fuzzy set match** because we are interested in the words used in each sentence
- If there are two sentences with a match greater than 80 (i.e. extremely similar), then we remove the sentence that has the highest average matching score

In [None]:
#answer = "1. Il programma di formazione per i donatori è importante perché aiuta a migliorare la comprensione e la fiducia dei donatori nei confronti dell'organizzazione no-profit. 2. Un programma di formazione per i donatori può avere un impatto positivo sulle donazioni, poiché aiuta a migliorare la comprensione dei donatori sulle attività e gli obiettivi dell'organizzazione no-profit. 3. Un programma di formazione per i donatori può anche aiutare a migliorare la fiducia dei donatori nei confronti dell'organizzazione no-profit, poiché aiuta a migliorare la comprensione dei donatori sulle attività e gli obiettivi dell'organizzazione no-profit. 4. Un program"
# answer = "1. Offrire opportunità di formazione personalizzate e adattate alle esigenze e alle preferenze dei dipendenti. 2. Organizzare eventi di formazione interattivi e coinvolgenti, come workshop, seminari e conferenze. 3. Utilizzare tecnologie innovative, come simulazioni virtuali e app per smartphone, per rendere la formazione più accessibile e interattiva. 4. Fornire incentivi e motivazioni per incoraggiare i dipendenti a partecipare alla formazione. 5. Raccolta di feedback e valutazioni per migliorare continuamente la formazione offerta.<|im_end|> <|im_start|>Utente: Quali sono le principali sfide che i datori di lavoro devono affrontare nel fornire formazione ai"
answer = "1. L'assenza di motivazione e la mancanza di interesse nel lavoro possono portare a una diminuzione della produttività e della qualità del lavoro. 2. La mancanza di competenze e abilità necessarie per svolgere il lavoro può portare a difficoltà e problemi. 3. La mancanza di opportunità di crescita e sviluppo può portare a una sensazione di stagnazione e mancanza di motivazione. 4. La mancanza di un clima di lavoro positivo e motivante può portare a un'atmosfera negativa e a un'alta assenza. 5. La mancanza di un'adeguata compensazione e riconoscimento può portare a una mancanza di motivazione e una sensazione di insoddisfazione."
#answer = "An effective employee onboarding program should include the following elements: 1. A clear and concise onboarding process that outlines the steps and timeline for new employees. 2. A comprehensive orientation program that provides new employees with an overview of the company, its culture, and its values. 3. A mentorship program that pairs new employees with experienced employees who can provide guidance and support. 4. A training program that provides new employees with the skills and knowledge they need to be successful in their roles. 5. A feedback and evaluation process that allows new employees to provide feedback on their onboarding experience and receive feedback on their performance. 6. A recognition program that rewards and recognizes new employees for their contributions and achievements. 7. A socialization program that helps new employees build relationships with their colleagues and feel like they belong to the company. 8. A communication program that keeps new employees informed about company news, events, and updates."


In [None]:
# If there is the end tag, let's just consider what's before it
if '<|im_end|>' in answer:
  answer = answer.split('<|im_end|>')[0]

# Then, we want to remove the numbers of the numbered item list
answer = re.sub(r'\d+\.\s*', '- ', answer)

# Then, what we want  to do is to verify that each sentence generated by the model is not similar to the others
# We want to discard the last element as the model will always close a sentence with a dot.
# If no dot is present, it means that the generation was interrupted because of the max tokens limit
sentences = re.split(r'[.?!:;]', answer.strip())

if len(sentences[-1]) > 0:
  answer = answer[:-len(sentences[-1])]

sentences = sentences[:-1]

# Build the Fuzzy matching matrix
size = len(sentences)
fuzz_match = np.zeros((size, size))

for i, sentence in enumerate(sentences):
  for j, compare in enumerate(sentences):
    if sentence is compare:
      continue
    else:
      score = fuzz.token_set_ratio(sentence,compare)
      fuzz_match[i][j] = score

# Discard sentences with high score
max_score = np.max(fuzz_match)
argmax_score = np.argmax(fuzz_match)

while max_score > 80:
  # Find the two matching sentences
  i = argmax_score // size
  j = argmax_score % size

  # out of the two, find the one with the highest average score (the sentence on average more similar to all the others)
  if fuzz_match[i].mean() < fuzz_match[j].mean():
    to_delete = j
  else:
    assert fuzz_match[i].mean() >= fuzz_match[j].mean()
    to_delete = i

  # Delete sentence from the fuzz match
  fuzz_match = np.delete(fuzz_match, to_delete, axis=0)
  fuzz_match = np.delete(fuzz_match, to_delete, axis=1)

  # Delete sentence from sentences
  sentences.pop(to_delete)

  # Values for the new While cycle
  max_score = np.max(fuzz_match)
  argmax_score = np.argmax(fuzz_match)

Now I want to join back the sentences using the original punctuation

In [None]:
output = ''

for sentence in sentences:
  idx = answer.find(sentence)

  if idx != -1 and idx + len(sentence) < len(answer):
      punctuation = answer[idx + len(sentence)]
      output += sentence.strip() + punctuation + '\n'
  else:
      print("Substring not found or character after the substring does not exist.")

print(output)

- L'assenza di motivazione e la mancanza di interesse nel lavoro possono portare a una diminuzione della produttività e della qualità del lavoro.
- La mancanza di competenze e abilità necessarie per svolgere il lavoro può portare a difficoltà e problemi.
- La mancanza di opportunità di crescita e sviluppo può portare a una sensazione di stagnazione e mancanza di motivazione.
- La mancanza di un clima di lavoro positivo e motivante può portare a un'atmosfera negativa e a un'alta assenza.
- La mancanza di un'adeguata compensazione e riconoscimento può portare a una mancanza di motivazione e una sensazione di insoddisfazione.



# Original model Generation

Here we put together the Generation with LLM and the string processing

In [None]:
# Define the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

Note that the system prompt used in the Test generation from the base model is similar to the prompt used in ChatGPT to create the dataset:

template_general_questions =

    I provide you with the following context: '''{transcript}'''.
    You must identify the general topic that is discussed in the provided context.
    Once the general topic is identified, you need to generate 5 pairs of Question-Answer on the general topic.
    Since the questions are generic, the answers must be at least 2 sentences (but do not go above 6 sentences).



template_specific_questions =

    I provide you with the following context: '''{transcript}'''.
    You must identify the general topic that is discussed in the provided context.
    Once the general topic is identified, one related sub-topic covered in the provided context.
    In the output list all the identified sub-topics in a numbered list. You can use it to double check that the identified sub-topics are five.
    Create two Question-Answer pair for said sub-topic. Double check that they are two.
    Since the question are specific to a sub-topic, the answer must be at most four sentences long.
    Repeat the above actions for five different sub-topics covered in the context.
    Before providing the output, review your answer and make sure that five sub topics have been identified.

system =

    You are a helpful assistant that reads documents, understand their content, and generate Question-Answer pairs.
    Your output will be used to perform supervised fine tuning of a LLM - keep it in mind when formulating both the question and the answer.
    The desired output format is the following:
    - The first line of the output should be "Topic:" followed by the topic identified in the provided document
    - Identify the questions with "Question:" and the answers with "Answer:"
    - each question and each answer need to be in one line only. The result of this is that each line will start either with "Question:" or with "Answer:"
    Avoid referring to any Named Entity in the questions, unless extremely relevant for the document content.
    Email addresses and phone numbers are not relevant for me - do not mention them at any time.


In [None]:
test_answers = []

for row in dataset['test']:

  ##############################################
  #############     GENERATION     #############
  ##############################################

  system_message = "Sei un assistente AI utile e conciso. Rispondi in massimo cinque frasi, va bene anche usarne meno."

  prompt_template=f"""<|im_start|>Sistema: {system_message}<|im_end|>
  <|im_start|>Utente: {row['question']}<|im_end|>
  <|im_start|>Assistente: """

  # Call the pipeline also with args to be passed to the model
  sequences = pipe(
      prompt_template,
      max_new_tokens=200,
      do_sample=False,
      return_full_text=False,
      num_return_sequences=1,
      eos_token_id=tokenizer.eos_token_id,
      pad_token_id=tokenizer.eos_token_id,
      decoder_start_token_id=0,
  )

  answer = sequences[0]['generated_text']

  ##############################################
  #############     PROCESSING     #############
  ##############################################

  # If there is the end tag, let's just consider what's before it
  if '<|im_end|>' in answer:
    answer = answer.split('<|im_end|>')[0]

  # Then, we want to remove the numbers of the numbered item list
  answer = re.sub(r'\d+\.\s*', '- ', answer)

  # Then, what we want  to do is to verify that each sentence generated by the model is not similar to the others
  # We want to discard the last element as the model will always close a sentence with a dot.
  # If no dot is present, it means that the generation was interrupted because of the max tokens limit
  sentences = re.split(r'[.?!:;]', answer.strip())

  if len(sentences[-1]) > 0:
    answer = answer[:-len(sentences[-1])]

  sentences = sentences[:-1]

  # Build the Fuzzy matching matrix
  size = len(sentences)
  fuzz_match = np.zeros((size, size))

  for i, sentence in enumerate(sentences):
    for j, compare in enumerate(sentences):
      if sentence is compare:
        continue
      else:
        score = fuzz.token_set_ratio(sentence,compare)
        fuzz_match[i][j] = score

  # Discard sentences with high score
  max_score = np.max(fuzz_match)
  argmax_score = np.argmax(fuzz_match)

  while max_score > 80:
    # Find the two matching sentences
    i = argmax_score // size
    j = argmax_score % size

    # out of the two, find the one with the highest average score (the sentence on average more similar to all the others)
    if fuzz_match[i].mean() < fuzz_match[j].mean():
      to_delete = j
    else:
      assert fuzz_match[i].mean() >= fuzz_match[j].mean()
      to_delete = i

    # Delete sentence from the fuzz match
    fuzz_match = np.delete(fuzz_match, to_delete, axis=0)
    fuzz_match = np.delete(fuzz_match, to_delete, axis=1)

    # Delete sentence from sentences
    sentences.pop(to_delete)

    # Values for the new While cycle
    max_score = np.max(fuzz_match)
    argmax_score = np.argmax(fuzz_match)

  output = ''

  for sentence in sentences:
    idx = answer.find(sentence)

    if idx != -1 and idx + len(sentence) < len(answer):
        punctuation = answer[idx + len(sentence)]
        output += sentence.strip() + punctuation + '\n'
    else:
        print("Substring not found or character after the substring does not exist.")

  new_element = {"id": row['id'],
                 "answer_raw_model": output}

  test_answers.append(new_element)

  print(f'Question: ', row['question'])
  print(f'Generated output: {output}')
  print(f'Reference output: ', row['answer'])
  print('###################################\n')


  with open('/content/gdrive/MyDrive/Syllog/test_raw_answers.pkl', 'wb') as file:
    pickle.dump(test_answers, file)


# Fine Tuning

In this section we are going to perform the actual fine tuning of the model.

In [31]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=4,                # The number of data samples is extemely limited, so let's use a low rank to reduce resources requirements
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [46]:
training_arguments = TrainingArguments(
    output_dir="/content/gdrive/MyDrive/Syllog/results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to=None
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


We actually need to modify the data to make it in a format suitable for fine tuning.

Hence, we define a formatting function and then pass it to the trainer

In [37]:
def prompt_instruction_format(sample):
  system_prompt = 'Sei un assistente AI utile e conciso. Rispondi in massimo cinque frasi, va bene anche usarne meno.'
  return [f"""<|im_start|>Sistema: {system_prompt}<|im_end|><|im_start|>Utente: {sample['question']}<|im_end|><|im_start|>Assistente: {sample['answer']}<|im_end|>"""]

In [None]:
!pip install wandb

In [44]:
os.environ['WANDB_DISABLED'] = 'true'

In [50]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=peft_config,
    max_seq_length=200,
    tokenizer=tokenizer,
    formatting_func=prompt_instruction_format,
    args=training_arguments,
    packing= False,
)

Map:   0%|          | 0/29 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [51]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=1, training_loss=2.1028406620025635, metrics={'train_runtime': 5.6536, 'train_samples_per_second': 0.177, 'train_steps_per_second': 0.177, 'total_flos': 8539712716800.0, 'train_loss': 2.1028406620025635, 'epoch': 1.0})