<a href="https://colab.research.google.com/github/Ilvecho/FineTuned_LLM/blob/main/LoRA_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we are going to perform the actual LoRA fine tuning of our model.

We will use the data scraped in the Web_Scraping notebook and then elaborated in the Docs_elaboration notebook.

Thanks to the processing steps, we have already available data in the desired JSON format.

In [1]:
import numpy as np
import pandas as pd
import torch
import os
import re
import json
import random
import pickle
import plotly.graph_objects as go

from google.colab import userdata
from google.colab import files,drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Use Transformers library

In [None]:
!pip install trl transformers datasets torch peft
!pip install -qU accelerate
!pip install -qU bitsandbytes

In [None]:
from datasets import load_dataset

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig, GenerationConfig, pipeline
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, AutoPeftModelForCausalLM
from trl import SFTTrainer

Load the created dataset

In [None]:
#train & test.json are in same folder as the jupyter notebook
data_files = {'train':'/content/gdrive/MyDrive/Syllog/train_data.json',
              'test':'/content/gdrive/MyDrive/Syllog/test_data.json'}
dataset = load_dataset('json',data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Load the model and configure it to use 4bit quantization (because of RAM limitations)

In [None]:
!pip install accelerate
!pip install bitsandbytes

In [None]:
!pip install -i https://test.pypi.org/simple/ bitsandbytes

Looking in indexes: https://test.pypi.org/simple/


In [None]:
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype= torch.bfloat16,
        bnb_4bit_use_double_quant= False,
)

model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

# We want each generation to be independent & save memory
model.config.use_cache = False
# The backprop gradient is computed not using all parameters, to save memory
model.gradient_checkpointing_enable()
# Makes training faster but a little less accurate
model.config.pretraining_tp = 1

ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or pip install bitsandbytes` 

Load the tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left')
tokenizer.add_eos_token = True # It was true, changed to False but I am not sure the change was enforced
tokenizer.add_bos_token = True
# tokenizer.add_bos_token, tokenizer.add_eos_token

Before proceeding with the fine tuning, let's first evaluate the performance of the model non fine tuned

In [None]:
# Define the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Config passed to model.generate
# CURRENTLY NOT USED
generation_config = GenerationConfig(
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
    max_new_tokens=300,
    num_return_sequences=1,
)

# prompt = "<|prompter|>Answer in maximum four sentences: What are the best ways to reduce Employee turnover?</s><|assistant|>"
# prompt = "<s>How to resolve conflict in the workplace?</s>"
# system_message = "<s>You are a useful and concise AI assistant. You are a pro at using bullet points when needed. You are allowed to use maximum five sentences for your answer</s>"

#prompt_template=f"""<|im_start|>System: {system_message}<|im_end|>
#<|im_start|>User: {prompt}<|im_end|>
#<|im_start|>Assistant: """

prompt = "Perché è importante che le organizzazioni no-profit sviluppino un programma di formazione per i donatori e quale impatto può avere?"
system_message = "Sei un assistente AI utile e conciso. Rispondi in massimo cinque frasi, va bene anche usarne meno."

prompt_template=f"""<|im_start|>Sistema: {system_message}<|im_end|>
<|im_start|>Utente: {prompt}<|im_end|>
<|im_start|>Assistente: """

# Call the pipeline also with args to be passed to the model
sequences = pipe(
    prompt_template,
    max_new_tokens=200,
    do_sample=False,
    return_full_text=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    decoder_start_token_id=0,
)

answer = sequences[0]['generated_text']
print(answer)

# Use the generate method directly
# inputs = tokenizer(prompt, return_tensors="pt")
# outputs = model.generate(**inputs, generation_config=generation_config)
# print(tokenizer.batch_decode(outputs, skip_special_tokens=False))

Let's do some basic processing of the output

In [2]:
! pip install thefuzz

Collecting thefuzz
  Downloading thefuzz-0.22.1-py3-none-any.whl (8.2 kB)
Collecting rapidfuzz<4.0.0,>=3.0.0 (from thefuzz)
  Downloading rapidfuzz-3.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, thefuzz
Successfully installed rapidfuzz-3.6.1 thefuzz-0.22.1


In [4]:
from thefuzz import process
from thefuzz import fuzz

In [38]:
# answer = "1. Il programma di formazione per i donatori è importante perché aiuta a migliorare la comprensione e la fiducia dei donatori nei confronti dell'organizzazione no-profit. 2. Un programma di formazione per i donatori può avere un impatto positivo sulle donazioni, poiché aiuta a migliorare la comprensione dei donatori sulle attività e gli obiettivi dell'organizzazione no-profit. 3. Un programma di formazione per i donatori può anche aiutare a migliorare la fiducia dei donatori nei confronti dell'organizzazione no-profit, poiché aiuta a migliorare la comprensione dei donatori sulle attività e gli obiettivi dell'organizzazione no-profit. 4. Un program"
#answer = "1. Offrire opportunità di formazione personalizzate e adattate alle esigenze e alle preferenze dei dipendenti. 2. Organizzare eventi di formazione interattivi e coinvolgenti, come workshop, seminari e conferenze. 3. Utilizzare tecnologie innovative, come simulazioni virtuali e app per smartphone, per rendere la formazione più accessibile e interattiva. 4. Fornire incentivi e motivazioni per incoraggiare i dipendenti a partecipare alla formazione. 5. Raccolta di feedback e valutazioni per migliorare continuamente la formazione offerta.<|im_end|> <|im_start|>Utente: Quali sono le principali sfide che i datori di lavoro devono affrontare nel fornire formazione ai"
answer = "An effective employee onboarding program should include the following elements: 1. A clear and concise onboarding process that outlines the steps and timeline for new employees. 2. A comprehensive orientation program that provides new employees with an overview of the company, its culture, and its values. 3. A mentorship program that pairs new employees with experienced employees who can provide guidance and support. 4. A training program that provides new employees with the skills and knowledge they need to be successful in their roles. 5. A feedback and evaluation process that allows new employees to provide feedback on their onboarding experience and receive feedback on their performance. 6. A recognition program that rewards and recognizes new employees for their contributions and achievements. 7. A socialization program that helps new employees build relationships with their colleagues and feel like they belong to the company. 8. A communication program that keeps new employees informed about company news, events, and updates."


In [39]:
# If there is the end tag, let's just consider what's before it
if '<|im_end|>' in answer:
  answer = answer.split('<|im_end|>')[0]

# Then, we want to remove the numbers of the numbered item list
answer = re.sub(r'\d+\.\s*', '', answer)

# Then, what we want  to do is to verify that each sentence generated by the model is not similar to the others
# We want to discard the last element as the model will always close a sentence with a dot.
# If no dot is present, it means that the generation was interrupted because of the max tokens limit
sentences = re.split(r'[.?!:;]', answer.strip())

if len(sentences[-1]) > 0:
  answer = answer[:-len(sentences[-1])]
  sentences = sentences[:-1]


In [40]:
size = len(sentences)
fuzz_match = np.zeros((size, size))

for i, sentence in enumerate(sentences):
  for j, compare in enumerate(sentences):
    if sentence is compare:
      continue
    else:
      score = fuzz.token_set_ratio(sentence,compare)
      fuzz_match[i][j] = score

fuzz_match

array([[ 0., 57., 48., 49., 45., 48., 53., 47., 47.,  0.],
       [57.,  0., 59., 55., 57., 65., 61., 54., 55.,  0.],
       [48., 59.,  0., 64., 64., 57., 64., 63., 64.,  0.],
       [49., 55., 64.,  0., 58., 63., 58., 57., 57.,  0.],
       [45., 57., 64., 58.,  0., 54., 55., 70., 51.,  0.],
       [48., 65., 57., 63., 54.,  0., 57., 54., 54.,  0.],
       [53., 61., 64., 58., 55., 57.,  0., 57., 62.,  0.],
       [47., 54., 63., 57., 70., 54., 57.,  0., 58.,  0.],
       [47., 55., 64., 57., 51., 54., 62., 58.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

# NOT YET RELEVANT

In [None]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [None]:
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="wandb"
)

We actually need to modify the data to make it in a format suitable for fine tuning.

Hence, we define a formatting function and then pass it to the trainer

In [None]:
def prompt_instruction_format(sample):
  return f"""<s>[INST] Generate an answer to the Input question with the information you have in your memory. If you are not sure about the answer, say so rather than making something up.
    ### Input:{sample['question']} [/INST]
    {sample['answer']}
    """

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    formatting_func=prompt_instruction_format,
    args=training_arguments,
    packing= False,
)

In [None]:
trainer.train()