<a href="https://colab.research.google.com/github/Ilvecho/FineTuned_LLM/blob/main/LoRA_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we are going to perform the actual LoRA fine tuning of our model.

We will use the data scraped in the Web_Scraping notebook and then elaborated in the Docs_elaboration notebook.

Thanks to the processing steps, we have already available data in the desired JSON format.

In [1]:
import numpy as np
import pandas as pd
import torch
import os
import re
import json
import random
import pickle
import plotly.graph_objects as go

from google.colab import userdata
from google.colab import files,drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load components

In [3]:
!pip install trl transformers datasets torch peft
!pip install -qU accelerate
!pip install -qU bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
from datasets import load_dataset

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig, GenerationConfig, pipeline
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, AutoPeftModelForCausalLM
from trl import SFTTrainer

Load the created dataset

In [None]:
#train & test.json are in same folder as the jupyter notebook
data_files = {'train':'/content/gdrive/MyDrive/Syllog/train_data.json',
              'test':'/content/gdrive/MyDrive/Syllog/test_data.json'}
dataset = load_dataset('json',data_files=data_files)

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Load the model and configure it to use 4bit quantization (because of RAM limitations)

In [6]:
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype= torch.bfloat16,
        bnb_4bit_use_double_quant= False,
)

model_name = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

# We want each generation to be independent & save memory
model.config.use_cache = False
# The backprop gradient is computed not using all parameters, to save memory
model.gradient_checkpointing_enable()
# Makes training faster but a little less accurate
model.config.pretraining_tp = 1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Load the tokenizer

In [23]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left')
tokenizer.add_eos_token = False # It was true, changed to False but I am not sure the change was enforced
tokenizer.add_bos_token = False
# tokenizer.add_bos_token, tokenizer.add_eos_token

Before proceeding with the fine tuning, let's first evaluate the performance of the model non fine tuned

In [25]:
# Define the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Config passed to model.generate
# CURRENTLY NOT USED
generation_config = GenerationConfig(
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
    max_new_tokens=300,
    num_return_sequences=1,
)

# prompt = "<|prompter|>Answer in maximum four sentences: What are the best ways to reduce Employee turnover?</s><|assistant|>"

prompt = "<s>What are the best ways to reduce Employee turnover?</s>"
system_message = "<s>You are a useful AI assistant good at giving short and concise answers.</s>"

prompt_template=f"""<|im_start|>system
                    {system_message}<|im_end|>
                    <|im_start|>user
                    {prompt}<|im_end|>
                    <|im_start|>assistant
                    """

# Call the pipeline also with args to be passed to the model
sequences = pipe(
    prompt,
    max_new_tokens=200,
    do_sample=False,
    return_full_text=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    decoder_start_token_id=0,
)
print(sequences[0]['generated_text'])

# Use the generate method directly
# inputs = tokenizer(prompt, return_tensors="pt")
# outputs = model.generate(**inputs, generation_config=generation_config)
# print(tokenizer.batch_decode(outputs, skip_special_tokens=False))



A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


й

Employee turnover is a major problem for many businesses. It can be costly and time-consuming to replace employees, and it can also lead to decreased productivity and morale. There are a number of ways to reduce employee turnover, but some of the most effective methods include:

1. Offering competitive salaries and benefits.

2. Providing opportunities for professional development and growth.

3. Creating a positive work environment.

4. Encouraging open communication and feedback.

5. Recognizing and rewarding employees for their hard work and contributions.

6. Offering flexible work arrangements.

7. Providing opportunities for employees to give back to the community.

8. Offering employee assistance programs.

9. Providing opportunities for employees to socialize and build relationships.

10. Offering employee discounts and perks.

11.


In [None]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [None]:
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="wandb"
)

We actually need to modify the data to make it in a format suitable for fine tuning.

Hence, we define a formatting function and then pass it to the trainer

In [None]:
def prompt_instruction_format(sample):
  return f"""<s>[INST] Generate an answer to the Input question with the information you have in your memory. If you are not sure about the answer, say so rather than making something up.
    ### Input:{sample['question']} [/INST]
    {sample['answer']}
    """

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    formatting_func=prompt_instruction_format,
    args=training_arguments,
    packing= False,
)

In [None]:
trainer.train()