<a href="https://colab.research.google.com/github/Addaci/marinelives-collaboratory/blob/main/distilabel_on_easy_mode_%2B_llm_sft.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Synthetic data generation & LLM finetuning

In this notebook we'll explore fine-tuning an LLM on synthetic data using distilabel easy-to-use pipeline API. the `InstructionResponsePipeline` class let's you generate a dataset based on a prompt.

We will generate an SFT dataset and finetune a SmolV2 model. Supervised fine-tuning (SFT) includes instruction-tuning, which instructs a model to respond based on predefined human definitions.

## 2. Setup and Installation

Install the necessary libraries.


In [None]:
# Install Pytorch & other libraries
%pip install -qqq torch

# Install Hugging Face libraries
%pip install  --upgrade -qqq \
  "transformers==4.46.3" \
  "datasets==3.1.0" \
  "accelerate==1.1.1" \
  "evaluate==0.4.3" \
  "bitsandbytes==0.44.1" \
  "trl==0.12.1" \
  "peft==0.13.2"

If you are using a GPU with Ampere architecture (e.g. NVIDIA A10G or RTX 4090/3090) or newer you can use Flash attention. Flash Attention is a an method that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. The TL;DR; accelerates training up to 3x. Learn more at [FlashAttention](https://github.com/Dao-AILab/flash-attention/tree/main).

_Note: If your machine has less than 96GB of RAM and lots of CPU cores, reduce the number of `MAX_JOBS`. On the `g6.2xlarge` we used `4`._

In [None]:
import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'
# install flash-attn
!pip install -qqq ninja packaging
!MAX_JOBS=4 pip install -q flash-attn --no-build-isolation

In [None]:
!pip install -qqq git+https://github.com/argilla-io/distilabel.git@develop
!pip install -qqq huggingface_hub

from huggingface_hub import notebook_login

notebook_login()

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

_Installing flash attention can take quite a bit of time (10-45 minutes)._

# 3. Create Synthetic Dataset

Now we are going to generate a dataset based on a system prompt which defines the LLM we want to train. In this example we define an LLM that generate product descriptions.

Under the hood, distilabel uses the [magpie approach](https://distilabel.argilla.io/dev/components-gallery/tasks/magpiegenerator/). If you want to complexify your pipeline further you should explore Magpie.

In [None]:
from distilabel.pipeline import InstructionResponsePipeline

system_prompt = """You are an eccommerce production description writer. You write succinct product descriptions based on semi-stuctured text."""

pipeline = InstructionResponsePipeline(system_prompt=system_prompt)

distiset = pipeline.run()



Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
distiset["default"]["train"].to_pandas()

Unnamed: 0,instruction,response,model_name,distilabel_metadata
0,"I run a small business, and I're struggling wi...","As a marketing guru, I'd be happy to help you ...",meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...
1,I'm a new business owner of a small plant nurs...,A pricing strategy in retail indeed refers to ...,meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...
2,I have a new travel company and I're excited t...,Congratulations on your new travel company. Gi...,meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...
3,I am the owner of a boutique hotel and you are...,To increase direct bookings on your website an...,meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...
4,I'm launching a new product: a social media ma...,To determine the most effective strategy for l...,meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...
5,I've been in the e-commerce industry for 5 yea...,To increase traffic and conversion rates by 30...,meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...
6,I am the owner of a small food delivery busine...,While hiding the menu item can temporarily dra...,meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...
7,"I'm an entrepreneur, starting my own business ...",Building a marketing plan from scratch starts ...,meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...
8,I'm a business owner with a relatively small m...,"As a marketing guru, I'd be happy to help. Her...",meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...
9,What are some marketing concepts that could ap...,Creating an app for buying and selling car par...,meta-llama/Meta-Llama-3.1-8B-Instruct,{'statistics_magpie_generator_0': {'input_toke...


## 3. Prepare the dataset

To use the datset in TRL, we will need to represent the samples as converrsations. TRL will then create chat for mat messages to pass to the LLM. We will also save the dataset to disk.

In [None]:
from datasets import load_dataset

def create_conversation(sample):
  return {
    "messages": [
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": sample["instruction"]},
      {"role": "assistant", "content": sample["response"]}
    ]
  }

# Load dataset from the hub
dataset = distiset["default"]["train"]

# Convert dataset to OAI messages
dataset = dataset.map(create_conversation, remove_columns=dataset.features,batched=False)
# split dataset into 10,000 training samples and 2,500 test samples
dataset = dataset.train_test_split(test_size=0.1)

# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

4903

## 4. Fine-tune LLM using `trl` and the `SFTTrainer`



In [None]:
from datasets import load_dataset

# Load jsonl data from disk
dataset = load_dataset("json", data_files="train_dataset.json", split="train")

Next, we will load our LLM. In this example we will use SmolLM2-360M-Instruct. The name shows that it is a small llm model of 360m parameters that has been instruction fine-tuned. That makes it great for fine-tuning on a further use case or domain, represented in our synthetically generated dataset.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import setup_chat_format

# Hugging Face model id
model_id = "HuggingFaceTB/SmolLM2-360M-Instruct"

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right' # to prevent warnings


config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/724M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

Now, let's define training in TRL. We can use `peft` for parameter efficient fine tuning to reduce the compute load of training. The `SFTTrainer`  supports a native integration with `peft`, which makes it easy to efficiently tune LLMs using, e.g. QLoRA. We only need to create our `LoraConfig` and provide it to the trainer. Our `LoraConfig` parameters are defined based on the [qlora paper](https://arxiv.org/pdf/2305.14314.pdf) and sebastian's [blog post](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms).

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from peft import LoraConfig

peft_config = LoraConfig(
        lora_alpha=128,
        lora_dropout=0.05,
        r=256,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

args = TrainingArguments(
    output_dir="llama-3-2-1b-commerce",
    num_train_epochs=3,
    per_device_train_batch_size=4, # batch sizes can be reduced for VRAM
    gradient_accumulation_steps=8, # batch sizes can be reduced for VRAM
    gradient_checkpointing=True, # gradient checkpointing can save memory
    optim="adamw_torch_fused",
    logging_steps=10,
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    push_to_hub=True,
)

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=2048,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False, # No need to add additional separator token
    }
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

Start training our model by calling the `train()` method on our `Trainer` instance. This will start the training loop and train our model for 3 epochs. Since we are using a PEFT method, we will only save the adapted model weights and not the full model.

In [None]:
trainer.train()
trainer.save_model()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


adapter_model.safetensors:   0%|          | 0.00/556M [00:00<?, ?B/s]

events.out.tfevents.1732097523.25c36ac8fe9b.24364.0:   0%|          | 0.00/6.57k [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
del model
del trainer
torch.cuda.empty_cache()

### Merge the fine-tuned LoRA adapter into the original model

When using QLoRA, we only train adapters and not the full model. This means when saving the model during training we only save the adapter weights and not the full model. If you want to save the full model, which makes it easier to use with Text Generation Inference you can merge the adapter weights into the model weights using the `merge_and_unload` method and then save the model with the `save_pretrained` method. This will save a default model, which can be used for inference.


In [None]:
from peft import AutoPeftModelForCausalLM

# Load PEFT model on CPU
model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
)

# Merge LoRA and base model and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")

## 4. Compare untrained, trained, and synthetic responses

Finally, we can review the results of training by comparing the outputs of the models. We will compare the outputs of the untrained SmolLM2, the trained SmolLM2, and the LLama-3.1-8b model which were used to create the synthetic dataset.



In [None]:
import torch
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM

model_id = f"./{args.output_dir}"

# Load Model with PEFT adapter
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  device_map="auto",
  torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

In [None]:
# Load Model with PEFT adapter
untrained_pipe = pipeline("text-generation", model=AutoModelForCausalLM.from_pretrained(
  "HuggingFaceTB/SmolLM2-360M-Instruct",
  device_map="auto",
  torch_dtype=torch.float16,
), tokenizer=AutoTokenizer.from_pretrained(model_id))

Let’s load our test dataset try to generate an instruction.

In [None]:
from datasets import load_dataset
from random import randint

# Load our test dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))

# Test on sample
prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)
untrained_outputs = untrained_pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)

print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"\n\n ## Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"\n\n ## Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")
print(f"\n\n ## Untrained Generated Answer:\n{untrained_outputs[0]['generated_text'][len(prompt):].strip()}")

Query:
I'm an entrepreneur, starting my own business and I want to create a marketing plan.  What should I focus on first?

As a marketing guru, I recommend you start with identifying your ideal target audience to create a marketing plan that resonates with them. Based on this, you'll need to create buyer personas, which are semi-fictional representations of your target audience based on market research. This way, you can understand their pain points, interests, behaviors, and motivations, making it easier to tailor your marketing efforts to effectively engage and attract them.

To begin, ask yourself the following questions to gather information about your target audience:

1. What is our business model? (e.g., B2B, B2C, product, service, subscription, etc.)
2. What products or services are we offering? 
3. Who is our potential audience (demographics: age, location, occupation, interests, income, etc.)?
4. What are the key issues or pain points that our customers might experience, whi