# Task definition


Fine-tuning LLMs requires a clear understanding of your desired outcome. This knowledge guides model selection or dataset creation for fine-tuning.  However, not all tasks need fine-tuning. Consider pre-existing fine-tuned models or API solutions before building your own.

The example focuses on using LLMs to generate SQL queries from natural language instructions. This task benefits from fine-tuning because it's complex and requires knowledge of both data and SQL.

# Setup development environment

In [1]:
# Install Pytorch & other libraries
!pip install -U torch tensorboard

# Install Hugging Face libraries
!pip install  --upgrade \
  transformers \
  datasets \
  accelerate \
  evaluate \
  bitsandbytes\


# install peft & trl from github
!pip install git+https://github.com/huggingface/trl@a3c5b7178ac4f65569975efadc97db2f3749c65e --upgrade
!pip install git+https://github.com/huggingface/peft@4a1559582281fc3c9283892caea8ccef1d6f5a4f --upgrade
!pip install git+https://github.com/huggingface/diffusers

Collecting git+https://github.com/huggingface/trl@a3c5b7178ac4f65569975efadc97db2f3749c65e
  Cloning https://github.com/huggingface/trl (to revision a3c5b7178ac4f65569975efadc97db2f3749c65e) to /tmp/pip-req-build-a556022d
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/trl /tmp/pip-req-build-a556022d
  Running command git rev-parse -q --verify 'sha^a3c5b7178ac4f65569975efadc97db2f3749c65e'
  Running command git fetch -q https://github.com/huggingface/trl a3c5b7178ac4f65569975efadc97db2f3749c65e
  Running command git checkout -q a3c5b7178ac4f65569975efadc97db2f3749c65e
  Resolved https://github.com/huggingface/trl to commit a3c5b7178ac4f65569975efadc97db2f3749c65e
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/peft@4a1559582281fc3c9283892caea8ccef1d6f5a4f
  Cloning https://github.co

For significant training speedups with large language models, consider using Flash Attention if you have a GPU with Ampere architecture or later (e.g., NVIDIA A10G, RTX 4090/3090). Flash Attention optimizes the way the model attends to information by leveraging techniques like tiling and recomputation. This reduces memory usage from growing exponentially with sequence length (quadratic) to a linear relationship, leading to training accelerations of up to 3x.

In [13]:
import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'
# install flash-attn
!pip install ninja packaging
!MAX_JOBS=4 pip install flash-attn --no-build-isolation

AssertionError: Hardware not supported for Flash Attention

To keep track of different versions of our model as we train it, we'll leverage the Hugging Face Hub as a remote storage solution. This means the model itself, along with any training logs and relevant information, will be automatically uploaded to the Hub.

To use the Hub, you'll need a Hugging Face account. Once you've created one, we'll use a special tool from the huggingface_hub library to log in and securely store your access credentials (a token) on your local machine.

In [14]:
from getpass import getpass
token = getpass('Enter the secret value: ')

Enter the secret value: ··········


In [15]:

from huggingface_hub import login

login(
  token=token, # ADD YOUR TOKEN HERE
  add_to_git_credential=True
)


# Dataset preparation

To train our model, we can leverage an existing dataset called sql-create-context. This dataset conveniently includes natural language instructions, schema definitions, and the corresponding SQL queries all in one place.

Even better, the latest update of the trl library now recognizes popular formats for instruction and conversation datasets. This means we simply need to convert our sql-create-context data into one of these supported formats, and trl will handle the rest of the processing for us! Here are some of the supported formats

- Conversational format

voir sur https://huggingface.co/docs/trl/en/sft_trainer


- Instruction format


In [3]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("b-mc2/sql-create-context", split="train")
dataset = dataset.shuffle().select(range(5000))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
print(dataset)
print(dataset["question"][0])
print(dataset["context"][0])
print(dataset["answer"][0])

Dataset({
    features: ['answer', 'question', 'context'],
    num_rows: 5000
})
what's the party with opponent being marcy kaptur (d) 75.3% randy whitman (r) 24.7%
CREATE TABLE table_1341522_38 (party VARCHAR, opponent VARCHAR)
SELECT party FROM table_1341522_38 WHERE opponent = "Marcy Kaptur (D) 75.3% Randy Whitman (R) 24.7%"


In [5]:
# Convert dataset to OAI messages
system_message = """You are an text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.
SCHEMA:
{schema}"""

def create_conversation(sample):
  return {
    "messages": [
      {"role": "system", "content": system_message.format(schema=sample["context"])},
      {"role": "user", "content": sample["question"]},
      {"role": "assistant", "content": sample["answer"]}
    ]
  }

# Convert dataset to OAI messages
dataset = dataset.map(create_conversation, remove_columns=dataset.features,batched=False)
# split dataset into 10,000 training samples and 2,500 test samples
dataset = dataset.train_test_split(test_size=1000)

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [18]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['messages'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['messages'],
        num_rows: 1000
    })
})


In [19]:
print(dataset["train"][0]["messages"])

[{'content': 'You are an text to SQL query translator. Users will ask you questions in English and you will generate a SQL query based on the provided SCHEMA.\nSCHEMA:\nCREATE TABLE table_name_48 (team VARCHAR, founded VARCHAR)', 'role': 'system'}, {'content': 'Which team was founded in 1970?', 'role': 'user'}, {'content': 'SELECT team FROM table_name_48 WHERE founded = "1970"', 'role': 'assistant'}]


In [6]:
# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

Creating json from Arrow format:   0%|          | 0/4 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

476118

# Fine tuning with trl

It's time to fine-tune our model! We'll use a special tool called SFTTrainer from the trl library. This tool makes it easy to train large language models (LLMs) on specific tasks.

SFTTrainer builds upon the powerful Trainer from the transformers library, inheriting features like logging, evaluation, and saving progress checkpoints. But SFTTrainer adds some extra superpowers to make our lives easier:

- Understanding different data formats: It can handle data formatted for conversations or instructions, making it versatile.
- Focusing on completions: It can train the model on the final part of a sentence (completion) instead of the entire prompt, which can be more efficient.
- Packing data efficiently: SFTTrainer can optimize how data is stored and used for training, making the process faster.
- Fine-tuning efficiently (PEFT): It supports techniques like Q-LoRA to train the model with fewer resources.
- Preparing the model for conversations: SFTTrainer can even get the model ready for tasks involving conversations, such as adding special tokens.

In [7]:

# Load jsonl data from disk
dataset = load_dataset("json", data_files="train_dataset.json", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

Correctly, preparing the LLM and Tokenizer for training chat/conversational models is crucial. We need to add new special tokens to the tokenizer and model and teach to understand the different roles in a conversation. In trl we have a convinient method called setup_chat_format, which:

Adds special tokens to the tokenizer, e.g. <|im_start|> and <|im_end|>, to indicate the start and end of a conversation.
Resizes the model’s embedding layer to accommodate the new tokens.
Sets the chat_template of the tokenizer, which is used to format the input data into a chat-like format. The default is chatml from OpenAI.

In [16]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import setup_chat_format

# Hugging Face model id
model_id =  "meta-llama/Llama-3.2-3B-Instruct"

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    #attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right' # to prevent warnings

# # set chat template to OAI chatML, remove if you start from a fine-tuned model
model, tokenizer = setup_chat_format(model, tokenizer)



OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct.
403 Client Error. (Request ID: Root=1-685ed674-09b691a54a68d2786c13f0f7;557f6f31-5970-4e3e-a82c-78640d8a718a)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/resolve/main/config.json.
Your request to access model meta-llama/Llama-3.2-3B-Instruct is awaiting a review from the repo authors.

In [8]:
# %pip install --upgrade trl
# %pip install transformers==4.38.2
from transformers.generation.utils import top_k_top_p_filtering

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
print(model)

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256002, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2SdpaAttention(
          (q_proj): Linear4bit(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2304, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear4bit(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear4bit(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm()
        (post_attention_layernorm): Gemma2RMSNorm

The SFTTrainer supports a native integration with peft, which makes it super easy to efficiently tune LLMs using, e.g. QLoRA. We only need to create our LoraConfig and provide it to the trainer. Our LoraConfig parameters are defined based on the qlora paper and sebastian's blog post.

In [None]:
from peft import LoraConfig

# LoRA config based on QLoRA paper & Sebastian Raschka experiment
peft_config = LoraConfig(
        lora_alpha=128,     # normalization coeff of the product of  matrix
        lora_dropout=0.05,
        r=64,
        bias="none",
        target_modules="all-linear",
        task_type="CAUSAL_LM",
)

In [None]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            # model-specific
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)
print(find_all_linear_names(model))

['up_proj', 'k_proj', 'v_proj', 'q_proj', 'gate_proj', 'down_proj', 'o_proj']


In [None]:
# hyperparameters
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="7b-text-to-sql",             # directory to save and repository id
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=1,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass, memory efficiency, slower updates
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    #By default, these gradients are updated after every batch; gradients are accumulated over multiple batches before performing a weight update.
    #The effective batch size for gradient updates becomes batch_size * gradient_accumulation_steps.
    #For example, if batch_size is 8 and gradient_accumulation_steps is 2, the effective batch size for weight updates is 16.
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use bfloat16 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",           # use constant learning rate scheduler
    push_to_hub=True,                       # push model to hub
    report_to="tensorboard",                # report metrics to tensorboard
)

In [None]:
from trl import SFTTrainer

max_seq_length = 500 # max sequence length for model and packing of the dataset

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with special tokens
        "append_concat_token": False, # No need to add additional separator token
    }
)


Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save model
trainer.save_model()
#Since we are using a PEFT method, we will only save the adapted model weights and not the full model.

It is strongly recommended to train Gemma2 models with the `eager` attention implementation instead of `sdpa`. Use `eager` with `AutoModelForCausalLM.from_pretrained('<path-to-checkpoint>', attn_implementation='eager')`.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
10,1.4188
20,0.8134
30,0.7508
40,0.7459
50,0.7191
60,0.7322
70,0.704
80,0.6869
90,0.6862
100,0.6335


In [None]:
# free the memory again
del model
del trainer
torch.cuda.empty_cache()

Optional: Merge LoRA adapter in to the original model

In [None]:
# from peft import AutoPeftModelForCausalLM

# # Load PEFT model on CPU
# model = AutoPeftModelForCausalLM.from_pretrained(
#     args.output_dir,
#     torch_dtype=torch.float16,
#     low_cpu_mem_usage=True,
# )
# # Merge LoRA and base model and save
# merged_model = model.merge_and_unload()
# merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")

# Test and evaluation

In [None]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer, pipeline

peft_model_id = "./7b-text-to-sql"
# peft_model_id = args.output_dir

# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
  peft_model_id,
  device_map="auto",
  torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)
# load into pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

In [None]:
from datasets import load_dataset
from random import randint


# Load our test dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))

# Test on sample
prompt = pipe.tokenizer.apply_chat_template(eval_dataset[rand_idx]["messages"][:2], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=False, temperature=0.1, top_k=50, top_p=0.1, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)

print(f"Query:\n{eval_dataset[rand_idx]['messages'][1]['content']}")
print(f"Original Answer:\n{eval_dataset[rand_idx]['messages'][2]['content']}")
print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")

In [None]:
from tqdm import tqdm


def evaluate(sample):
    prompt = pipe.tokenizer.apply_chat_template(sample["messages"][:2], tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95, eos_token_id=pipe.tokenizer.eos_token_id, pad_token_id=pipe.tokenizer.pad_token_id)
    predicted_answer = outputs[0]['generated_text'][len(prompt):].strip()
    if predicted_answer == sample["messages"][2]["content"]:
        return 1
    else:
        return 0

success_rate = []
number_of_eval_samples = 1000
# iterate over eval dataset and predict
for s in tqdm(eval_dataset.shuffle().select(range(number_of_eval_samples))):
    success_rate.append(evaluate(s))

# compute accuracy
accuracy = sum(success_rate)/len(success_rate)

print(f"Accuracy: {accuracy*100:.2f}%")