# Fine-Tuning on OpenAssistant dataset

OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus.

In [1]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

import warnings

# Settings the warnings to be ignored
warnings.filterwarnings('ignore')
















[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Dataset

In [2]:
from datasets import load_dataset

dataset_name = "OpenAssistant/oasst1"
dataset = load_dataset(dataset_name, split="train")
print(len(dataset))
print(type(dataset[0]))
dataset[0]

84437
<class 'dict'>


{'message_id': '6ab24d72-0181-4594-a9cd-deaf170242fb',
 'parent_id': None,
 'user_id': 'c3fe8c76-fc30-4fa7-b7f8-c492f5967d18',
 'created_date': '2023-02-05T14:23:50.983374+00:00',
 'text': 'Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.',
 'role': 'prompter',
 'lang': 'en',
 'review_count': 3,
 'review_result': True,
 'deleted': False,
 'rank': None,
 'synthetic': False,
 'model_name': None,
 'detoxify': {'toxicity': 0.00044308538781479,
  'severe_toxicity': 3.252684837207198e-05,
  'obscene': 0.00023475120542570949,
  'identity_attack': 0.0001416115992469713,
  'insult': 0.00039489680784754455,
  'threat': 4.075629112776369e-05,
  'sexual_explicit': 2.712695459194947e-05},
 'message_tree_id': '6ab24d72-0181-4594-a9cd-deaf170242fb',
 'tree_state': 'ready_for_export',
 'emojis': {'name': ['+1', '_skip_reply', '_skip_ranking'],
  'count': [10

## Model

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "meta-llama/Llama-3.2-3B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    token="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" # private
)
model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, token="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")  # private
tokenizer.pad_token = tokenizer.eos_token
tokenizer.special_tokens_map

`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 2/2 [00:33<00:00, 16.84s/it]


{'bos_token': '<|begin_of_text|>',
 'eos_token': '<|end_of_text|>',
 'pad_token': '<|end_of_text|>'}

In [4]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

modules = find_all_linear_names(model)
modules

['q_proj', 'v_proj', 'k_proj', 'o_proj', 'down_proj', 'up_proj', 'gate_proj']

In [5]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), eps=1e

## Preparing Dataset

In [6]:
"""
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 05 Oct 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'm fine thanks<|eot_id|><|start_header_id|>user<|end_header_id|>

What's your favorite thing to do in London?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Watch a football game.<|eot_id|>
"""

from itertools import groupby
from datasets import Dataset

train_dataset = []

# Grouping the dataset by 'message_tree_id'
for msg_tree, group in groupby(dataset, key=lambda x: x['message_tree_id']):
    last_msg_tree = "<|begin_of_text|>"

    # Processing each group
    for row in group:
        role = 'user' if row['role'] == 'prompter' else 'assistant'
        last_msg_tree += f"<|start_header_id|>{role}<|end_header_id|>\n\n{row['text']}<|eot_id|>"

    train_dataset.append({"text": last_msg_tree})

# Convert to Hugging Face dataset
train_dataset = Dataset.from_list(train_dataset)

In [7]:
print(len(train_dataset))
print()
print(train_dataset[0])

9846

{'text': '<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nCan you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2

## Inference before Fine-Tuning

In [8]:
from transformers import (
    pipeline,
)

In [9]:
# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

""")
print(result[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is a large language model?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

What is a large language model?　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　 　 　 　　　　

## Fine-Tuning

In [10]:
tokenizer.add_special_tokens({
    "eos_token": "<|eot_id|>"
})
tokenizer.pad_token = tokenizer.eos_token
tokenizer.special_tokens_map

{'bos_token': '<|begin_of_text|>',
 'eos_token': '<|eot_id|>',
 'pad_token': '<|eot_id|>'}

In [11]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules
)

In [12]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    report_to="none",
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    gradient_checkpointing=True,
)

In [13]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map: 100%|██████████| 9846/9846 [00:03<00:00, 2701.84 examples/s]
max_steps is given, it will override any value given in num_train_epochs


In [14]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

In [15]:
trainer.train()

Step,Training Loss
10,1.6744
20,1.5576
30,1.5271
40,1.5849
50,1.947
60,1.5375
70,1.5592
80,1.4539
90,1.5631
100,1.8612


TrainOutput(global_step=500, training_loss=1.5913713779449463, metrics={'train_runtime': 8113.3649, 'train_samples_per_second': 0.986, 'train_steps_per_second': 0.062, 'total_flos': 6.500987999639962e+16, 'train_loss': 1.5913713779449463, 'epoch': 0.8123476848090982})

## Inference after Fine-Tuning

In [17]:
# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"""<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

""")
print(result[0]['generated_text'])

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

What is a large language model?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

A large language model is a type of artificial intelligence (AI) model that is trained on vast amounts of text data to understand and generate human language. Large language models are capable of understanding and generating text in a wide range of contexts, including writing, translation, and conversation. They are often used in applications such as chatbots, language translation, and text summarization. The term "large" refers to the size of the model's parameters, which are the weights that the model uses to learn from the training data. Large language models are becoming increasingly important in many fields, from natural language processing to computer vision, and are used in a variety of applications. However, large language models are also controversial, as they can be used for harmful purposes such as generating fake news or creating