<a href="https://colab.research.google.com/github/CrashingGuru/FGAN-Build-a-thon/blob/main/Notebooks2023/Argilla_Supervised_Finetuning-v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Created: 5 Feb 2024  

Aaron, Evangel, Frank, Othniel, Kennedy, Victor, Vishnu.

Modification History: Created to use Set up the ArgillaTrainer for Supervised Finetuning
Refer to https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/training-llm-mistral-sft.html for examples.

Description:

This notebook pulls the dataset from remote argilla and applies it for Supervised Finetuning based on annotated data.

Prerequisites:

the following notebooks are already run:

0. Create the raw dataset in HF hub.

1. Manually create a HF spaces deployment of Argilla

2. Configure the argilla dataset

2. add records in the argilla dataset from the raw dataset

3. annotated the dataset in UI

and finally here we do -- Pull from argilla and apply the annotated data for Supervised Finetuning.


## Install Libraries

Install the latest version of Argilla in Colab, along with other libraries and models used in this notebook.

In [4]:
!pip install argilla datasets



Prerequisites

Deploy Argilla Server on [HF Spaces](https://huggingface.co/new-space?template=argilla/argilla-template-space).


More info on Installation [here](../getting_started/installation/deployments/deployments.html).

## Secretes needed




* `ARGILLA_API_URL`: It is the url of the Argilla Server.
  * If you're using HF Spaces, it is constructed as `https://[your-owner-name]-[your_space_name].hf.space`.
* `ARGILLA_API_KEY`: It is the API key of the Argilla Server. It is `owner` by default.
* `HF_TOKEN`: It is the Hugging Face API token. It is only needed if you're using a [private HF Space](https://docs.argilla.io/en/latest/getting_started/installation/deployments/huggingface-spaces.html#deploy-argilla-on-spaces). You can configure it in your profile: [Setting > Access Tokens](https://huggingface.co/settings/tokens).
* `workspace`: admin


In [5]:
import argilla as rg
from argilla._constants import DEFAULT_API_KEY

In [6]:
from google.colab import userdata
api_url= userdata.get('my_argilla_url')
api_key= userdata.get('my_argilla_key')

import argilla as rg
rg.init(api_url=api_url, api_key=api_key)




# # If you want to use your private HF Space
# rg.init(extra_headers={"Authorization": f"Bearer {hf_token}"})

This may lead to potential compatibility issues during your experience.
To ensure a seamless and optimized connection, we highly recommend aligning your client version with the server version.


In [7]:
hf_token = userdata.get('my_hf_write_token')
from huggingface_hub import login
login(hf_token)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [7]:
!pip install accelerate

Collecting accelerate
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/270.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/270.9 kB[0m [31m3.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.26.1


In [8]:
#refer: https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/training-llm-mistral-sft.html

dataset = rg.FeedbackDataset.from_argilla(name="fgan-annotate-dataset", workspace="admin")


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [2]:
from typing import Dict, Iterator, Any
from argilla.feedback import TrainingTask

ANNOTATED_ONLY = False

def formatting_func(sample: Dict[str, Any]) -> Iterator[str]:
    if ANNOTATED_ONLY:
        # Discard if there are no annotations...
        if not sample["quality"]:
            return

        # or if it is annotated as "Bad" or discarded.
        first_annotation = sample["quality"][0]
        if first_annotation["value"] == "Bad" or first_annotation["status"] == "discarded":
            return

    # Filter out responses that are likely low quality
    if len(sample["response"]) <= 2:
        return

    # Add </s><s> between all prompt-response pairs
    prompt = sample["prompt"]
    prompt = prompt.replace("<human>:", f"{tokenizer.eos_token}{tokenizer.bos_token}<human>:")
    prompt = prompt[prompt.find("<human>:"):]
    # Add response and optionally the background to the full text.
    output = prompt + " " + sample["response"]
    if sample["background"]:
        output = sample["background"] + " " + output
    output = output + tokenizer.eos_token
    # We expect one less <s> than </s>, because the Mistral tokenizer will automatically add the BOS
    # at the start of the text when this text is tokenized. When that's done, the format will be exactly
    # what we want
    assert output.count("<s>") + 1 == output.count("</s>")
    return output

task = TrainingTask.for_supervised_fine_tuning(formatting_func)

In [9]:
formatted_dataset = dataset.prepare_for_training(framework="trl", task=task)
formatted_dataset



Dataset({
    features: ['id', 'text'],
    num_rows: 98
})

In [10]:
print(formatted_dataset[80]["text"])

Background: Testbed for 5G Connected Artificial Intelligence on 
Virtualized Networks <human>: what is a use case for deployment of AI (artificial intelligence) in testbeds for AN (autonomous networks)? <bot>: testbed setup for a 5G mobile network with a virtualized and orchestrated structure, using containers, which focuses on integration to artificial intelligence (AI) applications is an example of a use case for deployment of AI (artificial intelligence) in testbeds for AN (autonomous networks).</s>


In [11]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
data_collator([tokenizer(formatted_dataset[0]["text"])])

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': tensor([[    1, 24316, 28747,  8862, 28779,   659,  4775,  5938,  4469,   354,
          3745,  6196,   607,  9488, 28713, 28723,  8862, 28779,  3232,  2071,
           356, 25009,   607, 12167,  7193,   272,   938,  4469,   304,  8862,
         28779, 28733, 28738,   318, 28777, 28740, 28770,   659,  4775,  1287,
           938,  4469,   356, 25009,   607, 12167, 28723,  2957,   938,  4469,
           460, 20577,  2458,   778,   989,  2191, 13187, 10085,   356,  3161,
           590,   460,  5202,   298,  4993,   302, 25009,   607, 12167,   442,
          5202,   298,   272,  6421, 16582,   302, 25009,   607, 12167, 28723,
           523, 18529,  9670,   693,  2034, 28714,  8137,   938,  4469,   354,
         25009,   607, 12167, 28804,   523, 10093,  9670,  8862, 28779,  2034,
         28714,  8137,   938,  4469,   354, 25009,   607, 12167,  2818,   356,
           272,   771,   302, 28705,  8862, 28779,  3232,  2071,   356, 25009,
           607, 12167,   304,  8862, 2

In [12]:
from transformers import DataCollatorForSeq2Seq, BatchEncoding

class DataCollatorForSeq2SeqCopyLabels(DataCollatorForSeq2Seq):
    def __call__(self, features, return_tensors=None) -> BatchEncoding:
        for feature in features:
            if "labels" not in feature:
                feature["labels"] = feature["input_ids"].copy()
        return super().__call__(features, return_tensors=return_tensors)

In [13]:
data_collator = DataCollatorForSeq2SeqCopyLabels(tokenizer)
data_collator([tokenizer(formatted_dataset[0]["text"])])

{'input_ids': tensor([[    1, 24316, 28747,  8862, 28779,   659,  4775,  5938,  4469,   354,
          3745,  6196,   607,  9488, 28713, 28723,  8862, 28779,  3232,  2071,
           356, 25009,   607, 12167,  7193,   272,   938,  4469,   304,  8862,
         28779, 28733, 28738,   318, 28777, 28740, 28770,   659,  4775,  1287,
           938,  4469,   356, 25009,   607, 12167, 28723,  2957,   938,  4469,
           460, 20577,  2458,   778,   989,  2191, 13187, 10085,   356,  3161,
           590,   460,  5202,   298,  4993,   302, 25009,   607, 12167,   442,
          5202,   298,   272,  6421, 16582,   302, 25009,   607, 12167, 28723,
           523, 18529,  9670,   693,  2034, 28714,  8137,   938,  4469,   354,
         25009,   607, 12167, 28804,   523, 10093,  9670,  8862, 28779,  2034,
         28714,  8137,   938,  4469,   354, 25009,   607, 12167,  2818,   356,
           272,   771,   302, 28705,  8862, 28779,  3232,  2071,   356, 25009,
           607, 12167,   304,  8862, 2

In [14]:
from typing import Optional
import torch
from transformers import TrainerCallback, TrainerControl, TrainerState, GenerationConfig, TrainingArguments, PreTrainedModel, PreTrainedTokenizer


class GenerationCallback(TrainerCallback):
    def __init__(self, prompt: str) -> None:
        super().__init__()
        self.prompt = prompt

    def on_evaluate(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, model: Optional[PreTrainedModel] = None, tokenizer: Optional[PreTrainedTokenizer] = None, **kwargs):
        # Tokenize the prompt and send it to the right device
        inputs = tokenizer(self.prompt, return_tensors="pt").to(model.device)

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                generation_config=GenerationConfig(
                    max_new_tokens=50,
                    pad_token_id=tokenizer.pad_token_id,
                    eos_token_id=tokenizer.eos_token_id,
                ),
            )
            print(tokenizer.batch_decode(outputs, skip_special_tokens=False)[0])


generation_callback = GenerationCallback("<human>: what are autonomous networks? <bot>:")

In [16]:
!pip install trl>=0.5.0

In [17]:
from argilla.feedback import ArgillaTrainer

trainer = ArgillaTrainer(
    dataset=dataset,
    model=model,
    tokenizer=tokenizer,
    task=task,
    framework="trl",
    train_size=0.99,
)

INFO:ArgillaTrainer:            ArgillaBaseTrainer info:
            _________________________________________________________________
            These baseline params are fixed:
                dataset: RemoteFeedbackDataset(
   id=9ffba1ef-487a-4cf4-b8b5-232250bec9e4
   name=fgan-annotate-dataset
   workspace=Workspace(id=1a10cae7-24f3-46e9-8363-5c200c65535f, name=admin, inserted_at=2024-02-05 15:07:36.991913, updated_at=2024-02-05 15:07:36.991913)
   url=https://vishnuramov-itu-t-build-a-thon.hf.space/dataset/9ffba1ef-487a-4cf4-b8b5-232250bec9e4/annotation-mode
   fields=[RemoteTextField(id=UUID('0ebfe0c0-6f1e-465c-a75c-e7dc67333d9e'), client=None, name='background', title='Background', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('ee68e515-a38b-434c-b871-f97dbdbd0858'), client=None, name='prompt', title='Prompt', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('2e7bc7bd-b3ff-4d2b-b08f-cfdb6b4104c8'), client=None, name='response'

In [19]:
!pip install peft

Collecting peft
  Downloading peft-0.8.2-py3-none-any.whl (183 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/183.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/183.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.4/183.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: peft
Successfully installed peft-0.8.2


In [20]:
from peft import LoraConfig, TaskType

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
)
trainer.update_config(
    data_collator=data_collator,
    callbacks=[generation_callback],
    peft_config=peft_config,
    max_seq_length=1024,
)

  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
INFO:ArgillaTrainer:Updated parameters:
_________________________________________________________________
'SFTTrainer'
data_collator: DataCollatorForSeq2SeqCopyLabels(tokenizer=LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-v0.1', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}, model=None, padding=True, max_length=None, pad_to_multiple_o

In [21]:
trainer.update_config(
    per_device_train_batch_size=3,
    per_device_eval_batch_size=3,
    eval_accumulation_steps=16,
    max_steps=3000,
    logging_steps=50,
    learning_rate=5e-5,
    save_strategy="no",
    evaluation_strategy="steps",
    eval_steps=500,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    remove_unused_columns=False,
    fp16=True,
    num_train_epochs=1,
)

  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
  trainer.update_config(
INFO:ArgillaTrainer:Updated parameters:
_________________________________________________________________
'SFTTrainer'
data_collator: DataCollatorForSeq2SeqCopyLabels(tokenizer=LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-v0.1', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s

In [23]:
trainer.train("Mistral-7B-v0.1-chat-OIG-3k")

