 <h1><img src="https://distilabel.argilla.io/latest/assets/logo.svg" alt="drawing" width="20"/> 🐭 a-human-can-create-a-DPO-dataset-with-smaller-models-by-looking-at-it</h1>

## Distilabel & Prometheus on **DPO** datasets

 This is a notebook for creating a DPO dataset from an SFT dataset with open source models. It uses a quantized version of [Prometheus 2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0), [distilabel](https://distilabel.argilla.io/latest/), and [Argilla](https://argilla.io/).

 Here's a handy blogpost on [DPO](https://argilla.io/blog/mantisnlp-rlhf-part-3/). The difference between a DPO dataset and an SFT one dataset is that a DPO dataset contains both 'chosen' and 'rejected' and rejected responses. Where as the SFT dataset contains a single response.


 ## 0. Install dependencies

 First, we need to install dependencies for this notebook: `distilabel`, `argilla`, and `llama-cpp`

In [1]:
!CMAKE_ARGS="-DLLAMA_CUDA=on" pip install -qqq llama-cpp-python
!pip install -qqq pynvml
!pip install -qqq --upgrade "distilabel[llama-cpp] @ git+https://github.com/argilla-io/distilabel.git@develop"
!pip install -qqq huggingface_hub argilla

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 MB[0m [31m34.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

## 0. Autheticate

Next, we'll need to login into huggingface because we're going to create a space and pull some models.

In [2]:
from huggingface_hub import notebook_login, hf_hub_download, duplicate_space, whoami

notebook_login()

HF_USERNAME = whoami()["name"]

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 1. Confiugure the distilabel pipeline

Next you can define the project information and configuration.

> I recomend starting from the default configuration and experiment from there.

---



In [3]:
# @markdown ---
# @markdown ### 🧹 Define the dataset to cleanup:

# @markdown What's the name of your project? This will be used to create datasets and spaces so it should be unique.
PROJECT_NAME = "prometheus-text-generation-project" # @param {type:"string"}

# @markdown What SFT dataset on the hub should we start from:
INPUT_DATASET_REPO_ID = "openbmb/UltraInteract_sft" # @param {type:"string"}

# @markdown What are the instruction and response columns named in the dataset?
RESPONSE_COLUMN_NAME = "response" # @param {type:"string"}
INSTRUCTION_COLUMN_NAME = "instruction" # @param {type:"string"}
EVALUATION_RUBRIC = "factual-validity" # @param ["helpfulness", "harmlessness", "honesty", "factual-validity", "reasoning"]

# @markdown ---

# @markdown 🤗 Model Selection

# @markdown Define the quantized models that you want to use for generation and evaluation. Note this notebook is based on Llama-cpp so you will need gguf files, or to adapt the implementation.

# @markdown **Prometheus Evaluation Model**
PROMETHEUS_MODEL_REPO = "AlekseiPravdin/prometheus-7b-v2_0-gguf" # @param {type:"string"}
PROMETHEUS_MODEL_PATH = "prometheus-7b-v2_0.q2_k.gguf" # @param {type:"string"}

# @markdown **Text Generation Model**
GENERATION_MODEL_REPO = "microsoft/Phi-3-mini-4k-instruct-gguf"# @param {type:"string"}
GENERATION_MODEL_PATH = "Phi-3-mini-4k-instruct-q4.gguf" # @param {type:"string"}

# @markdown You could advantage of different compute options by using other GGUF files in https://huggingface.co/AlekseiPravdin/prometheus-7b-v2_0-gguf

# @markdown ---

# @markdown 📖 Prometheus Model Configuration

# @markdown Refine the promethus configuration based on feedback. Start from the defaults.
TEMPERATURE = 0.7 # @param {type:"slider", min:0, max:1, step:0.1}
MAX_TOKENS = 512 # @param {type:"slider", min:64, max:2048, step:64}
NUM_SAMPLES = 10 # @param {type:"slider", min:5, max:500, step:10}
# @markdown ---

DATASET_REPO_ID = f"{HF_USERNAME}/{PROJECT_NAME}"
SPACE_REPO_ID = f"{HF_USERNAME}/{PROJECT_NAME}-argilla"
ARGILLA_URL = f"https://{HF_USERNAME}-{PROJECT_NAME}-argilla.hf.space"

prometheus_path = hf_hub_download(
            repo_id=PROMETHEUS_MODEL_REPO, filename=PROMETHEUS_MODEL_PATH, repo_type="model"
)
phi3_path = hf_hub_download(
            repo_id=GENERATION_MODEL_REPO, filename=GENERATION_MODEL_PATH, repo_type="model"
)
duplicate_space(
    from_id="argilla/argilla-template-space",
    to_id=SPACE_REPO_ID,
    private=False,
    exist_ok=True,
)

prometheus-7b-v2_0.q2_k.gguf:   0%|          | 0.00/2.72G [00:00<?, ?B/s]

Phi-3-mini-4k-instruct-q4.gguf:   0%|          | 0.00/2.39G [00:00<?, ?B/s]

RepoUrl('https://huggingface.co/spaces/skuma307/prometheus-text-generation-project-argilla', endpoint='https://huggingface.co', repo_type='space', repo_id='skuma307/prometheus-text-generation-project-argilla')

## 2. Define human feedback task in Argilla

We will use Argilla to review the dataset that we distil.

In [4]:
import argilla as rg

rg.init(api_url=ARGILLA_URL,api_key="owner.apikey")

dataset = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="instruction"),
        rg.TextField(name="generation"),
    ],
    questions=[
        rg.RatingQuestion(
            name="result",
            description="How would you rate the quality of the answer?",
            values=[1, 2, 3, 4, 5],
        ),
        rg.TextQuestion(
            name="feedback",
            description="Feedback on the quality.",
            required=False,
        ),
    ],
    guidelines="Please, read the question carefully and try to answer it as accurately as possible."
)

dataset.push_to_argilla(name=PROJECT_NAME, workspace="admin")

INFO:argilla.client.feedback.dataset.local.mixins:✓ Dataset succesfully pushed to Argilla
INFO:argilla.client.feedback.dataset.local.mixins:RemoteFeedbackDataset(
   id=887b32a1-de98-4002-ab0b-bc3d021e744c
   name=prometheus-text-generation-project
   workspace=Workspace(id=fa94ed90-0503-4789-88fb-5cfc50cb6847, name=admin, inserted_at=2024-05-29 11:40:08.087356, updated_at=2024-05-29 11:40:08.087356)
   url=https://skuma307-prometheus-text-generation-project-argilla.hf.space/dataset/887b32a1-de98-4002-ab0b-bc3d021e744c/annotation-mode
   fields=[RemoteTextField(id=UUID('04464e4e-58c5-4ad9-8d9c-1027894aebc5'), client=None, name='instruction', title='Instruction', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('a7397084-96dc-4442-825c-b222f6ea8828'), client=None, name='generation', title='Generation', required=True, type='text', use_markdown=False)]
   questions=[RemoteRatingQuestion(id=UUID('45fa0406-14c3-489c-9c1d-c24bd3019289'), client=None, name='result', ti

RemoteFeedbackDataset(
   id=887b32a1-de98-4002-ab0b-bc3d021e744c
   name=prometheus-text-generation-project
   workspace=Workspace(id=fa94ed90-0503-4789-88fb-5cfc50cb6847, name=admin, inserted_at=2024-05-29 11:40:08.087356, updated_at=2024-05-29 11:40:08.087356)
   url=https://skuma307-prometheus-text-generation-project-argilla.hf.space/dataset/887b32a1-de98-4002-ab0b-bc3d021e744c/annotation-mode
   fields=[RemoteTextField(id=UUID('04464e4e-58c5-4ad9-8d9c-1027894aebc5'), client=None, name='instruction', title='Instruction', required=True, type='text', use_markdown=False), RemoteTextField(id=UUID('a7397084-96dc-4442-825c-b222f6ea8828'), client=None, name='generation', title='Generation', required=True, type='text', use_markdown=False)]
   questions=[RemoteRatingQuestion(id=UUID('45fa0406-14c3-489c-9c1d-c24bd3019289'), client=None, name='result', title='Result', description='How would you rate the quality of the answer?', required=True, type='rating', values=[1, 2, 3, 4, 5]), RemoteText

## 3. Define distilabel pipeline

In [5]:
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadHubDataset, LoadDataFromDicts, CombineColumns, PreferenceToArgilla, step, StepInput
from distilabel.steps.tasks import PrometheusEval, TextGeneration
from distilabel.llms import LlamaCppLLM

@step(inputs=["instruction", "generations", "feedback", "result", "model_name"], outputs=["instruction", "generations", "feedback", "result", "model_name"])
def DPOToArgilla(inputs: StepInput):
    if inputs is None:
        yield inputs

    import argilla as rg

    rg.init(
        api_url=ARGILLA_URL,
        api_key="owner.apikey"
    )

    records = []

    for input in inputs:
        result = input.get("result")
        if result is None:
            continue
        chosen_index = 1 if result == "B" else 0
        rejected_index = 1 if result == "A" else 0
        record = rg.FeedbackRecord(
            fields={
                "prompt" : input["instruction"],
                "chosen": input["generations"][chosen_index],
                "rejected": input["generations"][rejected_index]
            },
            suggestions = [
                {
                    "question_name": "preference",
                    "value":[
                        {"rank": 1, "value": "chosen"},
                        {"rank": 2, "value": "rejected"},
                    ],
                },
                {
                    "question_name": "feedback",
                    "value": input["feedback"]
                }
            ]
        )
        records.append(record)
    dataset = rg.FeedbackDataset.from_argilla(name="honest_preferences", workspace="admin")
    dataset.add_records(records)
    yield inputs


with Pipeline(name="prometheus") as pipeline:

    load_dataset = LoadHubDataset(
        name="load_dataset",
        repo_id=INPUT_DATASET_REPO_ID,
        split="train",
        batch_size=3,
        num_examples=3,
        output_mappings={RESPONSE_COLUMN_NAME:"generation", INSTRUCTION_COLUMN_NAME:"instruction"}
    )


    generate_with_phi3 = TextGeneration(
        name="generate_with_phi3",
        llm=LlamaCppLLM(
            model_path=phi3_path,
            n_ctx=4092
        )
    )

    combine_columns = CombineColumns(
      name="combine_columns",
      columns=["generation", "model_name"],
      output_columns=["generations", "generation_models"],
    )

    prometheus = PrometheusEval(
        name="prometheus",
        llm=LlamaCppLLM(
            model_path=prometheus_path,
            n_ctx=4092
        ),
        mode="relative",
        rubric="factual-validity",
        reference=False,
        num_generations=1,
        group_generations=False,
    )

    keep_columns = KeepColumns(
        name="keep_columns",
        columns=["instruction", "generations", "feedback", "result", "model_name"],
    )

    push_to_argilla = DPOToArgilla(
        name="push_to_argilla"
    )

    load_dataset.connect(combine_columns)
    load_dataset.connect(generate_with_phi3)
    generate_with_phi3.connect(combine_columns)
    combine_columns.connect(prometheus)
    prometheus.connect(keep_columns)
    keep_columns.connect(push_to_argilla)


## 4. Run the distilabel pipeline

In [6]:
distiset = pipeline.run(
    parameters={
        generate_with_phi3.name: {
            "llm": {
                "generation_kwargs": {"max_new_tokens": 1024, "temperature": 0.7}
            }
            },
        prometheus.name: {  # type: ignore
            "llm": {
                "model_path": prometheus_path,
                "generation_kwargs": {
                    "max_new_tokens": 512,
                    "temperature": 0.7,
                },
            },
        },
    },
)

  return [self.format_input(input) for input in inputs]




Generating train split: 0 examples [00:00, ? examples/s]

In [7]:
distiset["default"]["train"].to_pandas()

Unnamed: 0,instruction,generations,feedback,result,model_name
0,,,,,
1,,,,,
2,,,,,


## 5. Look at the results (Human Feedback)


You can review your records in thr Argilla UI.

In [None]:
print(ARGILLA_URL)