 <h1><img src="https://distilabel.argilla.io/latest/assets/logo.svg" alt="drawing" width="20"/> 🐭 a-human-can-improve-an-SFT-dataset-with-smaller-models-by-looking-at-it</h1>

 ## Distilabel & Prometheus on **SFT** datasets

 This is a notebook for distiling Supervised Fine-tuning datasets with open source models. It uses a quantized version of [Prometheus 2](https://huggingface.co/prometheus-eval/prometheus-7b-v2.0), [distilabel](https://distilabel.argilla.io/latest/), and [Argilla](https://argilla.io/).

 ## 0. Install dependencies

 First, we need to install dependencies for this notebook: `distilabel`, `argilla`, and `llama-cpp`

In [None]:
!CMAKE_ARGS="-DLLAMA_CUDA=on" pip install -qqq llama-cpp-python
!pip install -qqq pynvml
!pip install -qqq --upgrade "distilabel[llama-cpp] @ git+https://github.com/argilla-io/distilabel.git@develop"
!pip install -qqq huggingface_hub argilla

## 0. Autheticate

Next, we'll need to login into huggingface because we're going to create a space and pull some models.

In [None]:
from huggingface_hub import notebook_login, hf_hub_download, duplicate_space, whoami

notebook_login()

HF_USERNAME = whoami()["name"]

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 1. Confiugure the distilabel pipeline

Next you can define the project information and configuration.

> I recomend starting from the default configuration and experiment from there.

---



In [None]:
# @markdown ---
# @markdown ### 🧹 Define the dataset to cleanup:

# @markdown What's the name of your project? This will be used to create datasets and spaces so it should be unique.
PROJECT_NAME = "prometheus-text-generation" # @param {type:"string"}

# @markdown What SFT dataset on the hub should we start from:
INPUT_DATASET_REPO_ID = "openbmb/UltraInteract_sft" # @param {type:"string"}

# @markdown What are the instruction and response columns named in the dataset?
RESPONSE_COLUMN_NAME = "response" # @param {type:"string"}
INSTRUCTION_COLUMN_NAME = "instruction" # @param {type:"string"}
EVALUATION_RUBRIC = "factual-validity" # @param ["helpfulness", "harmlessness", "honesty", "factual-validity", "reasoning"]

# @markdown ---

# @markdown 🤗 Model Selection

# @markdown You could advantage of different compute options by using other GGUF files in https://huggingface.co/AlekseiPravdin/prometheus-7b-v2_0-gguf

QUANTIZED_PROMETHEUS_MODEL_REPO = "AlekseiPravdin/prometheus-7b-v2_0-gguf" # @param {type:"string"}
QUANTIZED_PROMETHEUS_MODEL_PATH = "prometheus-7b-v2_0.q2_k.gguf" # @param {type:"string"}

# @markdown ---

# @markdown 📖 Prometheus Model Configuration

# @markdown Refine the promethus configuration based on feedback. Start from the defaults.
TEMPERATURE = 0.7 # @param {type:"slider", min:0, max:1, step:0.1}
MAX_TOKENS = 512 # @param {type:"slider", min:64, max:2048, step:64}
NUM_SAMPLES = 10 # @param {type:"slider", min:5, max:500, step:10}
# @markdown ---

DATASET_REPO_ID = f"{HF_USERNAME}/{PROJECT_NAME}"
SPACE_REPO_ID = f"{HF_USERNAME}/{PROJECT_NAME}-argilla"
ARGILLA_URL = f"https://{HF_USERNAME}-{PROJECT_NAME}-argilla.hf.space"

prometheus_path = hf_hub_download(
            repo_id=QUANTIZED_PROMETHEUS_MODEL_REPO, filename=QUANTIZED_PROMETHEUS_MODEL_PATH, repo_type="model"
)

duplicate_space(
    from_id="argilla/argilla-template-space",
    to_id=SPACE_REPO_ID,
    private=False,
    exist_ok=True,
)

'burtenshaw'

## 2. Define human feedback task in Argilla

We will use Argilla to review the dataset that we distil.

In [None]:
import argilla as rg

rg.init(api_url=ARGILLA_URL,api_key="owner.apikey")

dataset = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="instruction"),
        rg.TextField(name="generation"),
    ],
    questions=[
        rg.RatingQuestion(
            name="result",
            description="How would you rate the quality of the answer?",
            values=[1, 2, 3, 4, 5],
        ),
        rg.TextQuestion(
            name="feedback",
            description="Feedback on the quality.",
            required=False,
        ),
    ],
    guidelines="Please, read the question carefully and try to answer it as accurately as possible."
)

dataset.push_to_argilla(name=PROJECT_NAME, workspace="admin")

## 3. Define distilabel pipeline

In [None]:
@step(inputs=["instruction", "generation", "feedback", "result", "model_name"], outputs=["instruction", "generation", "feedback", "result", "model_name"])
def GenerationToArgilla(inputs: StepInput):
    if inputs is None:
        yield inputs

    import argilla as rg

    rg.init(
        api_url=ARGILLA_URL,
        api_key="owner.apikey"
    )

    records = []

    for input in inputs:
        result = input.get("result")
        if result is None:
            continue
        record = rg.FeedbackRecord(
            fields={
                "instruction" : input["instruction"],
                "generation": input["generation"],
            },
            suggestions = [
                {
                    "question_name": "result",
                    "value": input["result"],
                },
                {
                    "question_name": "feedback",
                    "value": input["feedback"]
                }
            ]
        )
        records.append(record)
    dataset = rg.FeedbackDataset.from_argilla(name=PROJECT_NAME, workspace="admin")
    dataset.add_records(records)
    return inputs

In [None]:
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadHubDataset, LoadDataFromDicts, CombineColumns, PreferenceToArgilla, step, StepInput
from distilabel.steps.tasks import PrometheusEval, TextGeneration
from distilabel.llms import LlamaCppLLM

with Pipeline(name="prometheus") as pipeline:

    load_dataset = LoadHubDataset(
        name="load_dataset",
        repo_id=INPUT_DATASET_REPO_ID,
        split="train",
        batch_size=5,
        num_examples=NUM_SAMPLES,
        output_mappings={RESPONSE_COLUMN_NAME:"generation", INSTRUCTION_COLUMN_NAME:"instruction"}
    )

    prometheus = PrometheusEval(
        name="prometheus",
        llm=LlamaCppLLM(
            model_path=prometheus_path,
            n_ctx=4092
        ),
        mode="absolute",
        rubric="factual-validity",
        reference=False,
        num_generations=1,
        group_generations=False,
    )

    keep_columns = KeepColumns(
        name="keep_columns",
        columns=["instruction", "generation", "result", "model_name", "feedback"],
    )

    push_to_argilla = GenerationToArgilla(
        name="push_to_argilla",
    )

    load_dataset.connect(prometheus)
    prometheus.connect(keep_columns)
    keep_columns.connect(push_to_argilla)


## 4. Run the distilabel pipeline

In [None]:
distiset = pipeline.run(
    parameters={
        prometheus.name: {  # type: ignore
            "llm": {
                "model_path": prometheus_path,
                "generation_kwargs": {
                    "max_new_tokens": MAX_TOKENS,
                    "temperature": TEMPERATURE,
                },
            },
        },
    },
)

## 5. Look at the results (Human Feedback)


You can review your records in thr Argilla UI.

In [None]:
print(ARGILLA_URL)