<a href="https://colab.research.google.com/github/HeidiHufford/uplimit-synthetic-data/blob/main/Project_1_getting_started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1: getting started

Let's get started with installing the dependencies.

In [None]:
!uv pip install "distilabel[hf-inference-endpoints,openai,ollama] @ git+https://github.com/argilla-io/distilabel.git@develop" "model2vec" "accelerate" "transformers>=4.54.0" "semhash" "datasets<4.0.0" "numpy<2.0.0" -U -q

⚠️ Warning

Google Colab doesn't prompt you to restart the environment in order to use
numpy smaller than 2.0.0 after a downgrade. You need to do that manually. So go to the dropdown next to "Run all" and restart the session. Similarly, if you experience a notebook being stuck, try to go for a hard reset by disconecting and deleting the runtime.

## Working with Hugging Face

Let's first [get our token](https://huggingface.co/settings/tokens) and then log in.

In [None]:
from huggingface_hub import login

login()

Or use the Google Colab secrets integration.

In [None]:
from huggingface_hub import login
from google.colab import userdata

login(token=(userdata.get('HF_TOKEN')))

## Using Hugging Face datsets

We will use the [`fka/awesome-chatgpt-prompts`](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts) dataset. This dataset holds a pretty neat collection prompts to use for language models.

In [None]:
from datasets import load_dataset

ds = load_dataset("fka/awesome-chatgpt-prompts")
ds

In [None]:
ds["train"].features

In [None]:
ds["train"][0]

We can then do some cool operations.

In [None]:
def do_cool_things(row):
    row["act_prompt"] = row["act"] + row["prompt"]
    return row

ds = ds.map(do_cool_things)
ds

In [None]:
ds["train"]["act_prompt"][0]

We can also do cool batch operations to make them quicker for example, when you need to run inference on the batches.

In [None]:
def do_cool_things(batch):
    row_act_prompts = []
    for act, prompt in zip(batch["act"], batch["prompt"]):
        row_act_prompts.append(act+prompt)
    batch["act_prompt"] = row_act_prompts
    return batch

ds = ds.map(do_cool_things, batched=True)
ds

In [None]:
ds["train"]["act_prompt"][0]

## Using LLMs

### Serverless HF LLMs

We can then search a model on Hugging Face and start calling LLMs. Let's use the [`meta-llama/Llama-3.2-3B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and find and use the snippet from calling API endpoints.


In [None]:
from openai import OpenAI
from huggingface_hub import get_token

client = OpenAI(
	base_url="https://router.huggingface.co/v1",
    api_key=get_token()
)

messages = [
	{
		"role": "user",
		"content": "What is the capital of France?"
	}
]

completion = client.chat.completions.create(
	model="meta-llama/Llama-3.2-1B-Instruct",
	messages=messages,
	max_tokens=500,
)

print(completion.choices[0].message)

Alternatively, you can also check the model card and [choose a specific inference provider](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct?inference_api=true&inference_provider=auto&language=python&client=openai). In that case, you would change `meta-llama/Llama-3.2-3B-Instruct` for `meta-llama/Llama-3.2-3B-Instruct:<provider_name>`. Let's give it a try with Together AI!

In [None]:
completion = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct:auto",
    messages=[
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
)
completion

### Serverless LLMs within distilabel

We can then use a prompt from this dataset to call [LLMs with distilabel](https://davidberenstein1957.github.io/distilabel/latest/components-gallery/llms/). Let's see how we can use the [InferenceEndpointsLLM](https://davidberenstein1957.github.io/distilabel/latest/components-gallery/llms/inferenceendpointsllm/#dedicated-inference-endpoints-or-tgi).

In [None]:
from distilabel.models import InferenceEndpointsLLM

llm = InferenceEndpointsLLM(
    model_id="meta-llama/Llama-3.2-1B-Instruct",
)

llm.load()
llm.generate_outputs(inputs=[[
    {"role": "user", "content": "What is the capital of France?"}
]])

### Local LLMs within distilabel

We can also use a local LLM using the `TransformersLLM` implementation. Although, the quality of generated examples is limited, this allows us to iterate on pipelines and projects, before spending many compute credits.
Some good examples are the [SmolLM2](https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9) and [LFM2](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38) models series.

In [None]:
from distilabel.models.llms import TransformersLLM
import torch

llm = TransformersLLM(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    device_map="auto",
    torch_dtype="bfloat16"
)

llm.load()
llm.generate_outputs(inputs=[[
    {"role": "user", "content": "What is the capital of France?"}
]])

## Using prompt templates

### Custom prompt templates

Besides the built-in templates, we can also define custom templates within [the TextGeneration Task](hhttps://davidberenstein1957.github.io/distilabel/latest/components-gallery/tasks/textgeneration/), where you define a placeholder inside a prompts to inject rows in your dataset.

In [None]:
from distilabel.steps.tasks import TextGeneration
from distilabel.models import TransformersLLM

# initialise the LLM
llm = TransformersLLM(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    device_map="auto",
    torch_dtype="bfloat16"
)

# define you custom template
system_prompt = "You are a helpful assistant that is great a rewriting content."
prompt_template = """Rewrite the following instruction to make it more complex:

{{instruction}}

IMPORTANT. Only return the rewritten instruction and nothing else!
Rewritten instruction:
"""
text_generation = TextGeneration(
    name="exam_generation",
    system_prompt=system_prompt,
    template=prompt_template,
    llm=llm,
    input_batch_size=8,
    columns=["instruction"]
)
text_generation.load()

next(
    text_generation.process([
        {"instruction": "What is the current capital of France?"}
    ]
))

### Built-in prompt templates

We can also use these LLMs along with the prompt templates. Prompt templates are called [tasks](https://davidberenstein1957.github.io/distilabel/latest/components-gallery/tasks/). We've already discussed the [EvolInstruct](https://davidberenstein1957.github.io/distilabel/latest/components-gallery/tasks/selfinstruct), [SelfInstruct](https://davidberenstein1957.github.io/distilabel/latest/components-gallery/tasks/selfinstruct/) and [Magpie](https://davidberenstein1957.github.io/distilabel/latest/components-gallery/tasks/magpie/) templates, let's try to use it now.

In [None]:
from distilabel.steps.tasks import SelfInstruct
from distilabel.models import TransformersLLM

# initialise the LLM
llm = TransformersLLM(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    device_map="auto",
    torch_dtype="bfloat16"
)

# Consider this as a placeholder for your actual LLM.
self_instruct = SelfInstruct(
    llm=llm
)
self_instruct.load()

next(
    self_instruct.process([
        {"input": "The current capital of France is Paris."}
    ]
))

Note that the magpie template is slightly different and allows you to define some additional parameters during the initialisation of your LLM, like `tokenizer_id`, `magpie_pre_query_template` and `use_magpie_template`. A

As we can see, the column of generated `instructions` is formatted as a list of strings. This means we can not directly use it in follow-up steps. To solve this this we've come up with formatting operations.

## Using formatting operations

### Built-in formatting using steps

Besides prompt remplates, we also support [several formatting operations](https://davidberenstein1957.github.io/distilabel/latest/components-gallery/steps/) to format data and columns. Let's take a look at how to expand columns using the [ExpandColumns](https://davidberenstein1957.github.io/distilabel/latest/components-gallery/steps/expandcolumns/) Step.



In [None]:
from distilabel.steps import ExpandColumns

expand_columns = ExpandColumns(
    columns=["generation"],
)

next(
    expand_columns.process(
        [
            {
                "instruction": "instruction 1",
                "generation": ["generation 1", "generation 2"]}
        ],
    )
)

If you do not like this interface, we recommend taking your data outside of the distilabel flow and simply edit it with any other tool you prefer like Pandas or Hugging Face datasets.

## Using LLMs in a Distilabel pipeline

Normally, distilabel works with [pipelines](https://davidberenstein1957.github.io/distilabel/latest/sections/getting_started/quickstart/#define-a-custom-pipeline). We can use these to define a custom synthetic data flow. These pipelines are very useful for verifying the generations, caching the generations, serialising the pipeline, and reusing the pipeline when needed. Let's try to rewrite prompts from [a basic instruction dataset](https://huggingface.co/datasets/distilabel-internal-testing/instruction-dataset-mini?viewer_embed=true).

In [None]:
from huggingface_hub import whoami

from distilabel.models import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.steps.tasks import EvolInstruct

# the with Pipeline context window calls `load()` on components
with Pipeline() as pipeline:
    loader = LoadDataFromHub(
        repo_id="distilabel-internal-testing/instruction-dataset-mini",
        split="test",
        num_examples=1
    )
    evol_instruct = EvolInstruct(
        llm=TransformersLLM(
            model="HuggingFaceTB/SmolLM2-360M-Instruct",
            device_map="auto",
            torch_dtype="bfloat16"
        ),
        num_evolutions=1,
        # ensure correct column mapping
        input_mappings={"instruction": "prompt"},
    )
    loader.connect(evol_instruct)
    # use connect to determine flow of data
    # or use the use '>>' operator: `loader >> evol_instruct`
pipeline

In [None]:
distiset = pipeline.run(
    parameters={
        evol_instruct.name: {
            "llm": {
                "generation_kwargs":{
                    "temperature": 0.7,
                    "max_new_tokens": 1024
                }
            }
        }
    }
)
distiset

In [None]:
distiset["default"]["train"]["prompt"], distiset["default"]["train"]["completion"]

## Upload your dataset

You can then push your dataset to Hugging Face and [create a nice dataset card](https://huggingface.co/datasets/uplimit/uplimit-synthetic-data-week-1-basic).

In [None]:
distiset.push_to_hub("davidberenstein1957/example-dataset-distilabel")

## Explore your data

There is [an integration with Nomic AI](https://huggingface.co/blog/MaxNomic/explore-any-hugging-face-dataset-with-nomic-atlas) that allows you to Explore, Curate and Vector Search Any Hugging Face Dataset with Nomic Atlas. Additionally, you could use something like [Argilla](https://huggingface.co/blog/argilla-ui-hub) for a more fine-grained analysis.


Let's start with exploring the data in [Nomic AI](https://atlas.nomic.ai/data/davidmberenstein/distilabel-intel-orca-dpo-pairs/map/58350d76-78cf-4383-ad65-2d4f562dabcf).

## Deduplicate you data

The [Dataset Tools organisation on Hugging Face](https://huggingface.co/collections/Dataset-Tools/models-for-dataset-curation-673c647d85be6398f9ba23d3) hold collections of tools and models to explore data or do feature engineering. For example, there is a really fast embedder which is based on [Model2Vec](https://github.com/MinishLab/model2vec) and that can be used to deduplicate data based on semantic overlab using [semhash](https://github.com/MinishLab/semhash/tree/main/semhash).


In [None]:
from datasets import load_dataset
from semhash import SemHash


ds = load_dataset("argilla/distilabel-intel-orca-dpo-pairs", split="train")

semhash = SemHash.from_records(records=ds["input"])

# Deduplicate the texts
deduplicated_texts = semhash.self_deduplicate(threshold=0.8).deduplicated
print(f"Original dataset: {len(ds)}. Filtered dataset: {len(deduplicated_texts)}. Percentage left: {len(deduplicated_texts)/len(ds)}")

## Filter data on quality

Similarly, you can there are models and tools, to determine the quality of your texts. A model we could use is the [`HuggingFaceFW/fineweb-edu-classifier`](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) model for educational quality. However,
[Text Descriptives](https://github.com/HLasse/TextDescriptives) is another Python library you can explore for calculating a large variety of quality metrics from text.

In [None]:
from transformers import pipeline
import pandas as pd
from datasets import load_dataset


ds = load_dataset(
    path="argilla/distilabel-intel-orca-dpo-pairs",
    split="train"
)

pipe = pipeline(
    task="text-classification",
    model="HuggingFaceFW/fineweb-edu-classifier"
)

quality_predictions = pipe(ds["chosen"], truncation=True, verbose=True)


quality_scores = [i["score"] for i in quality_predictions]

df = pd.DataFrame.from_dict(
    {
        "text": ds["chosen"],
        "quality": quality_scores
    }
)
p_to_keep = 0.8
min_score = 0.8
df.sort_values(by="quality", ascending=False, inplace=True)
df = df.head(int(len(df)*p_to_keep))
df = df[df["quality"] > min_score]
print(f"Original dataset: {len(ds)}. Filtered dataset: {len(df)}. Percentage left: {len(df)/len(ds)}")