# Generate a dataset for instruction tuning

This notebook will guide you through the process of generating a dataset for instruction tuning. We'll use the `distilabel` package to generate a dataset for instruction tuning.

So let's dig in to some instruction tuning datasets.

<div style='background-color: lightblue; padding: 10px; border-radius: 5px; margin-bottom: 20px; color:black'>
    <h2 style='margin: 0;color:blue'>Exercise: Generate a dataset for instruction tuning</h2>
    <p>Now that you've seen how to generate a dataset for instruction tuning, try generating a dataset for instruction tuning.</p>
    <p><b>Difficulty Levels</b></p>
    <p>🐢 Generate an instruction tuning dataset</p>
    <p>🐕 Generate a dataset for instruction tuning with seed data</p>
    <p>🦁 Generate a dataset for instruction tuning with seed data and with instruction evolution</p>
</div>

## Install dependencies

Instead of transformers, you can also install `vllm` or `hf-inference-endpoints`.

In [1]:
!pip install "distilabel[hf-transformers,outlines,instructor]"

Collecting distilabel[hf-transformers,instructor,outlines]
  Downloading distilabel-1.5.3-py3-none-any.whl.metadata (15 kB)
Collecting tblib>=3.0.0 (from distilabel[hf-transformers,instructor,outlines])
  Downloading tblib-3.1.0-py3-none-any.whl.metadata (25 kB)
Collecting universal-pathlib>=0.2.2 (from distilabel[hf-transformers,instructor,outlines])
  Downloading universal_pathlib-0.2.6-py3-none-any.whl.metadata (25 kB)
Collecting instructor>=1.2.3 (from distilabel[hf-transformers,instructor,outlines])
  Downloading instructor-1.8.3-py3-none-any.whl.metadata (24 kB)
Collecting numba>=0.54.0 (from distilabel[hf-transformers,instructor,outlines])
  Downloading numba-0.61.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.8 kB)
Collecting outlines>=0.0.40 (from distilabel[hf-transformers,instructor,outlines])
  Downloading outlines-0.2.3-py3-none-any.whl.metadata (18 kB)
Collecting docstring-parser<1.0,>=0.16 (from instructor>=1.2.3->distilabel[hf-trans

## Start synthesizing

As we've seen in the previous course content, we can create a distilabel pipelines for instruction dataset generation. The bare minimum pipline is already provided. Make sure to scale up this pipeline to generate a large dataset for instruction tuning. Swap out models, model providers and generation arguments to see how they affect the quality of the dataset. Experiment small, scale up later.

Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them. 

An example of loading data from the Hub instead of dictionaries is provided below.

```python
from datasets import load_dataset

with Pipeline(...) as pipeline:
    ...

if __name__ == "__main__:
    dataset = load_dataset("my-dataset", split="train")
    distiset = pipeline.run(dataset=dataset)
```

Don't forget to push your dataset to the Hub after running the pipeline!

In [None]:
from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration

with Pipeline() as pipeline:
    data = LoadDataFromDicts(data=[{"instruction": "Generate a short question about the Hugging Face Smol-Course."}])
    llm = TransformersLLM(model="HuggingFaceTB/SmolLM2-1.7B-Instruct")
    gen_a = TextGeneration(llm=llm, output_mappings={"generation": "instruction"})
    gen_b = TextGeneration(llm=llm, output_mappings={"generation": "response"})
    data >> gen_a >> gen_b

if __name__ == "__main__":
    distiset = pipeline.run(use_cache=False)
    distiset.push_to_hub("huggingface-smol-course-instruction-tuning-dataset")

## 🌯 That's a wrap

You've now seen how to generate a dataset for instruction tuning. You could use this to:

- Generate a dataset for instruction tuning.
- Create evaluation datasets for instruction tuning.

Next

🧑‍🏫 Learn - About [generating preference datasets](./preference_datasets.md)
🏋️‍♂️ Fine-tune a model for instruction tuning with a synthetic dataset based on the [instruction tuning chapter](../../1_instruction_tuning/README.md)
