# 3. Dataset Coding Practice

As mentioned, PyRIT is a framework and we try our best to not dictate the things you are testing. But we want to make it as easy as we can to test those things.

The last section covered much of _how_ we can add `Seeds`. Including different datasets are a really common Github issue, and in this section we do our best to define _how_ you can include datasets.

 Here are common datasets within PyRIT.

- Jailbreaks: these help with everything else. 
- Harms: these can benchmark how models are doing. They can be `SeedObjectives` or `SeedPrompts`.
- System prompts in various forms: adversarial models, scorers, etc.

These can all be included in various ways in our source code, including writing your own, YAML files, or using a `RemoteDatasetLoader`.

## YAML

A common way to include datasets is via YAML. We have talked about the format in [seed programming](./2_seed_programming.ipynb). But if the license is compatible and the dataset would be useful in PyRIT, we can often include it directly with appropriate attribution. Some common yaml files could go in:

- [jailbreaks](../../../pyrit/datasets/jailbreak/templates/): These will automatically be included via the `TextJailBreak` classes and used in places like the `TextJailBreakConverter`.
- [harms](../../../pyrit/datasets/seed_datasets/local/): These will automatically be loaded with `SeedDatasetProvider` and can be useful for various scenarios.

## Remote Datasets

Sometimes, it can be preferable to fetch the datasets remotely for licensing reasons, or to stay up to date. These are commonly retrieved from a URL or from a hugging face dataset. To include a dataset this way, create a `RemoteDatasetLoader` class. There are helper functions to parse common formats, caching, downloading from websites or HuggingFace. These will also be automatically loaded with `SeedDatasetProvider`.

Below is a simplified version of `DarkBenchDataset`.

In [None]:

from pyrit.datasets.seed_datasets.remote.remote_dataset_loader import RemoteDatasetLoader
from pyrit.models import SeedDataset, SeedPrompt

class SimpeDarkBench(RemoteDatasetLoader):

    @property
    def dataset_name(self) -> str:
        return "dark_bench"

    async def fetch_dataset(self, *, cache: bool = True) -> SeedDataset:
        # Fetch from HuggingFace
        data = await self._fetch_from_huggingface(
            dataset_name="apart/darkbench",
            config="default ",
            split="train",
            cache=cache,
            data_files="darkbench.tsv",
        )

        # Process into SeedPrompts
        seed_prompts = [
            SeedPrompt(
                value=item["Example"],
                data_type="text",
                dataset_name=self.dataset_name,
                harm_categories=[item["Deceptive Pattern"]],
            )
            for item in data
        ]

        return SeedDataset(seeds=seed_prompts, dataset_name=self.dataset_name)