## Configuring dataset for Fine-tuning

Datasets are a core component of fine-tuning workflows that serve as a “steering wheel” to guide LLM generation for a particular use case. Many publicly shared open-source datasets have become popular for fine-tuning LLMs and serve as a great starting point to train your model. torchtune gives you the tools to download external community datasets, load in custom local datasets, or create your own datasets.

### Built-in datasets

To use one of the built-in datasets in the library, simply import and call the dataset builder function. You can see a list of all supported datasets here.

```python
from torchtune.datasets import alpaca_dataset

# Load in tokenizer
tokenizer = ...
dataset = alpaca_dataset(tokenizer)
```

```yaml
# YAML config
dataset:
  _component_: torchtune.datasets.alpaca_dataset
```

In [None]:
# Command line
!tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=torchtune.datasets.alpaca_dataset

### Hugging Face datasets

We provide first class support for datasets on the Hugging Face hub. Under the hood, all of our built-in datasets and dataset builders are using Hugging Face’s `load_dataset()` to load in your data, whether local or on the hub.

You can pass in a Hugging Face dataset path to the `source` parameter in any of our builders to specify which dataset on the hub to download. Additionally, all builders accept any keyword-arguments that `load_dataset()` supports. You can see a full list on Hugging Face’s documentation.

In [None]:
from torchtune.datasets import text_completion_dataset

# Load in tokenizer
tokenizer = ...
dataset = text_completion_dataset(
    tokenizer,
    source="allenai/c4",
    # Keyword-arguments that are passed into load_dataset
    split="train",
    data_dir="realnewslike",
)

```yml
# YAML config
dataset:
  _component_: torchtune.datasets.text_completion_dataset
  source: allenai/c4
  split: train
  data_dir: realnewslike
```

In [None]:
# Command line
!tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset=torchtune.datasets.text_completion_dataset dataset.source=allenai/c4 \
dataset.split=train dataset.data_dir=realnewslike

### Setting max sequence length

The default collator `padded_collate()` used in all our training recipes will pad samples to the max sequence length within the batch, not globally. If you wish to set an upper limit on the max sequence length globally, you can specify it in the dataset builder with max_seq_len. Any sample in the dataset that is longer than max_seq_len will be truncated in `truncate()`. The tokenizer’s EOS ids are ensured to be the last token, except in `TextCompletionDataset`.

Generally, you want the max sequence length returned in each data sample to match the context window size of your model. You can also decrease this value to reduce memory usage depending on your hardware constraints.

```python
from torchtune.datasets import alpaca_dataset

# Load in tokenizer
tokenizer = ...
dataset = alpaca_dataset(
    tokenizer=tokenizer,
    max_seq_len=4096,
)
```

```yaml
# YAML config
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  max_seq_len: 4096


In [None]:
!tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset.max_seq_len=4096

### Sample packing

You can use sample packing with any of the single dataset builders by passing in packed=True. This requires some pre-processing of the dataset which may slow down time-to-first-batch, but can introduce significant training speedups depending on the dataset.
from torchtune.datasets import alpaca_dataset, PackedDataset

```python
# Load in tokenizer
tokenizer = ...
dataset = alpaca_dataset(
    tokenizer=tokenizer,
    packed=True,
)
print(isinstance(dataset, PackedDataset))  # True
```

```yaml
# YAML config
dataset:
  _component_: torchtune.datasets.alpaca_dataset
  packed: True
```

```python
!tune run full_finetune_single_device --config llama3/8B_full_single_device \
dataset.packed=True
```