Skip to content

Commit

Permalink
Add Dataset Descriptions And Instructions (#358)
Browse files Browse the repository at this point in the history
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
  • Loading branch information
rasbt and carmocca committed Aug 30, 2023
1 parent 7289da9 commit 241970d
Show file tree
Hide file tree
Showing 7 changed files with 144 additions and 17 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ build

# data
data
datasets
checkpoints
out
wandb
Expand Down
8 changes: 1 addition & 7 deletions scripts/prepare_dolly.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,13 +95,7 @@ def download_if_missing(file_path: Path, file_url: str):
f.write(requests.get(file_url).text)


def prepare_sample(
example: dict,
tokenizer: Tokenizer,
max_length: int,
mask_inputs: bool,
ignore_index: int,
):
def prepare_sample(example: dict, tokenizer: Tokenizer, max_length: int, mask_inputs: bool, ignore_index: int):
"""Processes a single sample.
Each sample in the dataset consists of:
Expand Down
2 changes: 2 additions & 0 deletions tutorials/finetune_adapter.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ python scripts/prepare_alpaca.py --checkpoint_dir checkpoints/stabilityai/stable

or [prepare your own dataset](#tune-on-your-dataset).

For more information about dataset preparation, also see the [prepare_dataset.md](./prepare_dataset.md) tutorial.

## Running the finetuning

```bash
Expand Down
4 changes: 3 additions & 1 deletion tutorials/finetune_full.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ The steps here only need to be done once:
python scripts/prepare_alpaca.py --checkpoint_dir checkpoints/tiiuae/falcon-7b
```

or [prepare your own dataset](#tune-on-your-dataset).
or [prepare your own dataset](#tune-on-your-dataset).

For more information about dataset preparation, also see the [prepare_dataset.md](./prepare_dataset.md) tutorial.

## Running the finetuning

Expand Down
10 changes: 6 additions & 4 deletions tutorials/finetune_lora.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,13 @@ The steps here only need to be done once:

3. Download the data and generate the instruction tuning dataset:

```bash
python scripts/prepare_alpaca.py
```
```bash
python scripts/prepare_alpaca.py --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b
```

or [prepare your own dataset](#tune-on-your-dataset).

(See [this blog article](https://lightning.ai/blog/how-to-finetune-gpt-like-large-language-models-on-a-custom-dataset) for how to prepare and use custom datasets.)
For more information about dataset preparation, also see the [prepare_dataset.md](./prepare_dataset.md) tutorial.

## Running the finetuning

Expand Down
11 changes: 6 additions & 5 deletions tutorials/neurips_challenge_quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,13 +129,11 @@ The following command will download and preprocess the Dolly15k dataset for the
```bash
python scripts/prepare_dolly.py \
--checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b \
--destination_path data/dolly-stablelm3b \
--max_seq_length 2048
--destination_path data/dolly-stablelm3b
```

**Important note**

The preprocessed dataset is specific to the StableLM 3B model. If you use a different model like Falcon or Llama 2 later, you'll need to process the dataset with that model checkpoint directory. This is because each model uses a different tokenizer.
> [!NOTE]
> The preprocessed dataset is specific to the StableLM 3B model. If you use a different model like Falcon or Llama 2 later, you'll need to process the dataset with that model checkpoint directory. This is because each model uses a different tokenizer.
&nbsp;

Expand All @@ -145,6 +143,9 @@ The preprocessed dataset is specific to the StableLM 3B model. If you use a diff

To accelerate this for testing purposes, edit the [./finetune/lora.py](https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/lora.py) script and change `max_iters = 50000` to `max_iters = 500` at the top of the file.

> [!NOTE]
> The Dolly dataset has a relatively long context length, which could result in out-of-memory issues. The maximum context length that is used for the evaluation, [according to the official competition rules](https://llm-efficiency-challenge.github.io/question), is 2,048 tokens. Hence, it's highly recommended to edit the [`finetune/lora.py` file](https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/lora.py#L37) and change `override_max_seq_length = None` to `override_max_seq_length = 2048`.
The following command finetunes the model:

```bash
Expand Down
125 changes: 125 additions & 0 deletions tutorials/prepare_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Preparing Datasets

Below is a table of all datasets that are currently supported in Lit-GPT:


| Name | Task | Size | Reference Repo | Paper / Blog | Data License |
|--------------|-------------|---------------------|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Alpaca | Finetuning | 51,759 samples | [URL](https://github.com/tatsu-lab/stanford_alpaca) | [URL](https://crfm.stanford.edu/2023/03/13/alpaca.html) | Attribution-NonCommercial 4.0 International, [ URL](https://crfm.stanford.edu/2023/03/13/alpaca.html) |
| Alpaca Libre | Finetuning | 55,370 samples | [URL](https://github.com/mobarski/alpaca-libre) | - | CC0/MIT, [URL](https://github.com/mobarski/alpaca-libre) |
| Dolly | Finetuning | 15,011 samples | [URL](https://github.com/databrickslabs/dolly/tree/master/data) | [URL](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) | CC-BY-SA, [URL](https://github.com/databrickslabs/dolly#model-overview) |
| LIMA | Finetuning | 1,084 samples | [URL](https://huggingface.co/datasets/GAIR/lima) | [URL](https://arxiv.org/abs/2305.11206) | "If the source data of LIMA has a stricter license than CC BY-NC-SA, the LIMA dataset follows the same. Otherwise, it follows the CC BY-NC-SA license", [URL](https://huggingface.co/datasets/GAIR/lima#license) |
| OpenWeb Text | Pretraining | 8,013,769 documents | [URL](https://github.com/jcpeterson/openwebtext) | [URL](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) | Unspecified |
| RedPajama | Pretraining | 1.2 T tokens | [URL](https://github.com/togethercomputer/RedPajama-Data) | [URL](https://together.ai/blog/redpajama-models-v1) | Subset-dependent, [URL](https://github.com/togethercomputer/RedPajama-Data#license) | | |

&nbsp;

## Preparing Finetuning Datasets

Note that the dataset needs to be prepared separately for each type of model since the tokenizers used by the models may differ, resulting in slightly different preprocessed datasets.

For the following examples, we will use a Falcon 7B model. However, the same methods are compatible with all other models as well.

The steps here only need to be done once before preparing the finetuning datasets in the following subsections:

1. Follow the instructions in the [README](../README.md) to install the dependencies.
2. Download and convert the weights following our [guide](download_falcon.md).

&nbsp;

### Alpaca and Alpaca Libre

&nbsp;

**Alpaca**

The Alpaca dataset consists of 52,000 instructions and demonstrations produced by OpenAI's text-davinci-003 engine. This data is used in instruction-tuning, helping improve the performance of language models to follow instructions.

In its development, the creators leveraged the data generation methodology from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct).

The original [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) dataset can be prepared as follows:

```bash
python scripts/prepare_alpaca.py \
--checkpoint_dir checkpoints/tiiuae/falcon-7b
```

&nbsp;

**Alpaca Libre**

[Alpaca Libre](https://github.com/mobarski/alpaca-libre) is a reimplementation or alternative to Alpaca using the same formatting.

To use Alpaca Libre instead of the original Alpaca dataset, use the following command:

```bash
python scripts/prepare_alpaca.py \
--checkpoint_dir "checkpoints/tiiuae/falcon-7b" \
--data_file_url "https://raw.githubusercontent.com/mobarski/alpaca-libre/main/data/output/alpaca_libre_ok_tasks_v4.json" \
--data_file_name "alpaca_libre_data_cleaned_archive.json" \
--destination_path "data/alpaca_libre"
```

&nbsp;

### Dolly

The Dolly dataset is a publicly available collection of 15k instruction-following entries created by Databricks. It spans multiple behavioral domains, as described in the [InstructGPT paper](https://arxiv.org/abs/2203.02155) paper. These include areas like brainstorming, classification, closed QA, content creation, information retrieval, open QA, and summary generation.

The usage is similar to the Alpaca dataset described above. Using Falcon 7b as an example, we can prepare the dataset as follows:

```bash
python scripts/prepare_dolly.py \
--checkpoint_dir "checkpoints/tiiuae/falcon-7b" \
```

&nbsp;

### LIMA

The LIMA dataset is a collection of 1,000 carefully curated prompts and responses, as described in the [LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206) paper. The dataset is sourced from three community Q&A websites: Stack Exchange, wikiHow, and the Pushshift Reddit Dataset. In addition, it also contains prompts and answers written and collected by the authors of the LIMA paper.

The usage is similar to the Dolly dataset described above except that it requires an Hugging Face access token that you need to copy & paste from your Hugging Face account. Using Falcon 7b as an example, we can prepare the dataset as follows:

```bash
python scripts/prepare_lima.py \
--checkpoint_dir "checkpoints/tiiuae/falcon-7b" \
--access_token "insert_your_token_here"
```

LIMA contains a handful of multiturn conversations. By default, only the first instruction-response pairs from
each of these multiturn conversations are included. If you want to override this behavior and include the follow up instructions
and responses, set `--include_multiturn_conversations True`.


&nbsp;

**Finetuning After Data Preparation**

After preparing the dataset, you can finetune the model using the [`finetune/*.py`](../finetune/) scripts, for example,

```bash
python finetune/lora.py
--checkpoint_dir "checkpoints/tiiuae/falcon-7b" \
--data_dir "data/alpaca_libre" \
--out_dir "out/lora/alpaca"
```

Please read the [tutorials/finetune_*.md](../tutorials) documents for more information about finetuning models.

> [!IMPORTANT]
> Make sure that the `prepare_*.py` and `finetune/*.py` scripts use the same model checkpoint specified via `--checkpoint_dir`.
> [!IMPORTANT]
> By default, the maximum sequence length is obtained from the model configuration file. In case you run into out-of-memory errors, especially in the cases of LIMA and Dolly,
> you can try to lower the context length by editing the [`finetune/lora.py` file](https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/lora.py#L37) and change `override_max_seq_length = None` to `override_max_seq_length = 2048`.
&nbsp;

## Preparing Pretraining Datasets

In addition to the finetuning dataset described above, Lit-GPT also supports several datasets for pretraining. The pretraining datasets are described in more detail in the following separate tutorial documents:

- [Pretrain Llama 2 on OpenWebText](./pretrain_openwebtext.md)
- [Pretrain Llama 2 on RedPajama](./pretrain_redpajama.md)

0 comments on commit 241970d

Please sign in to comment.