Add Dataset Descriptions And Instructions (#358)

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Lightning-AI · Aug 30, 2023 · 241970d · 241970d
1 parent 7289da9
commit 241970d
Show file tree

Hide file tree

Showing 7 changed files with 144 additions and 17 deletions.
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,7 @@ build
 
 # data
 data
+datasets
 checkpoints
 out
 wandb

diff --git a/scripts/prepare_dolly.py b/scripts/prepare_dolly.py
@@ -95,13 +95,7 @@ def download_if_missing(file_path: Path, file_url: str):
         f.write(requests.get(file_url).text)
 
 
-def prepare_sample(
-    example: dict,
-    tokenizer: Tokenizer,
-    max_length: int,
-    mask_inputs: bool,
-    ignore_index: int,
-):
+def prepare_sample(example: dict, tokenizer: Tokenizer, max_length: int, mask_inputs: bool, ignore_index: int):
     """Processes a single sample.
 
     Each sample in the dataset consists of:

diff --git a/tutorials/finetune_adapter.md b/tutorials/finetune_adapter.md
@@ -22,6 +22,8 @@ python scripts/prepare_alpaca.py --checkpoint_dir checkpoints/stabilityai/stable
 
 or [prepare your own dataset](#tune-on-your-dataset).
 
+For more information about dataset preparation, also see the [prepare_dataset.md](./prepare_dataset.md) tutorial.
+
 ## Running the finetuning
 
 ```bash

diff --git a/tutorials/finetune_full.md b/tutorials/finetune_full.md
@@ -14,7 +14,9 @@ The steps here only need to be done once:
 python scripts/prepare_alpaca.py --checkpoint_dir checkpoints/tiiuae/falcon-7b
 ```
 
-or [prepare your own dataset](#tune-on-your-dataset).
+or [prepare your own dataset](#tune-on-your-dataset). 
+
+For more information about dataset preparation, also see the [prepare_dataset.md](./prepare_dataset.md) tutorial.
 
 ## Running the finetuning
 

diff --git a/tutorials/finetune_lora.md b/tutorials/finetune_lora.md
@@ -18,11 +18,13 @@ The steps here only need to be done once:
 
 3. Download the data and generate the instruction tuning dataset:
 
-   ```bash
-   python scripts/prepare_alpaca.py
-   ```
+```bash
+python scripts/prepare_alpaca.py --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b
+```
+
+or [prepare your own dataset](#tune-on-your-dataset).
 
-(See [this blog article](https://lightning.ai/blog/how-to-finetune-gpt-like-large-language-models-on-a-custom-dataset) for how to prepare and use custom datasets.)
+For more information about dataset preparation, also see the [prepare_dataset.md](./prepare_dataset.md) tutorial.
 
 ## Running the finetuning
 

diff --git a/tutorials/neurips_challenge_quickstart.md b/tutorials/neurips_challenge_quickstart.md
@@ -129,13 +129,11 @@ The following command will download and preprocess the Dolly15k dataset for the
 ```bash
 python scripts/prepare_dolly.py \
   --checkpoint_dir checkpoints/stabilityai/stablelm-base-alpha-3b \
-  --destination_path data/dolly-stablelm3b \
-  --max_seq_length 2048
+  --destination_path data/dolly-stablelm3b
 ```
 
-**Important note**
-
-The preprocessed dataset is specific to the StableLM 3B model. If you use a different model like Falcon or Llama 2 later, you'll need to process the dataset with that model checkpoint directory. This is because each model uses a different tokenizer.
+> [!NOTE]
+> The preprocessed dataset is specific to the StableLM 3B model. If you use a different model like Falcon or Llama 2 later, you'll need to process the dataset with that model checkpoint directory. This is because each model uses a different tokenizer.
 
 &nbsp;
 
@@ -145,6 +143,9 @@ The preprocessed dataset is specific to the StableLM 3B model. If you use a diff
 
 To accelerate this for testing purposes, edit the [./finetune/lora.py](https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/lora.py) script and change `max_iters = 50000` to `max_iters = 500` at the top of the file.
 
+> [!NOTE]
+> The Dolly dataset has a relatively long context length, which could result in out-of-memory issues. The maximum context length that is used for the evaluation, [according to the official competition rules](https://llm-efficiency-challenge.github.io/question), is 2,048 tokens. Hence, it's highly recommended to edit the  [`finetune/lora.py` file](https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/lora.py#L37) and change `override_max_seq_length = None` to `override_max_seq_length = 2048`.
+
 The following command finetunes the model:
 
 ```bash

diff --git a/tutorials/prepare_dataset.md b/tutorials/prepare_dataset.md
@@ -0,0 +1,125 @@
+# Preparing Datasets
+
+Below is a table of all datasets that are currently supported in Lit-GPT:
+
+
+| Name         | Task        | Size                | Reference Repo                                                  | Paper / Blog                                                                                                              | Data License                                                                                                                                                                                                     |
+|--------------|-------------|---------------------|-----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Alpaca       | Finetuning  | 51,759 samples      | [URL](https://github.com/tatsu-lab/stanford_alpaca)             | [URL](https://crfm.stanford.edu/2023/03/13/alpaca.html)                                                                   | Attribution-NonCommercial 4.0 International, [ URL](https://crfm.stanford.edu/2023/03/13/alpaca.html)                                                                                                            |
+| Alpaca Libre | Finetuning  | 55,370 samples      | [URL](https://github.com/mobarski/alpaca-libre)                 | -                                                                                                                         | CC0/MIT,  [URL](https://github.com/mobarski/alpaca-libre)                                                                                                                                                        |
+| Dolly        | Finetuning  | 15,011 samples      | [URL](https://github.com/databrickslabs/dolly/tree/master/data) | [URL](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm)              | CC-BY-SA, [URL](https://github.com/databrickslabs/dolly#model-overview)                                                                                                                                          |
+| LIMA         | Finetuning  | 1,084 samples       | [URL](https://huggingface.co/datasets/GAIR/lima)                | [URL](https://arxiv.org/abs/2305.11206)                                                                                   | "If the source data of LIMA has a stricter license than CC BY-NC-SA, the LIMA dataset follows the same. Otherwise, it follows the CC BY-NC-SA license", [URL](https://huggingface.co/datasets/GAIR/lima#license) |
+| OpenWeb Text | Pretraining | 8,013,769 documents | [URL](https://github.com/jcpeterson/openwebtext)                | [URL](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) | Unspecified                                                                                                                                                                                                      |
+| RedPajama    | Pretraining | 1.2 T tokens        | [URL](https://github.com/togethercomputer/RedPajama-Data)       | [URL](https://together.ai/blog/redpajama-models-v1)                                                                       | Subset-dependent, [URL](https://github.com/togethercomputer/RedPajama-Data#license)                                                                                                                              |                                                                     |   |
+
+&nbsp;
+
+## Preparing Finetuning Datasets
+
+Note that the dataset needs to be prepared separately for each type of model since the tokenizers used by the models may differ, resulting in slightly different preprocessed datasets.
+
+For the following examples, we will use a Falcon 7B model. However, the same methods are compatible with all other models as well.
+
+The steps here only need to be done once before preparing the finetuning datasets in the following subsections: 
+
+1. Follow the instructions in the [README](../README.md) to install the dependencies.
+2. Download and convert the weights following our [guide](download_falcon.md).
+
+&nbsp;
+
+### Alpaca and Alpaca Libre
+
+&nbsp;
+
+**Alpaca**
+
+The Alpaca dataset consists of 52,000 instructions and demonstrations produced by OpenAI's text-davinci-003 engine. This data is used in instruction-tuning, helping improve the performance of language models to follow instructions.
+
+In its development, the creators leveraged the data generation methodology from the [Self-Instruct framework](https://github.com/yizhongw/self-instruct).
+
+The original [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) dataset can be prepared as follows:
+
+```bash
+python scripts/prepare_alpaca.py \
+ --checkpoint_dir checkpoints/tiiuae/falcon-7b
+```
+
+&nbsp;
+
+**Alpaca Libre**
+
+[Alpaca Libre](https://github.com/mobarski/alpaca-libre) is a reimplementation or alternative to Alpaca using the same formatting.
+
+To use Alpaca Libre instead of the original Alpaca dataset, use the following command:
+
+```bash
+python scripts/prepare_alpaca.py \
+ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" \
+ --data_file_url "https://raw.githubusercontent.com/mobarski/alpaca-libre/main/data/output/alpaca_libre_ok_tasks_v4.json" \
+ --data_file_name "alpaca_libre_data_cleaned_archive.json" \
+ --destination_path "data/alpaca_libre"
+```
+
+&nbsp;
+
+### Dolly
+
+The Dolly dataset is a publicly available collection of 15k instruction-following entries created by Databricks. It spans multiple behavioral domains, as described in the [InstructGPT paper](https://arxiv.org/abs/2203.02155) paper. These include areas like brainstorming, classification, closed QA, content creation, information retrieval, open QA, and summary generation.
+
+The usage is similar to the Alpaca dataset described above. Using Falcon 7b as an example, we can prepare the dataset as follows:
+
+```bash
+python scripts/prepare_dolly.py \
+ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" \
+```
+
+&nbsp;
+
+### LIMA
+
+The LIMA dataset is a collection of 1,000 carefully curated prompts and responses, as described in the [LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206) paper. The dataset is sourced from three community Q&A websites: Stack Exchange, wikiHow, and the Pushshift Reddit Dataset. In addition, it also contains prompts and answers written and collected by the authors of the LIMA paper.
+
+The usage is similar to the Dolly dataset described above except that it requires an Hugging Face access token that you need to copy & paste from your Hugging Face account. Using Falcon 7b as an example, we can prepare the dataset as follows:
+
+```bash
+python scripts/prepare_lima.py \
+ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" \
+ --access_token "insert_your_token_here"
+```
+
+LIMA contains a handful of multiturn conversations. By default, only the first instruction-response pairs from 
+each of these multiturn conversations are included. If you want to override this behavior and include the follow up instructions 
+and responses, set `--include_multiturn_conversations True`.
+
+
+&nbsp;
+
+**Finetuning After Data Preparation**
+
+After preparing the dataset, you can finetune the model using the [`finetune/*.py`](../finetune/) scripts, for example,
+
+```bash
+python finetune/lora.py
+ --checkpoint_dir "checkpoints/tiiuae/falcon-7b" \
+ --data_dir "data/alpaca_libre" \
+ --out_dir "out/lora/alpaca"
+```
+
+Please read the [tutorials/finetune_*.md](../tutorials) documents for more information about finetuning models.
+
+> [!IMPORTANT]
+> Make sure that the `prepare_*.py` and `finetune/*.py` scripts use the same model checkpoint specified via `--checkpoint_dir`.
+
+> [!IMPORTANT]
+> By default, the maximum sequence length is obtained from the model configuration file. In case you run into out-of-memory errors, especially in the cases of LIMA and Dolly,  
+> you can try to lower the context length by editing the  [`finetune/lora.py` file](https://github.com/Lightning-AI/lit-gpt/blob/main/finetune/lora.py#L37) and change `override_max_seq_length = None` to `override_max_seq_length = 2048`.
+
+&nbsp;
+
+## Preparing Pretraining Datasets
+
+In addition to the finetuning dataset described above, Lit-GPT also supports several datasets for pretraining. The pretraining datasets are described in more detail in the following separate tutorial documents:
+
+- [Pretrain Llama 2 on OpenWebText](./pretrain_openwebtext.md)
+- [Pretrain Llama 2 on RedPajama](./pretrain_redpajama.md)
+