Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAFT Support for chat and completion model formats #417

Merged
merged 2 commits into from
May 10, 2024

Conversation

cedricvidal
Copy link
Contributor

@cedricvidal cedricvidal commented May 8, 2024

Adds support to convert the dataset to formats expected to fine tune completion and chat models as specified there:
https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

chat format:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

completion format:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

raft.py:

  • Supports jsonl and parquet output types
  • Supports hf, chat and completion formats
  • chat format also accepts a --output-chat-system-prompt param to configure the system prompt
  • Ignore venv folders
  • Added usage to --help
  --output-format {hf,completion,chat}
                        Format to convert the dataset to. Defaults to hf.
  --output-type {parquet,jsonl}
                        Type to export the dataset to. Defaults to jsonl.
  --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT
                        The system prompt to use when the output format is chat

New format.py script, to convert dataset previously generated by raft.py:

$ python format.py --help
usage: format.py [-h] --input INPUT [--input-type {arrow,jsonl}] --output OUTPUT --output-format {hf,completion,chat}
                 [--output-type {parquet,jsonl}] [--output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT]

options:
  -h, --help            show this help message and exit
  --input INPUT         Input HuggingFace dataset file
  --input-type {arrow,jsonl}
                        Format of the input dataset. Defaults to arrow.
  --output OUTPUT       Output file
  --output-format {hf,completion,chat}
                        Format to convert the dataset to
  --output-type {parquet,jsonl}
                        Type to export the dataset to. Defaults to jsonl.
  --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT
                        The system prompt to use when the output format is chat

How to test format.py with the chat format:

python format.py --input output/data-00000-of-00001.arrow \
    --output output/ucb-short.chat.jsonl \
    --output-format chat \
    --output-chat-system-prompt 'You are an AI expert on UC Berkeley'

How to test format.py with the completion format:

python format.py --input output/data-00000-of-00001.arrow \
    --output output/ucb-short.completion.jsonl \
    --output-format completion

How to test raft.py with the chat format:

python3 raft.py \
    --datapath $PWD/sample_data/UC_Berkeley_short.pdf \
    --output $PWD/output \
    --distractors 3 \
    --doctype pdf \
    --chunk_size 512 \
    --questions 2 \
    --completion_model gpt-4-turbo \
    --embedding_model text-embedding-ada-002 \
    --output-format chat \
    --output-chat-system-prompt "You're a RAG AI"

cedricvidal and others added 2 commits May 8, 2024 14:24
- Ignore venv folders
- Supports JSONL and Parquet output types
- Supports hf, chat and completion formats
- chat format also accepts a `--output-chat-system-prompt` param to configure the system prompt
Copy link
Owner

@ShishirPatil ShishirPatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not change functionality.

@ShishirPatil ShishirPatil merged commit ae5f0a2 into ShishirPatil:main May 10, 2024
devanshamin pushed a commit to devanshamin/gorilla that referenced this pull request Jul 9, 2024
Adds support to convert the dataset to formats expected to fine tune
`completion` and `chat` models as specified there:

https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

`chat` format:
```
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}
```

`completion` format:
```
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
```

`raft.py`:
- Supports `jsonl` and `parquet` output types
- Supports `hf`, `chat` and `completion` formats
- `chat` format also accepts a `--output-chat-system-prompt` param to
configure the system prompt
- Ignore venv folders
- Added usage to --help

```
  --output-format {hf,completion,chat}
                        Format to convert the dataset to. Defaults to hf.
  --output-type {parquet,jsonl}
                        Type to export the dataset to. Defaults to jsonl.
  --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT
                        The system prompt to use when the output format is chat
```


New `format.py` script, to convert dataset previously generated by
`raft.py`:

```
$ python format.py --help
usage: format.py [-h] --input INPUT [--input-type {arrow,jsonl}] --output OUTPUT --output-format {hf,completion,chat}
                 [--output-type {parquet,jsonl}] [--output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT]

options:
  -h, --help            show this help message and exit
  --input INPUT         Input HuggingFace dataset file
  --input-type {arrow,jsonl}
                        Format of the input dataset. Defaults to arrow.
  --output OUTPUT       Output file
  --output-format {hf,completion,chat}
                        Format to convert the dataset to
  --output-type {parquet,jsonl}
                        Type to export the dataset to. Defaults to jsonl.
  --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT
                        The system prompt to use when the output format is chat
```

How to test `format.py` with the `chat` format:

```
python format.py --input output/data-00000-of-00001.arrow \
    --output output/ucb-short.chat.jsonl \
    --output-format chat \
    --output-chat-system-prompt 'You are an AI expert on UC Berkeley'
```

How to test `format.py` with the `completion` format:

```
python format.py --input output/data-00000-of-00001.arrow \
    --output output/ucb-short.completion.jsonl \
    --output-format completion
```

How to test `raft.py` with the `chat` format:

```
python3 raft.py \
    --datapath $PWD/sample_data/UC_Berkeley_short.pdf \
    --output $PWD/output \
    --distractors 3 \
    --doctype pdf \
    --chunk_size 512 \
    --questions 2 \
    --completion_model gpt-4-turbo \
    --embedding_model text-embedding-ada-002 \
    --output-format chat \
    --output-chat-system-prompt "You're a RAG AI"
```

Co-authored-by: Shishir Patil <30296397+ShishirPatil@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants