RAFT Support for chat and completion model formats #417

cedricvidal · 2024-05-08T21:36:05Z

Adds support to convert the dataset to formats expected to fine tune completion and chat models as specified there:
https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset

chat format:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

completion format:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

raft.py:

Supports jsonl and parquet output types
Supports hf, chat and completion formats
chat format also accepts a --output-chat-system-prompt param to configure the system prompt
Ignore venv folders
Added usage to --help

  --output-format {hf,completion,chat}
                        Format to convert the dataset to. Defaults to hf.
  --output-type {parquet,jsonl}
                        Type to export the dataset to. Defaults to jsonl.
  --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT
                        The system prompt to use when the output format is chat

New format.py script, to convert dataset previously generated by raft.py:

$ python format.py --help
usage: format.py [-h] --input INPUT [--input-type {arrow,jsonl}] --output OUTPUT --output-format {hf,completion,chat}
                 [--output-type {parquet,jsonl}] [--output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT]

options:
  -h, --help            show this help message and exit
  --input INPUT         Input HuggingFace dataset file
  --input-type {arrow,jsonl}
                        Format of the input dataset. Defaults to arrow.
  --output OUTPUT       Output file
  --output-format {hf,completion,chat}
                        Format to convert the dataset to
  --output-type {parquet,jsonl}
                        Type to export the dataset to. Defaults to jsonl.
  --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT
                        The system prompt to use when the output format is chat

How to test format.py with the chat format:

python format.py --input output/data-00000-of-00001.arrow \
    --output output/ucb-short.chat.jsonl \
    --output-format chat \
    --output-chat-system-prompt 'You are an AI expert on UC Berkeley'

How to test format.py with the completion format:

python format.py --input output/data-00000-of-00001.arrow \
    --output output/ucb-short.completion.jsonl \
    --output-format completion

How to test raft.py with the chat format:

python3 raft.py \
    --datapath $PWD/sample_data/UC_Berkeley_short.pdf \
    --output $PWD/output \
    --distractors 3 \
    --doctype pdf \
    --chunk_size 512 \
    --questions 2 \
    --completion_model gpt-4-turbo \
    --embedding_model text-embedding-ada-002 \
    --output-format chat \
    --output-chat-system-prompt "You're a RAG AI"

- Ignore venv folders - Supports JSONL and Parquet output types - Supports hf, chat and completion formats - chat format also accepts a `--output-chat-system-prompt` param to configure the system prompt

ShishirPatil

Does not change functionality.

Adds support to convert the dataset to formats expected to fine tune `completion` and `chat` models as specified there: https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset `chat` format: ``` {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]} {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]} {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]} ``` `completion` format: ``` {"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} ``` `raft.py`: - Supports `jsonl` and `parquet` output types - Supports `hf`, `chat` and `completion` formats - `chat` format also accepts a `--output-chat-system-prompt` param to configure the system prompt - Ignore venv folders - Added usage to --help ``` --output-format {hf,completion,chat} Format to convert the dataset to. Defaults to hf. --output-type {parquet,jsonl} Type to export the dataset to. Defaults to jsonl. --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT The system prompt to use when the output format is chat ``` New `format.py` script, to convert dataset previously generated by `raft.py`: ``` $ python format.py --help usage: format.py [-h] --input INPUT [--input-type {arrow,jsonl}] --output OUTPUT --output-format {hf,completion,chat} [--output-type {parquet,jsonl}] [--output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT] options: -h, --help show this help message and exit --input INPUT Input HuggingFace dataset file --input-type {arrow,jsonl} Format of the input dataset. Defaults to arrow. --output OUTPUT Output file --output-format {hf,completion,chat} Format to convert the dataset to --output-type {parquet,jsonl} Type to export the dataset to. Defaults to jsonl. --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT The system prompt to use when the output format is chat ``` How to test `format.py` with the `chat` format: ``` python format.py --input output/data-00000-of-00001.arrow \ --output output/ucb-short.chat.jsonl \ --output-format chat \ --output-chat-system-prompt 'You are an AI expert on UC Berkeley' ``` How to test `format.py` with the `completion` format: ``` python format.py --input output/data-00000-of-00001.arrow \ --output output/ucb-short.completion.jsonl \ --output-format completion ``` How to test `raft.py` with the `chat` format: ``` python3 raft.py \ --datapath $PWD/sample_data/UC_Berkeley_short.pdf \ --output $PWD/output \ --distractors 3 \ --doctype pdf \ --chunk_size 512 \ --questions 2 \ --completion_model gpt-4-turbo \ --embedding_model text-embedding-ada-002 \ --output-format chat \ --output-chat-system-prompt "You're a RAG AI" ``` Co-authored-by: Shishir Patil <30296397+ShishirPatil@users.noreply.github.com>

cedricvidal and others added 2 commits May 8, 2024 14:24

RAFT Support for chat and completion model formats

25577c1

- Ignore venv folders - Supports JSONL and Parquet output types - Supports hf, chat and completion formats - chat format also accepts a `--output-chat-system-prompt` param to configure the system prompt

Merge branch 'main' into formatter

0b3a509

ShishirPatil approved these changes May 10, 2024

View reviewed changes

ShishirPatil merged commit ae5f0a2 into ShishirPatil:main May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAFT Support for chat and completion model formats #417

RAFT Support for chat and completion model formats #417

cedricvidal commented May 8, 2024 •

edited

Loading

ShishirPatil left a comment

RAFT Support for chat and completion model formats #417

RAFT Support for chat and completion model formats #417

Conversation

cedricvidal commented May 8, 2024 • edited Loading

ShishirPatil left a comment

Choose a reason for hiding this comment

cedricvidal commented May 8, 2024 •

edited

Loading