-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAFT Support for chat and completion model formats #417
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Ignore venv folders - Supports JSONL and Parquet output types - Supports hf, chat and completion formats - chat format also accepts a `--output-chat-system-prompt` param to configure the system prompt
ShishirPatil
approved these changes
May 10, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not change functionality.
devanshamin
pushed a commit
to devanshamin/gorilla
that referenced
this pull request
Jul 9, 2024
Adds support to convert the dataset to formats expected to fine tune `completion` and `chat` models as specified there: https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset `chat` format: ``` {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]} {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]} {"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]} ``` `completion` format: ``` {"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} {"prompt": "<prompt text>", "completion": "<ideal generated text>"} ``` `raft.py`: - Supports `jsonl` and `parquet` output types - Supports `hf`, `chat` and `completion` formats - `chat` format also accepts a `--output-chat-system-prompt` param to configure the system prompt - Ignore venv folders - Added usage to --help ``` --output-format {hf,completion,chat} Format to convert the dataset to. Defaults to hf. --output-type {parquet,jsonl} Type to export the dataset to. Defaults to jsonl. --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT The system prompt to use when the output format is chat ``` New `format.py` script, to convert dataset previously generated by `raft.py`: ``` $ python format.py --help usage: format.py [-h] --input INPUT [--input-type {arrow,jsonl}] --output OUTPUT --output-format {hf,completion,chat} [--output-type {parquet,jsonl}] [--output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT] options: -h, --help show this help message and exit --input INPUT Input HuggingFace dataset file --input-type {arrow,jsonl} Format of the input dataset. Defaults to arrow. --output OUTPUT Output file --output-format {hf,completion,chat} Format to convert the dataset to --output-type {parquet,jsonl} Type to export the dataset to. Defaults to jsonl. --output-chat-system-prompt OUTPUT_CHAT_SYSTEM_PROMPT The system prompt to use when the output format is chat ``` How to test `format.py` with the `chat` format: ``` python format.py --input output/data-00000-of-00001.arrow \ --output output/ucb-short.chat.jsonl \ --output-format chat \ --output-chat-system-prompt 'You are an AI expert on UC Berkeley' ``` How to test `format.py` with the `completion` format: ``` python format.py --input output/data-00000-of-00001.arrow \ --output output/ucb-short.completion.jsonl \ --output-format completion ``` How to test `raft.py` with the `chat` format: ``` python3 raft.py \ --datapath $PWD/sample_data/UC_Berkeley_short.pdf \ --output $PWD/output \ --distractors 3 \ --doctype pdf \ --chunk_size 512 \ --questions 2 \ --completion_model gpt-4-turbo \ --embedding_model text-embedding-ada-002 \ --output-format chat \ --output-chat-system-prompt "You're a RAG AI" ``` Co-authored-by: Shishir Patil <30296397+ShishirPatil@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds support to convert the dataset to formats expected to fine tune
completion
andchat
models as specified there:https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
chat
format:completion
format:raft.py
:jsonl
andparquet
output typeshf
,chat
andcompletion
formatschat
format also accepts a--output-chat-system-prompt
param to configure the system promptNew
format.py
script, to convert dataset previously generated byraft.py
:How to test
format.py
with thechat
format:How to test
format.py
with thecompletion
format:How to test
raft.py
with thechat
format: