Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add LLaMA-3 instruct prompt strategies for fine-tuning #1553

Merged

Conversation

0-hero
Copy link
Contributor

@0-hero 0-hero commented Apr 20, 2024

Description

This builds on top of and includes the changes in the below PR's

Fastchat PR from @TJ-Solergibert needs to be merged before merging this

Motivation and Context

Fine-tuning models in the llama-3 instruct conversation format, or continue fine-tuning llama-3 instruct models

How has this been tested?

  • Successfully pre-processed sharegpt style conversations dataset into llama-3 instruct format for fine-tuning
  • Model fine-tuning is in progress

Update

Added changes from TJs monkeypatch to support the latest Fastchat PR lm-sys/FastChat#3259

@maziyarpanahi
Copy link
Contributor

@winglian Any chance this can be merged to unblock the other PR? (thanks for the PR @0-hero)

@TJ-Solergibert
Copy link
Contributor

@maziyarpanahi You can install THIS axolotl PR with THIS FastChat PR

@maziyarpanahi
Copy link
Contributor

@maziyarpanahi You can install THIS axolotl PR with THIS FastChat PR

Hi @TJ-Solergibert
Thanks, I will try that today. I just wanted to be sure they are both finished and ready to be merged before I do that. But it seems they are ready.
Thanks again, I'll give it shot :)

@upr1ce
Copy link

upr1ce commented Apr 22, 2024

I think this needs to be updated, at this point it generates some token duplication, this function might need to be updated:

def register_llama3_template(system_message=None):
    system_message = system_message or "You are a helpful assistant."
    bos_token = "<|start_of_conversation|>"
    eos_token = "<|end_of_conversation|>"
    register_conv_template(
        Conversation(
            name="llama3",
            system_template=bos_token + "<|start_header_id|>system<|end_header_id|>\n\n{system_message}" + "<|eot_id|>",
            system_message=system_message,
            roles=["<|start_header_id|>user<|end_header_id|>", "<|start_header_id|>assistant<|end_header_id|>"],
            sep_style=SeparatorStyle.LLAMA3,
            sep="<|eot_id|>",
            sep2="<|start_header_id|>",
        )
    )

To (in line with FastChat implementation):

def register_llama3_template(system_message=None):
    system_message = system_message or "You are a helpful assistant."
    register_conv_template(
        Conversation(
            name="llama3",
            system_template="<|start_header_id|>system<|end_header_id|>\n\n{system_message}" + "<|eot_id|>",
            system_message=system_message,
            roles=["user", "assistant"],
            sep_style=SeparatorStyle.LLAMA3,
            sep="<|eot_id|>",
            sep2="<|start_header_id|>",
        )
    )

@0-hero
Copy link
Contributor Author

0-hero commented Apr 22, 2024

Thanks @upr1ce yes it needs to be updated. Just waiting for the fastchat merge with final changes. I’ll update this once that happens

@0-hero
Copy link
Contributor Author

0-hero commented Apr 24, 2024

@winglian Made necessary changes, fastchat PR is merged

ShareGPT Original

[ { "from": "system", "value": "You are a helpful AI assistant." }, { "from": "user", "value": "Create a nested loop to print every combination of numbers between 0-9, excluding any combination that contains the number 5. Additionally, exclude any combination that contains a repeating digit. Implement the solution without using any built-in functions or libraries to check for repeating digits." }, { "from": "gpt", "value": "Here is an example of a nested loop in Python to print every combination of numbers between 0-9, excluding any combination that contains the number 5 or repeating digits:\n\n```python\nfor i in range(10): # First digit\n for j in range(10): # Second digit\n for k in range(10): # Third digit\n # Checking for the conditions\n if i != 5 and j != 5 and k != 5 and i != j and i != k and j != k:\n print(i, j, k)\n```\n\nThis code will generate and print every combination of three digits between 0-9 that do not contain the number 5 and do not have any repeating digits." } ]

Tokenised LLaMA-3-Instruct Format

<|begin_of_text|><|start_of_conversation|><|start_header_id|>system<|end_header_id|>
You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|> Create a nested loop to print every combination of numbers between 0-9, excluding any combination that contains the number 5. Additionally, exclude any combination that contains a repeating digit. Implement the solution without using any built-in functions or libraries to check for repeating digits.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Here is an example of a nested loop in Python to print every combination of numbers between 0-9, excluding any combination that contains the number 5 or repeating digits:

```python
for i in range(10):  # First digit
    for j in range(10):  # Second digit
        for k in range(10):  # Third digit
            # Checking for the conditions
            if i!= 5 and j!= 5 and k!= 5 and i!= j and i!= k and j!= k:
                print(i, j, k)
python```

This code will generate and print every combination of three digits between 0-9 that do not contain the number 5 and do not have any repeating digits.<|eot_id|><|end_of_text|>

LLaMA-3 Instruct format can be used with the config

datasets:
  - path: bjoernp/Vezora_Tested-22k-Python-Alpaca-sharegpt-filtered
    type: sharegpt
    conversation: llama3

Let me know if any changes are required

@maziyarpanahi
Copy link
Contributor

@0-hero isn't the actual Llama-3 template (as they have set in their HF tokenizer config) like this:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Who are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I am an AI<|eot_id|><|start_header_id|>user<|end_header_id|>

What's your name?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

condence:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI am an AI<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat's your name?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

if message:
yield f"<|start_header_id|>{role}<|end_header_id|>\n\n", f"{message.strip()}<|eot_id|>"
else:
yield f"<|start_header_id|>{role}<|end_header_id|>\n\n", ""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return is missing here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @ryj0902 pointed out, when there is no system_message, bos token is not added at the moment. should we do something like the following?

if self.sep_style == SeparatorStyle.LLAMA3:
        if self.system_message:
            # For llama3, the system message is NOT incorporated into the first human instruction
            # All messages follow <|start_header_id|>' + role + '<|end_header_id|>\n\n'+ message + '<|eot_id|>
            yield "", system_prompt
        for i, (role, message) in enumerate(self.messages):
            if message:
                role_header = f"<|start_header_id|>{role}<|end_header_id|>\n\n"
                if i == 0:
                    yield "<|begin_of_text|>" + role_header, f"{message.strip()}<|eot_id|>"
                else:
                    yield role_header, f"{message.strip()}<|eot_id|>"
            else:
                yield f"<|start_header_id|>{role}<|end_header_id|>\n\n", ""
        return

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately thats not going to work, it will lead to duplication of <|begin_of_text|>
example:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

asdf<|eot_id|><|begin_of_text|><|start_header_id|>user<|end_header_id|>

It is a test<|eot_id|><|start_header_id|>assistant<|end_header_id|>

WTF<|eot_id|><|end_of_text|>

@upr1ce
Copy link

upr1ce commented Apr 24, 2024

After the changes, it will generate correct output, example (formatted)

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Can you please book a flight for me from New York to Los Angeles?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

I'm sorry, but I'm unable to assist with booking flights.<|eot_id|><|end_of_text|>

if self.system_message:
# For llama3, the system message is NOT incorporated into the first human instruction
# All messages follow <|start_header_id|>' + role + '<|end_header_id|>\n\n'+ message + '<|eot_id|>
yield "", "<|begin_of_text|>" + system_prompt
Copy link

@upr1ce upr1ce Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be yield "", system_prompt otherwise <|begin_of_text|> will be replicated twice, as bos tokens are added automatically by axolotl. Sorry for next message, I did not mark it correctly so it was visible only for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just working on that, but even with yield "", system_prompt I'm seeing <|begin_of_text|> twice. Trying t figure out what's going on there

Copy link

@upr1ce upr1ce Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats a cache issue, clear your huggingface dataset cache and it should be fine, something like rm ~/.cache/huggingface/datasets/ depending on your OS

@0-hero
Copy link
Contributor Author

0-hero commented Apr 24, 2024

Thanks @upr1ce works as expected now

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Create a nested loop to print every combination of numbers between 0-9, excluding any combination that contains the number 5. Additionally, exclude any combination that contains a repeating digit. Implement the solution without using any built-in functions or libraries to check for repeating digits.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Here is an example of a nested loop in Python to print every combination of numbers between 0-9, excluding any combination that contains the number 5 or repeating digits:

```python
for i in range(10):  # First digit
    for j in range(10):  # Second digit
        for k in range(10):  # Third digit
            # Checking for the conditions
            if i!= 5 and j!= 5 and k!= 5 and i!= j and i!= k and j!= k:
                print(i, j, k)
```python

This code will generate and print every combination of three digits between 0-9 that do not contain the number 5 and do not have any repeating digits.<|eot_id|><|end_of_text|>

Sorry, Missed the last 2 changes when I changed to the 2nd fast chat PR
@maziyarpanahi @upr1ce this should be ready to use now

@0-hero
Copy link
Contributor Author

0-hero commented Apr 26, 2024

@upr1ce reran the workflows here - https://github.com/0-hero/axolotl/actions

@ryj0902
Copy link

ryj0902 commented Apr 26, 2024

Shouldn't the bos_token be added regardless of whether the system prompt is present or not?
Is the instruction prompt being implemented now exclusive to the single message example?

image

According to the current implementation, if there is no system message(or empty string), bos token is not added. Was this intentional?
Or are cases where there is no system message very rare, or am I using an immature use case as an example?

case 1, system message exist:
{"conversations": [{"from": "system", "value": "asdf"}, {"from": "user", "value": "It is a test"}, {"from": "assistant", "value": "C"}]}
result:

[2024-04-26 18:15:33,159] [INFO] [axolotl.check_example_labels:37] [PID:30126] [RANK:0] <|begin_of_text|>(-100, 128000) <|start_header_id|>(-100, 128006) system(-100, 9125) <|end_header_id|>(-100, 128007) 

(-100, 271) asdf(-100, 77715) <|eot_id|>(-100, 128009) <|start_header_id|>(-100, 128006) user(-100, 882) <|end_header_id|>(-100, 128007) 

(-100, 271) It(-100, 2181)  is(-100, 374)  a(-100, 264)  test(-100, 1296) <|eot_id|>(-100, 128009) <|start_header_id|>(-100, 128006) assistant(-100, 78191) <|end_header_id|>(-100, 128007) 

(271, 271) C(34, 34) <|eot_id|>(128009, 128009) <|end_of_text|>(128001, 128001)

case 2, system message is empty:
{"conversations": [{"from": "system", "value": ""}, {"from": "user", "value": "It is a test"}, {"from": "assistant", "value": "C"}]}
result:

[2024-04-26 18:18:23,201] [INFO] [axolotl.check_example_labels:37] [PID:32252] [RANK:0] <|start_header_id|>(-100, 128006) user(-100, 882) <|end_header_id|>(-100, 128007) 

(-100, 271) It(-100, 2181)  is(-100, 374)  a(-100, 264)  test(-100, 1296) <|eot_id|>(-100, 128009) <|start_header_id|>(-100, 128006) assistant(-100, 78191) <|end_header_id|>(-100, 128007) 

(271, 271) C(34, 34) <|eot_id|>(128009, 128009) <|end_of_text|>(128001, 128001)

case 3, system message not exist: {"conversations": [{"from": "user", "value": "It is a test"}, {"from": "assistant", "value": "C"}]}
result:

[2024-04-26 18:19:28,071] [INFO] [axolotl.check_example_labels:37] [PID:33302] [RANK:0] <|start_header_id|>(-100, 128006) user(-100, 882) <|end_header_id|>(-100, 128007) 

(-100, 271) It(-100, 2181)  is(-100, 374)  a(-100, 264)  test(-100, 1296) <|eot_id|>(-100, 128009) <|start_header_id|>(-100, 128006) assistant(-100, 78191) <|end_header_id|>(-100, 128007) 

(271, 271) C(34, 34) <|eot_id|>(128009, 128009) <|end_of_text|>(128001, 128001)
[2024-04-26 18:19:28,071] [INFO] [axolotl.check_example_labels:38] [PID:33302] [RANK:0] 

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we pass system_message as one of the key word arguments?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winglian
Copy link
Collaborator

winglian commented May 8, 2024

here is how llama-3-8b-instruct tokenizes:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello, how are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Well, how about yourself?<|eot_id|>

@upr1ce
Copy link

upr1ce commented May 9, 2024

I am having trouble adjusting it to work in all 3 cases mentioned by @ryj0902, case 1 and 3 are easy to fix, case 2 seems weird, if system message is empty, I am not able to force to include <|begin_of_text|> without messing it in other cases, e.g leading to repetition of <|begin_of_text|> any help is appreciated @winglian @NanoCode012

@ryj0902
Copy link

ryj0902 commented May 9, 2024

It would be great if code could be written to respond to all cases that a user might enter, but I am also aware that case 2 deals with a very rare, edge case.
As for case 2, I honestly don't know what format would be most appropriate, but wouldn't it be possible to respond with a guideline, warning, or assert?

@winglian winglian force-pushed the feature/llama-3-instruct-support branch from 2470724 to 73a03cd Compare May 9, 2024 18:55
@winglian
Copy link
Collaborator

winglian commented May 9, 2024

@0-hero I rebased this against main, and added a sharegpt tokenization test. I believe in order to tokenize it properly, you have to include this in your configuration, otherwise it includes an <|end_of_text|> token after each assistant turn

special_tokens:
  eos_token: "<|eot_id|>"

@maziyarpanahi
Copy link
Contributor

@0-hero I rebased this against main, and added a sharegpt tokenization test. I believe in order to tokenize it properly, you have to include this in your configuration, otherwise it includes an <|end_of_text|> token after each assistant turn

special_tokens:
  eos_token: "<|eot_id|>"

This is true! Unfortunately, Meta instead of fixing their tokenizer_config and changing the eos_token to <|eot_id|>, they offered a workaround which is to use <|eot_id|> in a terminators (stop strings). Thanks for catching this, not sure how much it has impact on the fine-tune.

@MoonRide303
Copy link

@0-hero I rebased this against main, and added a sharegpt tokenization test. I believe in order to tokenize it properly, you have to include this in your configuration, otherwise it includes an <|end_of_text|> token after each assistant turn

special_tokens:
  eos_token: "<|eot_id|>"

This is true! Unfortunately, Meta instead of fixing their tokenizer_config and changing the eos_token to <|eot_id|>, they offered a workaround which is to use <|eot_id|> in a terminators (stop strings). Thanks for catching this, not sure how much it has impact on the fine-tune.

They've just fixed official tokenizer config in Meta-Llama-3-8B-Instruct repo, today:
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2055

@winglian winglian force-pushed the feature/llama-3-instruct-support branch from 73a03cd to 9ae5649 Compare May 10, 2024 14:44
@winglian winglian merged commit 50421c8 into OpenAccess-AI-Collective:main May 11, 2024
7 checks passed
@0-hero
Copy link
Contributor Author

0-hero commented May 11, 2024

@winglian thanks for taking over! I had to go on a short medical break

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants