Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fine-tuning LLaMA3? #264

Closed
cnlinxi opened this issue Jun 17, 2024 · 4 comments
Closed

Support fine-tuning LLaMA3? #264

cnlinxi opened this issue Jun 17, 2024 · 4 comments

Comments

@cnlinxi
Copy link

cnlinxi commented Jun 17, 2024

Very great project!

I tried to use training/bash/run_ds3.sh fine-tuning LLaMA3-8b. However, I found that the following code does not work for LLaMA-3 during the debugging process. Is this in line with expectations?

tokenizer.add_eos_token = True

Thanks for your reply.

@timturing
Copy link
Contributor

Thank you for your issue! I have tested the model, and it appears to be working well.

Here's the input_id of the first example from the Alpaca dataset:
image

The LLaMA3-8b tokenizer configuration is as follows:

128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),

This indicates that the eos_token has been added successfully.

The training loss is also within the normal range:

{'loss': 1.5548, 'grad_norm': 18.375739200079522, 'learning_rate': 1e-05, 'epoch': 0.0}    0%|                                                     | 1/1200 [00:13<4:20:01, 13.01s/it]
{'loss': 1.3182, 'grad_norm': 10.779239457063062, 'learning_rate': 1e-05, 'epoch': 0.0}    0%|                                                     | 2/1200 [00:20<3:19:55, 10.01s/it]

There is one possible reason it might not be working for you: the tokenizer saves the tokenized data file into a .pth file directly and reloads it. We set this to save time when re-training large datasets. However, if you load a completely new model or a new tokenizer, it may not be compatible with the previously tokenized data (e.g., the eos_token might not match in this case).

If this is your situation, try deleting the .pth data file and retrain the model. We are working on fixing this issue soon.

Please let me know if you continue to experience the problem.

@cnlinxi
Copy link
Author

cnlinxi commented Jun 19, 2024

@timturing
Thanks for your reply.
In fact, I found a similar issue: huggingface/transformers#30947

the tokenizer for Llama3 is a PreTrainedTokenizerFast, not the LLamaTokenizer or a LlamaTokenizerFast. Though it might actually be good to support an easy way to add bos and eos. Currently what you have to do is update the TemplateProcessor which is fairly annoying (not beginner friendly).
huggingface/transformers#30947 (comment)

LLaMA3-8b has been fine-tuned according to the solution in this issue (Otherwise, the fine-tuned model cannot be stopped normally), but there are still some strange prefixes when inference, which may still be related to tokenizer.

image

After all, according to the method in this issue, eos token will be always added.

That is,

tokenizer.add_eos_token = True 

or

tokenizer.add_eos_token = False

doesn't work.

How should this be solved? Thank you.

@timturing
Copy link
Contributor

Thank you for providing the additional information. I have observed the issue as well. As you mentioned, this cannot be resolved with our current code. Should the transformers team develop a solution, we will promptly update our code to incorporate it.

@cnlinxi
Copy link
Author

cnlinxi commented Jun 30, 2024

Thank you for providing the additional information. I have observed the issue as well. As you mentioned, this cannot be resolved with our current code. Should the transformers team develop a solution, we will promptly update our code to incorporate it.

Ok. Closing this issue. Looking forward to fix.

@cnlinxi cnlinxi closed this as completed Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants