Skip to content

Conversation

lllAlexanderlll
Copy link
Contributor

@lllAlexanderlll lllAlexanderlll commented Jul 1, 2025

What does this PR do?

This PR ..

This PR adds support for instruction tuning, by

  1. Introducing a new entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml which takes a instruction-dataset and converts the structured conversations into a single prompt by applying the chat template given as jinja2 template string within the config. Here, we also include indicator tokens to mark what the system utterances are.
  2. In modalities training entry point you can now wrap the collate function by a "LossMaskingCollateFn", which first executes the wrapped collate function and then applies loss masking on each target as specified in the config. This allows to only include tokens that are part of the assistant answer into the loss, so that the model learns to act as helpful assistant.
  3. Modifiy the PackedMemMapDatasetContinuous to allow not to re-use the last target token, as this is not wanted in instruction-tuning where we apply truncation and packing.

General Changes

  • New entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml to convert structured JSONL into JSONL with a the new attribute "chat", i.e. the prompt were the chat template was applied
  • A wrapper for collate functions to include tokens which appear between indicator tokens
  • A new parameter for the PackedMemMapDatasetContinuous to allow not to re-use the last target token

Breaking Changes

  • None, as the high-level default value for PackedMemMapDatasetContinuous.reuse_last_target is True

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py). They do not. (warmstart tutorial on main fails; all other tests run trough.)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

rrutmann and others added 30 commits July 15, 2024 13:18
…data. Change symbol for special tokens, which are actaully a single token within the vocab.
Co-authored-by: Alexander Weber <alex.a.weber@gmx.de>
@lllAlexanderlll lllAlexanderlll changed the title Draft: Instruction tuning 2025 Instruction tuning 2025 Jul 2, 2025
@lllAlexanderlll lllAlexanderlll requested a review from le1nux July 2, 2025 13:07
Copy link
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first round of comments

Copy link
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fabolous work! I especially like the clean implementation and smooth integration.

Apart from some minor comments I was wondering, if we should do dataset splitting in an extra step. This way the interfaces and configs would become simpler. You only apply the chat template to a single file. If you want more, then you just split the JSONL file a priori and call the endpoint multiple times.

What do you think?

@lllAlexanderlll
Copy link
Contributor Author

Thanks Max!
I addressed all your comments. We use FSDP2 now and convert the dcp checkpoints to torch before generating text with it in the tutorial. I removed also duplicated files, which are present in the tutorial and updated comments and docs as you requested. Thanks for your help!

Copy link
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 LGTM :)

@lllAlexanderlll lllAlexanderlll merged commit bd649de into main Jul 3, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tutorial for instruction tuning of a HF-imported model Support for standard instruction-tuning

4 participants