Instruction tuning 2025 #379

lllAlexanderlll · 2025-07-01T08:57:16Z

What does this PR do?

This PR ..

replace the duplicate, less clean pull requests

will close feature issue Support for standard instruction-tuning #369
will close tutorial issue Tutorial for instruction tuning of a HF-imported model #370

This PR adds support for instruction tuning, by

Introducing a new entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml which takes a instruction-dataset and converts the structured conversations into a single prompt by applying the chat template given as jinja2 template string within the config. Here, we also include indicator tokens to mark what the system utterances are.
In modalities training entry point you can now wrap the collate function by a "LossMaskingCollateFn", which first executes the wrapped collate function and then applies loss masking on each target as specified in the config. This allows to only include tokens that are part of the assistant answer into the loss, so that the model learns to act as helpful assistant.
Modifiy the PackedMemMapDatasetContinuous to allow not to re-use the last target token, as this is not wanted in instruction-tuning where we apply truncation and packing.

General Changes

New entry point data apply_chat_template --config_file_path config_files/training/config_lorem_ipsum_sft.yaml to convert structured JSONL into JSONL with a the new attribute "chat", i.e. the prompt were the chat template was applied
A wrapper for collate functions to include tokens which appear between indicator tokens
A new parameter for the PackedMemMapDatasetContinuous to allow not to re-use the last target token

Breaking Changes

None, as the high-level default value for PackedMemMapDatasetContinuous.reuse_last_target is True

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py). They do not. (warmstart tutorial on main fails; all other tests run trough.)
I have updated the internal changelog (CHANGELOG_DEV.md)

…y stopping of generation

…y. Unit test still needed.

…data. Change symbol for special tokens, which are actaully a single token within the vocab.

…mentation.

…ft_with_main

…enerator

Co-authored-by: Alexander Weber <alex.a.weber@gmx.de>

SFT sample generator

…iles

le1nux

first round of comments

config_files/instruction_tuning/qwen2/apply_chat_template_config.yaml

config_files/data_preparation/apply_chat_template_config.yaml

config_files/text_generation/text_generation_config_torch.yaml

tutorials/instruction_tuning/configs/train_instruct_model_fsdp1_config.yaml

config_files/training/config_lorem_ipsum_instruct_fsdp1.yaml

config_files/training/config_lorem_ipsum_instruct_fsdp2.yaml

data/lorem_ipsum_instruct.jsonl

…ples

le1nux

Fabolous work! I especially like the clean implementation and smooth integration.

Apart from some minor comments I was wondering, if we should do dataset splitting in an extra step. This way the interfaces and configs would become simpler. You only apply the chat template to a single file. If you want more, then you just split the JSONL file a priori and call the endpoint multiple times.

What do you think?

tests/instruction_tuning/files/config_lorem_ipsum_instruct_fsdp1.yaml

src/modalities/dataloader/create_instruction_tuning_data.py

lllAlexanderlll · 2025-07-02T19:21:58Z

Thanks Max!
I addressed all your comments. We use FSDP2 now and convert the dcp checkpoints to torch before generating text with it in the tutorial. I removed also duplicated files, which are present in the tutorial and updated comments and docs as you requested. Thanks for your help!

le1nux

💯 LGTM :)

rrutmann and others added 30 commits July 15, 2024 13:18

Add example config for the construction of chat templates

59f0191

chore: add chat template config based on jinja2

8b60a83

chore: update chat template config based on jinja2

ba2f65c

chore: Add apply chat template feature with role mapping

47e71c3

chore: extend to multiple chat templates

3303147

fix: data driven chat tempalte key retrieval

0c6bbf5

chore: Add 'index' to output JSONL

32f5756

fix: Add süecical token to be kept during treinaing to allow for earl…

482f7af

…y stopping of generation

chore: Update output file

1d72770

build: Add jsonlines dependency

0bd9bfa

chore: integration of collator wrapper with loss masking functionalit…

ed2f4ce

…y. Unit test still needed.

chore: Use SFT config replaction with uuid as file pair identification.

6e24ea2

chore: Add loss masking test

6e716b4

chore: Merge branch 'main' into sft_loss_masking

a0376a6

fix: copy raw config file for truly original content

70dc498

chore: add pbin file for testing loss masking

242e429

chore: add pbin file with more data for testing loss masking

bddcf8b

chore: use a hash not uuid for showing which config belongs to whoch …

f86b6ed

…data. Change symbol for special tokens, which are actaully a single token within the vocab.

chore: add pbin file for testing loss masking

7632a02

chore: Fix loss masking when starting within an assistant answer

15719a3

chore: add lost collator wrappr again

ab0f34c

chore: fix the loss masking test and the implementation. Improve docu…

0a545ca

…mentation.

chore: Merge commit '15ed069beaa2c83dcd15b087e4d0864b1aec4caa' into s…

c6d0a61

…ft_with_main

chore: Merge branch 'sft_with_main' into sft_loss_masking

0c52856

feat(sft): Do not reuse last targets for Instruction Tuning

12c74bc

Merge remote-tracking branch 'origin/sft_with_main' into sft_sample_g…

fc7bec1

…enerator

refactor(sft): Make reuse_last_target optional

25fdcd7

docs: Correct spelling

01109e2

Update comment

7148e1e

Co-authored-by: Alexander Weber <alex.a.weber@gmx.de>

Merge pull request #193 from Modalities/sft_sample_generator

75611dd

SFT sample generator

lllAlexanderlll added 5 commits July 2, 2025 13:25

chore: revert change in tokenizer test

b51547b

chore: tidy up docs

9c13842

chore: fix return type TODO

6ce5a77

chore: tidy up PR

f50b3cf

chore: Update CHANGELOG_DEV

815ea37

lllAlexanderlll changed the title ~~Draft: Instruction tuning 2025~~ Instruction tuning 2025 Jul 2, 2025

lllAlexanderlll requested a review from le1nux July 2, 2025 13:07

lllAlexanderlll added 2 commits July 2, 2025 16:07

chore: fix tests. add IT tutorial to tests.py Remove generated data f…

4e1d8af

…iles

chore: Add missing file

1683c01

le1nux requested changes Jul 2, 2025

View reviewed changes

lllAlexanderlll added 3 commits July 2, 2025 16:22

chore: delete IT configs. We rely on the ones in the tutorial as exam…

5ad2097

…ples

chore: Remove small instruction tuning dataset

cb0a790

chore: fix file path in test config

841bf7f

This was referenced Jul 2, 2025

Draft: Sft with main2 #378

Closed

Instruction-tuning Support #196

Closed

Draft: Instruction tuning support #78

Closed

lllAlexanderlll self-assigned this Jul 2, 2025

lllAlexanderlll added this to the v0.4.0 milestone Jul 2, 2025

This was linked to issues Jul 2, 2025

Support for standard instruction-tuning #369

Closed

Tutorial for instruction tuning of a HF-imported model #370

Closed

lllAlexanderlll added the enhancement New feature or request label Jul 2, 2025

chore: Use correct file in IT test

f1bb516

le1nux requested changes Jul 2, 2025

View reviewed changes

tests/instruction_tuning/files/config_lorem_ipsum_instruct_fsdp1.yaml Outdated Show resolved Hide resolved

src/modalities/dataloader/create_instruction_tuning_data.py Outdated Show resolved Hide resolved

src/modalities/dataloader/create_instruction_tuning_data.py Show resolved Hide resolved

lllAlexanderlll added 3 commits July 2, 2025 21:11

chore: Use FSDP2 and convert dcp to torch before text generation

07c3525

chore: Delete smol training data from data. We have it in the tutorial

ae166d4

chore: Add doc regarding apartitions in IT data creation

27c06cd

le1nux approved these changes Jul 3, 2025

View reviewed changes

lllAlexanderlll merged commit bd649de into main Jul 3, 2025
3 checks passed

flxst mentioned this pull request Jul 4, 2025

Warmstart with FSDP2 and weight tying fails #381

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Instruction tuning 2025 #379

Instruction tuning 2025 #379

Uh oh!

lllAlexanderlll commented Jul 1, 2025 •

edited

Loading

Uh oh!

le1nux left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

le1nux left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lllAlexanderlll commented Jul 2, 2025

Uh oh!

le1nux left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Instruction tuning 2025 #379

Instruction tuning 2025 #379

Uh oh!

Conversation

lllAlexanderlll commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lllAlexanderlll commented Jul 2, 2025

Uh oh!

le1nux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lllAlexanderlll commented Jul 1, 2025 •

edited

Loading