Refactoring of Evaluation and adding of evaluate command #264

bigximik · 2025-05-14T08:26:13Z

✨ Description

Creates Evaluator abstraction so additional evaluators beyond Loss can be added.

Adds an evaluate command that accepts the same training config and enables evaluation on the last checkpoint.

Includes some fixes.

Example: specifying multiple LossEvaluators

training:
  evaluators:
    the_stack:
      interval: 50
      evaluator:
        type: loss
        iterations: 25
        dataset_name: the_stack
    fineweb:
      interval: 100
      evaluator:
        type: loss
        iterations: 15
        dataset_name: fineweb
data:
  datasets:
    the_stack:
      type: file
      path: path/to/validation_the_stack_dataset.yaml
    fineweb:
      type: file
      path: path/to/validation_fineweb_dataset.yaml

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.

…nges

…rate_final

…uate

…denis/evaluate

fast_llm/data/data/gpt/config.py

jlamypoirier · 2025-06-03T19:25:37Z

fast_llm/engine/training/config.py

-        hint=FieldHint.feature,
-        valid=skip_valid_if_none(check_field(Assert.gt, 0)),
-    )
+class TrainingEvaluatorConfig(EvaluatorConfigBase):


It should still inherit from IntervalConfig so it's simpler and we don't have the redundant run_interval.

I prefer composition over multiple inheritance here, as IntervalConfig is a property of the evaluator wrapper (which TrainingEvaluatorConfig is), and not of the new entity derived from EvaluatorConfigBase. In my opinion, this makes the code much more readable.

However, multiple inheritance is already used in many places. So if you still prefer that approach after my explanation, I’m happy to refactor accordingly.

Multiple inheritance is only needed because of the EvaluatorConfigBase mixin, which is arguably not even necessary. I'd rather prioritize better usage (configs) over marginally simpler code.

OK, I will use multiple inheritance but will retain EvaluatorConfigBase so I don't need to rewrite EvaluatorRunner then we will create a separate config for evaluate command in addition for it of accepting training config. We need to discuss what it should look like, so I’ve created an issue for it #285.

jlamypoirier · 2025-06-03T19:29:38Z

fast_llm/engine/training/config.py

+            else 0
+        )
+
+    def get_evaluator(


Seems unnecessary, we can just call evaluator.get_evaluator()

I explained in details #222 (comment) and #222 (comment) but shortly it is to maintain proper encapsulation

There is no encapsulation needed though, TrainingEvaluatorConfig is a fixed class with an evaluator :EvaluatorConfig field, which dis dynamic but has a well-defined get_evaluator method.

Encapsulation would be needed if we allowed for a more generic scenario where evaluator.get_evaluator doesn't exist or has a different signature, ex. if we allowed for a more generic evaluator or a generalized TrainingEvaluatorConfig that doesn't have an evaluator. I don't really see this happening anytime soon...

Another thing I can't really work around here is that I need to return TrainingEvaluator and not evaluator.get_evaluator(). This is because TrainingEvaluator is responsible for handling whether evaluators should or should not run during training. Neither the concrete evaluators nor the EvaluatorRunner are aware of this—they simply execute.

jlamypoirier · 2025-06-03T19:29:59Z

fast_llm/engine/evaluation/config.py

+
+
+@config_class()
+class EvaluatorConfigBase(Config):


I don't think that's necessary

I explained in details #222 (comment) and #222 (comment) but shortly it is to maintain proper encapsulation

fast_llm/engine/training/config.py

bigximik · 2025-06-04T13:30:01Z

I was also thinking we should move the dataset definitions once we have evaluators, feel free to do it here or in a follow-up PR.

I will move it in the #282, so we can close this faster.

…uate

fast_llm/engine/training/trainer.py

jlamypoirier · 2025-06-12T16:16:58Z

tests/test_gpt_loss.py

+
+# @pytest.mark.extra_slow
+@requires_cuda
+def test_loss_validation_vs_inference(model_and_tokenizer):


I don't think this test is worth it, it's kind of trivial.

…uate

bigximik and others added 14 commits May 12, 2025 07:50

copy from denis/generate of only generate support part, with some cha…

a46d581

…nges

fix to use right config param

9a81f89

added basic generate tests

667aacf

clean up

124cfa7

updated interface and clean up

fcef337

added from model case, renamed

47ad6d0

added decorators to the new test

e066170

added docs

24b3e1c

fixed typo

a574f94

docs updates

2094b60

cairosvg downgrade

196e73b

style filters applied

7fe218d

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/gene…

f395bed

…rate_final

copy of relevant files from denis/generate for evaluate refactoring

b86e5d9

Base automatically changed from denis/generate_final to main May 20, 2025 14:50

bigximik added 8 commits May 21, 2025 06:35

merge from main

84a33d6

added losses test

0cefd4f

added auto removed import

fb73dbe

fix to run validate or inference with labels without gradient update

3475799

changed for loss eval to be always in validation phase

1d22bc6

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/eval…

349c8ac

…uate

changes for new config format

37d93a3

param rename from evaluations to evaluators and some fixes

b0b7117

bigximik changed the title ~~[work in progress] Refactoring of Evaluation and adding of evaluate command~~ Refactoring of Evaluation and adding of evaluate command May 30, 2025

bigximik and others added 3 commits May 30, 2025 15:57

param rename update

2ff6ee8

docs and tests updates for new evaluators config structure

54464fd

Merge branch 'main' into denis/evaluate

ae04d38

bigximik requested a review from jlamypoirier May 30, 2025 16:31

bigximik added 2 commits May 30, 2025 16:32

removed manual tests configs

19c1ef3

Merge branch 'denis/evaluate' of github.com:ServiceNow/Fast-LLM into …

4102713

…denis/evaluate

bigximik marked this pull request as ready for review June 2, 2025 07:01

jlamypoirier reviewed Jun 3, 2025

View reviewed changes

removed wrong param renaming op

18217a6

bigximik mentioned this pull request Jun 4, 2025

Evaluate Command: Config Design and Functionality for Non-Training Model Evaluation #285

Open

4 tasks

bigximik added 2 commits June 4, 2025 14:25

removed run_interval from TrainingEvaluatorConfig

523aa29

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/eval…

53b8ec9

…uate

jlamypoirier approved these changes Jun 12, 2025

View reviewed changes

bigximik added 9 commits June 16, 2025 11:39

renamed state preparation function

b632665

merge from main

70cecd4

added custom bos token to config

e3ab04d

changed to a more relaxes assert of tokens count

f2fb6bf

reimplemented evaluate command with a new config api

2eaf796

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/eval…

5028fc2

…uate

test update for new files structure

be5fbda

more concise evaluate implementation suggested by Joel

77603a2

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/eval…

35203e9

…uate

bigximik merged commit 4119854 into main Jun 19, 2025
4 checks passed

bigximik deleted the denis/evaluate branch June 19, 2025 15:44

jlamypoirier mentioned this pull request Jun 19, 2025

Fix checkpoint export errors for the Dream model #311

Merged

25 tasks

Refactoring of Evaluation and adding of evaluate command #264

Refactoring of Evaluation and adding of evaluate command #264

Uh oh!

Conversation

bigximik commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Example: specifying multiple LossEvaluators

🔍 Type of change

📝 Changes

✅ Checklist

General

Testing

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bigximik Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bigximik Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bigximik commented Jun 4, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bigximik commented May 14, 2025 •

edited

Loading

bigximik Jun 4, 2025 •

edited

Loading

bigximik Jun 4, 2025 •

edited

Loading