Add benchmark reporting tool and mistral example #225

riccardofelluga · 2024-04-18T12:15:11Z

What does this PR do?

As part of #183 this fixes #224 by introducing an abseil based template that can be used to setup benchmarks and report results in from of JSONs. Also this PR includes an example benchmark of Mistral.

How does this work?

This contribution consist of two main parts:

It defines a template for creating standalone benchmarking scripts and provides the functionality record info and metrics. At the end of the run the recorded information is stored in the form of a JSON file.
The Benchmarking class provides a place where practitioners can register previously written scripts and parametrize them. Doing so allows to automate running benchmarks.

Benchmark scripts template

The idea here is that developers can write scripts with the following structure:

from absl.testing import absltest
from benchmark_scripts import Target

class MyBenchmarkClass(Target):
    def setUp(self):
        super().setUp()
        # setup benchmark here

    def tearDown(self) -> None:
        super().tearDown()
        # teardown for the benchmark just ran

    def test_my_benchmark(self):
        # code to be benchmarked
        pass

if __name__ == "__main__":
    absltest.main()

The reason behind having a separate script for each benchmark is that it helps with isolation. Running multiple benchmarks on the same process can interfere on some metrics like peak memory.
Within this class practitioners can report metrics and parameters with the methods report_metrics and report_flags respectively. This reported metrics together with commit/version information will then be saved as in the form of a JSON file like the following:

{
    "test_class": "MistralBenchmark",
    "timestamp": "2024-04-19_10-30-23",
    "head": "0534668a",
    "succeeded": true,
    "iter_time": 93333096.0,
    "max_memory_allocated": 5.098717696,
    "input_len": 4096,
    "batch_size": 1,
    "max_iters": 100,
    "warmup_iters": 20,
    "compile": "eager"
}

(example from mistral.py ran with python mistral.py --compile=eager --input_len=4096 --batch_size=1)

Benchmark automation

The Benchmarking class in benchmark_scripts.py file provides with a way to automate and parametrize the previously described scripts. In particular, practitioners can register an parametrize a script by adding a method to this class that starts with test_:

    @parameterized.product(
        compile=(
            "eager",
            "inductor",
            "thunder",
        ),
        input_len=(4_096,),
        batch_size=(1,),
    )
    def test_my_script(self, **kwargs):
        self.run_standalone_script("my_benchmark.py", kwargs)

It will pass the parameters as arguments when calling my_benchmark.py file.

Design decisions

In both cases I've used abseil built in argument parser for convenience.

crcrpar · 2024-04-18T12:43:59Z

thunder/benchmarks/benchmark_scripts.py

would you have plans to generalize this script to other files of thunder/benchmarks directory?

Yes if it can be useful for the team I think that would be a great path to continue on. For now this is more to see how the pieces fit together and then improve over time.

crcrpar · 2024-04-18T12:46:25Z

thunder/benchmarks/benchmark_scripts.py

+    def _get_git_revision(self) -> str:
+        cmd = ["git", "rev-parse", "--short", "HEAD"]
+        return subprocess.check_output(cmd).decode("ascii").strip()


I thought it'd make sense to return either thunder's version or git commit hash

crcrpar · 2024-04-18T12:47:24Z

thunder/benchmarks/benchmark_scripts.py

+    def test_mistral(self, **kwargs):
+        self.run_standalone_script("mistral.py", kwargs)


it seems that this requires us to run this script inside of thunder/benchmarks directory?

It's true! Let me fix that so that we can run from other dirs

crcrpar · 2024-04-18T12:48:18Z

thunder/benchmarks/mistral.py

+
+
+def get_model(vocab_size: int) -> GPT:
+    config = Config.from_name("Mistral-7B-v0.1")


would there be any other mistral models that we might want to run?

That's actually I good question! I think this can be parametrized so each version of the model gets run on a clean env.

crcrpar · 2024-04-18T12:50:33Z

thunder/benchmarks/mistral.py

+from thunder.tests.litgpt_model import Config, GPT
+from transformers import AutoTokenizer
+
+flags.DEFINE_string("compile", default="eager", help="Specify compile option: thunder|inductor|eager")


would there be an equivalent of argparse's choices?

Yes it can be done with lags.DEFINE_multi_string("multi_choice", ["a", "b", "c"] ....

crcrpar · 2024-04-18T12:52:39Z

thunder/benchmarks/mistral.py

At glance I'm not sure why this needs to be a separate file from https://github.com/Lightning-AI/lightning-thunder/blob/649c3d71de9af13a3247694a4e23289221d0ff72/thunder/benchmarks/benchmark_litgpt.py given the use of lit_gpt.

Could you kindly educate me?

You're right they are similar, I think the main difference here is that that script is not meant to be run nightly. At the moment the Mistral script is just an example of benchmarking and reporting metrics for a model. The main idea being that later on we can integrate this tools with more models and more metrics.
If later on when the benchmarking tools are more solid, we still see that this files are very similar and serve the same purpose I agree on the fact that we could try to unify them

mruberry · 2024-04-18T13:33:04Z

thunder/benchmarks/benchmark_scripts.py

+        filename = os.path.join(out_dir, f"{self._name}_{self._start_time}.json")
+        git_revision = self._get_git_revision()
+
+        with open(filename, mode="w") as json_report:


What does an example output look like?

Below an example output, I'll add it in the description of the PR too:

{ "test_class": "MistralBenchmark", "timestamp": "2024-04-19_10-30-23", "head": "0534668a", "succeeded": true, "iter_time": 93333096.0, "max_memory_allocated": 5.098717696, "input_len": 4096, "batch_size": 1, "max_iters": 100, "warmup_iters": 20, "compile": "eager" }

mruberry · 2024-04-18T13:34:14Z

A notebook explaining how to add a benchmark, and another notebook showing how to run and understand the results of a benchmark, would be very interesting. They don't need to be in this PR. That said, this PR could have more comments explaining how to use its functionality

thunder/benchmarks/benchmark_scripts.py

riccardofelluga · 2024-04-19T13:59:59Z

I've updated the description of the PR, let me know if there is something else that you would like to know that I might have missed :)

crcrpar · 2024-04-22T12:25:23Z

thunder/benchmarks/mistral.py

+        model.to(device=self.device)
+        self.model = setup_compile(model)
+
+        self.mixed_ctx = torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16)


why would autocast be enabled, though the model is already bfloat16?

crcrpar · 2024-04-22T12:28:05Z

thunder/benchmarks/mistral.py

+
+
+if __name__ == "__main__":
+    absltest.main()


would this support distributed or only single device run?

crcrpar · 2024-04-22T12:37:14Z

thunder/benchmarks/benchmark_scripts.py

+    @classmethod
+    def setUpClass(cls):
+        super().setUpClass()
+
+        if flags.FLAGS.output_dir:
+            os.makedirs(flags.FLAGS.output_dir, exist_ok=True)


it feels like this should be setUp not setUpClass as the former is called per each test method while the latter, per class.

crcrpar · 2024-04-22T12:38:01Z

thunder/benchmarks/benchmark_scripts.py

+    @classmethod
+    def tearDownClass(cls) -> None:
+        super().tearDownClass()


could you tell me what's expected to be defined in tearDownClass or tearDown?

riccardofelluga · 2024-04-25T15:17:33Z

Closing for re-evaluation

add benchmark reporting tool and mistral example

9f35b2b

riccardofelluga requested review from mruberry, lantiga, robieta, t-vi and carmocca as code owners April 18, 2024 12:15

riccardofelluga linked an issue Apr 18, 2024 that may be closed by this pull request

Benchmarking suite that runs scripts #224

Open

riccardofelluga mentioned this pull request Apr 18, 2024

Setup a nightly job that runs the benchmarking suite #226

Open

riccardofelluga linked an issue Apr 18, 2024 that may be closed by this pull request

Automated benchmarking and reporting #183

Open

2 tasks

crcrpar reviewed Apr 18, 2024

View reviewed changes

mruberry reviewed Apr 18, 2024

View reviewed changes

Borda reviewed Apr 18, 2024

View reviewed changes

thunder/benchmarks/benchmark_scripts.py Outdated Show resolved Hide resolved

Borda reviewed Apr 18, 2024

View reviewed changes

thunder/benchmarks/benchmark_scripts.py Outdated Show resolved Hide resolved

riccardofelluga added 3 commits April 19, 2024 14:39

changed folder existance check

0534668

renamed output dir flag

88d048a

added benckmark dir to be able to run from other locations

7f62fa6

crcrpar reviewed Apr 22, 2024

View reviewed changes

riccardofelluga closed this Apr 25, 2024

riccardofelluga removed a link to an issue Apr 30, 2024

Automated benchmarking and reporting #183

Open

2 tasks

t-vi deleted the benchmarking-suite branch July 16, 2024 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark reporting tool and mistral example #225

Add benchmark reporting tool and mistral example #225

riccardofelluga commented Apr 18, 2024 •

edited by IvanYashchuk

Loading

crcrpar Apr 18, 2024

riccardofelluga Apr 19, 2024

crcrpar Apr 18, 2024

crcrpar Apr 18, 2024

riccardofelluga Apr 19, 2024

crcrpar Apr 18, 2024

riccardofelluga Apr 19, 2024

crcrpar Apr 18, 2024

riccardofelluga Apr 19, 2024

crcrpar Apr 18, 2024

riccardofelluga Apr 19, 2024

mruberry Apr 18, 2024

riccardofelluga Apr 19, 2024

mruberry commented Apr 18, 2024

riccardofelluga commented Apr 19, 2024

crcrpar Apr 22, 2024

crcrpar Apr 22, 2024

crcrpar Apr 22, 2024

crcrpar Apr 22, 2024

riccardofelluga commented Apr 25, 2024

		def test_mistral(self, **kwargs):
		self.run_standalone_script("mistral.py", kwargs)



		def get_model(vocab_size: int) -> GPT:
		config = Config.from_name("Mistral-7B-v0.1")

Add benchmark reporting tool and mistral example #225

Add benchmark reporting tool and mistral example #225

Conversation

riccardofelluga commented Apr 18, 2024 • edited by IvanYashchuk Loading

What does this PR do?

How does this work?

Benchmark scripts template

Benchmark automation

Design decisions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mruberry commented Apr 18, 2024

riccardofelluga commented Apr 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

riccardofelluga commented Apr 25, 2024

riccardofelluga commented Apr 18, 2024 •

edited by IvanYashchuk

Loading