Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark reporting tool and mistral example #225

Closed
wants to merge 4 commits into from

Conversation

riccardofelluga
Copy link
Collaborator

@riccardofelluga riccardofelluga commented Apr 18, 2024

What does this PR do?

As part of #183 this fixes #224 by introducing an abseil based template that can be used to setup benchmarks and report results in from of JSONs. Also this PR includes an example benchmark of Mistral.

How does this work?

This contribution consist of two main parts:

  • It defines a template for creating standalone benchmarking scripts and provides the functionality record info and metrics. At the end of the run the recorded information is stored in the form of a JSON file.
  • The Benchmarking class provides a place where practitioners can register previously written scripts and parametrize them. Doing so allows to automate running benchmarks.

Benchmark scripts template

The idea here is that developers can write scripts with the following structure:

from absl.testing import absltest
from benchmark_scripts import Target

class MyBenchmarkClass(Target):
    def setUp(self):
        super().setUp()
        # setup benchmark here

    def tearDown(self) -> None:
        super().tearDown()
        # teardown for the benchmark just ran

    def test_my_benchmark(self):
        # code to be benchmarked
        pass

if __name__ == "__main__":
    absltest.main()

The reason behind having a separate script for each benchmark is that it helps with isolation. Running multiple benchmarks on the same process can interfere on some metrics like peak memory.
Within this class practitioners can report metrics and parameters with the methods report_metrics and report_flags respectively. This reported metrics together with commit/version information will then be saved as in the form of a JSON file like the following:

{
    "test_class": "MistralBenchmark",
    "timestamp": "2024-04-19_10-30-23",
    "head": "0534668a",
    "succeeded": true,
    "iter_time": 93333096.0,
    "max_memory_allocated": 5.098717696,
    "input_len": 4096,
    "batch_size": 1,
    "max_iters": 100,
    "warmup_iters": 20,
    "compile": "eager"
}

(example from mistral.py ran with python mistral.py --compile=eager --input_len=4096 --batch_size=1)

Benchmark automation

The Benchmarking class in benchmark_scripts.py file provides with a way to automate and parametrize the previously described scripts. In particular, practitioners can register an parametrize a script by adding a method to this class that starts with test_:

    @parameterized.product(
        compile=(
            "eager",
            "inductor",
            "thunder",
        ),
        input_len=(4_096,),
        batch_size=(1,),
    )
    def test_my_script(self, **kwargs):
        self.run_standalone_script("my_benchmark.py", kwargs)

It will pass the parameters as arguments when calling my_benchmark.py file.

Design decisions

In both cases I've used abseil built in argument parser for convenience.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you have plans to generalize this script to other files of thunder/benchmarks directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if it can be useful for the team I think that would be a great path to continue on. For now this is more to see how the pieces fit together and then improve over time.

Comment on lines +35 to +37
def _get_git_revision(self) -> str:
cmd = ["git", "rev-parse", "--short", "HEAD"]
return subprocess.check_output(cmd).decode("ascii").strip()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it'd make sense to return either thunder's version or git commit hash

Comment on lines +112 to +113
def test_mistral(self, **kwargs):
self.run_standalone_script("mistral.py", kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that this requires us to run this script inside of thunder/benchmarks directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true! Let me fix that so that we can run from other dirs



def get_model(vocab_size: int) -> GPT:
config = Config.from_name("Mistral-7B-v0.1")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be any other mistral models that we might want to run?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually I good question! I think this can be parametrized so each version of the model gets run on a clean env.

from thunder.tests.litgpt_model import Config, GPT
from transformers import AutoTokenizer

flags.DEFINE_string("compile", default="eager", help="Specify compile option: thunder|inductor|eager")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be an equivalent of argparse's choices?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it can be done with lags.DEFINE_multi_string("multi_choice", ["a", "b", "c"] ....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At glance I'm not sure why this needs to be a separate file from https://github.com/Lightning-AI/lightning-thunder/blob/649c3d71de9af13a3247694a4e23289221d0ff72/thunder/benchmarks/benchmark_litgpt.py given the use of lit_gpt.

Could you kindly educate me?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right they are similar, I think the main difference here is that that script is not meant to be run nightly. At the moment the Mistral script is just an example of benchmarking and reporting metrics for a model. The main idea being that later on we can integrate this tools with more models and more metrics.
If later on when the benchmarking tools are more solid, we still see that this files are very similar and serve the same purpose I agree on the fact that we could try to unify them

filename = os.path.join(out_dir, f"{self._name}_{self._start_time}.json")
git_revision = self._get_git_revision()

with open(filename, mode="w") as json_report:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does an example output look like?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below an example output, I'll add it in the description of the PR too:

{
    "test_class": "MistralBenchmark",
    "timestamp": "2024-04-19_10-30-23",
    "head": "0534668a",
    "succeeded": true,
    "iter_time": 93333096.0,
    "max_memory_allocated": 5.098717696,
    "input_len": 4096,
    "batch_size": 1,
    "max_iters": 100,
    "warmup_iters": 20,
    "compile": "eager"
}

@mruberry
Copy link
Collaborator

A notebook explaining how to add a benchmark, and another notebook showing how to run and understand the results of a benchmark, would be very interesting. They don't need to be in this PR. That said, this PR could have more comments explaining how to use its functionality

@riccardofelluga
Copy link
Collaborator Author

I've updated the description of the PR, let me know if there is something else that you would like to know that I might have missed :)

model.to(device=self.device)
self.model = setup_compile(model)

self.mixed_ctx = torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would autocast be enabled, though the model is already bfloat16?



if __name__ == "__main__":
absltest.main()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this support distributed or only single device run?

Comment on lines +85 to +90
@classmethod
def setUpClass(cls):
super().setUpClass()

if flags.FLAGS.output_dir:
os.makedirs(flags.FLAGS.output_dir, exist_ok=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels like this should be setUp not setUpClass as the former is called per each test method while the latter, per class.

Comment on lines +92 to +94
@classmethod
def tearDownClass(cls) -> None:
super().tearDownClass()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you tell me what's expected to be defined in tearDownClass or tearDown?

@riccardofelluga
Copy link
Collaborator Author

Closing for re-evaluation

@t-vi t-vi deleted the benchmarking-suite branch July 16, 2024 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Benchmarking suite that runs scripts
4 participants