-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark reporting tool and mistral example #225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you have plans to generalize this script to other files of thunder/benchmarks
directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if it can be useful for the team I think that would be a great path to continue on. For now this is more to see how the pieces fit together and then improve over time.
def _get_git_revision(self) -> str: | ||
cmd = ["git", "rev-parse", "--short", "HEAD"] | ||
return subprocess.check_output(cmd).decode("ascii").strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it'd make sense to return either thunder's version or git commit hash
def test_mistral(self, **kwargs): | ||
self.run_standalone_script("mistral.py", kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems that this requires us to run this script inside of thunder/benchmarks
directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true! Let me fix that so that we can run from other dirs
|
||
|
||
def get_model(vocab_size: int) -> GPT: | ||
config = Config.from_name("Mistral-7B-v0.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would there be any other mistral models that we might want to run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually I good question! I think this can be parametrized so each version of the model gets run on a clean env.
from thunder.tests.litgpt_model import Config, GPT | ||
from transformers import AutoTokenizer | ||
|
||
flags.DEFINE_string("compile", default="eager", help="Specify compile option: thunder|inductor|eager") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would there be an equivalent of argparse's choices
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it can be done with lags.DEFINE_multi_string("multi_choice", ["a", "b", "c"] ...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At glance I'm not sure why this needs to be a separate file from https://github.com/Lightning-AI/lightning-thunder/blob/649c3d71de9af13a3247694a4e23289221d0ff72/thunder/benchmarks/benchmark_litgpt.py given the use of lit_gpt.
Could you kindly educate me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right they are similar, I think the main difference here is that that script is not meant to be run nightly. At the moment the Mistral script is just an example of benchmarking and reporting metrics for a model. The main idea being that later on we can integrate this tools with more models and more metrics.
If later on when the benchmarking tools are more solid, we still see that this files are very similar and serve the same purpose I agree on the fact that we could try to unify them
filename = os.path.join(out_dir, f"{self._name}_{self._start_time}.json") | ||
git_revision = self._get_git_revision() | ||
|
||
with open(filename, mode="w") as json_report: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does an example output look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below an example output, I'll add it in the description of the PR too:
{
"test_class": "MistralBenchmark",
"timestamp": "2024-04-19_10-30-23",
"head": "0534668a",
"succeeded": true,
"iter_time": 93333096.0,
"max_memory_allocated": 5.098717696,
"input_len": 4096,
"batch_size": 1,
"max_iters": 100,
"warmup_iters": 20,
"compile": "eager"
}
A notebook explaining how to add a benchmark, and another notebook showing how to run and understand the results of a benchmark, would be very interesting. They don't need to be in this PR. That said, this PR could have more comments explaining how to use its functionality |
I've updated the description of the PR, let me know if there is something else that you would like to know that I might have missed :) |
model.to(device=self.device) | ||
self.model = setup_compile(model) | ||
|
||
self.mixed_ctx = torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would autocast be enabled, though the model is already bfloat16?
|
||
|
||
if __name__ == "__main__": | ||
absltest.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this support distributed or only single device run?
@classmethod | ||
def setUpClass(cls): | ||
super().setUpClass() | ||
|
||
if flags.FLAGS.output_dir: | ||
os.makedirs(flags.FLAGS.output_dir, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like this should be setUp
not setUpClass
as the former is called per each test method while the latter, per class.
@classmethod | ||
def tearDownClass(cls) -> None: | ||
super().tearDownClass() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you tell me what's expected to be defined in tearDownClass
or tearDown
?
Closing for re-evaluation |
What does this PR do?
As part of #183 this fixes #224 by introducing an abseil based template that can be used to setup benchmarks and report results in from of JSONs. Also this PR includes an example benchmark of Mistral.
How does this work?
This contribution consist of two main parts:
Benchmarking
class provides a place where practitioners can register previously written scripts and parametrize them. Doing so allows to automate running benchmarks.Benchmark scripts template
The idea here is that developers can write scripts with the following structure:
The reason behind having a separate script for each benchmark is that it helps with isolation. Running multiple benchmarks on the same process can interfere on some metrics like peak memory.
Within this class practitioners can report metrics and parameters with the methods
report_metrics
andreport_flags
respectively. This reported metrics together with commit/version information will then be saved as in the form of a JSON file like the following:(example from
mistral.py
ran withpython mistral.py --compile=eager --input_len=4096 --batch_size=1
)Benchmark automation
The
Benchmarking
class inbenchmark_scripts.py
file provides with a way to automate and parametrize the previously described scripts. In particular, practitioners can register an parametrize a script by adding a method to this class that starts withtest_
:It will pass the parameters as arguments when calling
my_benchmark.py
file.Design decisions
In both cases I've used abseil built in argument parser for convenience.