-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint merge script #466
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need to thoroughly test this before approving.
Ok @sweinbach I did some pretty extensive testing / debugging of this, and had to change a fair few things to get it to work at all, but I think this is about as good as we're going to get it, for reasons i'll explain further down in the comment. (unless I'm missing some bug / error in my implementation) Stuff I changed:
Results:So, the results aren't great. Since as I mentioned, the duplicated parameters across all ranks are not actually all the same, doing the merge does result in a loss of accuracy. It's not small, but it's not model breaking, and I suspect that the accuracy could possibly be recovered with a bit of extra tuning. These are the eval results on lambada for the base model: {
"lambada": {
"ppl": 4.843396407429969,
"acc": 0.6955171744614788,
}
} and for the merged model (zeroth partition): {
"lambada": {
"ppl": 5.400162748158698,
"acc": 0.6751406947409276,
}
} and for the merged model with averaged parameters: {
"lambada": {
"ppl": 5.391328941501356,
"acc": 0.6794100523966622,
}
} Differences:The differences between the replicated parameters are mostly very small, this is the difference between two input_layernorm.weights in 2 model parallel ranks of a 20B model: diff('input_layernorm.weight').sum()
# tensor(-0.0024, dtype=torch.float16)
diff('attention.dense.bias').sum()
# tensor(-0.0009, dtype=torch.float16)
diff('post_attention_layernorm.weight').sum()
# tensor(-0.0088, dtype=torch.float16) I think it's worth talking to the megatron and/or deepspeed devs to see if they've had a similar problem. If (the difference between replicated parameters) is something fixable, we can integrate it into neox. In theory, data parallelism should result in all these params being equal, but maybe there's something I'm missing... Anyway, script works, I think this can be merged, but maybe we should log a warning that you'll get accuracy loss. Additionally, there are certain settings I know this won't work for at all (geglu, for example), so we should also throw an error for known bad settings. |
Thank you @sdtblck. Points all taken and valid. I can confirm the observation that all_equal assumption does not hold for larger models. I checked for smaller models, where this does not seem to be an issue. Thanks for the effort! |
@sweinbach anything else to add, or would you say this is ready to merge? |
tools/merge.py
Outdated
args.pipe_parallel, | ||
args.output_dir, | ||
args.global_step, | ||
args.layernorm_mean, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line was throwing an error for me because layernorm_mean
is not on the args list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be parameter_mean
tools/merge.py
Outdated
|
||
# save modified config | ||
with open(output_configs_dir / "config.yml", "w") as f: | ||
json.dump(config, f, indent=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a subtle bug here. In my original config I had a parameter like
"eps": 1.0e-08
This script re-wrote that value to
"eps": 1e-08
Which led to the value being parsed as a string, not a float. Not sure why that's the case, but just FYI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. We ran into this issue before and I thought we had fixed it. It’s possible our “fix” was ultimately to write 1.0
everywhere though….
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, i think if we save it out with yaml instead of json dump (it is a yaml after all) that should fix it. Bit of an oversight there, nice catch!
if partition_dim is None: | ||
print(module_name) | ||
# just take the 0th partition for non model-parallel weights | ||
if parameter_mean: | ||
out_sd[module_name] = torch.mean(torch.stack(partitions), dim=0) | ||
else: | ||
out_sd[module_name] = partitions[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If gpt_j_residual
is true, shouldn't attention.dense.bias
and mlp.dense_4h_to_h.bias
also be multiplied (elementwise) by the number of partitions?
This line:
gpt-neox/megatron/model/transformer.py
Line 626 in d7af1e7
output = residual + self.reduce(output) |
seems to be an allreduce sum that is performed AFTER those biases are added, so it's basically multiplying those replicated biases by the number of partitions. I think these biases in the merged model need to also be multiplied to compensate for that.
Added assert for weights dir not existing (in case the user passes `-s 123` and `global_step123` doesn't exist)
Update merge.py
Closing as superceded by other approaches |
This PR removes the old merge script and adds a new one.
I assume the PR for config file management #463 to be merged so that config files are in the global_step* directory.
Note that I have only tested on a number of config parameters that are common to me. There might be cases where this merge script does not apply due to partition dimensions being different from the ones expected.