Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint merge script #466

Closed
wants to merge 15 commits into from
Closed

Checkpoint merge script #466

wants to merge 15 commits into from

Conversation

sweinbach
Copy link
Contributor

This PR removes the old merge script and adds a new one.

I assume the PR for config file management #463 to be merged so that config files are in the global_step* directory.

Note that I have only tested on a number of config parameters that are common to me. There might be cases where this merge script does not apply due to partition dimensions being different from the ones expected.

@sweinbach sweinbach requested a review from a team as a code owner November 15, 2021 09:50
Copy link
Member

@StellaAthena StellaAthena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need to thoroughly test this before approving.

@sdtblck
Copy link
Contributor

sdtblck commented Dec 20, 2021

Ok @sweinbach I did some pretty extensive testing / debugging of this, and had to change a fair few things to get it to work at all, but I think this is about as good as we're going to get it, for reasons i'll explain further down in the comment. (unless I'm missing some bug / error in my implementation)

Stuff I changed:

  • Made the function accessible without the argparse args, so we can use it externally if needs be.
  • Got rid of the double loop (looping over mp first, then grouped weights.) this sped things up a little for a larger model, and produced the same outputs (verified with tools/inspect_checkpoints.py before and after.)
  • replaced load/save directories with the output directory - otherwise upon reloading, neox will still try to load the weights from the parallelized model.
  • Ensured the vocab size in the merged model was correct by modifying the make_vocab_size_divisible_by arg here.
  • Told megatron not to load RNG states in the resulting model, as they will be incorrect.
  • Removed data_parallel_world_size from ignored layers - deepspeed actually raised an error if this wasn't in the checkpoint weights for me, because of this line.
  • Different strategy for merging - for some reason, the replicated (non model parallel) parameters are not all equal in all checkpoints. Due to this, we can't rely on elif all_equal(partitions): to catch all the replicated parameters. Instead, I take a rule based approach - all parameters except for vocab parallel embedding, row parallel linear weights, column parallel linear weights, and column parallel linear biases are replicated (you can see how NVIDIA megatron does this for reference), and can be merged either by just taking the 0th partition, or by taking the average of all partitions.
  • For efficiency, only the first of the model_states checkpoints are loaded. This was by far the most time consuming operation (~5 minutes, or about half the runtime for my model), and we really only need the first of them, because we're throwing out the optimizer states, and everything else is replicated.
  • I also write out the latest global_step file here.

Results:

So, the results aren't great. Since as I mentioned, the duplicated parameters across all ranks are not actually all the same, doing the merge does result in a loss of accuracy. It's not small, but it's not model breaking, and I suspect that the accuracy could possibly be recovered with a bit of extra tuning.

These are the eval results on lambada for the base model:

{
    "lambada": {
        "ppl": 4.843396407429969,
        "acc": 0.6955171744614788,
    }
}

and for the merged model (zeroth partition):

{
    "lambada": {
        "ppl": 5.400162748158698,
        "acc": 0.6751406947409276,
    }
}

and for the merged model with averaged parameters:

{
    "lambada": {
        "ppl": 5.391328941501356,
        "acc": 0.6794100523966622,
    }
}

Differences:

The differences between the replicated parameters are mostly very small, this is the difference between two input_layernorm.weights in 2 model parallel ranks of a 20B model:
image
As you can see, mostly the same with very slight differences. Other layers are similar:

diff('input_layernorm.weight').sum()
# tensor(-0.0024, dtype=torch.float16)

diff('attention.dense.bias').sum()
# tensor(-0.0009, dtype=torch.float16)

diff('post_attention_layernorm.weight').sum()
# tensor(-0.0088, dtype=torch.float16)

I think it's worth talking to the megatron and/or deepspeed devs to see if they've had a similar problem. If (the difference between replicated parameters) is something fixable, we can integrate it into neox. In theory, data parallelism should result in all these params being equal, but maybe there's something I'm missing...

Anyway, script works, I think this can be merged, but maybe we should log a warning that you'll get accuracy loss.

Additionally, there are certain settings I know this won't work for at all (geglu, for example), so we should also throw an error for known bad settings.

@sweinbach
Copy link
Contributor Author

Thank you @sdtblck. Points all taken and valid. I can confirm the observation that all_equal assumption does not hold for larger models. I checked for smaller models, where this does not seem to be an issue.

Thanks for the effort!

sdtblck
sdtblck previously approved these changes Dec 21, 2021
@sdtblck
Copy link
Contributor

sdtblck commented Dec 21, 2021

@sweinbach anything else to add, or would you say this is ready to merge?

tools/merge.py Outdated
args.pipe_parallel,
args.output_dir,
args.global_step,
args.layernorm_mean,
Copy link

@mitchellgordon95 mitchellgordon95 Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line was throwing an error for me because layernorm_mean is not on the args list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be parameter_mean

tools/merge.py Outdated

# save modified config
with open(output_configs_dir / "config.yml", "w") as f:
json.dump(config, f, indent=4)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a subtle bug here. In my original config I had a parameter like

"eps": 1.0e-08

This script re-wrote that value to

"eps": 1e-08

Which led to the value being parsed as a string, not a float. Not sure why that's the case, but just FYI.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. We ran into this issue before and I thought we had fixed it. It’s possible our “fix” was ultimately to write 1.0 everywhere though….

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i think if we save it out with yaml instead of json dump (it is a yaml after all) that should fix it. Bit of an oversight there, nice catch!

Comment on lines +387 to +393
if partition_dim is None:
print(module_name)
# just take the 0th partition for non model-parallel weights
if parameter_mean:
out_sd[module_name] = torch.mean(torch.stack(partitions), dim=0)
else:
out_sd[module_name] = partitions[0]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gpt_j_residual is true, shouldn't attention.dense.bias and mlp.dense_4h_to_h.bias also be multiplied (elementwise) by the number of partitions?

This line:

output = residual + self.reduce(output)

seems to be an allreduce sum that is performed AFTER those biases are added, so it's basically multiplying those replicated biases by the number of partitions. I think these biases in the merged model need to also be multiplied to compensate for that.

HughPH and others added 2 commits April 22, 2022 15:04
Added assert for weights dir not existing (in case the user passes `-s 123` and `global_step123` doesn't exist)
@StellaAthena
Copy link
Member

Closing as superceded by other approaches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants