Checkpoint merge script #466

sweinbach · 2021-11-15T09:50:48Z

This PR removes the old merge script and adds a new one.

I assume the PR for config file management #463 to be merged so that config files are in the global_step* directory.

Note that I have only tested on a number of config parameters that are common to me. There might be cases where this merge script does not apply due to partition dimensions being different from the ones expected.

…yer_parallelism

StellaAthena

We definitely need to thoroughly test this before approving.

sdtblck · 2021-12-20T01:34:01Z

Ok @sweinbach I did some pretty extensive testing / debugging of this, and had to change a fair few things to get it to work at all, but I think this is about as good as we're going to get it, for reasons i'll explain further down in the comment. (unless I'm missing some bug / error in my implementation)

Stuff I changed:

Made the function accessible without the argparse args, so we can use it externally if needs be.
Got rid of the double loop (looping over mp first, then grouped weights.) this sped things up a little for a larger model, and produced the same outputs (verified with tools/inspect_checkpoints.py before and after.)
replaced load/save directories with the output directory - otherwise upon reloading, neox will still try to load the weights from the parallelized model.
Ensured the vocab size in the merged model was correct by modifying the make_vocab_size_divisible_by arg here.
Told megatron not to load RNG states in the resulting model, as they will be incorrect.
Removed data_parallel_world_size from ignored layers - deepspeed actually raised an error if this wasn't in the checkpoint weights for me, because of this line.
Different strategy for merging - for some reason, the replicated (non model parallel) parameters are not all equal in all checkpoints. Due to this, we can't rely on elif all_equal(partitions): to catch all the replicated parameters. Instead, I take a rule based approach - all parameters except for vocab parallel embedding, row parallel linear weights, column parallel linear weights, and column parallel linear biases are replicated (you can see how NVIDIA megatron does this for reference), and can be merged either by just taking the 0th partition, or by taking the average of all partitions.
For efficiency, only the first of the model_states checkpoints are loaded. This was by far the most time consuming operation (~5 minutes, or about half the runtime for my model), and we really only need the first of them, because we're throwing out the optimizer states, and everything else is replicated.
I also write out the latest global_step file here.

Results:

So, the results aren't great. Since as I mentioned, the duplicated parameters across all ranks are not actually all the same, doing the merge does result in a loss of accuracy. It's not small, but it's not model breaking, and I suspect that the accuracy could possibly be recovered with a bit of extra tuning.

These are the eval results on lambada for the base model:

{
    "lambada": {
        "ppl": 4.843396407429969,
        "acc": 0.6955171744614788,
    }
}

and for the merged model (zeroth partition):

{
    "lambada": {
        "ppl": 5.400162748158698,
        "acc": 0.6751406947409276,
    }
}

and for the merged model with averaged parameters:

{
    "lambada": {
        "ppl": 5.391328941501356,
        "acc": 0.6794100523966622,
    }
}

Differences:

The differences between the replicated parameters are mostly very small, this is the difference between two input_layernorm.weights in 2 model parallel ranks of a 20B model:

As you can see, mostly the same with very slight differences. Other layers are similar:

diff('input_layernorm.weight').sum()
# tensor(-0.0024, dtype=torch.float16)

diff('attention.dense.bias').sum()
# tensor(-0.0009, dtype=torch.float16)

diff('post_attention_layernorm.weight').sum()
# tensor(-0.0088, dtype=torch.float16)

I think it's worth talking to the megatron and/or deepspeed devs to see if they've had a similar problem. If (the difference between replicated parameters) is something fixable, we can integrate it into neox. In theory, data parallelism should result in all these params being equal, but maybe there's something I'm missing...

Anyway, script works, I think this can be merged, but maybe we should log a warning that you'll get accuracy loss.

Additionally, there are certain settings I know this won't work for at all (geglu, for example), so we should also throw an error for known bad settings.

sweinbach · 2021-12-20T08:02:14Z

Thank you @sdtblck. Points all taken and valid. I can confirm the observation that all_equal assumption does not hold for larger models. I checked for smaller models, where this does not seem to be an issue.

Thanks for the effort!

sdtblck · 2021-12-21T09:13:15Z

@sweinbach anything else to add, or would you say this is ready to merge?

mitchellgordon95 · 2022-02-09T00:07:56Z

tools/merge.py

+        args.pipe_parallel,
+        args.output_dir,
+        args.global_step,
+        args.layernorm_mean,


this line was throwing an error for me because layernorm_mean is not on the args list

Should be parameter_mean

mitchellgordon95 · 2022-02-10T21:50:54Z

tools/merge.py

+
+    # save modified config
+    with open(output_configs_dir / "config.yml", "w") as f:
+        json.dump(config, f, indent=4)


There may be a subtle bug here. In my original config I had a parameter like

"eps": 1.0e-08

This script re-wrote that value to

"eps": 1e-08

Which led to the value being parsed as a string, not a float. Not sure why that's the case, but just FYI.

Ugh. We ran into this issue before and I thought we had fixed it. It’s possible our “fix” was ultimately to write 1.0 everywhere though….

Ah, i think if we save it out with yaml instead of json dump (it is a yaml after all) that should fix it. Bit of an oversight there, nice catch!

vfbd · 2022-03-19T21:49:06Z

tools/merge.py

+                if partition_dim is None:
+                    print(module_name)
+                    # just take the 0th partition for non model-parallel weights
+                    if parameter_mean:
+                        out_sd[module_name] = torch.mean(torch.stack(partitions), dim=0)
+                    else:
+                        out_sd[module_name] = partitions[0]


If gpt_j_residual is true, shouldn't attention.dense.bias and mlp.dense_4h_to_h.bias also be multiplied (elementwise) by the number of partitions?

This line:

gpt-neox/megatron/model/transformer.py

Line 626 in d7af1e7

output = residual + self.reduce(output)

seems to be an allreduce sum that is performed AFTER those biases are added, so it's basically multiplying those replicated biases by the number of partitions. I think these biases in the merged model need to also be multiplied to compensate for that.

Added assert for weights dir not existing (in case the user passes `-s 123` and `global_step123` doesn't exist)

Update merge.py

StellaAthena · 2023-04-10T19:15:35Z

Closing as superceded by other approaches

Samuel Weinbach added 6 commits November 9, 2021 09:33

load original config files in neox args

0532e97

Save config files with checkpoint

b24d3fa

less verbose comment for checkpointing dir name

750ccde

remove old merge script

a8c7021

merge script

5f4d86e

Merge script final_linear.weight partition dim dependent on output_la…

5b5930e

…yer_parallelism

sweinbach requested review from StellaAthena and sdtblck November 15, 2021 09:50

sweinbach requested a review from a team as a code owner November 15, 2021 09:50

StellaAthena reviewed Nov 21, 2021

View reviewed changes

sdtblck added 3 commits December 18, 2021 21:39

make function accessible without cli args

f8d9e51

Get merge script working

8972dff

removed all_equal function (unused)

3c9353a

Merge branch 'main' into merge_mp

934a9d8

sdtblck previously approved these changes Dec 21, 2021

View reviewed changes

stas00 mentioned this pull request Jan 23, 2022

DeepSpeed to DeepSpeed converter for changing tp/pp microsoft/Megatron-DeepSpeed#27

Closed

sdtblck mentioned this pull request Feb 7, 2022

Reshape deepspeed checkpoint bigscience-workshop/Megatron-DeepSpeed#239

Merged

mitchellgordon95 reviewed Feb 9, 2022

View reviewed changes

Merge branch 'main' into merge_mp

2c928de

mitchellgordon95 reviewed Feb 10, 2022

View reviewed changes

Fix misnamed argument

09b306a

StellaAthena dismissed sdtblck’s stale review via 09b306a February 12, 2022 04:42

json.dump -> yaml.dump

6892935

sdtblck mentioned this pull request Feb 12, 2022

Some questions about 20B checkpoints #545

Closed

Mistobaan mentioned this pull request Feb 13, 2022

add merge script #548

Closed

sdtblck mentioned this pull request Feb 13, 2022

GPT-NeoX-20B Integration huggingface/transformers#15642

Closed

vfbd suggested changes Mar 20, 2022

View reviewed changes

HughPH and others added 2 commits April 22, 2022 15:04

Update merge.py

8b9d572

Added assert for weights dir not existing (in case the user passes `-s 123` and `global_step123` doesn't exist)

Merge pull request #609 from HughPH/patch-1

864ea81

Update merge.py

StellaAthena closed this Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint merge script #466

Checkpoint merge script #466

sweinbach commented Nov 15, 2021

StellaAthena left a comment

sdtblck commented Dec 20, 2021

sweinbach commented Dec 20, 2021

sdtblck commented Dec 21, 2021

mitchellgordon95 Feb 9, 2022 •

edited

Loading

HughPH Feb 11, 2022

mitchellgordon95 Feb 10, 2022

StellaAthena Feb 11, 2022

sdtblck Feb 12, 2022

vfbd Mar 19, 2022

StellaAthena commented Apr 10, 2023

Checkpoint merge script #466

Checkpoint merge script #466

Conversation

sweinbach commented Nov 15, 2021

StellaAthena left a comment

Choose a reason for hiding this comment

sdtblck commented Dec 20, 2021

Stuff I changed:

Results:

Differences:

sweinbach commented Dec 20, 2021

sdtblck commented Dec 21, 2021

mitchellgordon95 Feb 9, 2022 • edited Loading

Choose a reason for hiding this comment

HughPH Feb 11, 2022

Choose a reason for hiding this comment

mitchellgordon95 Feb 10, 2022

Choose a reason for hiding this comment

StellaAthena Feb 11, 2022

Choose a reason for hiding this comment

sdtblck Feb 12, 2022

Choose a reason for hiding this comment

vfbd Mar 19, 2022

Choose a reason for hiding this comment

StellaAthena commented Apr 10, 2023

mitchellgordon95 Feb 9, 2022 •

edited

Loading