load GPT-J from HF #39

Fireblossom · 2022-11-25T22:52:21Z

load the GPT-J checkpoint newer version of transformers

CoEich · 2022-11-28T08:10:16Z

magma/magma.py

@@ -89,6 +91,21 @@ def __init__(self, config, device=None):
                    **attn_config,
                )

+        #check weights contiguous


Is there a specific reason why we need to check this?

The weights for GPT-J's attention layers are not contiguous, which will raise Tensors must be contiguous in deepspeed.
I find a similar issue here

CoEich · 2022-11-28T08:14:56Z

Hi,

thx again for another PR. It would be nice to use the official HF version for MAGMA. However, last time we tried to implement this we noticed a slight difference in model outputs which we could not really get to the bottom of.

I'm very careful with these kinds of changes, so it would be great if you could compare the logits for some example inputs before/after your change.

Best,

Constantin

Fireblossom · 2022-11-28T11:27:11Z

Hi,

thx again for another PR. It would be nice to use the official HF version for MAGMA. However, last time we tried to implement this we noticed a slight difference in model outputs which we could not really get to the bottom of.

I'm very careful with these kinds of changes, so it would be great if you could compare the logits for some example inputs before/after your change.

Best,

Constantin

Hi Constantin,

Thank you for reading my changes. Regarding the slight difference, I think it is hard for me to explain something from the example inputs/outputs right now.

However, a change in the structure of the model may be the cause.
In the old version HF, you used GPT-Neo to simulate GPT-J, GPTNeoMLP has no activation function.
in the new version HF, there is an activation function NewGELUActivation() added.

before:

(0): GPTNeoBlock(
      (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
      (attn): GPTNeoAttention(
        (attention): GPTNeoSelfAttention(
          (attn_dropout): Dropout(p=0, inplace=False)
          (resid_dropout): Dropout(p=0, inplace=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
      )
      (mlp): Sequential(
        (0): GPTNeoMLP(
          (c_fc): Linear(in_features=4096, out_features=16384, bias=True)
          (c_proj): Linear(in_features=16384, out_features=4096, bias=True)
          (dropout): Dropout(p=0, inplace=False)
        )
        (1): Adapter(
          (adapter): Sequential(
            (0): Linear(in_features=4096, out_features=1024, bias=True)
            (1): ReLU()
            (2): Linear(in_features=1024, out_features=4096, bias=True)
          )
        )
      )
    )

at present (without adding adapters)
GPT-Neo:

(0): GPTNeoBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=False)
            (q_proj): Linear(in_features=768, out_features=768, bias=False)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )

GPT-J:

(0): GPTJBlock(
        (ln_1): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): GPTJAttention(
          (attn_dropout): Dropout(p=0.0, inplace=False)
          (resid_dropout): Dropout(p=0.0, inplace=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): GPTJMLP(
          (fc_in): Linear(in_features=4096, out_features=16384, bias=True)
          (fc_out): Linear(in_features=16384, out_features=4096, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.0, inplace=False)
        )
      )

Hope this helps you!

Best,

Changxu

CoEich · 2022-12-06T07:42:14Z

Hmm this puzzles me a bit, but in any case, unless consistent behavior with the old version is ensured (e.g. by checking that all the hidden states are the same for a couple of example inputs) I will not merge these changes.

Let me know if you manage to do it and thx for the effort.

Best,

Constantin

nikisalli · 2023-03-15T08:55:50Z

Hello, did you manage to make this work?

Fireblossom · 2023-03-15T10:46:58Z

Hello, did you manage to make this work?

Hi,
I can confirm that the modification itself is runnable and can be fine-tuned by the same method as the MAGMA paper.
As discussed above, the modified branch may have inconsistent output with the checkpoint provided by the repo, so this PR will not be merged.
But I don't have a lot of time to dive into this right now. Maybe I will do it later.

nikisalli · 2023-03-15T10:50:44Z

Hi, thank you for the fast answer!
do you have a working checkpoint? the default one has some dimensionality differences and I'd avoid to copy and paste the tensors by hand. Can you upload it somewhere?

Fireblossom · 2023-03-15T11:06:41Z

Hi, thank you for the fast answer! do you have a working checkpoint? the default one has some dimensionality differences and I'd avoid to copy and paste the tensors by hand. Can you upload it somewhere?

The data I use for fine-tuning is in a completely different domain so I'm afraid my checkpoint can't meet your needs right now.

nikisalli · 2023-03-15T11:08:20Z

ah, ok, thank you anyway

load GPT-J from HF

222f846

CoEich reviewed Nov 28, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load GPT-J from HF #39

load GPT-J from HF #39

Fireblossom commented Nov 25, 2022

CoEich Nov 28, 2022

Fireblossom Nov 29, 2022

CoEich commented Nov 28, 2022

Fireblossom commented Nov 28, 2022 •

edited

CoEich commented Dec 6, 2022

nikisalli commented Mar 15, 2023

Fireblossom commented Mar 15, 2023

nikisalli commented Mar 15, 2023

Fireblossom commented Mar 15, 2023

nikisalli commented Mar 15, 2023

load GPT-J from HF #39

Are you sure you want to change the base?

load GPT-J from HF #39

Conversation

Fireblossom commented Nov 25, 2022

CoEich Nov 28, 2022

Choose a reason for hiding this comment

Fireblossom Nov 29, 2022

Choose a reason for hiding this comment

CoEich commented Nov 28, 2022

Fireblossom commented Nov 28, 2022 • edited

CoEich commented Dec 6, 2022

nikisalli commented Mar 15, 2023

Fireblossom commented Mar 15, 2023

nikisalli commented Mar 15, 2023

Fireblossom commented Mar 15, 2023

nikisalli commented Mar 15, 2023

Fireblossom commented Nov 28, 2022 •

edited