Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customized loss value #52

Open
ZN1010 opened this issue Aug 22, 2023 · 4 comments
Open

Customized loss value #52

ZN1010 opened this issue Aug 22, 2023 · 4 comments

Comments

@ZN1010
Copy link

ZN1010 commented Aug 22, 2023

full parameter update里面,我最近在试一种新的loss function,就是在原有的next token prediction上面加一个regularized term,希望某些layer的weights能尽可能小。可是总会遇到一些奇怪的bug。下面是我加的代码和遇到的error:

比如在lomo_trainer.py中:

lamda, regularization = 1, torch.tensor(0, requires_grad=True, dtype=torch.float32)
self.model.train()
for name, param in self.model.named_parameters():
    if "self_attn.q_proj" in name:
        with GatheredParameters(param):
            regularization = regularization + torch.mean(param)
...
loss = get_loss(outs.logits, batch['labels'], self.training_args.clip_loss_value) + lamda * regularization

可是这样做完,会造成lomo.py里面grad_norm()的loss.backward(retain_graph=True)产生RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1的错误。我猜是backward的时候找不到我新加的那些layer的weights。想请问一下,该怎么解决这个bug或者有没有更好的implementation?

非常感谢!

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Aug 22, 2023

你好,我建议不要使用GatheredParameters,而是用torch.mean(param.ds_tensor)然后自己gather

@ZN1010
Copy link
Author

ZN1010 commented Aug 22, 2023

你好,我建议不要使用GatheredParameters,而是用torch.mean(param.ds_tensor)然后自己gather

你好啊!想请教一下怎么自己gather呢?我看ds_tensor里面貌似是打乱过的parameters。这里我还是想知道某个参数具体在llama里面的位置的(比如在mlp/self_attention的哪个layer)。谢谢!

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Aug 25, 2023

还是可以使用if "self_attn.q_proj" in name:来判断参数的名字。ds_tensor里面是被切分过的参数,大小是参数本来大小除以GPU的数量。gather的话可以使用torch的gather API就行

@ZN1010
Copy link
Author

ZN1010 commented Sep 8, 2023

Sorry for the late response! 我用了推荐的办法,gather好了param.ds_tensor。可是在做backward的时候,还是遇到了跟最开始一样的问题(lomo.py里面grad_norm()的loss.backward(retain_graph=True)产生RuntimeError: The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1的)。

我猜还是deepspeed在做backward的时候,找不到这些ds_tensor。不知道我理解的是否正确或者有什么解决办法吗?谢谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants