Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DistConv connections to LBANN are too fragile. #2158

Open
benson31 opened this issue Nov 21, 2022 · 0 comments
Open

DistConv connections to LBANN are too fragile. #2158

benson31 opened this issue Nov 21, 2022 · 0 comments
Assignees
Labels

Comments

@benson31
Copy link
Collaborator

Code like the following can cause problems:

# given layers 'input' and 'x_true' of suitable shapes/types/etc
...
x = lbann.Convolution(input, ..., parallel_strategy=<not None>)
y = lbann.L2Norm2(x)
z = lbann.Subtract(x, x_true)
...

It seems that the split layer introduced by LBANN's runtime between x and the y and z children doesn't gracefully handle the fact that x's tensors are actually managed by DistConv. I was seeing error messages like:

layer "conv_norm" expected an input tensor stored in a 4096 x 1 matrix from layer "convolution_layer_split", but got a 0 x 0 matrix

To fix this, I replaced x with:

x = lbann.Identity(lbann.Convolution(input, ..., parallel_strategy=<not None>), parallel_strategy=None)

(where the parallel_strategy=None is just to make very explicit that I do NOT want this layer to be DistConv-managed). This seems to have worked.

@benson31 benson31 self-assigned this Nov 21, 2022
@benson31 benson31 added the bug label Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant