Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transfer_action too large #5299

Open
biddisco opened this issue Apr 23, 2021 · 7 comments
Open

transfer_action too large #5299

biddisco opened this issue Apr 23, 2021 · 7 comments

Comments

@biddisco
Copy link
Contributor

biddisco commented Apr 23, 2021

I have previously implemented a custom bootstrapping routine that allows the libfabric parcelport to initialize itself by sending addresses from worker nodes to the root/agas node so that address vectors can be updated and shared between nodes. This takes place before the big boot barrier kicks in and works well so that the worker registration that takes place in BBB functions as expected - but it does not fit the way other parcelports initialize themselves as an extra step swapping addresses is needed.

I would like to integrate the LF PP with the methods used in big boot barrier where nodes register themselves via the main register worker action, which contains the addresses of workers, and is sent to agas root and are then broadcast back to workers. Unfortunately, the serialization of the worker registration takes around 6052 byes and does not fit into the default message header of 4096 used by the libfabric parcelport - which means we must do an RMA operation to fetch the data. However, we cannot perform RMA operations to fetch the addresses of nodes if we have not yet swapped addresses since we do not know the address until we have got it.

The problem is that transfer_action consumes 5917 bytes of space itself and does not appear to serve much purpose. Other overheads push the total registration parcel size to 6052 bytes (The LF custom registration exchanges 16 bytes from worker to root for a tcp provider and ~48 for a GNI provider).

Are these massive overheads in action serialization actually necessary?

@hkaiser
Copy link
Member

hkaiser commented Apr 23, 2021

How did you determine the size of the transfer_action? From what I can see the transfer_action's size is determined mostly by the sizes of the arguments passed to the remote function (plus the size of the vtbl pointer). Do I miss something?

@biddisco
Copy link
Contributor Author

I looked at the archive size before serializing the action and after.

@hkaiser
Copy link
Member

hkaiser commented Apr 23, 2021

I looked at the archive size before serializing the action and after.

Ahh. I think this particular action might even have a configuration defined size as it transfers the action id's for all defined actions. What we could do is to separate the process of making the action id's consistent into a second step.

@biddisco
Copy link
Contributor Author

ok that makes sense. I didn't dive too deep into the contents of the action_base or whatever. I will have another look and see if I can factor out just the worker registration part into a smaller parcel

@biddisco
Copy link
Contributor Author

You are correct - the bulk of the space is taken up by


the typenames registered.

@biddisco
Copy link
Contributor Author

I've replaced my existing code with a single new function defined in the base parcelport

        // empty interface to allow libfabric parcelport to bootstrap
        virtual void pre_bootstrap_initialization() {}

which is called inside big boot barrier just before the early parcel is sent - it does nothing in the other parcelports, but allows libfabric to perform an address send using a low level message that bypasses the parcel creation step. This is simple and works and saves redesigning the existing big boot barrier init.

@hkaiser
Copy link
Member

hkaiser commented Mar 11, 2022

@biddisco I assume this can be closed now, can it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants