Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LIU-117 Add PlasmaFlight Drops #58

Merged
merged 20 commits into from
Aug 27, 2021
Merged

LIU-117 Add PlasmaFlight Drops #58

merged 20 commits into from
Aug 27, 2021

Conversation

calgray
Copy link
Collaborator

@calgray calgray commented Aug 10, 2021

Summary:
One of the problems of using plasma drops is data not being available to ports located on a different compute node due to plasma only being shared system memory store. Here a new PlasmaFlight drop has been made that uses a fallback machanism to a flight server-client whenever a plasma object does not exist locally.

The flight server uses plasma object IDs as ticket identifiers, plus only reads and writes raw binary data to plasma and the server container is maintained separately at https://github.com/ICRAR/plasmaflight

PlasmaFlight Drops have the parameters:

  • plasma_path: path to the local plasma store (same as plasma drop)
  • object_id: id of the object for all compute nodes (same as plasma drop)
  • flight_path: IP and flight port of the drop owner

Demos:
To demonstrate the functionality the first graph shows the compatibility of the new drops with simple plasma drops and how the flight container can connect to a plasma_store container:
image

In the second data is not copied into PlasmaFlight 5006 and forces the flight grpc+tcp call to network retrieve the object from the other plasma store. The PlasmaFlight Service is also capable of running the flight server and plasma store in the same container.
image
image

Future work:

  • True DataIO interface implementation which due to locking shared memory between reading and writing can't allow for immediate reading before a plasma buffer is sealed, but providing the memory size to the drop in advanced however should reduce the amount of in memory buffers for copying and be interface compatible for data over 4kb.
  • Flight server examples with TLS security are publicly available and should be incorporated in some form.
  • PlasmaFlight has only been tested with a single engine compute node with manually entered IPs. Reliable support for more compute configurations will need some form of IP address macros.
  • Reading a drop from a different compute node will make the assumption that the plasma socket and flight port are the same (unless macros are used and supported per compute node).
  • PlasmaFlight Service as a docker node currently blocks graph completion as the engine waits for the docker app to terminate. A new service type node that does not wait for termination would be ideal for situations such as this.

self._done = True
return data
else:
return []
Copy link
Collaborator Author

@calgray calgray Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to have been a bug before as I found it would read the store contents over and over in a loop in droputils.copyDropContents()

logger = logging.getLogger(__name__)


class PlasmaFlightClient():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a copy from https://github.com/ICRAR/plasmaflight/blob/main/icrar/plasmaflight/client/plasmaflight_client.py

An alternative option of getting it here is to add icrar.plasmaflight as a package dependency for daliuge

@coveralls
Copy link

coveralls commented Aug 10, 2021

Coverage Status

Coverage decreased (-0.1%) to 77.103% when pulling 3cf8254 on LIU-117-plasmaflight into 322d297 on master.

@@ -315,6 +320,20 @@ def run(self):
binds = list(set(binds)) # make this a unique list else docker complains
logger.debug("Volume bindings: %r", binds)

portMappings = {} # = {'5005/tcp': 5005, '5006/tcp': 5006}
for mapping in self._portMappings.split(','):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its worth noting this is exposing containers to public networks and a secure connection should be used. There are supposedly other possibilities for network binding such as specifying a private network ip (e.g. docker run -p localhost:5005:5005 or docker run --net=<custom bridge network>)

if not self._writer:
# use client.create and FixedSizeBufferWriter
self._writer = pyarrow.BufferOutputStream()
self._writer.write(data)
Copy link
Collaborator Author

@calgray calgray Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote the plasma writing here as the DataIO interface expects to write the number of bytes in count and calls write multiple times. Tested with a large file and this now works. Ideally if the write size is known in advanced a FixedSizeBufferWriter could be used here to write directly to the plasma mapped memory.

Copy link
Collaborator Author

@calgray calgray Aug 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed sized buffer writer is now used in PlasmaFlightIO

@@ -119,7 +119,7 @@ def _test_runGraphInTwoNMs(
drops.update(dm2._sessions[sessionId].drops)

leaf_drop = drops[leaf_oid]
with droputils.DROPWaiterCtx(self, leaf_drop, 1):
with droputils.DROPWaiterCtx(self, leaf_drop, 5):
Copy link
Collaborator Author

@calgray calgray Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this sometimes fails as the test uses memory drops and not plasma

Copy link
Collaborator

@james-strauss-uwa james-strauss-uwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit beyond me to understand completely. I just have a few comments:

  • Is exposing the container to a public networks a serious security issue? Is it just temporary to get the system working? As long as you are happy that the issue isn't serious, or that there is a plan to mitigate the issue, it's fine with me.
  • The change to test_dm.py seems weird, maybe create a ticket to look into that at some stage?

Do you want me to add plasma_path and object_id parameters to Plasma drops by default in EAGLE, and add those two params plus flight_path for PlasmaFlight Drops? I'm happy to do that.

@calgray
Copy link
Collaborator Author

calgray commented Aug 26, 2021

Thanks for reviewing!

The main issue with exposing the container at the moment is that the FlightServer does not use any form of network security so if the IP is discoverable on the internet anything with the ip and port combination can access and modify the data store. If the deployment tends to not change IPs very often a secure connection is likely a good idea. I have added some unit tests for creating a TLS flight server my plan for the security ticket is allow the server and client to be configured with a certificate and private key (possibly stored on the shared filesystem).

I've looked over the changes a bit for anything possibly relating to intermittent the test failure but almost everything exclusively affects plasma. I'll mention in a ticket which one is failing the commit where I first noticed it.

I was under the impression that plasma_path and object_id were already in the default palette, but if not that would be great to add. portMappings should also be added to DockerDrops.

@james-strauss-uwa
Copy link
Collaborator

portMappings is in the Docker drop, but the others are missing from Plasma/PlasmaFlight. I'll add them tomorrow.

@calgray calgray merged commit 6f931bc into master Aug 27, 2021
@calgray calgray deleted the LIU-117-plasmaflight branch August 27, 2021 02:10
awicenec pushed a commit that referenced this pull request May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants