New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Communication c++ layer #926
Comments
Other things that would be required is a listener for UCX or TCP. The only thing this listener does is wait for ANY message from ANY end point and pass that message along to the bufferAssembler. The bufferAssembler must be "started" before these message start arriving. This is interesting because it identifies a copule of things. We need functions that are like
|
The job of this class is to send a buffer from Endpoint to another. It should have no understanding of what payload it is sending. It takes in an rmm::device_buffer and an id that can be used by the BufferAssembler on the receiving side. It has a function to begin transmission so that we can associate each transmission with an assembler that gest made on the other side. it should block until we know the assemblers are made |
This comment has been minimized.
This comment has been minimized.
The Also for the listener of the messages, is that a responsability or feature of the Would there be one Where do the incoming message CacheMachine and outgoing message CacheMachine come to play? |
What is the base UCX API that we will use in order to build all of those abstractions?
|
Good idea
Neither, The call back on the ucx listener that respodns to incomding messages will use a tag to route the information to the appropriate place. When a message is goign to be sent, first goes over a list of tags that will map to that message. The call_back can access a global_context where the mapping of tag to BufferAssembler is stored and it sends the buffer over to the assembler
A ReceiverClass is generated by the first message which is begin_transmission. its assembler just takes that message and puts it into a vector of device_buffers, when it has accumulated them all (on a seperate thread with a condition variable waiting for them all to arrive, it then uses the deserializer to convert those buffers and the metadata from begin_transmission into our table representation.
A thread is polling on the outgoing one to call the send to its specific destination. |
UCP
This is no longer the case. A great deal of work has been done to ensure this is no longer the case. The biggest risk here is proper configuration of the network
actually we are using the ucp_ep_h class like cuml |
I see, great, another question:
How do we are going to test these new abstractions? I guess in case UCP doesn't detects any RDMA channel then it will fallback to TCP automagically right? or do we need to implement some UCPTCPChannelMessageCarrierSomething in order to support TCP? |
Is the listener in python or c++?
This tag is the message metadata (CacheData metadata)? to route the information into the appropriate CacheMachine?
I dont understand
What is the mapping of tag to BufferAssembler? Is the tag associated to a CacheMachine (i.e. a destination?) is there a BufferAssembler per CacheMachine? Is there one BufferAssembler or many? |
The polling thread, i imagine as soon as a message is pulled from the CacheMachine, it does the sending (serializing, BufferCommunicating) on a separate thread? I imagine we also want that sending on a separate thread, be part of a thread pool with an upper limit on the max number of outgoing messages |
That's a really good pair of questions. For the first. A nightmare/dream about a dragon dog hybrid that was inadvertently killing people seeking love and attention gave me some great insight on this. A dragon dog hybrid seemed like a good idea at first, but not everything was thought out when the dragon dog came to life. Namely that it would be needy, and that it seeking comfort from others was deadly to those whose affection it wanted. We have tried to add UCX a few times before. Early on we had lots of issues with genuine bugs in UCX. We gave up many times before because we got better performance from TCP. This is definitely no longer the case. We test these new abstractions quite easily actually. You can test the Sender and Receiver classes by making a version that takes a BufferSender or BufferReceiver that we KNOW will work, a dummy one, and the same thing for the Assembler, deserialier, serializer. We can write performance tests with this and dask to verify that it can handle high concurrency and to validate the transfer rate on different hardware. But to do this we must make these things pluggable. As far as falling back that is handled by how you launched this to begin with. It depends on all the UCX__ env variables |
yes
Correct it should use a thread pool. Only the begin transmission message should be synchronous to ensure that you don't start sending a message before it can be assembled |
Here is the beginning of a working implementation of MessageSender
|
@felipeblazing, related with the @percy's comment
I think we will need test the UCX (high level, no UCP) behavior, beacause
this UCX demo only used the cuda_cpy layer. |
@felipeblazing, related with the Listener and BufferCommunicator (and possiby with GPUCacheDataMetadata):
Another consideration (maybe for future):
|
This is how cuml does it. I am pretty sure they are taking advantage of these protocols. This is set by the end_points and what capacity those endpoints have
We arent configuring the interface at all. This is done by ucx-py
This was an issue in the past and should now be resolved. |
Show me a link of what you mean
Please look at the code above
Yes right now thats exactly how it works in the example I provided above |
Couple observations:
|
serializer converts a BlazingTable ==> metadata and buffers
The template parameters are so that you can mix and match these in case that becomes something that needs to be done. So imagine in the future there is library FastSend. And we FastSend like UCX can send gpu buffers directly but it has its own protocol for specifying information about how the buffer is going to be sent. We could change jsut the serializer here but use the same BufferToFrameFunction and change out the NodeAddress from UCX to FastSend (again thats a made up name) and have a FastSend Sender. Say that FastSend can use both TCP and UCX end points. Now The NodeAddress might be TCP or UCX but the Sender needs to be FastSend TCP can use the same Serialier as UCX |
remember the MessageSender class doesn't necessarily know what kind of address its getting |
ok then they should be just one function why SerializerFunction will need to return a pointer to gpu buffer for tcp just return the converted cpu buffer directly
That also could be done with interfaces and it's more explicit about what methods are required to be implemented.
that's right only the |
Becuse if they both return gpu pointers then you have only 1 serializer function for both of them The serialier function is complicated. The converting from gpu to cpu is not. Seperating these means that you ahve fewer differences between protocols
The point is that in the future we might not want the NodeAddress to have bth UCX and TCP connection information in them. For now we are doing this for expediency. So it might not have a getUCXAddress when its tcp or a getTCPAddress when its UCX
So I do see the point about the Address being tied to the sender.. I agree that Sender should be an interface. We could have NodeAddress be a pointer that gets typecast by the sender. |
what about something that looks more like this
|
You could have
If at one point that happens just change the interface. |
Right now this is how we send the metadata for a df
This is wait for begin transmission
While this guarantees that the message was delivered it does not guarantee that the other side has initialized the Message_receiver on the other side. We should change this function to actually wait for an acknowledge from the other side. We could do something along the lines of. Server 1 sends being transmission to Server 2 So server 1 has to listen for the message it sent N times. We could use the final 2 bytes as a representative of which ral_id is acknowledging if we want to track that. |
Big bummer realization. There are certain things that we are expected to keep in cope throughout query execution like info_tag which if I am not mistaken must stay in scope until the call back is called
|
There is no global place to store a representation of the graph. Because communications spans across queries we need to do something like store a map of query_id ==> graph so we can do things like get the output caches. |
Instead of incrementing so you wait until you receive that acknowledgement using somethign like
and then it is |
We had a pretty interesting experience trying to get performance and correctness by sending all of our messages between nodes using ucx-py and dask to send messages. The single threaded nature of python, the fact that dask is using torando.ioloop and we were seeing things like coroutines run at the same time if we were awaiting a ucx.send. It has been really hard to troubleshoot and the performance isn't there for us.
We need to send and receive messages in the c++ layer to remove the issues we had. Seeing as how we have often been hasty in trying to implement ucx as fast as possible we are going to try and be smart and slow the heck down. Hell if it takes 3 times as long to develop and 1/2 as long to debug we will come out ahead :).
I kind of envision a few classes like this
We can use a combination of these things to send a message and receive it on the other end with a listener that basses the buffer to the bufferAssembler, when the buffer assembler is done the deserializer converts it to a cudf::table and a Metadata that we can use to add the message to the appropriate class
The text was updated successfully, but these errors were encountered: