work completion event received with wrong value #44

yaelna · 2019-02-07T10:59:19Z

we tried implementing sever client on top of send recv operations and noticed that sometimes we get a receive completion event following a send operation (or vise versa)

I've managed to reproduce the issue on top of RDMAvsTcpBenchmark ( changed files attached) by saving the last completion event and compering the new event against the old one.
we expect that events will always be send-recv-send-recv-... and found that we get two consecutive similar events.

server output:
2305 [Thread-0] INFO com.ibm.disni - got event type + RDMA_CM_EVENT_ESTABLISHED, srcAddress /192.168.33.137:8881, dstAddress /192.168.33.137:51085
RDMAvsTcpBenchmarkServer::client connection accepted
2308 [Thread-1] INFO com.ibm.disni - cq processing, caught exception but keep going server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2000****
java.lang.RuntimeException: server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2000****
at com.ibm.disni.examples.SendRecvServer$CustomServerEndpoint.dispatchCqEvent(SendRecvServer.java:212)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:37)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:29)
at com.ibm.disni.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106)
at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:136)
at java.lang.Thread.run(Thread.java:745)
2309 [Thread-1] INFO com.ibm.disni - cq processing, caught exception but keep going server got IBV_WC_RECV event twice in a row. last id = 2001, current id 2001****
java.lang.RuntimeException: server got IBV_WC_RECV event twice in a row. last id = 2001, current id 2001****
at com.ibm.disni.examples.SendRecvServer$CustomServerEndpoint.dispatchCqEvent(SendRecvServer.java:212)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:37)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:29)
at com.ibm.disni.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106)
at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:136)
at java.lang.Thread.run(Thread.java:745)

changedFiles.zip

patrickstuedi · 2019-02-08T10:20:04Z

Does the problem also occur with the SendRecv Benchmark (https://github.com/zrlio/disni/blob/master/src/test/java/com/ibm/disni/benchmarks/SendRecvClient.java, https://github.com/zrlio/disni/blob/master/src/test/java/com/ibm/disni/benchmarks/SendRecvServer.java)?

yaelna · 2019-02-12T16:36:13Z

I didn't try it, I have made changes in these classes:
https://github.com/zrlio/disni/blob/master/src/test/java/com/ibm/disni/examples/SendRecvClient.java
https://github.com/zrlio/disni/blob/master/src/test/java/com/ibm/disni/examples/SendRecvServer.java

petro-rudenko · 2019-02-15T12:26:34Z

The issue is that disni reuses wc object. In your code you need to do:

lastEvent.set(wc.clone());

After changing this got correct result SEND -> RECV -> SEND -> RECV, etc.

yaelna · 2019-03-19T16:45:25Z

I've changed my code as follow:

public synchronized void dispatchCqEvent(IbvWC wc) throws IOException {
IbvWC clone = wc.clone();
int newOpCode = clone.getOpcode();
IbvWC old = lastEvent.get();
if(old == null){
lastEvent.set(clone);
} else if (old.getOpcode() == newOpCode){
throw new RuntimeException("*******server got "+ IbvWC.IbvWcOpcode.valueOf(newOpCode) +" event twice in a row. last id = "+old.getWr_id()+", current id "+wc.getWr_id()+"***");
}
lastEvent.set(clone);
wcEvents.add(clone);
}
and still I get this error in my logs :
7814 [Thread-1] INFO com.ibm.disni - cq processing, caught exception but keep going server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2001
java.lang.RuntimeException: server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2001
at com.ibm.disni.examples.SendRecvServer$CustomServerEndpoint.dispatchCqEvent(SendRecvServer.java:213)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:37)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:29)
at com.ibm.disni.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106)
at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:136)
at java.lang.Thread.run(Thread.java:745)

I am using Soft-RoCE to do this check , are you using the real hardware?

petro-rudenko · 2019-03-20T11:12:13Z

@yaelna Can you please run this function:

public void dispatchCqEvent(IbvWC wc) throws IOException {
                        System.out.println("Server got event " + wc.getWr_id() + " : " + IbvWC.IbvWcOpcode.valueOf(wc.getOpcode()) +
                                "Old " + lastEvent.get() + "\n\n");
                        int newOpCode = wc.getOpcode();
                        IbvWC old = lastEvent.get();
                        if(old == null){
                                lastEvent.set(wc.clone());
                                System.out.println(" Last event1 = " + lastEvent.get());
                        } else if (old.getOpcode() == newOpCode){
                                System.out.println( "Last event: " + old  + "current :" + wc);
                                throw new RuntimeException("*******server got "+ IbvWC.IbvWcOpcode.valueOf(newOpCode) +" event twice in a row. last id = "+old.getWr_id()+", current id "+old.getWr_id()+"***********");
                        }
                        lastEvent.set(wc.clone());
                        System.out.println("Last event1 = " + lastEvent.get());
                        wcEvents.add(wc);
                }

Yes i'm running on real Roce device. Regarding SoftRoce, maybe related to #37. Please see my comment:
#37 (comment)

venkatsc · 2019-10-28T20:27:47Z

I am testing Flink with DiSNI and noticed some sporadic duplicated receives. We are already cloning the WC in endpoint classes as suggested in this bug.

Receive complete: 406030 buff address: 139941385133904 seq:405014 
Receive complete: 406031 buff address: 139941385166688 seq:405016 <-- should be 405015
Receive complete: 406032 buff address: 139941385199472 seq:405016

All the receives starting from seq 0 till 405014 received without any repetition. This issue is happening only once in a while.

Server did not repeat the sequence during send, however, on client it is received twice. Any ideas why this could happen?

Additional information: it is tested on InfiniBand HCA mlx-4

venkatsc · 2019-10-31T18:51:59Z

Disclaimer: Following analysis is purely based on code walk through.

The issue could be with the way free() is used. For instance, the provided examples executes the command and frees the resources ,immediately, before WC event received.

This immediate free() of resources could tamper with the work request, as free() will return the direct buffer to which this command is written to the pool, before HCA executing the posted WR. The returned buffer could be allocated to the next WR and tampers the data associated with the already posted WR.

Imagining the HCA reading the WR from the command address and current WR being written to the same address as an interleaved operation, this could explain why we see duplicate WR ids, data in the receive or send completions.

The solution could be to hold command reference until WC is received and call free() on command only on WC receive.

patrickstuedi · 2019-10-31T19:25:36Z

Hi Venkatsc, thanks for posting this. Let me double check the code as well and find out if this is really an issue of the buffer being freed too early, that would be a major bug.

patrickstuedi · 2019-10-31T19:30:44Z

Can you point me to the line in the example with the problematic early free?

venkatsc · 2019-10-31T23:02:50Z

@patrickstuedi I have faced with this issue irregularly while testing code by posting many WRs at once. As per my reading of the code, it is more of documentation fix. Method free() should not be called until WR is completed. But the examples in the repository uses free() immediately after execute() and that may not cause issues when small number of WR posted.

The below line of code is taken from here

endpoint.postSend(endpoint.getWrList_send()).execute().free();

I think, below flow causing the issue when execute and free methods used in immediate succession.

Post a WR with WR_ID: 100 by issuing exec() and the object is written to memory address.

Let us make some assumptions:
- Command is written to the memory address 123
- There are few hundred pending WRs queued at the HCA and the current WR is added to the queue.

Since exec() is non-blocking, as soon as the WR is added to the queue, the free() will return the buffer to the memory pool by calling 1, 2.
Post new WR with WR_ID: 101, this request tries to get the buffer from the pool.
- Let us assume, the buffer with 123 from the pool is assigned to this WR .
Now, following three scenarios can happen
- WR with WR_ID: 100 is processed by HCA already, everything is fine. WR with WR_ID: 101 can override the data at memory address 123 without any issues.
- HCA is currently processing WR with WR_ID: 100 and WR with WR_ID: 101 starts writing its request at the same location. Result: memory corruption and depending on how much HCA read, the result could be duplicate WR with same id, or different ids posting same memory address for data send/receive.
- HCA has not reached the WR_ID:100 for processing. In this case, WR_ID: 101 would replace WR_ID: 100 and add duplicate WR to the queue.

See, zrlio/disni#44 (comment)

patrickstuedi · 2019-11-05T15:05:32Z

in (2), when exec is called, the WR list is actually copied by the user driver to a dedicated memory area from where the NIC is fetching the WR list via DMA. The copied WR lits also needs to have the specific format matching the device so the user driver also creates a slightly different layout. Therefore, the application memory backing the WR list can be freed and re-used right away after exec().

The copying of the WR list can been seen in the default post_send call of libibverbs that eventuallyIcalls into the kernel:

https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/cmd.c#L1247

This is not the post_send call that typiclly gets executed if you run on IB or RoCE beause normally a specific user-driver for a particular NIC is directly memory mapping the NIC queues and populates WRs without kernel involvement, but the copying will be similar.

At least that's my understanding of how things work, happy to discuss.

patrickstuedi · 2019-11-05T15:06:20Z

Note that we are talking about copying WRs, not data, just to not create confusion here.

BernardMetzler · 2019-11-05T16:26:36Z

-----"Patrick Stuedi" <notifications@github.com> wrote: -----

To: "zrlio/disni" ***@***.***> From: "Patrick Stuedi" ***@***.***> Date: 11/05/2019 04:42PM Cc: "Subscribed" ***@***.***> Subject: [EXTERNAL] Re: [zrlio/disni] work completion event received with wrong value (#44) in (2), when exec is called, the WR list is actually copied by the user driver to a dedicated memory area from where the NIC is fetching the WR list via DMA. The copied WR lits also needs to have the specific format matching the device so the user driver also creates a slightly different layout. Therefore, the application memory backing the WR list can be freed and re-used right away after exec(). The copying of the WR list can been seen in the default post_send call of libibverbs that eventuallyIcalls into the kernel: https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/cmd.c# L1247 This is not the post_send call that typiclly gets executed if you run on IB or RoCE beause normally a specific user-driver for a particular NIC is directly memory mapping the NIC queues and populates WRs without kernel involvement, but the copying will be similar. At least that's my understanding of how things work, happy to discuss.

I agree. The work request list is only needed to create driver specific work queue elements. The post_send() or post_receive() call is synchronous. If the call finishes, appropriate work requests got instantiated within the work queue of the driver, or the call fails since the work queue was full. In any case, with the call return, the WR list is under control of the caller again and can be re-used/overwritten, or free'd. Best regards, Bernard.

venkatsc · 2019-11-06T12:34:29Z

https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/cmd.c#L1247

Indeed, you are right. I was under assumption that, libibverbs does not copy the request from the given address. As I mentioned before, the issues is occurring rarely, but after not freeing resource until completion I didn't see the issue. Hence, thought free() could be the issue.

Recently, after posting the comment, I had faced the duplicate response again. Issue is due to application bug that is overriding header in the same memory location.

venkatsc · 2019-11-06T12:38:41Z

Should I remove the comments, to avoid misdirection. As you explained, free() is not the cause.

patrickstuedi · 2019-11-06T20:34:26Z

Please leave the comments, I think it's a good discussion. And it would be great to find the actual cause for this problem..

yaelna closed this as completed Feb 12, 2019

yaelna reopened this Feb 12, 2019

venkatsc added a commit to venkatsc/flink that referenced this issue Nov 1, 2019

Free should not be called untill the WR is complete

4d7516c

See, zrlio/disni#44 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

work completion event received with wrong value #44

work completion event received with wrong value #44

yaelna commented Feb 7, 2019

patrickstuedi commented Feb 8, 2019

yaelna commented Feb 12, 2019

petro-rudenko commented Feb 15, 2019

yaelna commented Mar 19, 2019 •

edited

petro-rudenko commented Mar 20, 2019 •

edited

venkatsc commented Oct 28, 2019 •

edited

venkatsc commented Oct 31, 2019

patrickstuedi commented Oct 31, 2019

patrickstuedi commented Oct 31, 2019

venkatsc commented Oct 31, 2019

patrickstuedi commented Nov 5, 2019

patrickstuedi commented Nov 5, 2019

BernardMetzler commented Nov 5, 2019 via email

venkatsc commented Nov 6, 2019 •

edited

venkatsc commented Nov 6, 2019

patrickstuedi commented Nov 6, 2019

work completion event received with wrong value #44

work completion event received with wrong value #44

Comments

yaelna commented Feb 7, 2019

patrickstuedi commented Feb 8, 2019

yaelna commented Feb 12, 2019

petro-rudenko commented Feb 15, 2019

yaelna commented Mar 19, 2019 • edited

petro-rudenko commented Mar 20, 2019 • edited

venkatsc commented Oct 28, 2019 • edited

venkatsc commented Oct 31, 2019

patrickstuedi commented Oct 31, 2019

patrickstuedi commented Oct 31, 2019

venkatsc commented Oct 31, 2019

patrickstuedi commented Nov 5, 2019

patrickstuedi commented Nov 5, 2019

BernardMetzler commented Nov 5, 2019 via email

venkatsc commented Nov 6, 2019 • edited

venkatsc commented Nov 6, 2019

patrickstuedi commented Nov 6, 2019

yaelna commented Mar 19, 2019 •

edited

petro-rudenko commented Mar 20, 2019 •

edited

venkatsc commented Oct 28, 2019 •

edited

venkatsc commented Nov 6, 2019 •

edited