Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

work completion event received with wrong value #44

Open
yaelna opened this issue Feb 7, 2019 · 16 comments
Open

work completion event received with wrong value #44

yaelna opened this issue Feb 7, 2019 · 16 comments

Comments

@yaelna
Copy link

yaelna commented Feb 7, 2019

we tried implementing sever client on top of send recv operations and noticed that sometimes we get a receive completion event following a send operation (or vise versa)

I've managed to reproduce the issue on top of RDMAvsTcpBenchmark ( changed files attached) by saving the last completion event and compering the new event against the old one.
we expect that events will always be send-recv-send-recv-... and found that we get two consecutive similar events.

server output:
2305 [Thread-0] INFO com.ibm.disni - got event type + RDMA_CM_EVENT_ESTABLISHED, srcAddress /192.168.33.137:8881, dstAddress /192.168.33.137:51085
RDMAvsTcpBenchmarkServer::client connection accepted
2308 [Thread-1] INFO com.ibm.disni - cq processing, caught exception but keep going server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2000****
java.lang.RuntimeException: server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2000****
at com.ibm.disni.examples.SendRecvServer$CustomServerEndpoint.dispatchCqEvent(SendRecvServer.java:212)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:37)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:29)
at com.ibm.disni.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106)
at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:136)
at java.lang.Thread.run(Thread.java:745)
2309 [Thread-1] INFO com.ibm.disni - cq processing, caught exception but keep going server got IBV_WC_RECV event twice in a row. last id = 2001, current id 2001****
java.lang.RuntimeException: server got IBV_WC_RECV event twice in a row. last id = 2001, current id 2001****
at com.ibm.disni.examples.SendRecvServer$CustomServerEndpoint.dispatchCqEvent(SendRecvServer.java:212)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:37)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:29)
at com.ibm.disni.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106)
at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:136)
at java.lang.Thread.run(Thread.java:745)

changedFiles.zip

@yaelna yaelna closed this as completed Feb 12, 2019
@yaelna yaelna reopened this Feb 12, 2019
@petro-rudenko
Copy link
Contributor

The issue is that disni reuses wc object. In your code you need to do:

lastEvent.set(wc.clone());

After changing this got correct result SEND -> RECV -> SEND -> RECV, etc.

@yaelna
Copy link
Author

yaelna commented Mar 19, 2019

I've changed my code as follow:

public synchronized void dispatchCqEvent(IbvWC wc) throws IOException {
IbvWC clone = wc.clone();
int newOpCode = clone.getOpcode();
IbvWC old = lastEvent.get();
if(old == null){
lastEvent.set(clone);
} else if (old.getOpcode() == newOpCode){
throw new RuntimeException("*******server got "+ IbvWC.IbvWcOpcode.valueOf(newOpCode) +" event twice in a row. last id = "+old.getWr_id()+", current id "+wc.getWr_id()+"***");
}
lastEvent.set(clone);
wcEvents.add(clone);
}
and still I get this error in my logs :
7814 [Thread-1] INFO com.ibm.disni - cq processing, caught exception but keep going server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2001
java.lang.RuntimeException: server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2001

at com.ibm.disni.examples.SendRecvServer$CustomServerEndpoint.dispatchCqEvent(SendRecvServer.java:213)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:37)
at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:29)
at com.ibm.disni.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106)
at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:136)
at java.lang.Thread.run(Thread.java:745)

I am using Soft-RoCE to do this check , are you using the real hardware?

@petro-rudenko
Copy link
Contributor

petro-rudenko commented Mar 20, 2019

@yaelna Can you please run this function:

public void dispatchCqEvent(IbvWC wc) throws IOException {
                        System.out.println("Server got event " + wc.getWr_id() + " : " + IbvWC.IbvWcOpcode.valueOf(wc.getOpcode()) +
                                "Old " + lastEvent.get() + "\n\n");
                        int newOpCode = wc.getOpcode();
                        IbvWC old = lastEvent.get();
                        if(old == null){
                                lastEvent.set(wc.clone());
                                System.out.println(" Last event1 = " + lastEvent.get());
                        } else if (old.getOpcode() == newOpCode){
                                System.out.println( "Last event: " + old  + "current :" + wc);
                                throw new RuntimeException("*******server got "+ IbvWC.IbvWcOpcode.valueOf(newOpCode) +" event twice in a row. last id = "+old.getWr_id()+", current id "+old.getWr_id()+"***********");
                        }
                        lastEvent.set(wc.clone());
                        System.out.println("Last event1 = " + lastEvent.get());
                        wcEvents.add(wc);
                }

Yes i'm running on real Roce device. Regarding SoftRoce, maybe related to #37. Please see my comment:
#37 (comment)

@venkatsc
Copy link

venkatsc commented Oct 28, 2019

I am testing Flink with DiSNI and noticed some sporadic duplicated receives. We are already cloning the WC in endpoint classes as suggested in this bug.

Receive complete: 406030 buff address: 139941385133904 seq:405014 
Receive complete: 406031 buff address: 139941385166688 seq:405016 <-- should be 405015
Receive complete: 406032 buff address: 139941385199472 seq:405016

All the receives starting from seq 0 till 405014 received without any repetition. This issue is happening only once in a while.

Server did not repeat the sequence during send, however, on client it is received twice. Any ideas why this could happen?

Additional information: it is tested on InfiniBand HCA mlx-4

@venkatsc
Copy link

Disclaimer: Following analysis is purely based on code walk through.

The issue could be with the way free() is used. For instance, the provided examples executes the command and frees the resources ,immediately, before WC event received.

This immediate free() of resources could tamper with the work request, as free() will return the direct buffer to which this command is written to the pool, before HCA executing the posted WR. The returned buffer could be allocated to the next WR and tampers the data associated with the already posted WR.

Imagining the HCA reading the WR from the command address and current WR being written to the same address as an interleaved operation, this could explain why we see duplicate WR ids, data in the receive or send completions.

The solution could be to hold command reference until WC is received and call free() on command only on WC receive.

@patrickstuedi
Copy link
Member

Hi Venkatsc, thanks for posting this. Let me double check the code as well and find out if this is really an issue of the buffer being freed too early, that would be a major bug.

@patrickstuedi
Copy link
Member

Can you point me to the line in the example with the problematic early free?

@venkatsc
Copy link

@patrickstuedi I have faced with this issue irregularly while testing code by posting many WRs at once. As per my reading of the code, it is more of documentation fix. Method free() should not be called until WR is completed. But the examples in the repository uses free() immediately after execute() and that may not cause issues when small number of WR posted.

The below line of code is taken from here

endpoint.postSend(endpoint.getWrList_send()).execute().free();

I think, below flow causing the issue when execute and free methods used in immediate succession.

  1. Post a WR with WR_ID: 100 by issuing exec() and the object is written to memory address.
  • Let us make some assumptions:
    • Command is written to the memory address 123
    • There are few hundred pending WRs queued at the HCA and the current WR is added to the queue.
  1. Since exec() is non-blocking, as soon as the WR is added to the queue, the free() will return the buffer to the memory pool by calling 1, 2.
  2. Post new WR with WR_ID: 101, this request tries to get the buffer from the pool.
    • Let us assume, the buffer with 123 from the pool is assigned to this WR .
  3. Now, following three scenarios can happen
    • WR with WR_ID: 100 is processed by HCA already, everything is fine. WR with WR_ID: 101 can override the data at memory address 123 without any issues.
    • HCA is currently processing WR with WR_ID: 100 and WR with WR_ID: 101 starts writing its request at the same location. Result: memory corruption and depending on how much HCA read, the result could be duplicate WR with same id, or different ids posting same memory address for data send/receive.
    • HCA has not reached the WR_ID:100 for processing. In this case, WR_ID: 101 would replace WR_ID: 100 and add duplicate WR to the queue.

venkatsc added a commit to venkatsc/flink that referenced this issue Nov 1, 2019
@patrickstuedi
Copy link
Member

in (2), when exec is called, the WR list is actually copied by the user driver to a dedicated memory area from where the NIC is fetching the WR list via DMA. The copied WR lits also needs to have the specific format matching the device so the user driver also creates a slightly different layout. Therefore, the application memory backing the WR list can be freed and re-used right away after exec().

The copying of the WR list can been seen in the default post_send call of libibverbs that eventuallyIcalls into the kernel:

https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/cmd.c#L1247

This is not the post_send call that typiclly gets executed if you run on IB or RoCE beause normally a specific user-driver for a particular NIC is directly memory mapping the NIC queues and populates WRs without kernel involvement, but the copying will be similar.

At least that's my understanding of how things work, happy to discuss.

@patrickstuedi
Copy link
Member

Note that we are talking about copying WRs, not data, just to not create confusion here.

@BernardMetzler
Copy link
Member

BernardMetzler commented Nov 5, 2019 via email

@venkatsc
Copy link

venkatsc commented Nov 6, 2019

https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/cmd.c#L1247

Indeed, you are right. I was under assumption that, libibverbs does not copy the request from the given address. As I mentioned before, the issues is occurring rarely, but after not freeing resource until completion I didn't see the issue. Hence, thought free() could be the issue.

Recently, after posting the comment, I had faced the duplicate response again. Issue is due to application bug that is overriding header in the same memory location.

@venkatsc
Copy link

venkatsc commented Nov 6, 2019

Should I remove the comments, to avoid misdirection. As you explained, free() is not the cause.

@patrickstuedi
Copy link
Member

Please leave the comments, I think it's a good discussion. And it would be great to find the actual cause for this problem..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants