[libbladeRF] Device locks up when simultaneously tuning freq and RX'ing #351

jynik · 2014-11-18T22:33:45Z

It has been reported that libbladeRF intermittently locks up when simultaneously tuning the frequency and RX'ing samples via the synchronous interface, more frequently with higher samplerates/bandwidths.

I'll be investigating this further and tracking the status on this here.

jynik · 2014-11-19T18:44:09Z

This has only been replicated on USB 3.0. However, that may simply just be a matter of transfers occurring quicker, so I do not yet want to make the claim that it's a USB 3.0-only issue.

I've added the libbladeRF_test_freq_hop(.exe) program to master to replicate this. The following is what I have been using to reproduce this:

$ ./libbladeRF_test_freq_hop --rx-sample-rate 28M --rx-bandwidth 28M --rx

In Linux, I seem to run into some timeouts and severe lockups. More debugging is required, but things seem stuck in within libusb code. In the past, I have seen this occur when transfers were corrupted due to race conditions.

In Windows, I hit this assertion failure because the transfer is marked in-flight. More digging required, but this smells like a race, perhaps due to the hackish grossness involving the stream lock here.

jynik · 2014-11-21T00:04:17Z

In Windows, it appears that I am able to induce this with libbladeRF_test_freq_hop as noted above, while startup up and closing other programs (momentary CPU load?).

When the aforementioned assertion failure is hit, it is because we expect transfer N to complete next; our assertion is expecting transfer N to be marked TRANSFER_AVAILBLE when we get to it. However, we receive a completion event for transfer #N+1 % num_transfers. Either we're misusing libusb in some aspect, or libusb does not guarantee that transfer completion callbacks will be made in the order the the associated transfers physically went out on the bus. I assume the former is the more likely case, with this out-of-order business to be the result of a defect elsewhere.

I will commit a band-aid for this, such that we will simply use the next available transfer. However, I would like to keep this observation in the back of my mind for when we dig into issue #328.

Below is a patch to log transfer events, should anyone else care to try to replicate this. (And admittedly, it give me a place to find it again. ;) )

diff --git a/host/libraries/libbladeRF/src/backend/usb/libusb.c b/host/libraries/libbladeRF/src/backend/usb/libusb.c
index d47602f..e52c72d 100644
--- a/host/libraries/libbladeRF/src/backend/usb/libusb.c
+++ b/host/libraries/libbladeRF/src/backend/usb/libusb.c
@@ -33,6 +33,39 @@
 #   define LIBUSB_HANDLE_EVENTS_TIMEOUT_NSEC    (15 * 1000)
 #endif

+#define NUM_OPERATIONS 4096
+
+typedef enum operation_type {
+    MARKED_AVAILABLE,
+    MARKED_IN_FLIGHT,
+    SUBMITTED
+} operation_type;
+
+struct operation {
+    uint64_t event_no;
+    size_t transfer_idx;
+    size_t num_avail;
+    operation_type type;
+};
+
+struct operation_log {
+    struct operation ops[NUM_OPERATIONS];
+    unsigned int curr_idx;
+    uint64_t op_no;
+};
+
+void log_operation(struct operation_log *log, operation_type t,
+                   size_t transfer_idx, size_t num_avail)
+{
+    log->ops[log->curr_idx].event_no = log->op_no;
+    log->ops[log->curr_idx].transfer_idx = transfer_idx;
+    log->ops[log->curr_idx].type = t;
+    log->ops[log->curr_idx].num_avail = num_avail;
+
+    log->op_no++;
+    log->curr_idx = (log->curr_idx + 1) % NUM_OPERATIONS;
+}
+
 struct bladerf_lusb {
     libusb_device           *dev;
     libusb_device_handle    *handle;
@@ -52,6 +85,8 @@ struct lusb_stream_data {
     size_t i;                           /* Index to next transfer */
     struct libusb_transfer **transfers; /* Array of transfer metadata */
     transfer_status *transfer_status;   /* Status of each transfer */
+    struct operation_log *log;
+    bool reported_out_of_order;
 };

 static inline struct bladerf_lusb * lusb_backend(struct bladerf *dev)
@@ -679,6 +714,7 @@ static void LIBUSB_CALL lusb_stream_cb(struct libusb_transfer *transfer)
     } else {
         stream_data->transfer_status[transfer_i] = TRANSFER_AVAIL;
         stream_data->num_avail++;
+        log_operation(stream_data->log, MARKED_AVAILABLE, transfer_i, stream_data->num_avail);
         pthread_cond_signal(&stream->can_submit_buffer);
     }

@@ -779,6 +815,18 @@ get_next_available_transfer(struct lusb_stream_data *stream_data)

     for (n = 0; n < stream_data->num_transfers; n++) {
         if (stream_data->transfer_status[i] == TRANSFER_AVAIL) {
+
+            /* This is here to give us a place to set a breakpoint and
+             * investigate stream_data->log in a debugger */
+#           if 1
+            if (i != stream_data->i) {
+                if (!stream_data->reported_out_of_order) {
+                    log_debug("First out-of-order transfer occurred!\n");
+                    stream_data->reported_out_of_order = true;
+                }
+            }
+#           endif
+
             stream_data->i = i;
             return stream_data->transfers[i];
         }
@@ -820,6 +868,9 @@ static int submit_transfer(struct bladerf_stream *stream, void *buffer)
     assert(stream_data->num_avail != 0);
     stream_data->num_avail--;

+    log_operation(stream_data->log, MARKED_IN_FLIGHT, prev_idx,
+                  stream_data->num_avail);
+
     /* FIXME We have an inherent issue here with lock ordering between
      *       stream->lock and libusb's underlying event lock, so we
      *       have to drop the stream->lock as a workaround.
@@ -831,10 +882,14 @@ static int submit_transfer(struct bladerf_stream *stream, void *buffer)
      *       Ultimately, we need to review our async scheme and associated
      *       lock schemes.
      */
+
     MUTEX_UNLOCK(&stream->lock);
     status = libusb_submit_transfer(transfer);
     MUTEX_LOCK(&stream->lock);

+    log_operation(stream_data->log, SUBMITTED, prev_idx,
+                  stream_data->num_avail);
+
     if (status != 0) {
         log_error("Failed to submit transfer in %s: %s\n",
                   __FUNCTION__, libusb_error_name(status));
@@ -875,6 +930,7 @@ static int lusb_init_stream(void *driver, struct bladerf_stream *stream,
     stream_data->num_transfers = num_transfers;
     stream_data->num_avail = 0;
     stream_data->i = 0;
+    stream_data->reported_out_of_order = false;

     stream_data->transfers =
         malloc(num_transfers * sizeof(struct libusb_transfer *));
@@ -894,6 +950,11 @@ static int lusb_init_stream(void *driver, struct bladerf_stream *stream,
         goto error;
     }

+    stream_data->log = calloc(1, sizeof(stream_data->log[0]));
+    if (stream_data->log == NULL) {
+        assert(0);
+    }
+
     /* Create the libusb transfers */
     for (i = 0; i < stream_data->num_transfers; i++) {
         stream_data->transfers[i] = libusb_alloc_transfer(0);

The assert() failures in Windows from issue #351 appear to be the result of transfer completion callbacks arriving out of order. This seems to happen pretty reliably after an event that momentarily spikes CPU load. When we expect that transfer number N to be available next, I'm seeing that transfer [ (N+1) % num_transfers ] is actually coming back as the next available transfer. This patch is a band-aid to allow us to just use the next available transfer, instead of assuming which transfer it will be. While this should keep things running instead of grinding to a halt, it does *not* address the underlying issue. Out-of-order transfer completion indicates users may get buffers out of order, which is unacceptable.

The assert() failures in Windows from issue #351 appear to be the result of transfer completion callbacks arriving out of order. This seems to happen pretty reliably after an event that momentarily spikes CPU load. (Windows 7 and 8.1 with libusb 1.0.19.) When we expect that transfer number N to be available next, I'm seeing that transfer [ (N+1) % num_transfers ] is actually coming back as the next available transfer. For example we get transfers ... 6, 7, 8, 10, 9, 11, 12 ... This patch is a band-aid to allow us to just use the next available transfer, instead of assuming which transfer it will be. While this should keep things running instead of grinding to a halt [1], it does *not* address the underlying issue. Out-of-order transfer completion indicates users may get buffers out of order, which is unacceptable. As such, we will need to keep digging further, and I will register a new issue regarding libusb giving us out-of-order transfer callbacks. [1] Arguably, a transient bad buffer is better than an entire application crashing.

The assert() failures in Windows from issue #351 appear to be the result of transfer completion callbacks arriving out of order. This seems to happen pretty reliably after an event that momentarily spikes CPU load when using higher sample rates. (Windows 7 and 8.1 with libusb 1.0.19) When we expect that transfer number N to be available next, I'm seeing that transfer [ (N+1) % num_transfers ] is actually coming back as the next available transfer. For example we get transfers ... 6, 7, 8, 10, 9, 11, 12 ... This patch is a band-aid to allow us to just use the next available transfer, instead of assuming which transfer it will be. While this should keep things running instead of grinding to a halt [1], it does *not* address the underlying issue. Out-of-order transfer completion indicates users may get buffers out of order, which is unacceptable. As such, we will need to keep digging further, and I will register a new issue regarding libusb giving us out-of-order transfer callbacks. [1] Arguably, a transient bad buffer is better than an entire application crashing, now that we're aware of this.

The assert() failures in Windows from issue #351 appear to be the result of libusb transfer completion callbacks arriving out of order. This seems to happen pretty reliably after an event that momentarily spikes CPU load when using higher sample rates. This has been reproduced on Windows 7 and 8.1 with libusb 1.0.19. The Cypress backend is not affected. When we expect that transfer number N to be available next, I'm seeing that transfer [ (N+1) % num_transfers ] is actually coming back as the next available transfer. For example we get transfers ... 6, 7, 8, 10, 9, 11, 12 ... This patch is a band-aid to allow us to just use the next available transfer, instead of assuming which transfer it will be. While this should keep things running instead of grinding to a halt [1], it does *not* address the underlying issue. Out-of-order transfer completion indicates users may get buffers out of order, which is unacceptable. As such, we will need to keep digging further, and I will register a new issue regarding libusb giving us out-of-order transfer callbacks. [1] Arguably, a transient bad buffer is better than an entire application crashing, now that we're aware of this.

jynik · 2014-11-24T22:04:54Z

A workaround for the hang/crash has been applied in c875837.

However, there is still an underlying issue. #352 has been opened to track this.

jynik self-assigned this Nov 18, 2014

jynik added the Issue: Bug It's a bug and it oughta be fixed label Nov 19, 2014

jynik mentioned this issue Nov 24, 2014

[libbladeRF] Intermittent out-of-order transfer callbacks with Windows & libusb 1.0.19 #352

Closed

jynik closed this as completed Nov 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libbladeRF] Device locks up when simultaneously tuning freq and RX'ing #351

[libbladeRF] Device locks up when simultaneously tuning freq and RX'ing #351

jynik commented Nov 18, 2014

jynik commented Nov 19, 2014

jynik commented Nov 21, 2014

jynik commented Nov 24, 2014

[libbladeRF] Device locks up when simultaneously tuning freq and RX'ing #351

[libbladeRF] Device locks up when simultaneously tuning freq and RX'ing #351

Comments

jynik commented Nov 18, 2014

jynik commented Nov 19, 2014

jynik commented Nov 21, 2014

jynik commented Nov 24, 2014