Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libbladeRF] Device locks up when simultaneously tuning freq and RX'ing #351

Closed
jynik opened this issue Nov 18, 2014 · 3 comments
Closed
Assignees
Labels
Issue: Bug It's a bug and it oughta be fixed

Comments

@jynik
Copy link
Contributor

jynik commented Nov 18, 2014

It has been reported that libbladeRF intermittently locks up when simultaneously tuning the frequency and RX'ing samples via the synchronous interface, more frequently with higher samplerates/bandwidths.

I'll be investigating this further and tracking the status on this here.

@jynik jynik self-assigned this Nov 18, 2014
@jynik
Copy link
Contributor Author

jynik commented Nov 19, 2014

This has only been replicated on USB 3.0. However, that may simply just be a matter of transfers occurring quicker, so I do not yet want to make the claim that it's a USB 3.0-only issue.

I've added the libbladeRF_test_freq_hop(.exe) program to master to replicate this. The following is what I have been using to reproduce this:

$ ./libbladeRF_test_freq_hop --rx-sample-rate 28M --rx-bandwidth 28M --rx

In Linux, I seem to run into some timeouts and severe lockups. More debugging is required, but things seem stuck in within libusb code. In the past, I have seen this occur when transfers were corrupted due to race conditions.

In Windows, I hit this assertion failure because the transfer is marked in-flight. More digging required, but this smells like a race, perhaps due to the hackish grossness involving the stream lock here.

@jynik jynik added the Issue: Bug It's a bug and it oughta be fixed label Nov 19, 2014
@jynik
Copy link
Contributor Author

jynik commented Nov 21, 2014

In Windows, it appears that I am able to induce this with libbladeRF_test_freq_hop as noted above, while startup up and closing other programs (momentary CPU load?).

When the aforementioned assertion failure is hit, it is because we expect transfer N to complete next; our assertion is expecting transfer N to be marked TRANSFER_AVAILBLE when we get to it. However, we receive a completion event for transfer #N+1 % num_transfers. Either we're misusing libusb in some aspect, or libusb does not guarantee that transfer completion callbacks will be made in the order the the associated transfers physically went out on the bus. I assume the former is the more likely case, with this out-of-order business to be the result of a defect elsewhere.

I will commit a band-aid for this, such that we will simply use the next available transfer. However, I would like to keep this observation in the back of my mind for when we dig into issue #328.

Below is a patch to log transfer events, should anyone else care to try to replicate this. (And admittedly, it give me a place to find it again. ;) )

diff --git a/host/libraries/libbladeRF/src/backend/usb/libusb.c b/host/libraries/libbladeRF/src/backend/usb/libusb.c
index d47602f..e52c72d 100644
--- a/host/libraries/libbladeRF/src/backend/usb/libusb.c
+++ b/host/libraries/libbladeRF/src/backend/usb/libusb.c
@@ -33,6 +33,39 @@
 #   define LIBUSB_HANDLE_EVENTS_TIMEOUT_NSEC    (15 * 1000)
 #endif

+#define NUM_OPERATIONS 4096
+
+typedef enum operation_type {
+    MARKED_AVAILABLE,
+    MARKED_IN_FLIGHT,
+    SUBMITTED
+} operation_type;
+
+struct operation {
+    uint64_t event_no;
+    size_t transfer_idx;
+    size_t num_avail;
+    operation_type type;
+};
+
+struct operation_log {
+    struct operation ops[NUM_OPERATIONS];
+    unsigned int curr_idx;
+    uint64_t op_no;
+};
+
+void log_operation(struct operation_log *log, operation_type t,
+                   size_t transfer_idx, size_t num_avail)
+{
+    log->ops[log->curr_idx].event_no = log->op_no;
+    log->ops[log->curr_idx].transfer_idx = transfer_idx;
+    log->ops[log->curr_idx].type = t;
+    log->ops[log->curr_idx].num_avail = num_avail;
+
+    log->op_no++;
+    log->curr_idx = (log->curr_idx + 1) % NUM_OPERATIONS;
+}
+
 struct bladerf_lusb {
     libusb_device           *dev;
     libusb_device_handle    *handle;
@@ -52,6 +85,8 @@ struct lusb_stream_data {
     size_t i;                           /* Index to next transfer */
     struct libusb_transfer **transfers; /* Array of transfer metadata */
     transfer_status *transfer_status;   /* Status of each transfer */
+    struct operation_log *log;
+    bool reported_out_of_order;
 };

 static inline struct bladerf_lusb * lusb_backend(struct bladerf *dev)
@@ -679,6 +714,7 @@ static void LIBUSB_CALL lusb_stream_cb(struct libusb_transfer *transfer)
     } else {
         stream_data->transfer_status[transfer_i] = TRANSFER_AVAIL;
         stream_data->num_avail++;
+        log_operation(stream_data->log, MARKED_AVAILABLE, transfer_i, stream_data->num_avail);
         pthread_cond_signal(&stream->can_submit_buffer);
     }

@@ -779,6 +815,18 @@ get_next_available_transfer(struct lusb_stream_data *stream_data)

     for (n = 0; n < stream_data->num_transfers; n++) {
         if (stream_data->transfer_status[i] == TRANSFER_AVAIL) {
+
+            /* This is here to give us a place to set a breakpoint and
+             * investigate stream_data->log in a debugger */
+#           if 1
+            if (i != stream_data->i) {
+                if (!stream_data->reported_out_of_order) {
+                    log_debug("First out-of-order transfer occurred!\n");
+                    stream_data->reported_out_of_order = true;
+                }
+            }
+#           endif
+
             stream_data->i = i;
             return stream_data->transfers[i];
         }
@@ -820,6 +868,9 @@ static int submit_transfer(struct bladerf_stream *stream, void *buffer)
     assert(stream_data->num_avail != 0);
     stream_data->num_avail--;

+    log_operation(stream_data->log, MARKED_IN_FLIGHT, prev_idx,
+                  stream_data->num_avail);
+
     /* FIXME We have an inherent issue here with lock ordering between
      *       stream->lock and libusb's underlying event lock, so we
      *       have to drop the stream->lock as a workaround.
@@ -831,10 +882,14 @@ static int submit_transfer(struct bladerf_stream *stream, void *buffer)
      *       Ultimately, we need to review our async scheme and associated
      *       lock schemes.
      */
+
     MUTEX_UNLOCK(&stream->lock);
     status = libusb_submit_transfer(transfer);
     MUTEX_LOCK(&stream->lock);

+    log_operation(stream_data->log, SUBMITTED, prev_idx,
+                  stream_data->num_avail);
+
     if (status != 0) {
         log_error("Failed to submit transfer in %s: %s\n",
                   __FUNCTION__, libusb_error_name(status));
@@ -875,6 +930,7 @@ static int lusb_init_stream(void *driver, struct bladerf_stream *stream,
     stream_data->num_transfers = num_transfers;
     stream_data->num_avail = 0;
     stream_data->i = 0;
+    stream_data->reported_out_of_order = false;

     stream_data->transfers =
         malloc(num_transfers * sizeof(struct libusb_transfer *));
@@ -894,6 +950,11 @@ static int lusb_init_stream(void *driver, struct bladerf_stream *stream,
         goto error;
     }

+    stream_data->log = calloc(1, sizeof(stream_data->log[0]));
+    if (stream_data->log == NULL) {
+        assert(0);
+    }
+
     /* Create the libusb transfers */
     for (i = 0; i < stream_data->num_transfers; i++) {
         stream_data->transfers[i] = libusb_alloc_transfer(0);

jynik added a commit that referenced this issue Nov 21, 2014
The assert() failures in Windows from issue #351 appear to be the
result of transfer completion callbacks arriving out of order. This
seems to happen pretty reliably after an event that momentarily spikes
CPU load.

When we expect that transfer number N to be available next,
I'm seeing that transfer [ (N+1) % num_transfers ] is actually coming
back as the next available transfer.

This patch is a band-aid to allow us to just use the next available transfer,
instead of assuming which transfer it will be.

While this should keep things running instead of grinding to a halt, it does
*not* address the underlying issue. Out-of-order transfer completion indicates
users may get buffers out of order, which is unacceptable.
jynik added a commit that referenced this issue Nov 24, 2014
The assert() failures in Windows from issue #351 appear to be the
result of transfer completion callbacks arriving out of order. This
seems to happen pretty reliably after an event that momentarily spikes
CPU load. (Windows 7 and 8.1 with libusb 1.0.19.)

When we expect that transfer number N to be available next,
I'm seeing that transfer [ (N+1) % num_transfers ] is actually coming
back as the next available transfer.

For example we get transfers ... 6, 7, 8, 10, 9, 11, 12 ...

This patch is a band-aid to allow us to just use the next available transfer,
instead of assuming which transfer it will be.

While this should keep things running instead of grinding to a halt [1], it does
*not* address the underlying issue. Out-of-order transfer completion indicates
users may get buffers out of order, which is unacceptable. As such, we will need
to keep digging further, and I will register a new issue regarding libusb giving
us out-of-order transfer callbacks.

[1] Arguably, a transient bad buffer is better than an entire application
    crashing.
jynik added a commit that referenced this issue Nov 24, 2014
The assert() failures in Windows from issue #351 appear to be the result of
transfer completion callbacks arriving out of order. This seems to happen pretty
reliably after an event that momentarily spikes CPU load when using higher
sample rates. (Windows 7 and 8.1 with libusb 1.0.19)

When we expect that transfer number N to be available next, I'm seeing that
transfer [ (N+1) % num_transfers ] is actually coming back as the next available
transfer.

For example we get transfers ... 6, 7, 8, 10, 9, 11, 12 ...

This patch is a band-aid to allow us to just use the next available transfer,
instead of assuming which transfer it will be.

While this should keep things running instead of grinding to a halt [1], it does
*not* address the underlying issue. Out-of-order transfer completion indicates
users may get buffers out of order, which is unacceptable. As such, we will need
to keep digging further, and I will register a new issue regarding libusb giving
us out-of-order transfer callbacks.

[1] Arguably, a transient bad buffer is better than an entire application
    crashing, now that we're aware of this.
jynik added a commit that referenced this issue Nov 24, 2014
The assert() failures in Windows from issue #351 appear to be the result
of libusb transfer completion callbacks arriving out of order. This
seems to happen pretty reliably after an event that momentarily spikes
CPU load when using higher sample rates.

This has been reproduced on Windows 7 and 8.1 with libusb 1.0.19. The
Cypress backend is not affected.

When we expect that transfer number N to be available next, I'm seeing
that transfer [ (N+1) % num_transfers ] is actually coming back as the
next available transfer.

For example we get transfers ... 6, 7, 8, 10, 9, 11, 12 ...

This patch is a band-aid to allow us to just use the next available
transfer, instead of assuming which transfer it will be.

While this should keep things running instead of grinding to a halt [1],
it does *not* address the underlying issue. Out-of-order transfer
completion indicates users may get buffers out of order, which is
unacceptable. As such, we will need to keep digging further, and I
will register a new issue regarding libusb giving us out-of-order
transfer callbacks.

[1] Arguably, a transient bad buffer is better than an entire
    application crashing, now that we're aware of this.
@jynik
Copy link
Contributor Author

jynik commented Nov 24, 2014

A workaround for the hang/crash has been applied in c875837.

However, there is still an underlying issue. #352 has been opened to track this.

@jynik jynik closed this as completed Nov 24, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug It's a bug and it oughta be fixed
Projects
None yet
Development

No branches or pull requests

1 participant