Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"trap divide error" while generating calibration table in CLI: cal table dc rx #373

Closed
mambrus opened this issue Feb 11, 2015 · 15 comments
Closed
Labels
Issue: Bug It's a bug and it oughta be fixed

Comments

@mambrus
Copy link
Contributor

mambrus commented Feb 11, 2015

@mambrus
Copy link
Contributor Author

mambrus commented Feb 12, 2015

Previous posts session and attached core-files could be irrelevant as I run the wrong FW by mistake (1.7.1 instead of 1.8.0. due to bug in automation tool which did not consider late git tags)

Corrected the mistake but there are still problems. Actually worse this time. Attaching a new session and a corresponding script that provokes the error.

export BLADERF_LOG_LEVEL="verbose"

Session
syslog
calibrate.sh

The new behavior is worse now because BladeRF-cli on third invocation (which in itself is odd) will hang badly. Process will not be kill-able.

[me@host bladeRF :) ]$ sudo kill -9 3007
[me@host bladeRF :) ]$ sudo kill -9 22393
[me@host bladeRF :) ]$ sudo kill -9 22500
[me@host bladeRF :) ]$ sudo kill -9 22558
[me@host bladeRF :) ]$ sudo kill -9 22558
[me@host bladeRF :) ]$ ps -Al | grep bladeRF-cli
0 D 2060735032 3007 1  0  80   0 - 46146 usb_ki pts/32   00:00:00 bladeRF-cli
0 D 2060735032 22393 1  0 80   0 -  7161 usbdev pts/32   00:00:00 bladeRF-cli
0 D 2060735032 22500 1  0 80   0 -  7161 usbdev pts/32   00:00:00 bladeRF-cli
0 D 2060735032 22558 22017  0 80 0 - 7161 usbdev pts/34  00:00:00 bladeRF-cli
0 D 2060735032 22724 31900  0 80 0 - 7161 usbdev pts/32  00:00:00 bladeRF-cli
[23059623@seldlx11530 bladeRF :) ]$

This becomes persistent and I see no way out from this than reboot.

I.e. I can't provide a crashdump at this point nether from the first session (that detected some problem but didn't crash and exited) nor from the others that are stuck. Need to do a change in code to provoke crash at point of failure.

  • I suspect two independent errors. The "non access/hung" part is probably not related to the original error, but could be triggered by it.
  • I suspect further that there could be leads in the first reported failure (see previous post) as to me it seem extremely unlikely they they're not.

@mambrus
Copy link
Contributor Author

mambrus commented Feb 12, 2015

Reboot will release the locked situation. Perhaps it's one of those "one -of" things. Lets not focus on that for the time being...

Script above updated with one that will do what it was supposed to do.
New session logs based on running the above calibration script:

I.e. two behaviors occur. The division by zero as with 1.7.1, and a time-out. About around the same place in time and code.

To see relevant diff between them (I'm guessing interesting parts are at the end):

meld \
   <(wget -O- http://pastebin.com/raw.php?i=vRUkVV5P | sed -Ee 's/\[[0-9]+\]/[---]/') \
   <(wget -O- http://pastebin.com/raw.php?i=BJU41BcS | sed -Ee 's/\[[0-9]+\]/[---]/')

Fortunately we now have a crashdump
Files: coredump, symbols and .so are here:

...which makes debugging much easier and we can see the direct cause of the crash.

libraries/libbladeRF/src/lms.c

             status = lms_set_frequency(dev, BLADERF_MODULE_RX,
                                        lms_frequency_to_hz(&f));

Structure member x of variable f which is uint8_t x; /**< VCO division ratio */ is zero.

As in lms_frequency_to_hz it is indirectly used as div in

     return (uint32_t)(((f->reference * pll_coeff) + (div >> 1)) / div);

Why is a different matter. Actally all values of stuct f are '0' except reference. Hopefully this is enough to solve the issue swiftly :-)

@mambrus
Copy link
Contributor Author

mambrus commented Feb 13, 2015

Reproduce-ability disappeared. Switched from one cable to another, which permitted calibration to go through. After that, switching back to the first cable (which failed at least 20 times in a row) seemed to work OK.

This is unfortunate because it doesn't mean the problem is gone but without reputability debugging becomes hard. I am able to occasionally provoke a error again, but it's not identical.

bladeRF-cli -e "cal table dc tx"

ERROR: status != 0 failed in [rx_avg_magnitude] @ [calibrate_dc.c:481].
ERROR: status != 0 failed in [calibrate_dc_tx] @ [calibrate_dc.c:566]
ERROR: status != 0 failed in [calibrate_dc_gen_tbl] @ [calibrate_dc.c:809]

  Error: Operation timed out

(Additional printouts come from patches on this branch and are needed not only to see the error-flow, but also for line numbers to match.)

Note that system runs for quite a while before this happens, but when it does fail, it becomes stuck in a error state and consecutive accesses will timeout much faster (if not immediately).

Behavior is also different than before wrt to consecutive access. Whence gotten into this state, the unit is responsive. Examples:

$ bladeRF-cli -e "probe"

  Backend:        libusb
  Serial:         9d39ad51b7873293c26c7883b487d5b1
  USB Bus:        10
  USB Address:    25

...and

 bladeRF-cli -e "cal table dc rx"

ERROR: status != 0 failed in [rx_avg] @ [calibrate_dc.c:247]
ERROR: status != 0 failed in [calibrate_dc_rx] @ [calibrate_dc.c:308]
ERROR: status != 0 failed in [calibrate_dc_gen_tbl] @ [calibrate_dc.c:809]

  Error: Operation timed out

...whereas before (on consecutive access without reset by power-cycle in-between access) bladerf responded No device(s) attached.

Whence it was in it's previous reproducible state, I got as far as determining code failing in function sync_rx(). Unfortunately I have no more details to share as the debugging sessions were not captured.

At this point I can't say what's wrong. It could be either or a combination of:

  • Cable [timing (no, I don't think)? voltage?]
  • Pins/pads of the female micro USB3 connector [seems unlikely at this point]
  • FW [concurrency? power-consumption spikes due to activation of certain circuitry?]

Reason FW is mentioned is that it appeared that "timeouts" instead of div by zero crashes became more common in 1.8.0 than in 1.7.1. But that could also be misleading and no assumptions should be made at this point. If so, this is likely a combination error,

If issue is truly a intermittent, it will in my experience definitely turn up again. It should therefore be encouraged not to let go if it shows again, until the cause is nailed down.

@mambrus
Copy link
Contributor Author

mambrus commented Feb 13, 2015

It seems that cal table dc tx is more likely to trigger the (potentially new) error-state. Note: Haven't been able to re-create error-state with cal table dc rx again.

Running the script below failed with 1:st cable 3 times in a row, but with 2:nd cable, bladerf seems stable. (Failures accounted are after disconnecting and reconnecting)

This gives cable hypothesis more credibility. If so, I tend to believe voltage instability is the cause (power carrying chords have too high resistance which creates dips at device side when sporadic current loads occur) .

Each calibration iteration takes a good while, roughly 20 minutes. Will leave test-script running over the weekend.

@jynik
Copy link
Contributor

jynik commented Feb 13, 2015

If you suspect there to be a power issue, have you tried supplying power via the DC barrel jack, rather than via the USB connection?

Also, note that the TX calibration utilizes full-duplex loopback communication. If you're connecting to a USB 2.0 controller rather than USB 3.0, the power required for full duplex operation may very well be exceeding the Hi-Speed allowance.

@mambrus
Copy link
Contributor Author

mambrus commented Feb 13, 2015

  • Re PS: Not yet, next on my list (with scope attached)
  • Re USB2: No USB3 all the way (cable, controller and connectors).

Tests over weekend running now. Please share scripts with community:

mkdir -p ~/.Nuand/bladeRF
cd  ~/.Nuand/bladeRF

Place scripts in same directory as above:

Run:

./forever_calib.sh $SID

@mambrus
Copy link
Contributor Author

mambrus commented Feb 15, 2015

Running a few tests out-of office using the calibrate script and external power-supply.

Pic of the setup

This time with the newest (spare) bladerf, which has not been in use for more than a few days. (I.e. it's very unlikely there to be any micro-USB connector issue unless a big batch is bad).

Somewhat surprised about the results:

  • 1:st try with external power-supply, I accidentally had current limiter set at 500mA. And the same error-flow occurs, it just start to happen much earlier. Voltage changes were also so slow one could see them without scope. Voltage dropped to about 3.5V then bladerf started to bail out.
  • 2:nd try: Setting current limiter to 1.5A, the following occurs:
    session log 2:nd try

I.e. one full cal table dc rx successful, but even a short cal dc tx makes the system bail out.

  • 3:rd try: Full throttle of what my PS can supply, 3A. It's old but has plenty of wight (iron) and capacitors. Thinking that should be enough...

Problems occur.Second try printed something about timeout, but process won't react to SIGINT i.e. ^C. Trying to kill via another terminal:

[mambrus@hornet  :) ]$ ps -Al | grep calib
0 S  1000 31200 26101  0  80   0 -  4459 wait   pts/26   00:00:00 calibrate.sh
[mambrus@hornet  :) ]$ sudo kill -9 31200
[mambrus@hornet  :) ]$ ps -Al | grep -i blade
0 S  1000   652 10879  0  80   0 - 66710 hrtime pts/26   00:00:07 bladeRF-cli
[mambrus@hornet  :) ]$ ps -Al | grep calib
[mambrus@hornet  :( ]$  

I.e. the strongest signal SIGKILL the system possesses won't kill the underlying stuck process (bladeRF-cli). That's not good...

After a long time it does die however. Saving me from the need to re-boot. Always something...

Here's the: session log

I.e. still fail

Some notes for the record

  • rx table calib seems to draw 500mAon average. Probably much more while peeking, but how much?
  • From bladerf schema (last page) I can see that J70 needs to have both it's jumpers in the opposite position for external power to supply the board.
  • Having cable connected first, then switching power on seems to be the correct power-up sequence (or FW will hang)

The above point's wouldn't hurt to spread around everywhere.

Conclusion so far.

External power didn't seem to improve the situation much, on the contrary it seems the blade is less stable. I'm running a different HW on a different location, but that shouldn't matter unless something isn't right anyway.

I continue to believe this to be a power-related issue, at least in parts. But having just a good-old slow PSU doesn't seem to be enough. Extremely high currents could be indications of lower layer bugs (GPIO output-to-output?) or maybe normal for certain calculations in the FPGA. If so, current peeks seem to be extremely short and no PSU in the world will manage to supply the unit with stable enough current. If so, HW patching might be needed.

Please advise.

@mambrus
Copy link
Contributor Author

mambrus commented Feb 16, 2015

Two blades were running forever_calib.sh over weekend.

Here's their session logs:

Note that the file-names contain '*' which can confuse your shell. Rename them for ease. Names are verbatim from test-script above.

Getting raw logs and getting rid of control characters:

sed 's/[[:cntrl:]]/\n/g' file.log

To make some high-level statistical analyse, I've made the following scripts:

Use as follows:

cat file.log | ./analyze.sh
cat file.log | ./analyze_tdiffs.sh

The second script uses the first and just calculates the time each iteration took.

@jynik
Copy link
Contributor

jynik commented Feb 16, 2015

Verbose log snippet leading up to a cal timeout:

[DEBUG @ bladerf.c:301] Enable Module: RX - False
[DEBUG @ bladerf.c:301] Enable Module: TX - False
[DEBUG @ bladerf.c:301] Enable Module: RX - True
[DEBUG @ bladerf.c:301] Enable Module: TX - True
[DEBUG @ sync.c:757] sync_tx: Worker is now running.
[DEBUG @ sync.c:314] sync_rx: Worker is idle. Going to reset buf mgmt.
[DEBUG @ sync.c:334] sync_rx: Reset buf_mgmt consumer index
[DEBUG @ sync.c:349] sync_rx: Worker is now running.
[DEBUG @ libusb.c:1270] lusb_submit_stream_buffer: Timed out waiting for a transfer to become availble.
[DEBUG @ sync.c:641] advance_tx_buffer: Failed to advance buffer: Operation timed out
[DEBUG @ bladerf.c:301] Enable Module: TX - False
[DEBUG @ bladerf.c:301] Enable Module: RX - False

  Error: Operation timed out

@mambrus
Copy link
Contributor Author

mambrus commented Feb 17, 2015

I've found a scenario from extensive automated testing (few thousand tests) where in one case calib.sh "always" fails, and another where it always succeeds. Wrote an update, but unfortunately the submit didn't take before leaving. Conclusions however:

  • Error follow cable (100% certain)
  • It's not due to resistance in power chords
  • "Good" cables are the rare ones. Most are "bad".

-1 It doesn't exclude DC spikes, but if so they are self inflicted from circuits on PCB.

-2 It could be timing related, but then something is really on the edge. If so, we're talking nanoseconds region.

@mambrus
Copy link
Contributor Author

mambrus commented Feb 18, 2015

Here's conclusive evidence that error follows cable. Lot's of calibrations executed in a loop with previous scripts for two days.

The setup

Only cable was swapped, ports and bladerfs remained.

One change, almost digital difference. That's nice hen it comes to pin down a search area :-)

I didn't include the complete session log this time because they're huge and they don't say more than before. Instead he'res the condensed result using the anayse_tdiff.sh script from before (slightly changed to permit output of delta-time per operation as well)

Day one

Day two

I admit I didn't see this coming...

Both cables come from Nuand, the white one is the good one. It should be noted that I've tried 4-5 others which were equally bad, some were purchased locally. If a batch of these cable went bad, it has to be an awfully big batch :-)

To quote Sherlock Holmes: It is an old maxim of mine that when you have excluded the impossible, whatever remains, however improbable, must be the truth.

These results combined with last weekends test-runs where 5V barrel-jack was used with worse results completely (see post earlier) kills the "voltage instability due to cable resistance" theory.

Parenthesis

Especially note how "OK's" in each list from day one stops (i.e. which calibration was executed last). "Nr freqs" are high, possibly the same frequency. That's not entirely conclusive, we could be talking about two issues. Regardless of error-state there's no coming back from either.

How to interpret the tables above (spaces added for table header to make sense):

Iteration-# Abs-time-since-Epoc What Nr-freqs Result duration(s)
1 1424109029 RX 1428 OK 384
1 1424109413 TX 1428 OK 1145
: : : : : :
31 1424154918 RX 1334 FAIL 372
31 1424155290 TX 0 FAIL 107

@mambrus
Copy link
Contributor Author

mambrus commented Feb 18, 2015

I've tidied up some automation tools and added the scripts used here in the same repo. They are far from production quality and there's no help nor man-pages whatsoever. But I miss the ability to track what I'm doing and also appreciate some extra watchful eyes in case I feck-up.

Feel free to add yourself as contributor. Consider it a sandbox for staging/wanna-be automation-tools where portability is less important than getting stuff up and shared quickly. Clone the project below:

Add it's root-directory to your $PATH (If you're paranoid about adding foreign PATH:s and if you'r careful about what to link, you can link specific scripts only into somewhere you already have $PATH to. Scripts are designed for the case.)

Example follows, to run calibration routine as test to trigger this case:

brat_forever_calib "$(brat_sid 9d39ad51b7873293c26c7883b487d5b1)"
# Replace  this -------------------------^

brat_sid will transform into a string for bladeRF-cli -p for simultaneous run of several blades. The argument can be left out if running a single blade.

To run the stats analyze:

cat recently_created.log | brat_analyze_td

or

cat recently_created.log | brat_analyze

Last but not least. I've got my hands on an industry-grade USB3 cable that we use for an extremely demanding vision-system (I can assure cable doesn't get better than this ;-) ). See picture here: link

Notice the screws. A suggestion for next version bladerf HW is to make the socket much more rigid and to accommodate for this type of industry-grade cables.

Running this in tandem over-night with the "magically good" cable and this new cable. Seems actually making things worse. I.e. what's a "good" vs. "bad" cable seems a bit arbitrary.

@mambrus
Copy link
Contributor Author

mambrus commented Feb 19, 2015

Results of nightly run (2014-02-18/19):

NOTE:* One tested with external power, one without. Swapping for a short run. Each (OK) iteration takes 25 minutes. Will do 5 short iterations...

Results later same day (2014-02-19):

Some notes for the record

  • A more powerful PSU was used then last weekend (5A)
  • The ports were also changed this time. (I.e. Blades were swapped instead of cables.)
  • Results were originally a disaster. Both blades were close, both were transmitting & both where failing. The new "industry grade" cable is longer (3m). I separated them as much as possible and by doing so I could recreate the same scenario as before where "error travels with cable".
  • Statistics are a bit few for any evidence, but conclusion has been proven before. I re-did this as a sanity test for the automation tools and out of curiosity about how this new (damned expensive) cable would affect the situation.

@jynik jynik added the Issue: Bug It's a bug and it oughta be fixed label Feb 28, 2015
@jynik
Copy link
Contributor

jynik commented Sep 4, 2015

I believe the trap divide error here was due to some bad LMS register accesses, which was address in the FPGA v0.3.3 onward.

There are other issues open for the timeout/hang, so I'll let you close this if you agree.

@jynik
Copy link
Contributor

jynik commented Dec 4, 2015

Per my previous comment, I believe the issues here were the result of some LMS6 register access failures. Since I will address the hanging for #362, I am marking this closed.

@jynik jynik closed this as completed Dec 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Bug It's a bug and it oughta be fixed
Projects
None yet
Development

No branches or pull requests

2 participants