New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"trap divide error" while generating calibration table in CLI: cal table dc rx #373
Comments
Previous posts session and attached core-files could be irrelevant as I run the wrong FW by mistake (1.7.1 instead of 1.8.0. due to bug in automation tool which did not consider late git tags) Corrected the mistake but there are still problems. Actually worse this time. Attaching a new session and a corresponding script that provokes the error. export BLADERF_LOG_LEVEL="verbose" The new behavior is worse now because BladeRF-cli on third invocation (which in itself is odd) will hang badly. Process will not be kill-able. [me@host bladeRF :) ]$ sudo kill -9 3007
[me@host bladeRF :) ]$ sudo kill -9 22393
[me@host bladeRF :) ]$ sudo kill -9 22500
[me@host bladeRF :) ]$ sudo kill -9 22558
[me@host bladeRF :) ]$ sudo kill -9 22558
[me@host bladeRF :) ]$ ps -Al | grep bladeRF-cli
0 D 2060735032 3007 1 0 80 0 - 46146 usb_ki pts/32 00:00:00 bladeRF-cli
0 D 2060735032 22393 1 0 80 0 - 7161 usbdev pts/32 00:00:00 bladeRF-cli
0 D 2060735032 22500 1 0 80 0 - 7161 usbdev pts/32 00:00:00 bladeRF-cli
0 D 2060735032 22558 22017 0 80 0 - 7161 usbdev pts/34 00:00:00 bladeRF-cli
0 D 2060735032 22724 31900 0 80 0 - 7161 usbdev pts/32 00:00:00 bladeRF-cli
[23059623@seldlx11530 bladeRF :) ]$ This becomes persistent and I see no way out from this than reboot. I.e. I can't provide a crashdump at this point nether from the first session (that detected some problem but didn't crash and exited) nor from the others that are stuck. Need to do a change in code to provoke crash at point of failure.
|
Reboot will release the locked situation. Perhaps it's one of those "one -of" things. Lets not focus on that for the time being... Script above updated with one that will do what it was supposed to do. I.e. two behaviors occur. The division by zero as with 1.7.1, and a time-out. About around the same place in time and code. To see relevant diff between them (I'm guessing interesting parts are at the end): meld \
<(wget -O- http://pastebin.com/raw.php?i=vRUkVV5P | sed -Ee 's/\[[0-9]+\]/[---]/') \
<(wget -O- http://pastebin.com/raw.php?i=BJU41BcS | sed -Ee 's/\[[0-9]+\]/[---]/') Fortunately we now have a crashdump ...which makes debugging much easier and we can see the direct cause of the crash. libraries/libbladeRF/src/lms.c status = lms_set_frequency(dev, BLADERF_MODULE_RX,
lms_frequency_to_hz(&f)); Structure member As in return (uint32_t)(((f->reference * pll_coeff) + (div >> 1)) / div); Why is a different matter. Actally all values of |
Reproduce-ability disappeared. Switched from one cable to another, which permitted calibration to go through. After that, switching back to the first cable (which failed at least 20 times in a row) seemed to work OK. This is unfortunate because it doesn't mean the problem is gone but without reputability debugging becomes hard. I am able to occasionally provoke a error again, but it's not identical.
(Additional printouts come from patches on this branch and are needed not only to see the error-flow, but also for line numbers to match.) Note that system runs for quite a while before this happens, but when it does fail, it becomes stuck in a error state and consecutive accesses will timeout much faster (if not immediately). Behavior is also different than before wrt to consecutive access. Whence gotten into this state, the unit is responsive. Examples:
...and
...whereas before (on consecutive access without reset by power-cycle in-between access) bladerf responded Whence it was in it's previous reproducible state, I got as far as determining code failing in function At this point I can't say what's wrong. It could be either or a combination of:
Reason FW is mentioned is that it appeared that "timeouts" instead of If issue is truly a intermittent, it will in my experience definitely turn up again. It should therefore be encouraged not to let go if it shows again, until the cause is nailed down. |
It seems that Running the script below failed with 1:st cable 3 times in a row, but with 2:nd cable, bladerf seems stable. (Failures accounted are after disconnecting and reconnecting) This gives cable hypothesis more credibility. If so, I tend to believe voltage instability is the cause (power carrying chords have too high resistance which creates dips at device side when sporadic current loads occur) . Each calibration iteration takes a good while, roughly 20 minutes. Will leave test-script running over the weekend. |
If you suspect there to be a power issue, have you tried supplying power via the DC barrel jack, rather than via the USB connection? Also, note that the TX calibration utilizes full-duplex loopback communication. If you're connecting to a USB 2.0 controller rather than USB 3.0, the power required for full duplex operation may very well be exceeding the Hi-Speed allowance. |
Tests over weekend running now. Please share scripts with community: mkdir -p ~/.Nuand/bladeRF
cd ~/.Nuand/bladeRF Place scripts in same directory as above: Run: ./forever_calib.sh $SID |
Running a few tests out-of office using the calibrate script and external power-supply. This time with the newest (spare) bladerf, which has not been in use for more than a few days. (I.e. it's very unlikely there to be any micro-USB connector issue unless a big batch is bad). Somewhat surprised about the results:
I.e. one full
Problems occur.Second try printed something about timeout, but process won't react to [mambrus@hornet :) ]$ ps -Al | grep calib
0 S 1000 31200 26101 0 80 0 - 4459 wait pts/26 00:00:00 calibrate.sh
[mambrus@hornet :) ]$ sudo kill -9 31200
[mambrus@hornet :) ]$ ps -Al | grep -i blade
0 S 1000 652 10879 0 80 0 - 66710 hrtime pts/26 00:00:07 bladeRF-cli
[mambrus@hornet :) ]$ ps -Al | grep calib
[mambrus@hornet :( ]$ I.e. the strongest signal After a long time it does die however. Saving me from the need to re-boot. Always something... Here's the: session log I.e. still fail Some notes for the record
The above point's wouldn't hurt to spread around everywhere. Conclusion so far.External power didn't seem to improve the situation much, on the contrary it seems the blade is less stable. I'm running a different HW on a different location, but that shouldn't matter unless something isn't right anyway. I continue to believe this to be a power-related issue, at least in parts. But having just a good-old slow PSU doesn't seem to be enough. Extremely high currents could be indications of lower layer bugs (GPIO output-to-output?) or maybe normal for certain calculations in the FPGA. If so, current peeks seem to be extremely short and no PSU in the world will manage to supply the unit with stable enough current. If so, HW patching might be needed. Please advise. |
Two blades were running forever_calib.sh over weekend. Here's their session logs:
Note that the file-names contain '*' which can confuse your shell. Rename them for ease. Names are verbatim from test-script above. Getting raw logs and getting rid of control characters: sed 's/[[:cntrl:]]/\n/g' file.log To make some high-level statistical analyse, I've made the following scripts: Use as follows: cat file.log | ./analyze.sh
cat file.log | ./analyze_tdiffs.sh The second script uses the first and just calculates the time each iteration took. |
Verbose log snippet leading up to a cal timeout:
|
I've found a scenario from extensive automated testing (few thousand tests) where in one case calib.sh "always" fails, and another where it always succeeds. Wrote an update, but unfortunately the submit didn't take before leaving. Conclusions however:
-1 It doesn't exclude DC spikes, but if so they are self inflicted from circuits on PCB. -2 It could be timing related, but then something is really on the edge. If so, we're talking nanoseconds region. |
Here's conclusive evidence that error follows cable. Lot's of calibrations executed in a loop with previous scripts for two days. Only cable was swapped, ports and bladerfs remained. One change, almost digital difference. That's nice hen it comes to pin down a search area :-) I didn't include the complete session log this time because they're huge and they don't say more than before. Instead he'res the condensed result using the Day one
Day two
I admit I didn't see this coming... Both cables come from Nuand, the white one is the good one. It should be noted that I've tried 4-5 others which were equally bad, some were purchased locally. If a batch of these cable went bad, it has to be an awfully big batch :-) To quote Sherlock Holmes: It is an old maxim of mine that when you have excluded the impossible, whatever remains, however improbable, must be the truth. These results combined with last weekends test-runs where 5V barrel-jack was used with worse results completely (see post earlier) kills the "voltage instability due to cable resistance" theory. ParenthesisEspecially note how "OK's" in each list from day one stops (i.e. which calibration was executed last). "Nr freqs" are high, possibly the same frequency. That's not entirely conclusive, we could be talking about two issues. Regardless of error-state there's no coming back from either. How to interpret the tables above (spaces added for table header to make sense):
|
I've tidied up some automation tools and added the scripts used here in the same repo. They are far from production quality and there's no help nor man-pages whatsoever. But I miss the ability to track what I'm doing and also appreciate some extra watchful eyes in case I feck-up. Feel free to add yourself as contributor. Consider it a sandbox for staging/wanna-be automation-tools where portability is less important than getting stuff up and shared quickly. Clone the project below: Add it's root-directory to your $PATH (If you're paranoid about adding foreign PATH:s and if you'r careful about what to link, you can link specific scripts only into somewhere you already have $PATH to. Scripts are designed for the case.) Example follows, to run calibration routine as test to trigger this case: brat_forever_calib "$(brat_sid 9d39ad51b7873293c26c7883b487d5b1)"
# Replace this -------------------------^
To run the stats analyze: cat recently_created.log | brat_analyze_td or cat recently_created.log | brat_analyze Last but not least. I've got my hands on an industry-grade USB3 cable that we use for an extremely demanding vision-system (I can assure cable doesn't get better than this ;-) ). See picture here: link Notice the screws. A suggestion for next version bladerf HW is to make the socket much more rigid and to accommodate for this type of industry-grade cables. Running this in tandem over-night with the "magically good" cable and this new cable. Seems actually making things worse. I.e. what's a "good" vs. "bad" cable seems a bit arbitrary. |
Results of nightly run (2014-02-18/19):NOTE:* One tested with external power, one without. Swapping for a short run. Each (OK) iteration takes 25 minutes. Will do 5 short iterations... Results later same day (2014-02-19):Some notes for the record
|
I believe the trap divide error here was due to some bad LMS register accesses, which was address in the FPGA v0.3.3 onward. There are other issues open for the timeout/hang, so I'll let you close this if you agree. |
Per my previous comment, I believe the issues here were the result of some LMS6 register access failures. Since I will address the hanging for #362, I am marking this closed. |
The text was updated successfully, but these errors were encountered: