New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression tests fail on ARM #304
Comments
First of all, thanks a lot for the test suite, it makes debugging much easier. The decribed build/test issue is not only limited to ARM. i386 suffers under similar issues. https://buildd.debian.org/status/fetch.php?pkg=samtools&arch=i386&ver=1.1-1&stamp=1411604847 I don't see any architecture specific instructions in the code, so the goal should be to get samtools running on most of the other POSIX platforms as well. (like 0.1.19) |
These all go through Travis, so maybe it is something different between the Debian build hosts and Travis (& our own development servers). Is there something specific that you can think of that will differ and could cause such issues? We've already heard of one case where stdin is not a terminal, so I can try exploring that as a possibility. Anything else obvious to try changing? Are there any plans for the debian build hosts to more faithfully represent a typical user at a command line? |
Samtools builds find on amd64 in Debian; it is on i386, ARM and other platforms that it fails, and to my knowledge, travis does not test this, isn't it ? |
You're right. I can reproduce one of these failures (the 87 i386 ones) locally so will investigate, but I cannot reproduce the hanging faidx issue. (I've asked our local systems support for access to an ARM system to test on.) |
Thanks a lot ! |
Bizarrely it turns out to be an issue with the test data more than anything else. BAM allows for auxiliary maximum values of 4294967295, but SAM only permits 2147483647 (unless part of a "B" array). This oddity, that it is legal to have BAM files which are not representable in SAM, has struck me as a bug in the spec for years, but is unlikely to change. Arguably it was a flaw that the 64-bit tests didn't notice this. |
I think the only place in the spec that talks about the range of integers in SAM aux fields is the table at the start of §1.5, which talks about lightly-burnt 32-bit integers. IMHO this is a mistake, and the limits of the range should be unstated as they are elsewhere in the spec. (In particular, IMHO samtools/hts-specs#36 is misreading what that footnote applies to.) So I reckon the correction to the bug in the spec is to say that integer aux fields in SAM are textual things, with the corollary that 4294967295 in a BAM file is indeed representable in SAM. (...which has an effect on how a 32-bit implementation reads in a SAM file...) |
There are two locations in SAM spec that mentions int32_t instead of uint32_t: section 1.5 mentions "i" as signed 32-bit integers (with no equivalent "I" for unsigned); section 4.2 footnote 14 mentions that in SAM all single integers are mapped to int32_t (after having mentioned both int32_t and uint32_t for BAM). So it appears to be deliberate, although I wonder whether the spec was amended after discovering the difference between SAM and BAM rather than vice versa. Practically speaking, this means the sam_parse1() function should probably be using atoi instead of strtol, or maybe having deliberate casts into int32_t after running strtol. However I agree that the SAM specification ought to be a textual one and not limited by binary issues; certainly not more limited than the binary version. For this issue though I'll just make it so that the code works the same on 32-bit and 64-bit builds. Although it's not going to address the hang in faidx on arm. That's got me stumped atm. (Any chance of attaching gdb to it Charles and getting a stack trace?) |
I have access to a "porter box", but I do not know how to make a stack trace. Can you tell me which command to execute ? |
You'd need to have the make test running, but wedged. Then do something like "ps x" to list processes and find the PID of the wedged process, and then do "gdb -p PID", followed by "bt" (or "where") for the backtrace and then "q" to quit out again. |
I get this: #0 0x000588dc in bgzf_read_block ()
#1 0x00059a58 in bgzf_getc ()
#2 0x0005b89c in fai_build_core ()
#3 0x0005c550 in fai_build ()
#4 0x00048abc in faidx_main (argc=argc@entry=2, argv=argv@entry=0xbee50408) at faidx.c:70
#5 0x00017f64 in main (argc=3, argv=0xbee50404) at bamtk.c:165 The process that opened in gdb was |
Thanks Charles. I'm still scratching my head, and as it's gone off into the bgzf implementation it's probably someone one of the other developers are best looking at. (It may stall though while waiting on hardware.) |
You're welcome. Let me know if there are other commands to run later. |
Here is what I get:
The samtools process is still running
gdb -p 2823
I hope this helps a little. |
I am not sure if this is related, but I am getting already strange errors in htslib alone:
|
Bus error is probably due to unaligned data access. I did some experimental work on removing those from htslib (see https://github.com/jkbonfield/htslib/tree/SPARC) and I believe there are some pull requests here too from others, but none of them are perfect yet and we find it hard to check such things right now. |
After commenting out test/sam I get this:
|
Back to the original problem, after 30 min the test is still running, and it looks like an integer overflow.
|
I can see an error in htslib/bgzf.c although not why it happens. Adding error checking would solve it though maybe.
If hread() returns -1 then count is not 0, so it adds it to fp->block_length possibly making that -1, which in turn means the loop in bgzf_getc may count for an awful long time. Adding a "if (count < 0) return -1" in there after the hread call may at least make it fail rather than hang for ages, but it doesn't explain the failure. John? Petr? |
James, I'll take a look at this, but it will be more efficient to wait until we have the 32 bit machine for testing. |
The problem is that htslib/faidx.c
doesn't catch the return code -1 from bgzf_getc Here is an simple test:
On ARM:
on amd64:
|
Here is an easy fix.
|
This is much appreciated, thank you. Fixed by samtools/htslib@14a4a81 |
It's perhaps safer to use int instead. Although in this situation it doesn't matter as fasta is pure 7-bit ASCII, generally getc type calls want to return 0-255 plus -1, so 257 possible values. Hence int and not char (signed or otherwise). Thanks for identifying the cause of the problem. |
Many thanks everybody ! I can also confirm that the tests do not hang anymore on ARM. |
Dear Samtools developers,
Samtools 1.1 is packaged in Debian, where building it has been attempted on multiple hardware architectures. Issue #268 already reports failures on MIPS.
On ARM as well, the test fail, but with a different symptom: just after the first three tests (for
bgzip
), the output stops for at least 300 minutes (when the build is eventually killed by our build farm). This happens on both 32 and 64-bits ARM platforms. Here are links to the build logs.You may then wonder if somebody is using Samtools on ARM platforms, and I do share the same concern. At least, we have been contacted once by researchers who were organising a practical course in their university, using Raspberry Pi as a platform. Perhaps other potential users may advocate for support in this issue record ?
Have a nice day,
Charles Plessy
Debian Med packaging team
https://www.debian.org/devel/debian-med
Tsurumi, Kanagwa, Japan
The text was updated successfully, but these errors were encountered: