Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression tests fail on ARM #304

Open
charles-plessy opened this issue Sep 30, 2014 · 27 comments
Open

Regression tests fail on ARM #304

charles-plessy opened this issue Sep 30, 2014 · 27 comments

Comments

@charles-plessy
Copy link
Contributor

Dear Samtools developers,

Samtools 1.1 is packaged in Debian, where building it has been attempted on multiple hardware architectures. Issue #268 already reports failures on MIPS.

On ARM as well, the test fail, but with a different symptom: just after the first three tests (for bgzip), the output stops for at least 300 minutes (when the build is eventually killed by our build farm). This happens on both 32 and 64-bits ARM platforms. Here are links to the build logs.

You may then wonder if somebody is using Samtools on ARM platforms, and I do share the same concern. At least, we have been contacted once by researchers who were organising a practical course in their university, using Raspberry Pi as a platform. Perhaps other potential users may advocate for support in this issue record ?

Have a nice day,

Charles Plessy
Debian Med packaging team
https://www.debian.org/devel/debian-med
Tsurumi, Kanagwa, Japan

@domibel
Copy link

domibel commented Oct 1, 2014

First of all, thanks a lot for the test suite, it makes debugging much easier.

The decribed build/test issue is not only limited to ARM. i386 suffers under similar issues.

https://buildd.debian.org/status/fetch.php?pkg=samtools&arch=i386&ver=1.1-1&stamp=1411604847
Number of tests:
total .. 342
passed .. 237
failed .. 87
expected failure .. 18
unexpected pass .. 0

I don't see any architecture specific instructions in the code, so the goal should be to get samtools running on most of the other POSIX platforms as well. (like 0.1.19)

@jkbonfield
Copy link
Contributor

These all go through Travis, so maybe it is something different between the Debian build hosts and Travis (& our own development servers). Is there something specific that you can think of that will differ and could cause such issues? We've already heard of one case where stdin is not a terminal, so I can try exploring that as a possibility. Anything else obvious to try changing?

Are there any plans for the debian build hosts to more faithfully represent a typical user at a command line?

@charles-plessy
Copy link
Contributor Author

Samtools builds find on amd64 in Debian; it is on i386, ARM and other platforms that it fails, and to my knowledge, travis does not test this, isn't it ?

@jkbonfield
Copy link
Contributor

You're right. I can reproduce one of these failures (the 87 i386 ones) locally so will investigate, but I cannot reproduce the hanging faidx issue. (I've asked our local systems support for access to an ARM system to test on.)

@charles-plessy
Copy link
Contributor Author

Thanks a lot !

@jkbonfield
Copy link
Contributor

Bizarrely it turns out to be an issue with the test data more than anything else. BAM allows for auxiliary maximum values of 4294967295, but SAM only permits 2147483647 (unless part of a "B" array). This oddity, that it is legal to have BAM files which are not representable in SAM, has struck me as a bug in the spec for years, but is unlikely to change.

Arguably it was a flaw that the 64-bit tests didn't notice this.

@jmarshall
Copy link
Member

I think the only place in the spec that talks about the range of integers in SAM aux fields is the table at the start of §1.5, which talks about lightly-burnt 32-bit integers. IMHO this is a mistake, and the limits of the range should be unstated as they are elsewhere in the spec. (In particular, IMHO samtools/hts-specs#36 is misreading what that footnote applies to.)

So I reckon the correction to the bug in the spec is to say that integer aux fields in SAM are textual things, with the corollary that 4294967295 in a BAM file is indeed representable in SAM. (...which has an effect on how a 32-bit implementation reads in a SAM file...)

@jkbonfield
Copy link
Contributor

There are two locations in SAM spec that mentions int32_t instead of uint32_t: section 1.5 mentions "i" as signed 32-bit integers (with no equivalent "I" for unsigned); section 4.2 footnote 14 mentions that in SAM all single integers are mapped to int32_t (after having mentioned both int32_t and uint32_t for BAM). So it appears to be deliberate, although I wonder whether the spec was amended after discovering the difference between SAM and BAM rather than vice versa.

Practically speaking, this means the sam_parse1() function should probably be using atoi instead of strtol, or maybe having deliberate casts into int32_t after running strtol. However I agree that the SAM specification ought to be a textual one and not limited by binary issues; certainly not more limited than the binary version.

For this issue though I'll just make it so that the code works the same on 32-bit and 64-bit builds. Although it's not going to address the hang in faidx on arm. That's got me stumped atm. (Any chance of attaching gdb to it Charles and getting a stack trace?)

@charles-plessy
Copy link
Contributor Author

I have access to a "porter box", but I do not know how to make a stack trace. Can you tell me which command to execute ?

@jkbonfield
Copy link
Contributor

You'd need to have the make test running, but wedged. Then do something like "ps x" to list processes and find the PID of the wedged process, and then do "gdb -p PID", followed by "bt" (or "where") for the backtrace and then "q" to quit out again.

@jkbonfield
Copy link
Contributor

The i386 issues are partially resolved by #307, but also see #305 for one outstanding issue causing make test to still fail.

However this ticket started with (and should end with) ARM failures, so obviously don't consider it resolved yet.

@charles-plessy
Copy link
Contributor Author

I get this:

#0  0x000588dc in bgzf_read_block ()
#1  0x00059a58 in bgzf_getc ()
#2  0x0005b89c in fai_build_core ()
#3  0x0005c550 in fai_build ()
#4  0x00048abc in faidx_main (argc=argc@entry=2, argv=argv@entry=0xbee50408) at faidx.c:70
#5  0x00017f64 in main (argc=3, argv=0xbee50404) at bamtk.c:165

The process that opened in gdb was /home/plessy/samtools-1.1/samtools faidx /tmp/mVscbR4vKG/faidx.fa. It was started by /bin/bash -o pipefail -c (/home/plessy/samtools-1.1/samtools faidx /tmp/mVscbR4vKG/faidx.fa) 2> /tmp/NJvtyg_T1y.

@jkbonfield
Copy link
Contributor

Thanks Charles. I'm still scratching my head, and as it's gone off into the bgzf implementation it's probably someone one of the other developers are best looking at. (It may stall though while waiting on hardware.)

@charles-plessy
Copy link
Contributor Author

You're welcome. Let me know if there are other commands to run later.

@domibel
Copy link

domibel commented Oct 1, 2014

Here is what I get:

~/samtools# make test
cd ../htslib && make bgzip
make[1]: Entering directory `/root/htslib'
gcc -g -Wall -O2 -I. -DSAMTOOLS=1 -c -o bgzip.o bgzip.c
gcc -pthread  -o bgzip bgzip.o libhts.a  -lz
make[1]: Leaving directory `/root/htslib'
REF_PATH=: test/test.pl --exec bgzip=../htslib/bgzip
../htslib/bgzip -c -b 65272 -s 5 /tmp/iEDMnDPgQW/bgzip.dat.gz
.. ok

../htslib/bgzip -c -b 979200 -s 6 /tmp/iEDMnDPgQW/bgzip.dat.gz
.. ok

../htslib/bgzip -c -b 652804 -s 6 /tmp/iEDMnDPgQW/bgzip.dat.gz
.. ok

The samtools process is still running

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 2823 root      20   0  2972  736  500 R  97.7  0.1   6:22.01 /root/samtools/samtools faidx /tmp/iEDMnDPgQW/faidx.fa

gdb -p 2823

143     size_t n = fp->end - fp->begin;
(gdb) n
144     if (n > nbytes) n = nbytes;
(gdb) n
145     memcpy(buffer, fp->begin, n);
(gdb) n
144     if (n > nbytes) n = nbytes;
(gdb) n
bgzf_read_block (fp=0x85068) at bgzf.c:424
424         count = hread(fp->fp, fp->uncompressed_block, BGZF_MAX_BLOCK_SIZE);
(gdb) n
425         if ( count==0 )
(gdb) n
427             fp->block_length = 0;
(gdb) n
428             return 0;
(gdb) n
542 }
(gdb) n
nbgzf_getc (fp=0x85068) at bgzf.c:882
882         if (fp->block_length == 0) return -1; /* end-of-file */
(gdb) n
892 }
(gdb) n
fai_build_core (bgzf=0x85068) at faidx.c:126
126                 ++l1;
(gdb) n
127                 if (isgraph(c)) ++l2;
(gdb) n
128             } while ( (c=bgzf_getc(bgzf))>=0 && c != '\n');
(gdb) n
126                 ++l1;
(gdb) print l1
$1 = 366088164
(gdb) n
127                 if (isgraph(c)) ++l2;
(gdb) n
128             } while ( (c=bgzf_getc(bgzf))>=0 && c != '\n');
(gdb) n
126                 ++l1;
(gdb) n
127                 if (isgraph(c)) ++l2;
(gdb) n
128             } while ( (c=bgzf_getc(bgzf))>=0 && c != '\n');
(gdb) n
126                 ++l1;
(gdb) n
127                 if (isgraph(c)) ++l2;
(gdb) n
128             } while ( (c=bgzf_getc(bgzf))>=0 && c != '\n');
(gdb) n
126                 ++l1;
(gdb) print l1
$2 = 366088166
(gdb) n
127                 if (isgraph(c)) ++l2;
(gdb) n
128             } while ( (c=bgzf_getc(bgzf))>=0 && c != '\n');
(gdb) n
126                 ++l1;
(gdb) n
127                 if (isgraph(c)) ++l2;
(gdb) n
128             } while ( (c=bgzf_getc(bgzf))>=0 && c != '\n');
(gdb) n
126                 ++l1;
(gdb) n
127                 if (isgraph(c)) ++l2;

I hope this helps a little.

@domibel
Copy link

domibel commented Oct 1, 2014

I am not sure if this is related, but I am getting already strange errors in htslib alone:

~/htslib# make test
test/test-regidx
test/fieldarith test/fieldarith.sam
test/hfile
test/sam
make: *** [test] Bus error
gdb test/sam
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/htslib/test/sam...done.
(gdb) run
Starting program: /root/htslib/test/sam 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".

Program received signal SIGBUS, Bus error.
0x0001516a in bam_aux2f (s=0x7a4fe "\333\017I@XddiW\024\213\n\277\005@XZZHello, world!") at sam.c:1121
1121        else if (type == 'f') return *(float*)s;
(gdb) bt
#0  0x0001516a in bam_aux2f (s=0x7a4fe "\333\017I@XddiW\024\213\n\277\005@XZZHello, world!") at sam.c:1121
#1  0x0000a124 in aux_fields1 () at test/sam.c:93
#2  main () at test/sam.c:139

@jkbonfield
Copy link
Contributor

Bus error is probably due to unaligned data access. I did some experimental work on removing those from htslib (see https://github.com/jkbonfield/htslib/tree/SPARC) and I believe there are some pull requests here too from others, but none of them are perfect yet and we find it hard to check such things right now.

@domibel
Copy link

domibel commented Oct 1, 2014

After commenting out test/sam I get this:

~/htslib# make test
echo '#define HTS_VERSION "1.1-6-gd2cb7ba-dirty"' > version.h
gcc -g -Wall -O2 -I. -DSAMTOOLS=1 -c -o hts.o hts.c
ar -rc libhts.a kfunc.o knetfile.o kstring.o bgzf.o faidx.o hfile.o hfile_net.o hts.o sam.o synced_bcf_reader.o vcf_sweep.o tbx.o vcf.o vcfutils.o cram/cram_codecs.o cram/cram_decode.o cram/cram_encode.o cram/cram_index.o cram/cram_io.o cram/cram_samtools.o cram/cram_stats.o cram/files.o cram/mFILE.o cram/md5.o cram/open_trace_file.o cram/pooled_alloc.o cram/sam_header.o cram/string_alloc.o cram/thread_pool.o cram/vlen.o cram/zfio.o regidx.o
ranlib libhts.a
gcc -pthread  -o test/fieldarith test/fieldarith.o libhts.a  -lz
gcc  -o test/hfile test/hfile.o libhts.a  -lz
gcc -pthread  -o test/sam test/sam.o libhts.a  -lz
gcc -pthread  -o test/test_view test/test_view.o libhts.a  -lz
gcc -pthread  -o test/test-vcf-api test/test-vcf-api.o libhts.a  -lz
gcc -pthread  -o test/test-vcf-sweep test/test-vcf-sweep.o libhts.a  -lz
gcc -pthread  -o test/test-regidx test/test-regidx.o libhts.a  -lz
test/test-regidx
test/fieldarith test/fieldarith.sam
test/hfile
#test/sam
cd test && REF_PATH=: ./test_view.pl

=== Testing aux#aux.sam, ref aux.fa ===
  ./test_view -S -b aux#aux.sam > aux#aux.tmp.bam
  ./test_view aux#aux.tmp.bam > aux#aux.tmp.bam.sam_
Bus error
FAIL 
  ./compare_sam.pl aux#aux.sam aux#aux.tmp.bam.sam_
EOF on aux#aux.sam
FAIL 
  ./test_view -t aux.fa -S -C aux#aux.sam > aux#aux.tmp.cram
  ./test_view -D aux#aux.tmp.cram > aux#aux.tmp.cram.sam_
Bus error
FAIL 
  ./compare_sam.pl -nomd aux#aux.sam aux#aux.tmp.cram.sam_
EOF on aux#aux.sam
FAIL 
  ./test_view -t aux.fa -C aux#aux.tmp.bam > aux#aux.tmp.bam.cram
  ./test_view -b -D aux#aux.tmp.bam.cram > aux#aux.tmp.bam.cram.bam
  ./test_view aux#aux.tmp.bam.cram.bam > aux#aux.tmp.bam.cram.bam.sam_
Bus error
FAIL 
  ./compare_sam.pl -nomd aux#aux.sam aux#aux.tmp.bam.cram.bam.sam_
EOF on aux#aux.sam
FAIL 

=== Testing c1#bounds.sam, ref c1.fa ===
  ./test_view -S -b c1#bounds.sam > c1#bounds.tmp.bam
  ./test_view c1#bounds.tmp.bam > c1#bounds.tmp.bam.sam_
  ./compare_sam.pl c1#bounds.sam c1#bounds.tmp.bam.sam_
  ./test_view -t c1.fa -S -C c1#bounds.sam > c1#bounds.tmp.cram
  ./test_view -D c1#bounds.tmp.cram > c1#bounds.tmp.cram.sam_
  ./compare_sam.pl -nomd c1#bounds.sam c1#bounds.tmp.cram.sam_
  ./test_view -t c1.fa -C c1#bounds.tmp.bam > c1#bounds.tmp.bam.cram
  ./test_view -b -D c1#bounds.tmp.bam.cram > c1#bounds.tmp.bam.cram.bam
  ./test_view c1#bounds.tmp.bam.cram.bam > c1#bounds.tmp.bam.cram.bam.sam_
  ./compare_sam.pl -nomd c1#bounds.sam c1#bounds.tmp.bam.cram.bam.sam_

[following tests are looking good]
[snip]

Successes 234

Failures  6
make: *** [test] Error 1

@domibel
Copy link

domibel commented Oct 1, 2014

Back to the original problem, after 30 min the test is still running, and it looks like an integer overflow.

~/htslib# gdb -p 2823
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 2823
Reading symbols from /root/samtools/samtools...done.
Reading symbols from /lib/arm-linux-gnueabihf/libncurses.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib/arm-linux-gnueabihf/libncurses.so.5
Reading symbols from /lib/arm-linux-gnueabihf/libtinfo.so.5...(no debugging symbols found)...done.
Loaded symbols for /lib/arm-linux-gnueabihf/libtinfo.so.5
Reading symbols from /lib/arm-linux-gnueabihf/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/arm-linux-gnueabihf/libm.so.6
Reading symbols from /lib/arm-linux-gnueabihf/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/arm-linux-gnueabihf/libz.so.1
Reading symbols from /lib/arm-linux-gnueabihf/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/arm-linux-gnueabihf/libgcc_s.so.1
Reading symbols from /lib/arm-linux-gnueabihf/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Loaded symbols for /lib/arm-linux-gnueabihf/libpthread.so.0
Reading symbols from /lib/arm-linux-gnueabihf/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/arm-linux-gnueabihf/libc.so.6
Reading symbols from /lib/arm-linux-gnueabihf/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/arm-linux-gnueabihf/libdl.so.2
Reading symbols from /lib/ld-linux-armhf.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux-armhf.so.3
0x40278ef8 in memcpy () from /lib/arm-linux-gnueabihf/libc.so.6
(gdb) n
Single stepping until exit from function memcpy,
which has no line number information.
hread (nbytes=65536, buffer=0x850b0, fp=0x84030) at htslib/hfile.h:146
146     fp->begin += n;
(gdb) n
147     return (n == nbytes)? (ssize_t) n : hread2(fp, buffer, nbytes, n);
(gdb) n
146     fp->begin += n;
(gdb) n
147     return (n == nbytes)? (ssize_t) n : hread2(fp, buffer, nbytes, n);
(gdb) n
bgzf_read_block (fp=0x85068) at bgzf.c:425
425         if ( count==0 )
(gdb) n
n427                fp->block_length = 0;
(gdb) n
428             return 0;
(gdb) n
542 }
(gdb) n
bgzf_getc (fp=0x85068) at bgzf.c:882
882         if (fp->block_length == 0) return -1; /* end-of-file */
(gdb) n
892 }
(gdb) n
fai_build_core (bgzf=0x85068) at faidx.c:126
126                 ++l1;
(gdb) n
127                 if (isgraph(c)) ++l2;
n(gdb) nn
Undefined command: "nn".  Try "help".
(gdb) n
128             } while ( (c=bgzf_getc(bgzf))>=0 && c != '\n');
(gdb) n
126                 ++l1;
(gdb) 
127                 if (isgraph(c)) ++l2;
n(gdb) n
128             } while ( (c=bgzf_getc(bgzf))>=0 && c != '\n');
(gdb) n
126                 ++l1;
(gdb) print l1
$1 = -447326328

@jkbonfield
Copy link
Contributor

I can see an error in htslib/bgzf.c although not why it happens. Adding error checking would solve it though maybe.

int bgzf_read_block(BGZF *fp)
{
    uint8_t header[BLOCK_HEADER_LENGTH], *compressed_block;
    int count, size = 0, block_length, remaining;

    // Reading an uncompressed file
    if ( !fp->is_compressed )
    {
        count = hread(fp->fp, fp->uncompressed_block, BGZF_MAX_BLOCK_SIZE);
        if ( count==0 )
        {
            fp->block_length = 0;
            return 0;
        }
        if (fp->block_length != 0) fp->block_offset = 0;
        fp->block_address += count;
        fp->block_length = count;
        return 0;
    }
    ...

If hread() returns -1 then count is not 0, so it adds it to fp->block_length possibly making that -1, which in turn means the loop in bgzf_getc may count for an awful long time.

Adding a "if (count < 0) return -1" in there after the hread call may at least make it fail rather than hang for ages, but it doesn't explain the failure. John? Petr?

@pd3
Copy link
Member

pd3 commented Oct 2, 2014

James, I'll take a look at this, but it will be more efficient to wait until we have the 32 bit machine for testing.

@domibel
Copy link

domibel commented Oct 2, 2014

The problem is that htslib/faidx.c

while ( (c=bgzf_getc(bgzf))>=0 ) {

doesn't catch the return code -1 from bgzf_getc

Here is an simple test:

#include <iostream>
#include <string>

int x() { return -1; }

int main()
{
    char c;
    c = x();
    std::cout << "x() as int  : " << x() << std::endl;
    std::cout << "x() as char : " << int(c) << std::endl;
}

On ARM:

x() as int  : -1
x() as char : 255

on amd64:

x() as int  : -1
x() as char : -1

@domibel
Copy link

domibel commented Oct 2, 2014

@domibel
Copy link

domibel commented Oct 2, 2014

Here is an easy fix.

diff --git a/faidx.c b/faidx.c
index 75ec84c..7baedf5 100644
--- a/faidx.c
+++ b/faidx.c
@@ -74,7 +74,8 @@ static inline void fai_insert_index(faidx_t *idx, const char *name, int len, int

 faidx_t *fai_build_core(BGZF *bgzf)
 {
-    char c, *name;
+    signed char c;
+    char *name;
     int l_name, m_name;
     int line_len, line_blen, state;
     int l1, l2;

@pd3
Copy link
Member

pd3 commented Oct 3, 2014

This is much appreciated, thank you. Fixed by samtools/htslib@14a4a81

@jkbonfield
Copy link
Contributor

It's perhaps safer to use int instead. Although in this situation it doesn't matter as fasta is pure 7-bit ASCII, generally getc type calls want to return 0-255 plus -1, so 257 possible values. Hence int and not char (signed or otherwise).

Thanks for identifying the cause of the problem.

@charles-plessy
Copy link
Contributor Author

Many thanks everybody ! I can also confirm that the tests do not hang anymore on ARM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants