Sample order matters? #222

louislamont · 2020-01-04T00:38:39Z

Hi, I noticed some strange behavior when running HISAT2 (v 2.1.0) on many files at once. Some background: The data I'm working with are from 2 flow cells (fcA/B), 4 lanes each(L001/2/3/4), paired end, so we have 16 files per sample (e.g., sample1.L001.fcA.R1.fq, sample1.L001.fcA.R2.fq ... sample1.L004.fcB.R1.fq, sample1.L004.fcB.R2.fq). The issue: When setting up the HISAT2 pipeline, I noticed differences in number of aligned reads depending on the order that the fq files for a given sample were supplied to HISAT2 (meaning, when I set up the script manually, samples were entered by flowcell (L001.fcA, L002.fcA, etc). Using bash scripting provided the same samples, but ordered alphabetically - L001.fcA, L001.fcB, etc.). Here are the two summary files that were produced. By flowcell:

112154948 reads; of these:
  112154948 (100.00%) were paired; of these:
    6283267 (5.60%) aligned concordantly 0 times
    99278753 (88.52%) aligned concordantly exactly 1 time
    6592928 (5.88%) aligned concordantly >1 times
    ----
    6283267 pairs aligned concordantly 0 times; of these:
      624298 (9.94%) aligned discordantly 1 time
    ----
    5658969 pairs aligned 0 times concordantly or discordantly; of these:
      11317938 mates make up the pairs; of these:
        6719600 (59.37%) aligned 0 times
        3760871 (33.23%) aligned exactly 1 time
        837467 (7.40%) aligned >1 times
97.00% overall alignment rate

By lane:

112154948 reads; of these:
  112154948 (100.00%) were paired; of these:
    6284631 (5.60%) aligned concordantly 0 times
    99209992 (88.46%) aligned concordantly exactly 1 time
    6660325 (5.94%) aligned concordantly >1 times
    ----
    6284631 pairs aligned concordantly 0 times; of these:
      624071 (9.93%) aligned discordantly 1 time
    ----
    5660560 pairs aligned 0 times concordantly or discordantly; of these:
      11321120 mates make up the pairs; of these:
        6722040 (59.38%) aligned 0 times
        3758172 (33.20%) aligned exactly 1 time
        840908 (7.43%) aligned >1 times

I set up two scripts that explicitly gave the same sets of reads in different order and was able to reproduce the results. When I used featureCounts to get counts from each bam file, they were pretty close to one another (only 3191/60623 genes had different counts), but it seemed odd to me that specifying the same files in two different orders would lead to differences in alignment and counts. Has anyone else noticed anything like this? Am I doing something wrong, is this a bug, or is HISAT functioning properly? Does this post even make sense?

Here are the commands I ran. Samples ordered by flowcell:

hisat2 -p 20 --rna-strandness RF \
--dta -x path/to/indexes/HS2-ens98 \
--un-conc-gz data/testfiles/unaligned/sample1-unaligned.R%.fq.gz \
-1 data/testfiles/trimmed/sample1.L001.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R1.trimmed.fq \
-2 data/testfiles/trimmed/sample1.L001.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R2.trimmed.fq | \
samtools view -q 15 -Su | samtools sort -m 3G -@ 15 - \
-o data/testfiles/bams/sample1-byflowcell.sorted.bam

Samples ordered by lane:

hisat2 -p 20 --rna-strandness RF \
--dta -x path/to/indexes/HS2-ens98 \
--un-conc-gz data/testfiles/unaligned/sample1-unaligned.R%.fq.gz \
-1 data/testfiles/trimmed/sample1.L001.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R1.trimmed.fq \
-2 data/testfiles/trimmed/sample1.L001.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R2.trimmed.fq | \
samtools view -q 15 -Su | samtools sort -m 3G -@ 15 - \
-o data/testfiles/bams/sample1-bylane.sorted.bam

The text was updated successfully, but these errors were encountered:

parkchanhee · 2020-01-07T03:36:26Z

Hi @louislamont
When HISAT2 does RNA-alignment, HISAT2 tries to find splice sites from earlier reads' result, and use it.
Splice site information that HISAT2 found may vary slightly depending on reads order.

Could you run it again with '--no-temp-splicesite' option?

louislamont · 2020-01-07T19:32:35Z

Thanks for the reply. When running with the --no-temp-splicesite option, I get the same alignment regardless of file order. Here is the summary output from both scripts:

112154948 reads; of these:
  112154948 (100.00%) were paired; of these:
    6376372 (5.69%) aligned concordantly 0 times
    98844999 (88.13%) aligned concordantly exactly 1 time
    6933577 (6.18%) aligned concordantly >1 times
    ----
    6376372 pairs aligned concordantly 0 times; of these:
      622770 (9.77%) aligned discordantly 1 time
    ----
    5753602 pairs aligned 0 times concordantly or discordantly; of these:
      11507204 mates make up the pairs; of these:
        6873539 (59.73%) aligned 0 times
        3701254 (32.16%) aligned exactly 1 time
        932411 (8.10%) aligned >1 times
96.94% overall alignment rate

I am using HISAT2 as part of the Stringtie lncRNA discovery pipeline... Do you have any recommendations on whether or not to include the --no-temp-splicesite flag?

agalitsyna · 2021-02-02T14:19:39Z

Hi, I run hisat2 with both --no-temp-splicesite and --novel-splicesite-outfile options, and I notice that the reads order is important again.

I use a different test design than @louislamont. I have one fastq file with 2 mln single-end reads of different sizes (from 14 to 64 bps) and split it into two: top 1 mln reads file and bottom 1 mln reads file. Then I run hisat2 v2.2.1 on these three files (--reorder option on) and compare the output for the individual reads.

Top 1 mln reads map identically if processed as a separate file or as the beginning of a larger library. However, bottom 1 mln reads mapping seems to depend on splicing annotation of the upstream processed reads.

I compare four commands:

hisat2 --reorder -p 3 -k 1 -x reference.fa --no-softclip --no-spliced-alignment
results in identical mapping of bottom 1 mln reads.
hisat2 --reorder -p 3 -k 1 -x reference.fa --no-softclip --dta-cufflinks --known-splicesite-infile known-splicesites.txt
results in 3919 reads with differing annotation (~0.39%). The differences include different cigar fields, NH:i tags, mapping status and chromosome). This is expected, as explained in Sample order matters? #222 (comment) by @parkchanhee.
hisat2 --reorder -p 3 -k 1 -x reference.fa --no-softclip --dta-cufflinks --known-splicesite-infile known-splicesites.txt --no-temp-splicesite
results in identical mapping, as expected. As I understand, hisat2 in this mode does not use cumulative information on novel splice sites, and the reads processed upstream do not matter.
hisat2 --reorder -p 3 -k 1 -x reference.fa --no-softclip --dta-cufflinks --known-splicesite-infile known-splicesites.txt --no-temp-splicesite --novel-splicesite-outfile novel-splicesites.txt
results in 3917 reads with differing annotation (again!). This I don't understand. It looks like the writing of splice sites changes the mapping of downstream reads.

To give you some more numbers, if I compare the output of --no-softclip with and without --novel-splicesite-outfile, it produces around 0.43% of different mapping annotations.

Do you recommend using --novel-splicesite-outfile with --no-temp-splicesite? How do I report splicing sites and make sure that I do not depend on the order of the reads, and upstream processed reads in particular?

Genome is not human, but fruitfly, but it should not be relevant.

parkchanhee · 2021-02-04T17:49:40Z

@agalitsyna
If you use both options (--no-temp-splicesite, --novel-splicesite-outfile) at the same time, the --no-temp-splicesite option doesn't work internally, so you will get slightly different results.

Currently, there is no way to generate same result regardless of the order of input reads with using the --novel-splicesite-outfile option.

ewallace mentioned this issue Feb 12, 2020

Do not multi-map rRNA/contaminant reads riboviz/riboviz#127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample order matters? #222

Sample order matters? #222

louislamont commented Jan 4, 2020 •

edited

Loading

parkchanhee commented Jan 7, 2020

louislamont commented Jan 7, 2020

agalitsyna commented Feb 2, 2021

parkchanhee commented Feb 4, 2021 •

edited

Loading

Sample order matters? #222

Sample order matters? #222

Comments

louislamont commented Jan 4, 2020 • edited Loading

parkchanhee commented Jan 7, 2020

louislamont commented Jan 7, 2020

agalitsyna commented Feb 2, 2021

parkchanhee commented Feb 4, 2021 • edited Loading

louislamont commented Jan 4, 2020 •

edited

Loading

parkchanhee commented Feb 4, 2021 •

edited

Loading