Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample order matters? #222

Open
louislamont opened this issue Jan 4, 2020 · 4 comments
Open

Sample order matters? #222

louislamont opened this issue Jan 4, 2020 · 4 comments

Comments

@louislamont
Copy link

louislamont commented Jan 4, 2020

Hi, I noticed some strange behavior when running HISAT2 (v 2.1.0) on many files at once. Some background: The data I'm working with are from 2 flow cells (fcA/B), 4 lanes each(L001/2/3/4), paired end, so we have 16 files per sample (e.g., sample1.L001.fcA.R1.fq, sample1.L001.fcA.R2.fq ... sample1.L004.fcB.R1.fq, sample1.L004.fcB.R2.fq). The issue: When setting up the HISAT2 pipeline, I noticed differences in number of aligned reads depending on the order that the fq files for a given sample were supplied to HISAT2 (meaning, when I set up the script manually, samples were entered by flowcell (L001.fcA, L002.fcA, etc). Using bash scripting provided the same samples, but ordered alphabetically - L001.fcA, L001.fcB, etc.). Here are the two summary files that were produced. By flowcell:

112154948 reads; of these:
  112154948 (100.00%) were paired; of these:
    6283267 (5.60%) aligned concordantly 0 times
    99278753 (88.52%) aligned concordantly exactly 1 time
    6592928 (5.88%) aligned concordantly >1 times
    ----
    6283267 pairs aligned concordantly 0 times; of these:
      624298 (9.94%) aligned discordantly 1 time
    ----
    5658969 pairs aligned 0 times concordantly or discordantly; of these:
      11317938 mates make up the pairs; of these:
        6719600 (59.37%) aligned 0 times
        3760871 (33.23%) aligned exactly 1 time
        837467 (7.40%) aligned >1 times
97.00% overall alignment rate

By lane:

112154948 reads; of these:
  112154948 (100.00%) were paired; of these:
    6284631 (5.60%) aligned concordantly 0 times
    99209992 (88.46%) aligned concordantly exactly 1 time
    6660325 (5.94%) aligned concordantly >1 times
    ----
    6284631 pairs aligned concordantly 0 times; of these:
      624071 (9.93%) aligned discordantly 1 time
    ----
    5660560 pairs aligned 0 times concordantly or discordantly; of these:
      11321120 mates make up the pairs; of these:
        6722040 (59.38%) aligned 0 times
        3758172 (33.20%) aligned exactly 1 time
        840908 (7.43%) aligned >1 times

I set up two scripts that explicitly gave the same sets of reads in different order and was able to reproduce the results. When I used featureCounts to get counts from each bam file, they were pretty close to one another (only 3191/60623 genes had different counts), but it seemed odd to me that specifying the same files in two different orders would lead to differences in alignment and counts. Has anyone else noticed anything like this? Am I doing something wrong, is this a bug, or is HISAT functioning properly? Does this post even make sense?

Here are the commands I ran. Samples ordered by flowcell:

hisat2 -p 20 --rna-strandness RF \
--dta -x path/to/indexes/HS2-ens98 \
--un-conc-gz data/testfiles/unaligned/sample1-unaligned.R%.fq.gz \
-1 data/testfiles/trimmed/sample1.L001.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R1.trimmed.fq \
-2 data/testfiles/trimmed/sample1.L001.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R2.trimmed.fq | \
samtools view -q 15 -Su | samtools sort -m 3G -@ 15 - \
-o data/testfiles/bams/sample1-byflowcell.sorted.bam

Samples ordered by lane:

hisat2 -p 20 --rna-strandness RF \
--dta -x path/to/indexes/HS2-ens98 \
--un-conc-gz data/testfiles/unaligned/sample1-unaligned.R%.fq.gz \
-1 data/testfiles/trimmed/sample1.L001.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R1.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R1.trimmed.fq \
-2 data/testfiles/trimmed/sample1.L001.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L001.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L002.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L003.fcB.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcA.R2.trimmed.fq,\
data/testfiles/trimmed/sample1.L004.fcB.R2.trimmed.fq | \
samtools view -q 15 -Su | samtools sort -m 3G -@ 15 - \
-o data/testfiles/bams/sample1-bylane.sorted.bam
@parkchanhee
Copy link
Collaborator

Hi @louislamont
When HISAT2 does RNA-alignment, HISAT2 tries to find splice sites from earlier reads' result, and use it.
Splice site information that HISAT2 found may vary slightly depending on reads order.

Could you run it again with '--no-temp-splicesite' option?

@louislamont
Copy link
Author

Thanks for the reply. When running with the --no-temp-splicesite option, I get the same alignment regardless of file order. Here is the summary output from both scripts:

112154948 reads; of these:
  112154948 (100.00%) were paired; of these:
    6376372 (5.69%) aligned concordantly 0 times
    98844999 (88.13%) aligned concordantly exactly 1 time
    6933577 (6.18%) aligned concordantly >1 times
    ----
    6376372 pairs aligned concordantly 0 times; of these:
      622770 (9.77%) aligned discordantly 1 time
    ----
    5753602 pairs aligned 0 times concordantly or discordantly; of these:
      11507204 mates make up the pairs; of these:
        6873539 (59.73%) aligned 0 times
        3701254 (32.16%) aligned exactly 1 time
        932411 (8.10%) aligned >1 times
96.94% overall alignment rate

I am using HISAT2 as part of the Stringtie lncRNA discovery pipeline... Do you have any recommendations on whether or not to include the --no-temp-splicesite flag?

@agalitsyna
Copy link

Hi, I run hisat2 with both --no-temp-splicesite and --novel-splicesite-outfile options, and I notice that the reads order is important again.

I use a different test design than @louislamont. I have one fastq file with 2 mln single-end reads of different sizes (from 14 to 64 bps) and split it into two: top 1 mln reads file and bottom 1 mln reads file. Then I run hisat2 v2.2.1 on these three files (--reorder option on) and compare the output for the individual reads.

Top 1 mln reads map identically if processed as a separate file or as the beginning of a larger library. However, bottom 1 mln reads mapping seems to depend on splicing annotation of the upstream processed reads.

I compare four commands:

  • hisat2 --reorder -p 3 -k 1 -x reference.fa --no-softclip --no-spliced-alignment
    results in identical mapping of bottom 1 mln reads.

  • hisat2 --reorder -p 3 -k 1 -x reference.fa --no-softclip --dta-cufflinks --known-splicesite-infile known-splicesites.txt
    results in 3919 reads with differing annotation (~0.39%). The differences include different cigar fields, NH:i tags, mapping status and chromosome). This is expected, as explained in Sample order matters? #222 (comment) by @parkchanhee.

  • hisat2 --reorder -p 3 -k 1 -x reference.fa --no-softclip --dta-cufflinks --known-splicesite-infile known-splicesites.txt --no-temp-splicesite
    results in identical mapping, as expected. As I understand, hisat2 in this mode does not use cumulative information on novel splice sites, and the reads processed upstream do not matter.

  • hisat2 --reorder -p 3 -k 1 -x reference.fa --no-softclip --dta-cufflinks --known-splicesite-infile known-splicesites.txt --no-temp-splicesite --novel-splicesite-outfile novel-splicesites.txt
    results in 3917 reads with differing annotation (again!). This I don't understand. It looks like the writing of splice sites changes the mapping of downstream reads.

To give you some more numbers, if I compare the output of --no-softclip with and without --novel-splicesite-outfile, it produces around 0.43% of different mapping annotations.

Do you recommend using --novel-splicesite-outfile with --no-temp-splicesite? How do I report splicing sites and make sure that I do not depend on the order of the reads, and upstream processed reads in particular?

Genome is not human, but fruitfly, but it should not be relevant.

@parkchanhee
Copy link
Collaborator

parkchanhee commented Feb 4, 2021

@agalitsyna
If you use both options (--no-temp-splicesite, --novel-splicesite-outfile) at the same time, the --no-temp-splicesite option doesn't work internally, so you will get slightly different results.

Currently, there is no way to generate same result regardless of the order of input reads with using the --novel-splicesite-outfile option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants