Failed to make fai #67

juancresc · 2017-12-01T11:25:06Z

I'm getting this while running:

Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
[fai_build_core] different line length in sequence 'stitch_167'.
        Failed to make fai of stitched genome. Aborting Run

This is the command I've used to run:

./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack2 --genomefile files/data/Triticum_aestivum.TGACv1.dna.toplevel.fa --bowtie_cores 3

The text was updated successfully, but these errors were encountered:

MikeAxtell · 2017-12-01T12:47:33Z

Yes, I've seen this before. It results from an improperly FASTA format in the genome file. The samtools faidx command that ShortStack is calling fails. You can verify by simply testing the command

samtools faidx files/data/Triticum_aestivum.TGACv1.dna.toplevel.fa

If I'm right you'll get the same error and a failure to make the .fai genome index.

When I've seen it before it has been one of two things:

One or more completely empty lines in the FASTA file (i.e. that match the regex ^$).
One or more sequence data lines of unequal length. For instance a line with 60 nts, followed by a line with 70nts, will cause samtools faidx to fail.

I also noticed that when I googled using just your genome file name 'Triticum_aestivum.TGACv1.dna.toplevel.fa' among the top hits were complaints from other people who've tried to work with that particular file. That is consistent with my hypothesis of an incorrectly formatted (or corrupted) FASTA file for this genome build.

Hope this helps and let me know.

juancresc · 2017-12-04T17:02:11Z

Just to notice, the same run, twice one after another. I've also run samtools and works fine. I don't know why the first run throws the error.

Anyway working fine now

Run Progress and Messages:

Mon Dec  4 11:58:39 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
        Failed to make fai of stitched genome. Aborting Run

Run Progress and Messages:
Use of uninitialized value $fai_fields[1] in numeric lt (<) at ./files/libs/ShortStack/ShortStack line 1009, <FAI> line 136610.

Mon Dec  4 12:00:20 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
        Done

MikeAxtell · 2017-12-04T19:34:12Z

Weird. But I'm glad you are up and running.

…

On Mon, Dec 4, 2017 at 12:02 PM, juanmas07 ***@***.***> wrote: Just to notice, the same run, twice one after another. I've also run samtools and works fine. I don't know why the first run throws the error. Anyway working fine now Run Progress and Messages: Mon Dec 4 11:58:39 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Failed to make fai of stitched genome. Aborting Run Run Progress and Messages: Use of uninitialized value $fai_fields[1] in numeric lt (<) at ./files/libs/ShortStack/ShortStack line 1009, <FAI> line 136610. Mon Dec 4 12:00:20 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Done — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGiXieXfsKN3o2cunf0oyLFazvHLfnoYks5s9CWTgaJpZM4QyNnp> .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc · 2017-12-12T12:12:51Z

Seems like the issue is in the stiched file:


samtools faidx files/output/sstack_toplevelformatted_taes21/wheat_stitched.fasta 
[fai_build_core] different line length in sequence 'stitch_54'.

I'm able to run samtools in the genomefile, but not in the stiched

MikeAxtell · 2017-12-12T12:35:44Z

Weird. Can you send me the url where I can find the exact genome file you are working with, so I replicate the error here? I'll mark this as a bug. Once I get the exact genome file I will be able to begin testing. Thanks, Mike

…

On Tue, Dec 12, 2017 at 7:12 AM, juanmas07 ***@***.***> wrote: Seems like the issue is in the stiched file: samtools faidx files/output/sstack_toplevelformatted_taes21/wheat_stitched.fasta [fai_build_core] different line length in sequence 'stitch_54'. I'm able to run samtools in the genomefile, but not in the stiched — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGiXifD0rtlZrumk5zdhJkcZbMCl3ho2ks5s_m3DgaJpZM4QyNnp> .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc · 2017-12-12T12:38:16Z

I'm using this genome:

ftp://ftp.ensemblgenomes.org/pub/plants/release-37/fasta/triticum_aestivum/dna/Triticum_aestivum.TGACv1.dna.toplevel.fa.gz

I'm also using a formatted version, with the same results. For formatting I've used a biopython script:


#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

def FAfixer(input_fasta, output_fasta):
    """
    """
    buffer_seqs = []
    for seq_record in SeqIO.parse(input_fasta, "fasta"):
        record = SeqRecord(seq_record.seq, id=seq_record.id, description=seq_record.description)
        buffer_seqs.append(record)
    SeqIO.write(buffer_seqs, output_fasta, "fasta")

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()#pylint: disable=invalid-name
    parser.add_argument("-i", "--input_fasta", help="(.fasta)", required=True)
    parser.add_argument("-o", "--output_fasta", help="(.fasta)", required=True)
    args = parser.parse_args()#pylint: disable=invalid-name
    FAfixer(args.input_fasta, args.output_fasta)

MikeAxtell · 2017-12-12T13:48:53Z

Thanks I will take a look at this as soon as I can and get back to you

…

On Tue, Dec 12, 2017 at 7:38 AM, juanmas07 ***@***.***> wrote: I'm using this genome: ftp://ftp.ensemblgenomes.org/pub/plants/release-37/fasta/ triticum_aestivum/dna/Triticum_aestivum.TGACv1.dna.toplevel.fa.gz I'm also using a formatted version, with the same results. For formatting I've used a biopython script: #!/usr/bin/env python # -*- coding: utf-8 -*- from Bio import SeqIO from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq def FAfixer(input_fasta, output_fasta): """ """ buffer_seqs = [] for seq_record in SeqIO.parse(input_fasta, "fasta"): record = SeqRecord(seq_record.seq, id=seq_record.id, description=seq_record.description) buffer_seqs.append(record) SeqIO.write(buffer_seqs, output_fasta, "fasta") if __name__ == "__main__": import argparse parser = argparse.ArgumentParser()#pylint: disable=invalid-name parser.add_argument("-i", "--input_fasta", help="(.fasta)", required=True) parser.add_argument("-o", "--output_fasta", help="(.fasta)", required=True) args = parser.parse_args()#pylint: disable=invalid-name FAfixer(args.input_fasta, args.output_fasta) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGiXibXY0ZhjjTkEAkhltps39f7NHx6Tks5s_nO4gaJpZM4QyNnp> .

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc · 2017-12-14T11:55:45Z

I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9

MikeAxtell · 2017-12-14T12:41:23Z

Hi again. I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download. Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else. For comparison my testing environment is: ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1 Hope this helps, Mike

…

On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 ***@***.***> wrote: I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9 — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc · 2017-12-14T12:56:38Z

Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can

…

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email. On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell <notifications@github.com> wrote:

Hi again. I was unable to reproduce the error you report when using the version of the wheat genome you specified along with a publicly available set of wheat small RNA-seq data. I did not modify or polish the genome file in any way except to decompress it from .gz form after download. Your message stated that the "...the genome file 34.9" [GB]. I'm not sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file I retrieved from ensembl at the link you provided is 13G in size when decompressed (13723165766 bytes to be exact), not 34.9G. I suspect your local copy of this genome file has been corrupted and or concatenated with something else. For comparison my testing environment is: ShortStack 3.8.3 samtools 1.4.1 RNAfold 2.3.5 bowtie 1.2.1.1 Hope this helps, Mike On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 ***@***.***> wrote: > I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while > the genome file is 34.9 > > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub, or mute the thread. -- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp> .

juancresc · 2017-12-14T13:00:44Z

This is my last run: (wheat.fasta is the 13 GB genome file) nohup ./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack_toplevelformatted_taes21 --genomefile /dev/wheat.fasta --bowtie_cores 3 & ShortStack version 3.8.4 Thu Dec 14 07:46:57 EST 2017 hostname: vzmarcosjuarezubuntu working directory: /home/trigo/runs/miRNA-MITE Settings: RNAfold_version: 2 bowtie_cores: 3 bowtie_m: 50 dicermax: 24 dicermin: 20 error_logfile: files/output/sstack_toplevelformatted_taes21/ErrorLogs.txt foldsize: 300 genomefile: /dev/wheat.fasta logfile: files/output/sstack_ toplevelformatted_taes21/Log.txt mincov: 0.5rpm mismatches: 1 mmap: u outdir: files/output/sstack_toplevelformatted_taes21 pad: 75 ranmax: 3 readfile: files/output/TAEs.21.fasta sort_mem: 768M strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 07:46:58 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... [fai_build_core] different line length in sequence 'stitch_186'. Failed to make fai of stitched genome. Aborting Run

…

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Dec 14, 2017 at 9:56 AM, JM C ***@***.***> wrote: Sorry that was a typo, the genome file is 13G. I'm running it again and adding all information I can -- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email. On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell ***@***.***> wrote: > Hi again. > > I was unable to reproduce the error you report when using the version > of the wheat genome you specified along with a publicly available set > of wheat small RNA-seq data. I did not modify or polish the genome > file in any way except to decompress it from .gz form after download. > > Your message stated that the "...the genome file 34.9" [GB]. I'm not > sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file > I retrieved from ensembl at the link you provided is 13G in size when > decompressed (13723165766 bytes to be exact), not 34.9G. I suspect > your local copy of this genome file has been corrupted and or > concatenated with something else. > > For comparison my testing environment is: > > ShortStack 3.8.3 > samtools 1.4.1 > RNAfold 2.3.5 > bowtie 1.2.1.1 > > Hope this helps, > Mike > > > > > On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 ***@***.***> > wrote: > > I'm reviewing the stiched file, not quite sure why it is only 2.6 GB > while > > the genome file is 34.9 > > > > — > > You are receiving this because you were assigned. > > Reply to this email directly, view it on GitHub, or mute the thread. > > > > -- > Michael J. Axtell, Ph.D. > Professor of Biology > Penn State University > http://sites.psu.edu/axtell > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#67 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp> > . >

MikeAxtell · 2017-12-14T15:41:30Z

Hi Juan, I noticed you were using ShortStack 3.8.4. This is the latest commit on github but is not yet an "official" release. See https://github.com/MikeAxtell/ShortStack/releases for the official releases. That said, it shouldn't make a difference because the changes between the latest commit you used and the last official release won't affect this error. To make sure, I re-tested with the wheat genome on my system with 3.8.4. I had no problems completing a run and was not able to reproduce your error (Log.txt copied below). So, I'm not sure what to tell you. I can't reproduce the problem here with the same genome. The error message you are getting comes from samtools faidx being unable to index the "stitched" genome, which is an internal construct that ShortStack makes to speed sequence retrievals for RNA folding (highly fragmented genomes really slow down sequence retrievals). The error indicates that the stitched genome is corrupted in some way such that samtools can't index it. But since I can't reproduce the error I'm not sure what else to do. I'm not convinced that it is a ShortStack bug. You had mentioned that you pre-processed the genome and showed some code. I would suggest re-downloading the original genome fasta file, making sure the checksums are good, rebuilding the bowtie indices, and re-running. ### ShortStack version 3.8.4 Thu Dec 14 09:43:44 EST 2017 hostname: comp-bc-0143.acib.production.int.aci.ics.psu.edu working directory: /storage/work/mja18 Settings: RNAfold_version: 2 bamfile: SS_1/SRR3721341.bam dicermax: 24 dicermin: 20 error_logfile: SS_384/ErrorLogs.txt foldsize: 300 genomefile: Triticum_aestivum.TGACv1.dna.toplevel.fa logfile: SS_384/Log.txt mincov: 0.5rpm outdir: SS_384 pad: 75 sort_mem: 16G strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 09:43:46 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Done Tally of primary alignments (INCLUDES unmapped, but EXCLUDES secondary, duplicate, failed QC, and supplementary alignments): 8892343 Tally of PLACED primary alignments (EXCLUDES unmapped, secondary, duplicate, failed QC, and supplementary alignments): 5267484 Thu Dec 14 09:51:29 EST 2017 Performing de-novo cluster identification and analyses. At specified mincov of 0.5rpm with 5267484 placed primary reads, mincov is 3 raw alignments Completed at Thu Dec 14 10:29:10 EST 2017 Performing search for unplaced small RNAs. Completed at Thu Dec 14 10:30:09 EST 2017 Thu Dec 14 10:30:10 EST 2017 Tally of loci by predominant RNA size (DicerCall): DicerCall NotMIRNA MIRNA N or NA 8884 0 20 1150 2 21 8590 74 22 1829 19 23 2774 3 24 59828 1 Unplaced small RNAs Size MultiMapped NoAlignments <20 6437 5707 20 2513 4030 21 6320 18115 22 3057 7317 23 2744 2918 24 10008 6732

…

24 3629 6247 On Thu, Dec 14, 2017 at 8:00 AM, juanmas07 ***@***.***> wrote: This is my last run: (wheat.fasta is the 13 GB genome file) nohup ./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack_toplevelformatted_taes21 --genomefile /dev/wheat.fasta --bowtie_cores 3 & ShortStack version 3.8.4 Thu Dec 14 07:46:57 EST 2017 hostname: vzmarcosjuarezubuntu working directory: /home/trigo/runs/miRNA-MITE Settings: RNAfold_version: 2 bowtie_cores: 3 bowtie_m: 50 dicermax: 24 dicermin: 20 error_logfile: files/output/sstack_toplevelformatted_taes21/ErrorLogs.txt foldsize: 300 genomefile: /dev/wheat.fasta logfile: files/output/sstack_ toplevelformatted_taes21/Log.txt mincov: 0.5rpm mismatches: 1 mmap: u outdir: files/output/sstack_toplevelformatted_taes21 pad: 75 ranmax: 3 readfile: files/output/TAEs.21.fasta sort_mem: 768M strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 07:46:58 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... [fai_build_core] different line length in sequence 'stitch_186'. Failed to make fai of stitched genome. Aborting Run -- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email. On Thu, Dec 14, 2017 at 9:56 AM, JM C ***@***.***> wrote: > Sorry that was a typo, the genome file is 13G. I'm running it again and > adding all information I can > > -- > Juan Manuel Crescente > Eng. Software development and IT management > Bioinformatics - PhD Candidate @ INTA/CONICET > Please consider the environment before printing this email. > > On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell ***@***.***> > wrote: > >> Hi again. >> >> I was unable to reproduce the error you report when using the version >> of the wheat genome you specified along with a publicly available set >> of wheat small RNA-seq data. I did not modify or polish the genome >> file in any way except to decompress it from .gz form after download. >> >> Your message stated that the "...the genome file 34.9" [GB]. I'm not >> sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file >> I retrieved from ensembl at the link you provided is 13G in size when >> decompressed (13723165766 bytes to be exact), not 34.9G. I suspect >> your local copy of this genome file has been corrupted and or >> concatenated with something else. >> >> For comparison my testing environment is: >> >> ShortStack 3.8.3 >> samtools 1.4.1 >> RNAfold 2.3.5 >> bowtie 1.2.1.1 >> >> Hope this helps, >> Mike >> >> >> >> >> On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 ***@***.***> >> wrote: >> > I'm reviewing the stiched file, not quite sure why it is only 2.6 GB >> while >> > the genome file is 34.9 >> > >> > — >> > You are receiving this because you were assigned. >> > Reply to this email directly, view it on GitHub, or mute the thread. >> >> >> >> -- >> Michael J. Axtell, Ph.D. >> Professor of Biology >> Penn State University >> http://sites.psu.edu/axtell >> >> — >> You are receiving this because you authored the thread. >> Reply to this email directly, view it on GitHub >> >> <#67 (comment)>, >> or mute the thread >> >> <https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp> >> . >> > > — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell

juancresc · 2017-12-15T11:30:23Z

Dear Mike, Thank you for your detailed explanation, I'll continue trying till I get it working. Once I have a solution I'll update you. Thanks again and kind regards,

…

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email. On Thu, Dec 14, 2017 at 12:41 PM, Mike Axtell <notifications@github.com> wrote:

Hi Juan, I noticed you were using ShortStack 3.8.4. This is the latest commit on github but is not yet an "official" release. See https://github.com/MikeAxtell/ShortStack/releases for the official releases. That said, it shouldn't make a difference because the changes between the latest commit you used and the last official release won't affect this error. To make sure, I re-tested with the wheat genome on my system with 3.8.4. I had no problems completing a run and was not able to reproduce your error (Log.txt copied below). So, I'm not sure what to tell you. I can't reproduce the problem here with the same genome. The error message you are getting comes from samtools faidx being unable to index the "stitched" genome, which is an internal construct that ShortStack makes to speed sequence retrievals for RNA folding (highly fragmented genomes really slow down sequence retrievals). The error indicates that the stitched genome is corrupted in some way such that samtools can't index it. But since I can't reproduce the error I'm not sure what else to do. I'm not convinced that it is a ShortStack bug. You had mentioned that you pre-processed the genome and showed some code. I would suggest re-downloading the original genome fasta file, making sure the checksums are good, rebuilding the bowtie indices, and re-running. ### ShortStack version 3.8.4 Thu Dec 14 09:43:44 EST 2017 hostname: comp-bc-0143.acib.production.int.aci.ics.psu.edu working directory: /storage/work/mja18 Settings: RNAfold_version: 2 bamfile: SS_1/SRR3721341.bam dicermax: 24 dicermin: 20 error_logfile: SS_384/ErrorLogs.txt foldsize: 300 genomefile: Triticum_aestivum.TGACv1.dna.toplevel.fa logfile: SS_384/Log.txt mincov: 0.5rpm outdir: SS_384 pad: 75 sort_mem: 16G strand_cutoff: 0.8 Run Progress and Messages: Thu Dec 14 09:43:46 EST 2017 Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ... Done Tally of primary alignments (INCLUDES unmapped, but EXCLUDES secondary, duplicate, failed QC, and supplementary alignments): 8892343 Tally of PLACED primary alignments (EXCLUDES unmapped, secondary, duplicate, failed QC, and supplementary alignments): 5267484 Thu Dec 14 09:51:29 EST 2017 Performing de-novo cluster identification and analyses. At specified mincov of 0.5rpm with 5267484 placed primary reads, mincov is 3 raw alignments Completed at Thu Dec 14 10:29:10 EST 2017 Performing search for unplaced small RNAs. Completed at Thu Dec 14 10:30:09 EST 2017 Thu Dec 14 10:30:10 EST 2017 Tally of loci by predominant RNA size (DicerCall): DicerCall NotMIRNA MIRNA N or NA 8884 0 20 1150 2 21 8590 74 22 1829 19 23 2774 3 24 59828 1 Unplaced small RNAs Size MultiMapped NoAlignments <20 6437 5707 20 2513 4030 21 6320 18115 22 3057 7317 23 2744 2918 24 10008 6732 >24 3629 6247 On Thu, Dec 14, 2017 at 8:00 AM, juanmas07 ***@***.***> wrote: > This is my last run: > > (wheat.fasta is the 13 GB genome file) > > nohup ./files/libs/ShortStack/ShortStack --readfile > files/output/TAEs.21.fasta --outdir > files/output/sstack_toplevelformatted_taes21 --genomefile > /dev/wheat.fasta --bowtie_cores 3 & > > ShortStack version 3.8.4 Thu Dec 14 07:46:57 EST 2017 hostname: > vzmarcosjuarezubuntu working directory: /home/trigo/runs/miRNA-MITE > Settings: RNAfold_version: 2 bowtie_cores: 3 bowtie_m: 50 dicermax: 24 > dicermin: 20 error_logfile: > files/output/sstack_toplevelformatted_taes21/ErrorLogs.txt > foldsize: 300 genomefile: /dev/wheat.fasta logfile: files/output/sstack_ > toplevelformatted_taes21/Log.txt mincov: 0.5rpm mismatches: 1 mmap: u > outdir: files/output/sstack_toplevelformatted_taes21 pad: 75 ranmax: 3 > readfile: files/output/TAEs.21.fasta sort_mem: 768M strand_cutoff: 0.8 Run > Progress and Messages: Thu Dec 14 07:46:58 EST 2017 Genome has more than 50 > segments and more than two of them are < 1Mb. Stitching short references to > improve performance ... [fai_build_core] different line length in sequence > 'stitch_186'. Failed to make fai of stitched genome. Aborting Run > > > -- > Juan Manuel Crescente > Eng. Software development and IT management > Bioinformatics - PhD Candidate @ INTA/CONICET > Please consider the environment before printing this email. > > On Thu, Dec 14, 2017 at 9:56 AM, JM C ***@***.***> wrote: > >> Sorry that was a typo, the genome file is 13G. I'm running it again and >> adding all information I can >> >> -- >> Juan Manuel Crescente >> Eng. Software development and IT management >> Bioinformatics - PhD Candidate @ INTA/CONICET >> Please consider the environment before printing this email. >> >> On Thu, Dec 14, 2017 at 9:41 AM, Mike Axtell ***@***.***> >> wrote: >> >>> Hi again. >>> >>> I was unable to reproduce the error you report when using the version >>> of the wheat genome you specified along with a publicly available set >>> of wheat small RNA-seq data. I did not modify or polish the genome >>> file in any way except to decompress it from .gz form after download. >>> >>> Your message stated that the "...the genome file 34.9" [GB]. I'm not >>> sure why that is .. the Triticum_aestivum.TGACv1.dna.toplevel.fa file >>> I retrieved from ensembl at the link you provided is 13G in size when >>> decompressed (13723165766 bytes to be exact), not 34.9G. I suspect >>> your local copy of this genome file has been corrupted and or >>> concatenated with something else. >>> >>> For comparison my testing environment is: >>> >>> ShortStack 3.8.3 >>> samtools 1.4.1 >>> RNAfold 2.3.5 >>> bowtie 1.2.1.1 >>> >>> Hope this helps, >>> Mike >>> >>> >>> >>> >>> On Thu, Dec 14, 2017 at 6:55 AM, juanmas07 ***@***.***> >>> wrote: >>> > I'm reviewing the stiched file, not quite sure why it is only 2.6 GB >>> while >>> > the genome file is 34.9 >>> > >>> > — >>> > You are receiving this because you were assigned. >>> > Reply to this email directly, view it on GitHub, or mute the thread. >>> >>> >>> >>> -- >>> Michael J. Axtell, Ph.D. >>> Professor of Biology >>> Penn State University >>> http://sites.psu.edu/axtell >>> >>> — >>> You are receiving this because you authored the thread. >>> Reply to this email directly, view it on GitHub >>> >>> <https://github.com/MikeAxtell/ShortStack/issues/ 67#issuecomment-351700258>, >>> or mute the thread >>> >>> <https://github.com/notifications/unsubscribe-auth/ACCu0wwOAxv_ L0WKUyLNi7vmEVIlKKxMks5tARdzgaJpZM4QyNnp> >>> . >>> >> >> > > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub, or mute the thread. -- Michael J. Axtell, Ph.D. Professor of Biology Penn State University http://sites.psu.edu/axtell — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACCu0ylKK1UATFbFgmKLWowKVGfBOs9cks5tAUGrgaJpZM4QyNnp> .

MikeAxtell · 2018-01-25T15:03:36Z

I am going through old issues today; did you ever resolve this one?

juancresc · 2018-03-12T16:02:05Z

He mike sorry for the delay. I still haven't had the chance to run it again, but I will soon and post an update here. Cheers

…

-- Juan Manuel Crescente Eng. Software development and IT management Bioinformatics - PhD Candidate @ INTA/CONICET Please consider the environment before printing this email.

On Thu, Jan 25, 2018 at 12:03 PM, Mike Axtell ***@***.***> wrote: I am going through old issues today; did you ever resolve this one? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#67 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACCu01ncG31TBglgfMg4LN-X3SoVazgKks5tOJfIgaJpZM4QyNnp> .

juancresc · 2018-06-15T19:28:02Z

I was able to run the pipeline with the genome with no problems.

palperbel · 2021-02-03T14:55:27Z

Hi all! Im having the exact same problem as the one reported here but im not able to solve it.

Im using an internal reference genome that was done at my company. The only thing is that this reference is at scaffold level and it contains around 10000 scaffolds but it has been validated and it works fine, but not sure if it could be the main reason for the problem.

In one of the coments this was mention as a possible source of the problem: "One or more sequence data lines of unequal length. For instance a line with 60 nts, followed by a line with 70nts, will cause samtools faidx to fail"
But this is something that happens in any fasta reference right? all the lines in each chr/scaffold are 60 nts except for the last one that will have different length (between 1 - 60 nts).

When ShortStack is trying to create the index from the stitched file, samtools faidx crash, and this also happens when I try to run it directly with samtools faidx. However im able to run samtools faidx in my reference.fasta. I have check the stitched file and there were some empty lines that I have removed but still doesnt work.

Here I report the error while running ShortStack:
###################
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
Failed to make fai of stitched genome. Aborting Run
###################
Here I report the error while running samtools faidx:
###################
[fai_build_core] different line length in sequence 'stitch_0'.
Could not build fai index ${path}/reference_stitched.fasta.fai
###################

Any idea why this is happening and how to solve it?
Thanks!!

MikeAxtell · 2021-02-03T15:05:51Z

Thanks for the report. Because you can successfully samtools faidx your original genome file, it does indeed sound like the "genome stitching" is to blame somehow; it must, under some circumstances, be making odd-length sequence lines.

If it's possible, you could share your genome file with me (directly email me might be best here), and I could try to figure it out. I don't think I'll be able to figure out the bug fix without the example case.

btw you are right that the last sequence line in any multi-FASTA file can be a shorter length. samtools faidx doesn't complain about that .. it complains about internal line-length deviations.

Anyway if you are able to post or send me the offending genome as a test case, I can try and run it down.

Mike

MikeAxtell · 2021-02-03T15:12:59Z

Another quick idea you can look at .. ShortStack assumes (and never bother checking, bad on me) that your .fasta file has linux-like line endings \n. If you have DOS-type \r\n line endings, it might be the reason stitching is getting fouled up. Anyway you could quickly convert your .fasta file to force it to linux-style line breaks and see if that helps .. just a thought.

palperbel · 2021-02-03T15:41:37Z

WOW!!! Thanks so much for your fast reply!! I have done dos2unix in order to get linux-like line ending and it worked!!!!!!!!!! The stitched index file has been created! :) Now ShortStack is running!!!

Thanks so much again!! you make my day!!!

Paloma

MikeAxtell · 2021-02-03T15:42:58Z

Glad to hear it.

MikeAxtell self-assigned this Dec 1, 2017

MikeAxtell added the question label Dec 1, 2017

MikeAxtell added the bug label Dec 12, 2017

juancresc closed this as completed Jun 15, 2018

MikeAxtell reopened this Feb 3, 2021

MikeAxtell closed this as completed Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to make fai #67

Failed to make fai #67

juancresc commented Dec 1, 2017

MikeAxtell commented Dec 1, 2017

juancresc commented Dec 4, 2017

MikeAxtell commented Dec 4, 2017 via email

juancresc commented Dec 12, 2017

MikeAxtell commented Dec 12, 2017 via email

juancresc commented Dec 12, 2017

MikeAxtell commented Dec 12, 2017 via email

juancresc commented Dec 14, 2017

MikeAxtell commented Dec 14, 2017 via email

juancresc commented Dec 14, 2017 via email

juancresc commented Dec 14, 2017 via email

MikeAxtell commented Dec 14, 2017 via email

juancresc commented Dec 15, 2017 via email

MikeAxtell commented Jan 25, 2018

juancresc commented Mar 12, 2018 via email

juancresc commented Jun 15, 2018

palperbel commented Feb 3, 2021

MikeAxtell commented Feb 3, 2021

MikeAxtell commented Feb 3, 2021

palperbel commented Feb 3, 2021

MikeAxtell commented Feb 3, 2021

Failed to make fai #67

Failed to make fai #67

Comments

juancresc commented Dec 1, 2017

MikeAxtell commented Dec 1, 2017

juancresc commented Dec 4, 2017

MikeAxtell commented Dec 4, 2017 via email

juancresc commented Dec 12, 2017

MikeAxtell commented Dec 12, 2017 via email

juancresc commented Dec 12, 2017

MikeAxtell commented Dec 12, 2017 via email

juancresc commented Dec 14, 2017

MikeAxtell commented Dec 14, 2017 via email

juancresc commented Dec 14, 2017 via email

juancresc commented Dec 14, 2017 via email

MikeAxtell commented Dec 14, 2017 via email

juancresc commented Dec 15, 2017 via email

MikeAxtell commented Jan 25, 2018

juancresc commented Mar 12, 2018 via email

juancresc commented Jun 15, 2018

palperbel commented Feb 3, 2021

MikeAxtell commented Feb 3, 2021

MikeAxtell commented Feb 3, 2021

palperbel commented Feb 3, 2021

MikeAxtell commented Feb 3, 2021