Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to make fai #67

Closed
juancresc opened this issue Dec 1, 2017 · 21 comments
Closed

Failed to make fai #67

juancresc opened this issue Dec 1, 2017 · 21 comments
Assignees

Comments

@juancresc
Copy link

I'm getting this while running:

Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
[fai_build_core] different line length in sequence 'stitch_167'.
        Failed to make fai of stitched genome. Aborting Run

This is the command I've used to run:

./files/libs/ShortStack/ShortStack --readfile files/output/TAEs.21.fasta --outdir files/output/sstack2 --genomefile files/data/Triticum_aestivum.TGACv1.dna.toplevel.fa --bowtie_cores 3

@MikeAxtell MikeAxtell self-assigned this Dec 1, 2017
@MikeAxtell
Copy link
Owner

Yes, I've seen this before. It results from an improperly FASTA format in the genome file. The samtools faidx command that ShortStack is calling fails. You can verify by simply testing the command

samtools faidx files/data/Triticum_aestivum.TGACv1.dna.toplevel.fa

If I'm right you'll get the same error and a failure to make the .fai genome index.

When I've seen it before it has been one of two things:

  1. One or more completely empty lines in the FASTA file (i.e. that match the regex ^$).

  2. One or more sequence data lines of unequal length. For instance a line with 60 nts, followed by a line with 70nts, will cause samtools faidx to fail.

I also noticed that when I googled using just your genome file name 'Triticum_aestivum.TGACv1.dna.toplevel.fa' among the top hits were complaints from other people who've tried to work with that particular file. That is consistent with my hypothesis of an incorrectly formatted (or corrupted) FASTA file for this genome build.

Hope this helps and let me know.

@juancresc
Copy link
Author

Just to notice, the same run, twice one after another. I've also run samtools and works fine. I don't know why the first run throws the error.

Anyway working fine now

Run Progress and Messages:

Mon Dec  4 11:58:39 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
        Failed to make fai of stitched genome. Aborting Run

Run Progress and Messages:
Use of uninitialized value $fai_fields[1] in numeric lt (<) at ./files/libs/ShortStack/ShortStack line 1009, <FAI> line 136610.

Mon Dec  4 12:00:20 EST 2017
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
        Done

@MikeAxtell
Copy link
Owner

MikeAxtell commented Dec 4, 2017 via email

@juancresc
Copy link
Author

Seems like the issue is in the stiched file:


samtools faidx files/output/sstack_toplevelformatted_taes21/wheat_stitched.fasta 
[fai_build_core] different line length in sequence 'stitch_54'.

I'm able to run samtools in the genomefile, but not in the stiched

@MikeAxtell
Copy link
Owner

MikeAxtell commented Dec 12, 2017 via email

@MikeAxtell MikeAxtell added the bug label Dec 12, 2017
@juancresc
Copy link
Author

I'm using this genome:

ftp://ftp.ensemblgenomes.org/pub/plants/release-37/fasta/triticum_aestivum/dna/Triticum_aestivum.TGACv1.dna.toplevel.fa.gz

I'm also using a formatted version, with the same results. For formatting I've used a biopython script:


#!/usr/bin/env python
# -*- coding: utf-8 -*-
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

def FAfixer(input_fasta, output_fasta):
    """
    """
    buffer_seqs = []
    for seq_record in SeqIO.parse(input_fasta, "fasta"):
        record = SeqRecord(seq_record.seq, id=seq_record.id, description=seq_record.description)
        buffer_seqs.append(record)
    SeqIO.write(buffer_seqs, output_fasta, "fasta")

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()#pylint: disable=invalid-name
    parser.add_argument("-i", "--input_fasta", help="(.fasta)", required=True)
    parser.add_argument("-o", "--output_fasta", help="(.fasta)", required=True)
    args = parser.parse_args()#pylint: disable=invalid-name
    FAfixer(args.input_fasta, args.output_fasta)

@MikeAxtell
Copy link
Owner

MikeAxtell commented Dec 12, 2017 via email

@juancresc
Copy link
Author

I'm reviewing the stiched file, not quite sure why it is only 2.6 GB while the genome file is 34.9

@MikeAxtell
Copy link
Owner

MikeAxtell commented Dec 14, 2017 via email

@juancresc
Copy link
Author

juancresc commented Dec 14, 2017 via email

@juancresc
Copy link
Author

juancresc commented Dec 14, 2017 via email

@MikeAxtell
Copy link
Owner

MikeAxtell commented Dec 14, 2017 via email

@juancresc
Copy link
Author

juancresc commented Dec 15, 2017 via email

@MikeAxtell
Copy link
Owner

I am going through old issues today; did you ever resolve this one?

@juancresc
Copy link
Author

juancresc commented Mar 12, 2018 via email

@juancresc
Copy link
Author

I was able to run the pipeline with the genome with no problems.

@palperbel
Copy link

Hi all! Im having the exact same problem as the one reported here but im not able to solve it.

Im using an internal reference genome that was done at my company. The only thing is that this reference is at scaffold level and it contains around 10000 scaffolds but it has been validated and it works fine, but not sure if it could be the main reason for the problem.

In one of the coments this was mention as a possible source of the problem: "One or more sequence data lines of unequal length. For instance a line with 60 nts, followed by a line with 70nts, will cause samtools faidx to fail"
But this is something that happens in any fasta reference right? all the lines in each chr/scaffold are 60 nts except for the last one that will have different length (between 1 - 60 nts).

When ShortStack is trying to create the index from the stitched file, samtools faidx crash, and this also happens when I try to run it directly with samtools faidx. However im able to run samtools faidx in my reference.fasta. I have check the stitched file and there were some empty lines that I have removed but still doesnt work.

Here I report the error while running ShortStack:
###################
Genome has more than 50 segments and more than two of them are < 1Mb. Stitching short references to improve performance ...
Failed to make fai of stitched genome. Aborting Run
###################
Here I report the error while running samtools faidx:
###################
[fai_build_core] different line length in sequence 'stitch_0'.
Could not build fai index ${path}/reference_stitched.fasta.fai
###################

Any idea why this is happening and how to solve it?
Thanks!!

@MikeAxtell
Copy link
Owner

Thanks for the report. Because you can successfully samtools faidx your original genome file, it does indeed sound like the "genome stitching" is to blame somehow; it must, under some circumstances, be making odd-length sequence lines.

If it's possible, you could share your genome file with me (directly email me might be best here), and I could try to figure it out. I don't think I'll be able to figure out the bug fix without the example case.

btw you are right that the last sequence line in any multi-FASTA file can be a shorter length. samtools faidx doesn't complain about that .. it complains about internal line-length deviations.

Anyway if you are able to post or send me the offending genome as a test case, I can try and run it down.

  • Mike

@MikeAxtell MikeAxtell reopened this Feb 3, 2021
@MikeAxtell
Copy link
Owner

Another quick idea you can look at .. ShortStack assumes (and never bother checking, bad on me) that your .fasta file has linux-like line endings \n. If you have DOS-type \r\n line endings, it might be the reason stitching is getting fouled up. Anyway you could quickly convert your .fasta file to force it to linux-style line breaks and see if that helps .. just a thought.

@palperbel
Copy link

WOW!!! Thanks so much for your fast reply!! I have done dos2unix in order to get linux-like line ending and it worked!!!!!!!!!! The stitched index file has been created! :) Now ShortStack is running!!!

Thanks so much again!! you make my day!!!

Paloma

@MikeAxtell
Copy link
Owner

Glad to hear it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants