Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STAR-Fusion Requires Missing File in CTAT Resource #15

Closed
DarioS opened this issue Oct 12, 2016 · 14 comments
Closed

STAR-Fusion Requires Missing File in CTAT Resource #15

DarioS opened this issue Oct 12, 2016 · 14 comments

Comments

@DarioS
Copy link

DarioS commented Oct 12, 2016

I ran the form of STAR-Fusion that uses existing STAR outputs. It runs for a couple of minutes, outputs candidate fusions, then terminates showing an error looking for an index file:

Error: cannot locate GRCh37_gencode_v19_CTAT_lib/blast_pairs.idx

Indeed, there is no index provided by the downloaded resource:

$ ls GRCh37_gencode_v19_CTAT_lib
blast_pairs.outfmt6.gz  ref_annot.gtf  ref_cdna.fasta  ref_genome.fa  ref_genome.jf.min2.kmers

Perhaps the resources should be updated to contain all necessary files for fusion finding.

@brianjohnhaas
Copy link
Contributor

Hi

The various index files are very large (including the STAR index), and some
of them might be platform dependent, so instead we should be providing the
basic data files and then you'd run the prep step as per:
https://github.com/FusionFilter/FusionFilter/wiki

best,

~b

On Wed, Oct 12, 2016 at 4:00 AM, DarioS notifications@github.com wrote:

I ran the form of STAR-Fusion that uses existing STAR outputs. It runs for
a couple of minutes, outputs candidate fusions, then terminates showing an
error looking for an index file:

Error: cannot locate GRCh37_gencode_v19_CTAT_lib/blast_pairs.idx

Indeed, there is no index provided by the downloaded resource:

$ ls GRCh37_gencode_v19_CTAT_lib
blast_pairs.outfmt6.gz ref_annot.gtf ref_cdna.fasta ref_genome.fa ref_genome.jf.min2.kmers

Perhaps the resources should be updated to contain all necessary files for
fusion finding.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#15, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHMVX5PtTkTM7HEsLWgZAPvERQ_OxsLnks5qzD8pgaJpZM4KUSYk
.

Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

@DarioS
Copy link
Author

DarioS commented Oct 13, 2016

The first step of preparation is creating indices for STAR with --sjdbGTFfile which I already did when using STAR directly with --runMode genomeGenerate before subsequently mapping to the genome . Is there a good reason to do it again? It takes an hour to do and seems redundant. Perhaps prep_genome_lib.pl could have an option to skip this. The only substantial difference is that I used --sjdbOverhang 99 because the STAR manual states:

Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.

where as your preparation script has a read length option (default 100) and passes it unmodified to STAR. Perhaps --max_readlength should have 1 subtracted from it before it's passed onto STAR. In the meantime, a workaround could be to set --max_readlength to 99, as long as that doesn't have any side-effects in other preparation stages which I'm unaware of.

Although distributing the large STAR index would cause the download to become too large, the indexes of the BLAST pairs are only 42 MB. The BLAST result, which is already included in the download, is 174 MB, so including those indexes would only increase the download modestly.

@brianjohnhaas
Copy link
Contributor

I agree that the .idx file should be easily distributable. Note, it only
takes a minute or so to build that idx file, so it's my hope that it
doesn't involve too much effort on the part of the user. The prep lib also
generate various files that are named in a certain way, so in order to be
less problem-prone, it's best for the user to just run the prep_lib script
and ensure that everything that the tools look for in the genome_lib_dir
are there and named as they're expected to be.

The .idx file is generated using DB_file (berkeley DB) and this is highly
platform dependent, so one built on linux won't necessarily work on other
versions of unix (not as common these days anyway) or mac or windows.

cheers,

~b

On Thu, Oct 13, 2016 at 4:00 AM, DarioS notifications@github.com wrote:

The first step of preparation is creating indices for STAR with
--sjdbGTFfile which I already did when using STAR directly with --runMode
genomeGenerate before subsequently mapping to the genome . Is there a
good reason to do it again? It takes an hour to do and seems redundant.
Perhaps prep_genome_lib.pl could have an option to skip this. The only
substantial difference is that I used --sjdbOverhang 99 because the STAR
manual states:

Ideally, this length should be equal to the ReadLength-1, where ReadLength
is the length of the reads.

where as your preparation script has a read length option (default 100)
and passes it unmodified to STAR. Perhaps --max_readlength should have 1
subtracted from it before it's passed onto STAR. In the meantime, a
workaround could be to set --max_readlength to 99, as long as that
doesn't have any side-effects in other preparation stages which I'm unaware
of.

Although distributing the large STAR index would cause the download to
become too large, the indexes of the BLAST pairs are only 42 MB. The BLAST
result, which is already included in the download, is 174 MB, so including
those indexes would only increase the download modestly.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHMVX6DStnjN9DlnYo8d0k83QX5scFOaks5qzZCogaJpZM4KUSYk
.

Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

@DarioS
Copy link
Author

DarioS commented Oct 13, 2016

I see. Is it best to set --max_readlength to a value 1 smaller than the actual read length, as I thought?

@brianjohnhaas
Copy link
Contributor

probably. I'll check with Alex about this again. The default setting
should be fine. I don't like the idea of users having to rebuild indexes
each time they have different read lengths. Differences in output are
probably negligible based on that setting (my intuition), but you wouldn't
want it to be overly short with respect to the lengths of reads you're
working with.

On Thu, Oct 13, 2016 at 7:00 AM, DarioS notifications@github.com wrote:

I see. Is it best to set --max_readlength to a value 1 smaller than the
actual read length, as I thought?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHMVX9hbBdi4CciHt4SyBAsH4cfRkfMZks5qzbrWgaJpZM4KUSYk
.

Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

@jblachly
Copy link

Brian,

I asked Alex about this years ago, and your intuition is correct.

https://groups.google.com/forum/#!searchin/rna-star/blachly%7Csort:relevance/rna-star/h9oh10UlvhI/WqaVMukuGDMJ https://groups.google.com/forum/#!searchin/rna-star/blachly|sort:relevance/rna-star/h9oh10UlvhI/WqaVMukuGDMJ

Bottom line, if you want only to have one index then make the sjdbOverhang one less than your longest expected read length. There is only a minor speed penalty for this, whereas if it is too short (e.g 49 when you have 75 or 100 nt reads) you face the prospect of false negatives (missed mappings).

So, if you are going to distribute an index or build only once, I would recommend 149 or 150 (in this era of 2x150 RNA-seq; note that in the thread Alex said sjdb 100 was fine)

James Blachly
Ohio State University

On Oct 13, 2016, at 1:07 AM, Brian Haas notifications@github.com wrote:

probably. I'll check with Alex about this again. The default setting
should be fine. I don't like the idea of users having to rebuild indexes
each time they have different read lengths. Differences in output are
probably negligible based on that setting (my intuition), but you wouldn't
want it to be overly short with respect to the lengths of reads you're
working with.

On Thu, Oct 13, 2016 at 7:00 AM, DarioS notifications@github.com wrote:

I see. Is it best to set --max_readlength to a value 1 smaller than the
actual read length, as I thought?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHMVX9hbBdi4CciHt4SyBAsH4cfRkfMZks5qzbrWgaJpZM4KUSYk
.

Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #15 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AFtbUQZGSKc0vb1KnWEa8siVTT1NvYywks5qzbx8gaJpZM4KUSYk.

@brianjohnhaas
Copy link
Contributor

Thanks, James! That was the discussion that I was very vaguely remembering.

much appreciated!

~brian

On Thu, Oct 13, 2016 at 2:19 PM, James S Blachly, MD <
notifications@github.com> wrote:

Brian,

I asked Alex about this years ago, and your intuition is correct.

https://groups.google.com/forum/#!searchin/rna-star/
blachly%7Csort:relevance/rna-star/h9oh10UlvhI/WqaVMukuGDMJ <
https://groups.google.com/forum/#!searchin/rna-star/
blachly|sort:relevance/rna-star/h9oh10UlvhI/WqaVMukuGDMJ>

Bottom line, if you want only to have one index then make the sjdbOverhang
one less than your longest expected read length. There is only a minor
speed penalty for this, whereas if it is too short (e.g 49 when you have 75
or 100 nt reads) you face the prospect of false negatives (missed mappings).

So, if you are going to distribute an index or build only once, I would
recommend 149 or 150 (in this era of 2x150 RNA-seq; note that in the thread
Alex said sjdb 100 was fine)

James Blachly
Ohio State University

On Oct 13, 2016, at 1:07 AM, Brian Haas notifications@github.com
wrote:

probably. I'll check with Alex about this again. The default setting
should be fine. I don't like the idea of users having to rebuild indexes
each time they have different read lengths. Differences in output are
probably negligible based on that setting (my intuition), but you
wouldn't
want it to be overly short with respect to the lengths of reads you're
working with.

On Thu, Oct 13, 2016 at 7:00 AM, DarioS notifications@github.com
wrote:

I see. Is it best to set --max_readlength to a value 1 smaller than the
actual read length, as I thought?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#15
issuecomment-253414811>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/
AHMVX9hbBdi4CciHt4SyBAsH4cfRkfMZks5qzbrWgaJpZM4KUSYk>
.

Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <
#15
issuecomment-253415574>, or mute the thread <https://github.com/
notifications/unsubscribe-auth/AFtbUQZGSKc0vb1KnWEa8siVTT1NvY
ywks5qzbx8gaJpZM4KUSYk>.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#15 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHMVX0wWfhhnxaRcaNemB80EHBqLjpmKks5qziG1gaJpZM4KUSYk
.

Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

@arnavaz
Copy link

arnavaz commented Mar 7, 2018

Hi,
When I follow the instructions on fusion filter wiki page for making the index files and run the command it takes forever to run and it eventually crashes bc of memory. The command I am running is:/
/FusionFilter-master/prep_genome_lib.pl --genome_fa $REF --gtf $GTF --pfam_db PFAM.domtblout.dat.gz --CPU 10

It starts by generating the star index files which I have done already once for getting star alignments and it stops there.
Can you tell me what is wrong with what I am doing?

Thanks,
Arna

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Mar 7, 2018 via email

@arnavaz
Copy link

arnavaz commented Mar 8, 2018

Thanks for your prompt response.
Yes, I use the link from star-fusion wiki: GRCh38_v27_CTAT_lib_Feb092018
I am using latest star-fusion with STAR/2.4.2a, and blast+/2.2.30.

And I ran the command one more time increasing memory to 60G but still got the following error
(I think now it needs me to install some perl IO).

-found STAR at /cluster/tools/software/star/2.4.2a/STAR

-found makeblastdb at /cluster/tools/software/blast+/2.2.30/bin/makeblastdb

-found blastn at /cluster/tools/software/blast+/2.2.30/bin/blastn

-- Skipping CMD: cp /cluster/tools/data/genomes/human/hg38/iGenomes/Sequence/WholeGenomeFasta/genome.fa /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_genome.fa, checkpoint [/cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/__chkpts/ref_genome.fa.ok] exists.
-- Skipping CMD: samtools faidx /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_genome.fa, checkpoint [/cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/__chkpts/ref_genome_fai.ok] exists.
-- Skipping CMD: cp /cluster/tools/data/genomes/human/hg38/iGenomes/Annotation/Genes/genes.gtf /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_annot.gtf, checkpoint [/cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/__chkpts/ref_annot.gtf.ok] exists.
-- Skipping CMD: cp /cluster/home/arna/FusionFilter-master/AnnotFilterRuleLib/AnnotFilterRule.pm /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/., checkpoint [/cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/__chkpts/annotfiltrule_cp.ok] exists.
-- Skipping CMD: bash -c " set -eof pipefail; /cluster/home/arna/FusionFilter-master/util/gtf_to_exon_gene_records.pl /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_annot.gtf | sort -k 1,1 -k4,4g -k5,5g | uniq > /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_annot.gtf.mini.sortu " , checkpoint [/cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/__chkpts/ref_annot.gtf.mini.sortu.ok] exists.

  • Running CMD: STAR --runThreadN 10 --runMode genomeGenerate --genomeDir /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_genome.fa.star.idx --genomeFastaFiles /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_genome.fa --limitGenomeGenerateRAM 40419136213 --genomeChrBinNbits 16 --sjdbGTFfile /cluster/tools/data/genomes/human/hg38/iGenomes/Annotation/Genes/genes.gtf --sjdbOverhang 150
  • Running CMD: /cluster/home/arna/FusionFilter-master/util/gtf_to_gene_spans.pl /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_annot.gtf > /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_annot.gtf.gene_spans
    -- Skipping CMD: /cluster/home/arna/FusionFilter-master/util/gtf_file_to_feature_seqs.pl /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_annot.gtf /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/ref_genome.fa CDSplus > ref_annot.cdsplus.fa, checkpoint [__loc_chkpts/ref_annot.cdsplus.fa.ok] exists.
    -- Skipping CMD: makeblastdb -in ref_annot.cdsplus.fa -dbtype nucl, checkpoint [__loc_chkpts/ref_annot.cdsplus.fa.blidx.ok] exists.
    -- Skipping CMD: blastn -query ref_annot.cdsplus.fa -db ref_annot.cdsplus.fa -max_target_seqs 10000 -outfmt 6 -evalue 1e-10 -num_threads 10 -dust no > ref_annot.cdsplus.allvsall.outfmt6, checkpoint [__loc_chkpts/ref_annot.cdsplus.allvsall.outfmt6.ok] exists.
    -- Skipping CMD: bash -c " set -eof pipefail; /cluster/home/arna/FusionFilter-master/util/blast_outfmt6_replace_trans_id_w_gene_symbol.pl ref_annot.cdsplus.fa ref_annot.cdsplus.allvsall.outfmt6 | gzip > ref_annot.cdsplus.allvsall.outfmt6.genesym.gz" , checkpoint [__loc_chkpts/ref_annot.cdsplus.allvsall.outfmt6.genesym.gz.ok] exists.
  • Running CMD: /cluster/home/arna/FusionFilter-master/util/index_blast_pairs.pl /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/blast_pairs.idx ref_annot.cdsplus.allvsall.outfmt6.genesym.gz
    Can't locate PerlIO/gzip.pm in @inc (you may need to install the PerlIO::gzip module) (@inc contains: /cluster/home/arna/FusionFilter-master/util/../lib /cluster/tools/software/star/2.4.2a/lib /cluster/tools/software/perl/5.18.1/lib/site_perl/5.18.1/x86_64-linux-thread-multi /cluster/tools/software/perl/5.18.1/lib/site_perl/5.18.1 /cluster/tools/software/perl/5.18.1/lib/5.18.1/x86_64-linux-thread-multi /cluster/tools/software/perl/5.18.1/lib/5.18.1 .) at /cluster/home/arna/FusionFilter-master/util/index_blast_pairs.pl line 9.
    BEGIN failed--compilation aborted at /cluster/home/arna/FusionFilter-master/util/index_blast_pairs.pl line 9.
    Error, cmd: /cluster/home/arna/FusionFilter-master/util/index_blast_pairs.pl /cluster/projects/pughlab/projects/Neurofibroma/CTAT_resource_lib/GRCh38_v27_CTAT_lib_Feb092018/ctat_genome_lib_build_dir/blast_pairs.idx ref_annot.cdsplus.allvsall.outfmt6.genesym.gz died with ret 512 at /cluster/home/arna/FusionFilter-master/lib/Pipeliner.pm line 166.
    Pipeliner::run('Pipeliner=HASH(0xe64358)') called at /cluster/home/arna/FusionFilter-master/prep_genome_lib.pl line 321

@arnavaz
Copy link

arnavaz commented Mar 8, 2018

Additionally, in the above thread you indicate that this should take a minute or so. But it takes for a few hours to get the above error message..

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Mar 8, 2018 via email

@brianjohnhaas
Copy link
Contributor

brianjohnhaas commented Mar 8, 2018 via email

@arnavaz
Copy link

arnavaz commented Mar 8, 2018

Thanks for your help! I will install those modules.
Best,
Arna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants