Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Producing LAVEnder Report #72

Open
RSherman15 opened this issue Oct 18, 2018 · 6 comments
Open

Error Producing LAVEnder Report #72

RSherman15 opened this issue Oct 18, 2018 · 6 comments

Comments

@RSherman15
Copy link

I seem to be getting an error towards the end of the LAVEnder evaluation (just producing a simple report, via a Snakefile exactly as specified in the tutorial/examples). It does produce the individual html reports for the two different aligners I used, but does not produce any sort of combined report, which I think is supposed to be output as well? Additionally, it took about a week to run, which seems to have been pretty much entirely in the step going from es to et files -- I assume this is a bug of some sort, as it seems like evaluation, once the simulation and alignments have already been run, should be relatively quick.

It gives the following error and does not produce any "combined" files:

13 of 16 steps (81%) done

rule 5:

    input: report/0/gp/_combined.gp, report/0/roc/GRCh38_chr4_bowtie.roc, report/0/roc/GRCh38_chr4_bwa.roc

    output: report/0/graphics/_combined_0.svg, report/0/graphics/_combined_1.svg, report/0/graphics/_combined_2.svg, report/0/graphics/_combined_3.svg, report/0/graphics/_combined_4.svg, report/0/graphics/_combined_0.pdf, report/0/graphics/_combined_1.pdf, report/0/graphics/_combined_2.pdf, report/0/graphics/_combined_3.pdf, report/0/graphics/_combined_4.pdf, report/0/tar/report.dir 0.tar

    jobid: 5


plot "report/0/roc/GRCh38_chr4_bowtie.roc" using (($3+$4)/($2+$3+$4)):((($2+$3+$4+$5+$10)/$11)*100) with lp ls 1 ps 0.8 title "  GRCh38_chr4_bowtie" noenhanced,"report/0/roc/GRCh38_chr4_bwa.roc" using (($3+$4)/($2+$3+$4)):((($2+$3+$4+$5+$10)/$11)*100) with lp ls 2 ps 0.8 title "  GRCh38_chr4_bwa" noenhanced,
                                                                                                                                                 ^

"report/0/gp/_combined.gp", line 41: ';' expected


Error in job 5 while creating output files report/0/graphics/_combined_0.svg, report/0/graphics/_combined_1.svg, report/0/graphics/_combined_2.svg, report/0/graphics/_combined_3.svg, report/0/graphics/_combined_4.svg, report/0/graphics/_combined_0.pdf, report/0/graphics/_combined_1.pdf, report/0/graphics/_combined_2.pdf, report/0/graphics/_combined_3.pdf, report/0/graphics/_combined_4.pdf, report/0/tar/report.dir 0.tar.

RuleException:

CalledProcessError in line 51 of /home/rsherman/bin/miniconda3/envs/rnftools_py34/lib/python3.4/site-packages/rnftools/lavender/lavender.snake:

Command '"gnuplot" "report/0/gp/_combined.gp"' returned non-zero exit status 1

  File "/home/rsherman/bin/miniconda3/envs/rnftools_py34/lib/python3.4/site-packages/rnftools/lavender/lavender.snake", line 51, in __rule_8

  File "/home/rsherman/bin/miniconda3/envs/rnftools_py34/lib/python3.4/site-packages/rnftools/lavender/Panel.py", line 324, in create_graphics

  File "/home/rsherman/bin/miniconda3/envs/rnftools_py34/lib/python3.4/site-packages/rnftools/utils/__init__.py", line 45, in shell

  File "/home/rsherman/bin/miniconda3/envs/rnftools_py34/lib/python3.4/concurrent/futures/thread.py", line 54, in run

Exiting because a job execution failed. Look above for error message

Will exit after finishing currently running jobs.

Exiting because a job execution failed. Look above for error message

For reference, here is the Snakefile run, though as mentioned it is basically exactly as specified in the tutorial:

import rnftools

rnftools.lavender.Report(
        # insert a list of directories with BAM files here:
        bam_dirs=["bamAlignments"],
        name="report",
        keep_intermediate_files=True,
        allowed_delta=10,
        default_x_run=[0.01,0.5]
)

rule all: input: rnftools.input()

include: rnftools.include()

Any help would be appreciated in figuring out this error, as well as how to generate the missing report file(s) without having to re-run the full evaluation since it took over a week -- I did have it save intermediate files, so I have the es and et files which were generated.

@karel-brinda
Copy link
Owner

Hi Rachel,

Thank you for reporting the issue. Could you please send me the GP file causing this issue?

Yeah, the RNFtools was developed almost five years ago when R wasn't very common and BioConda didn't exist yet. The original version contained even own package manager which compiled all used packages from sources. This is why RNFtools uses GnuPlot. Unfortunately, GnuPlot is now a bit obsolete and seems to be slightly incompatible across versions; it sometimes fails very strangely.

I am thinking about creating RNFtools 2, which would be more up-to-date with modern technologies and would be better suited for clusters and metagenomic simulations. I plan to incorporate all your comments.

However, I was thinking about not including LAVEnder in the new version. This part of the framework was a little bit overkill – it was developed before modern long-read technologies emerged. At that time, it seemed that the future was "pair-end" -> "many fragments", therefore, the LAVEnder aimed at something what didn't come into practice much.

@karel-brinda
Copy link
Owner

karel-brinda commented Oct 18, 2018

Any help would be appreciated in figuring out this error, as well as how to generate the missing report file(s) without having to re-run the full evaluation since it took over a week -- I did have it save intermediate files, so I have the es and et files which were generated.

It's great that you archived all these files. In this case, it should be possible to run

rnftools et2roc

and then to plot the graphs from the obtained ROC files using GP files from RNFtools examples.

Sorry for this inconvenience.

@RSherman15
Copy link
Author

Thanks for the quick responses even though RNFtools isn't under super active development - I really appreciate it.

Perhaps it would help to clarify what my goals are, and perhaps you can help point me to ways of using RNFtools that might be more efficient, if that's possible with it's current implementation.

I'm primarily interested in the "m" number in the evaluation reports - the number of reads which should be unmapped, but are mapped. I'm basically simulating from two references, and aligning back to only one, and then want the stats on what was correctly/incorrectly aligned from both the reference that should and shouldn't align. RNFtools seems to be a lovely wrapper to do this. From the documentation I understood LAVEnder to be the evaluator once reads have been simulated, named, and aligned. If you're saying LAVEnder is "overkill", is there a way to get out the alignment stats (number of reads mapped, unmapped, etc) without the week-long step of going from the es to et files? It seems to me that even just reading through the alignment files in python to get these stats should be considerably faster than LAVENder is, and I'm not sure what's going on under the hood as I haven't looked into the code base here.

Perhaps I should just be writing something fairly simple myself to get out these stats from the bam alignments, but using RNFtools as a wrapper for simulation + evaluation, and having it produce such a nice report of the stats seems so convenient that I'd love to use it if possible!

@karel-brinda
Copy link
Owner

Thanks for the quick responses even though RNFtools isn't under super active development - I really appreciate it.

Thanks for mentioning this.

Even though RNFtools is not until active development now, I would love to modernize it. However I would need to find a grad student who would be willing to do it as a side project. The motivation could be that and he/she would get a first-author paper (probably in Bioinformatics). I know which parts of the method/tool should be improved, but I don't have time for that. If you by any chance knew about anyone, please let me know.

From the documentation I understood LAVEnder to be the evaluator once reads have been simulated, named, and aligned.

That's correct.

If you're saying LAVEnder is "overkill", is there a way to get out the alignment stats (number of reads mapped, unmapped, etc) without the week-long step of going from the es to et files?

The overkill was designing LAVEnder in such a way that it supports an unlimited number of segments. Then there are many possible combinations of correctly/incorrectly mapped/unmapped segments and the program has to consider all of them, and at the same time evaluate the criteria for all possible mapping qualities.

It seems to me that even just reading through the alignment files in python to get these stats should be considerably faster than LAVENder is, and I'm not sure what's going on under the hood as I haven't looked into the code base here.

The critical part of the code is here:

for line in es_fo:

and especially the following function:

def _vector_of_categories(srs, read_tuple_name, parts):

It definitely can be optimized. At the time of developing, I prioritized correctness and generality over performance. And then I switched to other projects and never got to optimize it.

I'm primarily interested in the "m" number in the evaluation reports - the number of reads which should be unmapped, but are mapped. I'm basically simulating from two references, and aligning back to only one, and then want the stats on what was correctly/incorrectly aligned from both the reference that should and shouldn't align. RNFtools seems to be a lovely wrapper to do this.

Perhaps I should just be writing something fairly simple myself to get out these stats from the bam alignments, but using RNFtools as a wrapper for simulation + evaluation, and having it produce such a nice report of the stats seems so convenient that I'd love to use it if possible!

What sequencing technology do you simulate? If it's Illumina, you could write a simple Python script taking ES files and producing ROC files directly (without the intermediate step).

@RSherman15
Copy link
Author

I am indeed using Illumina (paired end) simulated data, so writing a script to just produce the ROC files directly is probably the way to go - I'll go take a look at the pieces of the code you pointed to. Thanks.

@karel-brinda
Copy link
Owner

In that case I would propose the following:

  1. Using rnftools sam2es (this will check the positions of reads and compare them with the original coordinates)
$ rnftools sam2es
usage: rnftools sam2es [-h] -i file -o file [-d int]

todo

optional arguments:
  -h, --help            show this help message and exit
  -i file, --sam file   SAM/BAM with aligned RNF reads(- for standard input).
  -o file, --es file    Output ES file (evaluated segments, - for standard
                        output).
  -d int, --allowed-delta int
                        Tolerance of difference of coordinates between true
                        (i.e., expected) alignment and real alignment (very
                        important parameter!) (default: 5).

The ES format should be easy to parse, see the following description:

es_fo.write("# RN: read name" + os.linesep)

  1. Writing your own script for ES=>ROC

    With Illumina pair-end reads, you have always 2 segments. There are relatively few combinations that can appear in your case and you are probably interested only in some of them.

  2. Using RNFtools gp scripts for generating the plots

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants