Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NanoStats.txt generated by NanoPlot Not Recognized #1995

Closed
4 tasks done
joeellis1331 opened this issue Aug 21, 2023 · 17 comments · Fixed by #1997 or #2155
Closed
4 tasks done

NanoStats.txt generated by NanoPlot Not Recognized #1995

joeellis1331 opened this issue Aug 21, 2023 · 17 comments · Fixed by #1997 or #2155
Assignees

Comments

@joeellis1331
Copy link

Description of bug

Hi,
I have recently used NanoPlot 1.41.6 and tried to ingest the NanoStat.txt summary files into the multiQC report using v1.15 but kept receiving [WARNING] No analysis results found. Cleaning up... I saw this issue post which and compared to the original NanoStats.txt file mine is slightly different. Maybe a recent update of NanoPlot adjusted this format, unsure!

File that triggers the error

NanoStats_mine.txt
NanoStats_original.txt

MultiQC Error log

multiqc --verbose ./

  /// MultiQC 🔍 | v1.15

[2023-08-21 20:52:03] multiqc                                            [DEBUG  ]  This is MultiQC v1.15
[2023-08-21 20:52:03] multiqc                                            [DEBUG  ]  Command used: /home/ubuntu/.local/bin/multiqc --verbose ./
[2023-08-21 20:52:03] multiqc                                            [DEBUG  ]  Latest MultiQC version is v1.15
[2023-08-21 20:52:03] multiqc                                            [DEBUG  ]  Working dir : /data1/NanoPlot_out
[2023-08-21 20:52:03] multiqc                                            [DEBUG  ]  Template    : default
[2023-08-21 20:52:03] multiqc                                            [DEBUG  ]  Running Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
[2023-08-21 20:52:03] multiqc                                            [DEBUG  ]  Analysing modules: custom_content, ccs, ngsderive, purple, conpair, lima, peddy, somalier, methylQA, mosdepth, phantompeakqualtools, qualimap, preseq, hifiasm, quast, qorts, rna_seqc, rockhopper, rsem, rseqc, busco, bustools, goleft_indexcov, gffcompare, disambiguate, supernova, deeptools, sargasso, verifybamid, mirtrace, happy, mirtop, sambamba, gopeaks, homer, hops, macs2, theta2, snpeff, gatk, htseq, bcftools, featureCounts, fgbio, dragen, dragen_fastqc, dedup, pbmarkdup, damageprofiler, biobambam2, jcvi, mtnucratio, picard, vep, sentieon, prokka, qc3C, nanostat, samblaster, samtools, sexdeterrmine, eigenstratdatabasetools, bamtools, jellyfish, vcftools, longranger, stacks, varscan2, snippy, umitools, bbmap, bismark, biscuit, diamond, hicexplorer, hicup, hicpro, salmon, kallisto, slamdunk, star, hisat2, tophat, bowtie2, bowtie1, cellranger, snpsplit, odgi, pangolin, nextclade, humid, kat, leehom, librarian, adapterRemoval, bbduk, clipandmerge, cutadapt, flexbar, kaiju, kraken, malt, motus, trimmomatic, sickle, skewer, sortmerna, biobloomtools, fastq_screen, afterqc, fastp, fastqc, filtlong, prinseqplusplus, pychopper, porechop, pycoqc, minionqc, anglerfish, multivcfanalyzer, clusterflow, checkqc, bcl2fastq, bclconvert, interop, ivar, flash, seqyclean, optitype, whatshap
[2023-08-21 20:52:03] multiqc                                            [DEBUG  ]  Using temporary directory for creating report: /tmp/tmp4dx6zuer
[2023-08-21 20:52:03] multiqc                                            [INFO   ]  Search path : /data1/NanoPlot_out
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1726/1726
[2023-08-21 20:52:21] multiqc                                            [DEBUG  ]  Summary of files that were skipped by the search: [skipped_module_specific_max_filesize: 3172] // [skipped_no_match: 1611] // [skipped_ignore_pattern: 470] // [skipped_file_contents_search_errors: 170] // [skipped_filesize_limit: 92]
[2023-08-21 20:52:22] multiqc.plots.bargraph                             [DEBUG  ]  Using matplotlib version 3.7.2
[2023-08-21 20:52:22] multiqc.plots.linegraph                            [DEBUG  ]  Using matplotlib version 3.7.2
[2023-08-21 20:52:22] multiqc                                            [DEBUG  ]  No samples found: custom_content
[2023-08-21 20:52:22] multiqc                                            [DEBUG  ]  No samples found: htseq
[2023-08-21 20:52:22] multiqc                                            [WARNING]  No analysis results found. Cleaning up..
[2023-08-21 20:52:22] multiqc                                            [INFO   ]  MultiQC complete

Before submitting

  • I have read the troubleshooting documentation.
  • I am using the latest release of MultiQC.
  • I have included a full MultiQC log, not truncated.
  • I have attached an input file (.zip if necessary) that triggers the error.
@vladsavelyev
Copy link
Member

vladsavelyev commented Aug 22, 2023

Hi @joeellis1331, thanks for reporting the bug.

It looks like NanoPlot no longer uses NanoStat directly to write the text report, and NanoStat seems to be discontinued). NanoPlot now is calling the nanomath library directly.

It's also mentioned that there is a Rust replacement for NanoStat, called Cramino, though it seems to output a different format, as far as I can see.

As a temporary solution, it looks like nanomath can produce the legacy report if you run NanoPlot with the --tsv_stats flag.

Longer term, I'll look into implementing support for the new format. There was a suggestion to generalise this module to support different tools from the Nano* familiy, as long as they use the same nanomath functions, the format should be probably similar? Not sure if Cramino will require a separate module.

@vladsavelyev
Copy link
Member

I think the route forward would be to create a new MultiQC module called nanopack, support the new format there, and leave nanostat unchanged and deprecated.

Happy to do that, however, I'm missing more test data - currently, the NanoStat test data has a bunch of output of different types, and for the new format, we only have one file (kudos to you for providing it). Would you be able to generate more test data of different types by any chance, to have a similar structure we have for NanoStat in the MuiltiQC_TestData repo?

@joeellis1331 joeellis1331 closed this as not planned Won't fix, can't repro, duplicate, stale Aug 22, 2023
@joeellis1331
Copy link
Author

joeellis1331 commented Aug 22, 2023

Hi @vladsavelyev, (apologies for messing with the closed status, my bad)

It would have probably been useful to provide the command I used to run NanoPlot!

NanoPlot -t 4 --verbose --store -o $nameplot --tsv_stats --info_in_report --fastq $fq

where $nameplot is an output directory and $fq is the fastq file input.

Can you clarify what you mean by test data of different types? I ran NanoPlot on ~90 different fastq files so if you mean more samples I can happily provide more?

Additionally as a note about Cramino, it also inputs only BAM/CRAM files and does not take FASTQ files so it isn't quite a replacement as it requires alignment.

@joeellis1331 joeellis1331 reopened this Aug 22, 2023
@vladsavelyev
Copy link
Member

What I meant is that for NanoStat, we have example outputs for fastq, fasta, alignment, etc: https://github.com/ewels/MultiQC_TestData/tree/master/data/modules/nanostat
Screenshot 2023-08-22 at 16 32 01

I'm not very familiar with NanoStat and NanoPack, do you know if NanoPlot only works with FastQ or takes BAMs/Fasta as well? Is it only Cramino that supports BAMs?

The reason I'm asking is that I don't want to implement a parser for just one specific file, because the tool might have other use cases that expect a different input and generate other variations of outputs that we want to support as well.

@vladsavelyev
Copy link
Member

Yep, NanoPlot can take various input:
https://github.com/wdecoster/NanoPlot/
Screenshot 2023-08-22 at 16 48 06
https://github.com/wdecoster/NanoPlot/blob/3c45efbf7d2d06b911e63ceca820a5dd92ef4234/nanoplot/NanoPlot.py#L51-L60

I'll start with your example, but it would be good to extend the example data with more.

@joeellis1331
Copy link
Author

Got it, right now I only have fastq files processed. I could potentially generate from summary file (albacore I believe), bam file, and pickle file formats as well. If those files inputs all produce the same NanoStats.txt would that be helpful to know?

Overall it looks like a number of tools in the NanoPack are focused on BAM/CRAM formats. With a handful focusing on raw read (i.e. FASTQ/FASTQ) formats. NanoPlot seems to be the most accommodating as it accepts fasta, fastq_rich, fastq_minimal, summary (albacore or guppy), bam, ubam, cram, pickle, or feather. The pickle/feather is from NanoPlot itself.

@vladsavelyev
Copy link
Member

I could potentially generate from summary file (albacore I believe), bam file, and pickle file formats as well.

That would be great! The more - the better. Ideally, we want to replicate all the old-format examples we have here https://github.com/ewels/MultiQC_TestData/tree/master/data/modules/nanostat but in a new format. So our tests cover all possible scenarios.

@vladsavelyev
Copy link
Member

Alright, it appeared to be a pretty straightforward change, as the new format has a one-to-one match with the old format. See the pull request #1997

I'm keeping the module named NanoStat for back compatibility, even though the NanoStat tool itself is deprecated. I guess it's still pretty clear that the module is for the NanoPack family tools.

@vladsavelyev vladsavelyev self-assigned this Aug 22, 2023
@joeellis1331
Copy link
Author

joeellis1331 commented Aug 22, 2023

Alright, it appeared to be a pretty straightforward change, as the new format has a one-to-one match with the old format. See the pull request #1997

I'm keeping the module named NanoStat for back compatibility, even though the NanoStat tool itself is deprecated. I guess it's still pretty clear that the module is for the NanoPack family tools.

@vladsavelyev Do this mean that it should work now for NanoPlot?

Additionally, it would take me a few days but I could still generate the output for summary, bam, and fastq with and without the --tsv_stats flag as well if that's still desired?

@vladsavelyev
Copy link
Member

@joeellis1331, yep, it should work for NanoPlot now!

@joeellis1331
Copy link
Author

Hi @vladsavelyev, apologies for the simplicity of this question but what is the best way to ensure these changes take effect on my install? I tried to use pip install --upgrade multiqc and that did not seem to result in any changes!

@ewels
Copy link
Member

ewels commented Aug 28, 2023

@joeellis1331 the pull-request is not yet merged, you can see it here: #1997

Even once merged you'd still need to install the development version, until those changes go out in a stable release.

See docs for installing the development version (boils down to this command):

pip install git+https://github.com/ewels/MultiQC.git

And to try out the as-yet-unmerged pull request, I'd recommend first cloning the MultiQC repository and then using the GitHub CLI to check out the pull request.

eg:

git clone https://github.com/ewels/MultiQC.git
cd MultiQC
gh pr checkout 1997
pip install .

ewels added a commit that referenced this issue Aug 28, 2023
* NanoStat: support new output format. Fixes #1995

* Docs: more mention of alternative tool names.

Means if anyone searches on the modules listing page for these, it'll show up.

* Split legacy parsing into its own function

---------

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
@joeellis1331
Copy link
Author

Hi, Installing the development version I was still unable to get multiQC to generate a report from my NanoPlot output (i.e. it doesn't recognize the NanoStat.txt files). Is there additional resources/information I can provide?

It is worth noting that I run multiQC within a parent directory where subdirectories contain the output from NanoPlot.

@knacko
Copy link

knacko commented Oct 26, 2023

I had this problem as well. You need to add the search terms in the config file to find the NanoPlot output stats:

Create ./multiqc_config.yaml (or equivalent):

sp:
  nanostat:
    fn: "NanoStats.txt"

Since there's also no prefix to the file, you have to add a --dirs flag, or only a single file will be shown in MultiQC:

Then run multiqc with:
multiqc . --dirs --config ./multiqc_config.yaml

@vladsavelyev
Copy link
Member

@joeellis1331 and @knacko, would you be able to share the NanoStats.txt/NanoStat.txt files that you are using? That would be really valuable for us to see how they are different from the available test data, so we can fix MultiQC to recognise them correctly.

Note that at the moment, MultiQC tells the nanostat files by checking that they contain either a "General summary:" or a "Metrics dataset" line inside: https://github.com/ewels/MultiQC/blob/master/multiqc/utils/search_patterns.yaml#L444-L451

@joeellis1331
Copy link
Author

joeellis1331 commented Oct 27, 2023

@vladsavelyev Here is an example file, it does follow the "Metrics dataset" header naming convention. The file itself is named as such ("NanoStats.txt") there is no prefix added to the file name.

NanoStats.txt

@vladsavelyev
Copy link
Member

vladsavelyev commented Oct 30, 2023

Thank you so much @joeellis1331! The search patterns didn't account for tab characters as a column separator, that's why it was missing the file. I added a fix for that: #2155

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants