Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiQC 1.17 fails to recognize fastp files #2138

Closed
4 tasks done
tamuanand opened this issue Oct 19, 2023 · 14 comments · Fixed by #2139
Closed
4 tasks done

MultiQC 1.17 fails to recognize fastp files #2138

tamuanand opened this issue Oct 19, 2023 · 14 comments · Fixed by #2139
Labels
bug: module Bug in a MultiQC module

Comments

@tamuanand
Copy link

Description of bug

Hi @vladsavelyev and @ewels

MultQC 1.17 fails to detect fastp files - the same works with MultiQC 1.16

I will attach the verbose output from MultiQC 1.16 as a comment so that you can see the difference(s).

File that triggers the error

HG001.fastp.json
HG002_son.fastp.json
HG003_father.fastp.json
HG004_mother.fastp.json

MultiQC Error log

/ MultiQC 🔍 | v1.17

[2023-10-19 02:47:16] multiqc                                            [DEBUG  ]  This is MultiQC v1.17
[2023-10-19 02:47:16] multiqc                                            [DEBUG  ]  Command used: /usr/local/bin/multiqc /path_to/temp_dir/2023_Oct_19 --force --verbose --interactive
[2023-10-19 02:47:16] multiqc                                            [DEBUG  ]  Could not connect to multiqc.info for version check: module 'packaging.version' has no attribute 'StrictVersion'
[2023-10-19 02:47:16] multiqc                                            [DEBUG  ]  Working dir : /path_to/temp_dir/2023_Oct_19
[2023-10-19 02:47:16] multiqc                                            [DEBUG  ]  Template    : default
[2023-10-19 02:47:16] multiqc                                            [DEBUG  ]  Running Python 3.12.0 | packaged by conda-forge | (main, Oct  3 2023, 08:43:22) [GCC 12.3.0]
[2023-10-19 02:47:16] multiqc                                            [DEBUG  ]  Analysing modules: custom_content, ccs, ngsderive, purple, conpair, lima, peddy, somalier, methylQA, mosdepth, phantompeakqualtools, qualimap, preseq, hifiasm, quast, qorts, rna_seqc, rockhopper, rsem, rseqc, busco, bustools, goleft_indexcov, gffcompare, disambiguate, supernova, deeptools, sargasso, verifybamid, mirtrace, happy, mirtop, sambamba, gopeaks, homer, hops, macs2, theta2, snpeff, gatk, htseq, bcftools, featureCounts, fgbio, dragen, dragen_fastqc, dedup, pbmarkdup, damageprofiler, mapdamage, biobambam2, jcvi, mtnucratio, picard, vep, sentieon, bakta, prokka, qc3C, nanostat, samblaster, samtools, sexdeterrmine, eigenstratdatabasetools, bamtools, jellyfish, vcftools, longranger, stacks, varscan2, snippy, umitools, truvari, bbmap, bismark, biscuit, diamond, hicexplorer, hicup, hicpro, salmon, kallisto, slamdunk, star, hisat2, tophat, bowtie2, bowtie1, cellranger, snpsplit, odgi, pangolin, nextclade, freyja, humid, kat, leehom, librarian, adapterRemoval, bbduk, clipandmerge, cutadapt, flexbar, sourmash, kaiju, bracken, kraken, malt, motus, trimmomatic, sickle, skewer, sortmerna, biobloomtools, fastq_screen, afterqc, fastp, fastqc, filtlong, prinseqplusplus, pychopper, porechop, pycoqc, minionqc, anglerfish, multivcfanalyzer, clusterflow, checkqc, bcl2fastq, bclconvert, interop, ivar, flash, seqyclean, optitype, whatshap
[2023-10-19 02:47:16] multiqc                                            [DEBUG  ]  Using temporary directory for creating report: /tmp/tmpdtsgt7xq
[2023-10-19 02:47:16] multiqc                                            [INFO   ]  Search path : /path_to/temp_dir/2023_Oct_19
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 4/4
[2023-10-19 02:47:16] multiqc                                            [DEBUG  ]  Summary of files that were skipped by the search: []
[2023-10-19 02:47:17] multiqc.plots.bargraph                             [DEBUG  ]  Using matplotlib version 3.8.0
[2023-10-19 02:47:17] multiqc.plots.linegraph                            [DEBUG  ]  Using matplotlib version 3.8.0
[2023-10-19 02:47:17] multiqc                                            [DEBUG  ]  No samples found: custom_content
[2023-10-19 02:47:17] multiqc.modules.fastp.fastp                        [WARNING]  Could not parse sample name from fastp command: HG004_mother.fastp.json
[2023-10-19 02:47:17] multiqc.modules.fastp.fastp                        [WARNING]  Could not parse sample name from fastp command: HG003_father.fastp.json
[2023-10-19 02:47:17] multiqc.modules.fastp.fastp                        [WARNING]  Could not parse sample name from fastp command: HG001.fastp.json
[2023-10-19 02:47:17] multiqc.modules.fastp.fastp                        [WARNING]  Could not parse sample name from fastp command: HG002_son.fastp.json
[2023-10-19 02:47:17] multiqc                                            [DEBUG  ]  No samples found: fastp
[2023-10-19 02:47:17] multiqc.utils.software_versions                    [DEBUG  ]  Reading software versions from config.software_versions
[2023-10-19 02:47:17] multiqc                                            [WARNING]  No analysis results found. Cleaning up..
[2023-10-19 02:47:17] multiqc                                            [INFO   ]  MultiQC complete

Before submitting

  • I have read the troubleshooting documentation.
  • I am using the latest release of MultiQC.
  • I have included a full MultiQC log, not truncated.
  • I have attached an input file (.zip if necessary) that triggers the error.
@tamuanand
Copy link
Author

Hi @vladsavelyev

Given below is the verbose log with multiqc 1.16

  // MultiQC 🔍 | v1.16

[2023-10-19 02:50:19] multiqc                                            [DEBUG  ]  This is MultiQC v1.16
[2023-10-19 02:50:19] multiqc                                            [DEBUG  ]  Command used: /usr/local/bin/multiqc /path_to/2023_Oct_19 --force --verbose --interactive
[2023-10-19 02:50:20] multiqc                                            [DEBUG  ]  Latest MultiQC version is v1.16
[2023-10-19 02:50:20] multiqc                                            [DEBUG  ]  Working dir : /path_to/2023_Oct_19
[2023-10-19 02:50:20] multiqc                                            [DEBUG  ]  Template    : default
[2023-10-19 02:50:20] multiqc                                            [DEBUG  ]  Running Python 3.11.5 | packaged by conda-forge | (main, Aug 27 2023, 03:34:09) [GCC 12.3.0]
[2023-10-19 02:50:20] multiqc                                            [DEBUG  ]  Analysing modules: custom_content, ccs, ngsderive, purple, conpair, lima, peddy, somalier, methylQA, mosdepth, phantompeakqualtools, qualimap, preseq, hifiasm, quast, qorts, rna_seqc, rockhopper, rsem, rseqc, busco, bustools, goleft_indexcov, gffcompare, disambiguate, supernova, deeptools, sargasso, verifybamid, mirtrace, happy, mirtop, sambamba, gopeaks, homer, hops, macs2, theta2, snpeff, gatk, htseq, bcftools, featureCounts, fgbio, dragen, dragen_fastqc, dedup, pbmarkdup, damageprofiler, mapdamage, biobambam2, jcvi, mtnucratio, picard, vep, sentieon, bakta, prokka, qc3C, nanostat, samblaster, samtools, sexdeterrmine, eigenstratdatabasetools, bamtools, jellyfish, vcftools, longranger, stacks, varscan2, snippy, umitools, bbmap, bismark, biscuit, diamond, hicexplorer, hicup, hicpro, salmon, kallisto, slamdunk, star, hisat2, tophat, bowtie2, bowtie1, cellranger, snpsplit, odgi, pangolin, nextclade, freyja, humid, kat, leehom, librarian, adapterRemoval, bbduk, clipandmerge, cutadapt, flexbar, sourmash, kaiju, kraken, malt, motus, trimmomatic, sickle, skewer, sortmerna, biobloomtools, fastq_screen, afterqc, fastp, fastqc, filtlong, prinseqplusplus, pychopper, porechop, pycoqc, minionqc, anglerfish, multivcfanalyzer, clusterflow, checkqc, bcl2fastq, bclconvert, interop, ivar, flash, seqyclean, optitype, whatshap
[2023-10-19 02:50:20] multiqc                                            [DEBUG  ]  Using temporary directory for creating report: /tmp/tmpsih7vs4f
[2023-10-19 02:50:20] multiqc                                            [INFO   ]  Search path : /path_to/2023_Oct_19
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 4/4
[2023-10-19 02:50:20] multiqc                                            [DEBUG  ]  Summary of files that were skipped by the search: []
[2023-10-19 02:50:21] multiqc.plots.bargraph                             [DEBUG  ]  Using matplotlib version 3.8.0
[2023-10-19 02:50:21] multiqc.plots.linegraph                            [DEBUG  ]  Using matplotlib version 3.8.0
[2023-10-19 02:50:21] multiqc                                            [DEBUG  ]  No samples found: custom_content
[2023-10-19 02:50:21] multiqc.modules.fastp.fastp                        [DEBUG  ]  No duplication rate plot data: HG004_mother.fastp.json
[2023-10-19 02:50:21] multiqc.modules.fastp.fastp                        [DEBUG  ]  No duplication rate plot data: HG003_father.fastp.json
[2023-10-19 02:50:21] multiqc.modules.fastp.fastp                        [DEBUG  ]  No duplication rate plot data: HG001.fastp.json
[2023-10-19 02:50:21] multiqc.modules.fastp.fastp                        [DEBUG  ]  No duplication rate plot data: HG002_son.fastp.json
[2023-10-19 02:50:21] multiqc.modules.fastp.fastp                        [INFO   ]  Found 4 reports
[2023-10-19 02:50:21] multiqc.utils.software_versions                    [DEBUG  ]  Reading software versions from config.software_versions
[2023-10-19 02:50:21] multiqc                                            [DEBUG  ]  Compressing plot data
[2023-10-19 02:50:21] multiqc                                            [INFO   ]  Report      : multiqc_report.html
[2023-10-19 02:50:21] multiqc                                            [INFO   ]  Data        : multiqc_data
[2023-10-19 02:50:21] multiqc                                            [DEBUG  ]  Moving data file from '/tmp/tmpsih7vs4f/multiqc_data' to '/path_to/2023_Oct_19/multiqc_data'
[2023-10-19 02:50:21] multiqc                                            [DEBUG  ]  Full report path: /path_to/2023_Oct_19/multiqc_report.html
[2023-10-19 02:50:21] multiqc                                            [INFO   ]  MultiQC complete

@vladsavelyev
Copy link
Member

Thank you @tamuanand for reporting the bug. It indeed was introduced while attempting to fix another issue.

Here is the fix: #2139

It now correctly parses the "command" field in the JSON when it uses --in1/--in2 options for the input, and also falls back to using the file name if it failed to parse for any reason.

I also added your test example into the test data repo: MultiQC/test-data#295

@vladsavelyev vladsavelyev added the bug: module Bug in a MultiQC module label Oct 19, 2023
@tamuanand
Copy link
Author

tamuanand commented Oct 19, 2023

Hi @vladsavelyev

I am not sure I am happy with this fix..

With previous versions of multiqc, with this command below, I would get the General Stats table to show up as HG001 for fastp related stats.

fastp --in1 HG001_1.fastq.gz --in2 HG001_2.fastq.gz --out1 HG001_1.trim.fastq.gz --out2 
HG001_2.trim.fastq.gz --json HG001.fastp.json --html HG001.fastp.html 

With the current fix, it shows up in General Stats table as HG001_1.

  • therefore, this incorrectly conveys the impression that _1 alone had so and so statistics.

Is there a way to have it reported as before - please.

Thanks in advance.

@vladsavelyev
Copy link
Member

I'm not sure though there is a good way to find that the sample name is HG001 in this case.

The problem here is that the default file name for the JSON file that we need is just fastp.json (https://github.com/OpenGene/fastp#simple-usage), so if we trim the .json/.fastp.json suffix and take the remaining prefix as a sample name, we will end up with just fastp as a sample name in most real life cases.

I can't think of an obvious generalizable solution here, without involving configuration options (e.g. sample_names_replace or module-specific extra_fn_clean_trim: ["_1", "_2"] (https://multiqc.info/docs/getting_started/config/#trimming-extensions and https://multiqc.info/docs/getting_started/config/#module).

Can you think of a generalizable solution?

@tamuanand
Copy link
Author

How about this

  • if sample_name.fastp.json is available, then use it to parse and report in general stat
  • if not, then fall back on parsing from --in1

@vladsavelyev
Copy link
Member

That should work. Though I'm a bit hesitant though because normally a module checks one place for the sample name, which makes the behaviour transparent. And here, the behaviour would be different when the file name is exactly fastp.json. But perhaps it's the best we can do.

@tamuanand
Copy link
Author

tamuanand commented Oct 19, 2023

I agree - but my question is "why fix/change something that has been working for years now" - especially with something like fastp

I have this in my multiqc_config.yaml - but it does not work

fn_clean_trim:
  - ".FQ2BAM_NO_POSTALT.duplicates_metrics"
  - ".pb.no_postalt"
  - ".pb.postalt"

extra_fn_clean_exts:
  - type: regex
    pattern: "_1$"

Does fn_clean_trim take regex? Something like

 - "_1$"

@vladsavelyev
Copy link
Member

I agree - but my question is "why fix/change something that has been working for years now" - especially with something like fastp

I know what you feel, that's annoying that something suddenly stops working for no good reason. But it wasn't always working: #2031 And the way it was, it worked often by accident, e.g. it wouldn't have behaved correctly with inputs with --in1 in the command combined with fastp.json as an output name, as it wasn't aware of --in1 flags either. It would have fallen back to the file name as a sample name, which is fastp.

Anyway, we should have better tested all possible scenarios. Hopefully it's a good opportunity now, thanks to you for bringing back the discussion.

Regarding the config options, you can just use the following in multiqc_config.yaml:

extra_fn_clean_trim: "_1"

It will remove _1 exactly from the end of the sample name.

Note that if you want to use fn_clean_trim, it's better to use extra_fn_clean_trim instead, because otherwise your custom config will override everything in config_defaults.yaml.

@tamuanand
Copy link
Author

Hi @vladsavelyev

Just checking - will there be some more fixes based on what we have discussed above.

Thanks.

@vladsavelyev
Copy link
Member

vladsavelyev commented Oct 19, 2023

@tamuanand, I pushed another change based on our discussion, as you proposed, see the pull-request: #2139

@tamuanand
Copy link
Author

tamuanand commented Oct 19, 2023

Hi @vladsavelyev

I can confirm that this latest patch/fix works as before without me having to do any config changes like this

extra_fn_clean_trim: "_1"

Will there be a 1.17.1 release or something like that?

Thanks

@ewels
Copy link
Member

ewels commented Oct 26, 2023

This kind of discussion is tricky as everyone uses bioinformatics tools differently - other users might call the fastp results before_trimming.fastp.json for example (or whatever), so taking the JSON filename as the top priority means they only get a single sample. There's no right or wrong answer here as there's no standard in how the tools are used.

My preference is to use the FastQ in the command line as the top priority - this is the same behaviour as we have in most other modules, such as Trimmomatic, Cutadapt, FastQC etc. I think that consistency with other modules should be maintained where possible.

Trimmomatic has a module-specific configuration option s_name_filenames which we could also do here, to make the filename the top priority for just this module. Hopefully this should give the desired behaviour without too much overhead.

@yeroslaviz
Copy link

yeroslaviz commented Oct 30, 2023

Is there a solution for that problem by now?

For now, all I can do is manually modify the jason file

from:

"fastp --in1 rawData/Undetermined_S0_R1_001.fastq.gz --in2 rawData/Undet...

to:

"fastp --i rawData/Undetermined_S0_R1_001.fastq.gz --in2 rawData/Undet

@vladsavelyev
Copy link
Member

@yeroslaviz, the solution is merged into the master branch of the repository, but will be only available in the installable package after the next release (1.18), hopefully next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: module Bug in a MultiQC module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants