Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new module to support DRAGEN-FastQC #1232

Merged
merged 117 commits into from
Dec 21, 2022
Merged

Add new module to support DRAGEN-FastQC #1232

merged 117 commits into from
Dec 21, 2022

Conversation

bnbowman
Copy link
Contributor

  • This comment contains a description of changes (within reason)
  • CHANGELOG.md has been updated

Illumina is in the process of updating it's DRAGEN Bio-It platform that, among other things, adds a new feature to generate hardware-accelerated primary analysis statistics similar to the widely used FastQC tool. This PR creates a new module that parses the outputs from that new feature and uses it to replicate the plots that a user would expect to see from FastQC, or from MultiQC's existing FastQC module.

NOTE #1: This module doesn't yet properly support sample-quality flags, but it needs to before it's merged. But I wanted to get this PR up to start the discussion before the weekend, while I'm still working on that.

NOTE #2: As part of supporting full graphical parity with FastQC, this PR also includes new plotting functionality to create box-and-whisker plots. I imagine these need a fair amount of work still, particularly for things like pconfig support, but I think some code-review would help differentiate the must-haves from nice-to-haves

  • There is example tool output for tools in the https://github.com/ewels/MultiQC_TestData repository or attached to this PR
  • Code is tested and works locally (including with --lint flag)
  • docs/README.md is updated with link to below
  • docs/modulename.md is created
  • Everything that can be represented with a plot instead of a table is a plot
  • Report sections have a description and help text (with self.add_section)
  • There aren't any huge tables with > 6 columns (explain reasoning if so)
  • Each table column has a different colour scale to its neighbour, which relates to the data (eg. if high numbers are bad, they're red)
  • Module does not do any significant computational work

Bowman and others added 25 commits May 25, 2020 14:18
…cate sample names, rename data to data_by_sample/data_by_mate
Merging Vlad's suggested changes into master
@ewels
Copy link
Member

ewels commented Jun 23, 2020

This is huge, thank you for your work on this @bnbowman! Lots to look through and test so I need to carve out a proper slot of time to work on this, but a couple of very fast speed thoughts:

  • Box plots code looks nice, but I see that you kind of cheated and it's only doing flat matplotlib images 😉
    • This plot type is well overdue, if I update HighCharts then it should be possible to have interactive as well as static boxplots now. It will require quite a bit of new javascript.
    • I'm happy to take this on, and would prefer to keep it in a separate PR if possible
    • I'm very happy that you did the matplotlib stuff already, hopefully I can take your work there and extend it - in the past it's been the matplotlib stuff that has given me the most grief 👍
  • Shared FastQC plot code
    • This really needs to be ported / rewritten as core functionality. I think this is about the third or maybe fourth module that has tried to reuse the code so it's definitely time.
    • More than just moving the files around, I think it could do with some refactoring. Much of this code comes from the very early days of MultiQC and there is probably a better approach with some of it now.
    • Specifically for the base content heatmap, I would like to look into rewriting this using the core MultiQC heatmap functions if possible.

So I think we basically have three separate PRs in one here - boxplots, core FastQC functionality, and DRAGEN. My preferred approach would be for someone (probably me) to tackle the first two in separate PRs. Especially the second point, as I'd like to update other modules such as FastQC as part of the same process. Once those are done we can come back to the DRAGEN part to make use of this new functionality.

This is going to take a fair amount of work but I think that these updates will really improve MultiQC and are often requested. I'll try to spend some time on this once I've got my current batch of @nf_core stuff done.

So this isn't going to be a quick approve + merge PR I'm afraid @bnbowman 😞 I need to finish other stuff and once I start it'll take me a while to implement the above. But the PR is 2k lines of new code which touch at the heart of MultiQC, so maybe that's not surprising 😅

I guess that time will be a factor for you and you'll want this merged ASAP. If so then maybe we can discuss a strategy to break this up a little and start bringing in parts of the new DRAGEN functionality that don't rely on the core MultiQC changes mentioned above. What do you think?

Phil

@ewels
Copy link
Member

ewels commented Jun 23, 2020

ps. I'm not entirely sure what you mean about sample-quality flags?

@bnbowman
Copy link
Contributor Author

bnbowman commented Jul 6, 2020

Sorry for the delayed response - I've been on vacation for the past week and a half (Not to mention dealing with a brand new puppy...)

So I think we basically have three separate PRs in one here - boxplots, core FastQC functionality, and DRAGEN.
My preferred approach would be for someone (probably me) to tackle the first two in separate PRs.

I broadly agree with this. I can help somewhat on the Python-side of the boxplots, and I agree the DRAGEN code is my responsibility, but HighCharts, core infra, and JS in general are not my forte.

So this isn't going to be a quick approve + merge PR I'm afraid

Totally understand. I tried to get people rallied around this course of action early b/c I knew it was going to be a lift, but there was a lot of internal dithering over MultiQC vs. a bespoke solution. So here we are. We're using my personal branch for now for business purposes, so it's not a show-stopper, but it is definitely something we want to get done sooner rather than later.

ps. I'm not entirely sure what you mean about sample-quality flags?

The thing you do where you color some lines green/yellow/red depending on how the sample stacks up against various criteria? And have a little colored box of the number of samples at the top of each section? That thing.

@bnbowman
Copy link
Contributor Author

bnbowman commented Jul 6, 2020

On a side note - I've got a major deadline for another project due at the end of the month, so I won't have much time to work on this until then. But afterwards I should have more flexibility - will reach out to check in and see if there is anything I can help with when my schedule opens up a little.

alexiswl and others added 12 commits February 22, 2022 17:46
* rna_quant_metrics.py
* rna_transcript_cov.py
* time_metrics.py
* trimmer_metrics.py
* dragen_fastqc.py
* This means t/n components have their own mapping metrics.
DRAGEN mapping metrics support NA vlues
* Add scATAC metrics

* missing file

* Update search_patterns.yaml

Co-authored-by: Lee <slee2@illumina.com>
* Fix QC coverage metrics

* Remove debug messages

Co-authored-by: Lee <slee2@illumina.com>
@ewels ewels mentioned this pull request Sep 14, 2022
Co-authored-by: Lee <slee2@illumina.com>
@ewels
Copy link
Member

ewels commented Dec 18, 2022

@andrei-seleznev / @alexiswl / @bnbowman / anyone else who might knows - this PR includes a new kiwisolver dependency, but I can't see it mentioned anywhere in the codebase except setup.py.

Is this new dependency definitely still used? It's one of the main blockers for me from merging currently. I'm basically happy to merge anything within the dragen folders at this point as-is, plus the box plot. The only other thing outside of that scope is this extra dependency.

@alexiswl
Copy link
Contributor

@ewels I'm not sure what the kiwisolver is used for but it was first seen in this commit from @bnbowman.

I remember having the same issue you are seeing when trying to create a MultiQC docker image from this branch and attributed it this kiwi solver issue. Although this issue was resolved in kiwisolver version 1.4.4, trying to build the Dockerfile with this later version of kiwisolver resulted in the same error:

...
  gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Os -fomit-frame-pointer -g -Os -fomit-frame-pointer -g -Os -fomit-frame-pointer -g -DTHREAD_STACK_SIZE=0x100000 -fPIC -I/tmp/pip-build-env-iwfb688u/overlay/lib/python3.8/site-packages/cppy/include -I. -I/usr/include/python3.8 -c py/src/constraint.cpp -o build/temp.linux-x86_64-cpython-38/py/src/constraint.o -std=c++11
  /tmp/pip-build-env-iwfb688u/overlay/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py:108: _BetaConfiguration: Support for `[tool.setuptools]` in `pyproject.toml` is still *beta*.
    warnings.warn(msg, _BetaConfiguration)
  error: command 'gcc' failed: No such file or directory
  ----------------------------------------
  ERROR: Failed building wheel for kiwisolver
  Running setup.py clean for kiwisolver
  Building wheel for pyyaml (PEP 517): started
  Building wheel for pyyaml (PEP 517): finished with status 'done'
  Created wheel for pyyaml: filename=PyYAML-6.0-cp38-cp38-linux_x86_64.whl size=45331 sha256=480051c2b1516b42060a514652cb283c4dcabb6cac9d829a8bd24409bbe9a21e
  Stored in directory: /root/.cache/pip/wheels/95/84/67/ebeac632c63797cfbeb90128ca41073117721540dad526d213
Successfully built multiqc pyyaml
Failed to build kiwisolver
ERROR: Could not build wheels for kiwisolver which use PEP 517 and cannot be installed directly

A workaround for me was using #1541 which installs (and then removes) the build essentials for compiling pip packages that don't have the available wheels to install.

@ewels
Copy link
Member

ewels commented Dec 19, 2022

That commit was 2 years ago though and presumably to also solve some kind of installation error. I'm hoping that it's no longer needed and we can remove it.

@alexiswl
Copy link
Contributor

That commit was 2 years ago though and presumably to also solve some kind of installation error. I'm hoping that it's no longer needed and we can remove it.

Didn't even think of that! Rebuilt this branch without kiwisolver and was still able to produce dragen and dragen-fastqc module outputs.

@andrei-seleznev
Copy link
Contributor

I got in touch with Brett - indeed it was an old trick to solve local environment dependency issues. Good to be removed.

@ewels
Copy link
Member

ewels commented Dec 19, 2022

Great news! Ok then it's just fixing up these merge conflicts, a few test runs to ease my conscience, and hopefully we're good to go..!

@alexiswl
Copy link
Contributor

@ewels I've put in a PR to @bnbowman to resolve the merge conflicts with bnbowman#35.

alexiswl and others added 2 commits December 19, 2022 16:57
And add gc metrics to general stats
* Try prefixing analysis dirs

* Update CHANGELOG.md

* tried to add conditional execution to actions

* Update CHANGELOG.md

* Add dependency statement

* only overwrite id when not set

* changelog

* always set c_id

* Revert setting force_interactive flag for rich with --no-ansi

* Don't force terminal escape codes for the progress bar

* Extend kallisto module regex to recognize newer output

I've noticed multiqc (v 1.12) didn't recognise some kallisto output (I'm using kb_python 0.26.4).
Having digged a bit, it seems to not work with more recent kallisto output, i.e. this snippet taken from here:
https://github.com/pachterlab/GRNP_2020/blob/daed9c2f204f1c3f6ee0e864c3db93b0baadfc8a/notebooks/FASTQ_processing/ProcessPBMC_NG.ipynb
```
[index] k-mer length: 31
[index] number of targets: 187,626
[index] number of k-mers: 108,619,921
tcmalloc: large alloc 3221225472 bytes == 0x556459b7e000 @  0x7feac4ab5887 0x556458814ad2 0x55645880d061 0x5564587e1372 0x7feac3935bf7 0x5564587e60da
[index] number of equivalence classes: 752,021
[quant] will process sample 1: A_R1.gz
                               A_R2.gz
[quant] will process sample 2: B_R1.gz
                               B_R2.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 170,526,037 reads, 98,632,205 reads pseudoaligned
```
Turns out multiqc only looks for `pair|file` but not sample. Replacing sample for file did do the trick, hence I suggest to add `sample` in the regex pattern.
I haven't tested this, but it should work now.

Here is the code which generates this output:
https://github.com/pachterlab/kallisto/blob/83bde908c403ea4014b5092a243e5c7240f48dd5/src/ProcessReads.cpp#L235

This is the commit which introduced it (already in 2018, so not sure why this hasn't been caught yet)
pachterlab/kallisto@62e9464

* Replace logger.hasHandlers() with logger.handlers

There are cases where configuring logging results in logger.handlers being empty but logger.hasHandlers() returns True: MultiQC#1643

Since the block modified removes based on logger.handlers, the condition to enter the block should check logger.handlers rather than logger.hasHandlers()

* Added description of changes for pull request

* Document 'no_version_check' config option

* Docs tweak

* Fix kwargs for MultiQC plugins

* New config option 'custom_table_header_config'

* Run black

* Update adapterRemoval.py

Returns actual proportion of reads that were collapsed and discarded

* Black format

* Black format

* BlackPython

* Fix chart labels and titles

* Fix chart labels and titles

* Add columns to stats table

Add columns with proportion of collapsed/discarded reads to the general stats table

* Add Columns - Fix format

* Changelog

* Fixed bug when other fields also have a "-" instead of an integer.

* Updated CHANGELOG

* Fixed typos

* Fixed format typo

* Fixed format typo

* Nanostat: Remove HTML escaping

Jinja2 escape() function removed in jinja2 v3.10

I don't think that this escaping should be required. I can't see any effect in the report when I remove it anyway.

* Changelog

* Changing 0 to None

* Skip fields with `-`

* Pangolin 4.0 compatability

Recently pangolin has been updated to version 4.0 and this changes the output CSV file - see: https://github.com/cov-lineages/pangolin/releases/tag/v4.0

This causes the module to fail in its current state as row['qc_status'] already exists and the current replacement triggers a key error by searching for row['status'] which no longer exists. Thanks to @alexomics for tracking down the issue.

* Don't duplicate custom-content section descriptions.

Fixed edge-case bug in custom content where a `description` that doesn't terminate in `.` gave duplicate section descriptions.

* Changelog

* Tidied the verbose log, added summaries for skipped search files to debug log

* Allow sorting of table columns with text contents

* update changelog

* optimize linegraph category comparison

* Somalier: division by zero in sex ploidy plot

* Changelog

* Add time zone

* Update changelog

* Fix typo in bcl2fastq.py

* Handle too long and low complexity

* update changelog

* fix zero division error in sambamba markdup module

* black formatting

* update CHANGELOG.md to address MultiQC#1654

* bclconvert checks RunInfo xml if reads are singleend or pairedend and sets clusterlength appropriately. resolves MultiQC#1697

* Added CITATION.cff file for standardized citations

* fixed formatting of url

* fixed citation formatting

* Run prettier

* Fix module crashing due to missing field in report

* Fix bug where module wouldn't run if all content was within a MultiQC config file

Fixes MultiQC#1686

* nanostat: add check for quality scores

* update CHANGELOG.md

* update CHANGELOG.md

* Custom content: Fix crash when 'info' isn't set

Closes MultiQC#1688

* Added nix flake support

* Update docs/installation.md

Co-authored-by: Phil Ewels <phil@seqera.io>

* Fix zero division error

* Update fastqc.py

* Update fastqc.py

* fix format

* add change log

* fix doc ref

* Don't need Prettier _and_ markdown/yamllint CI

* Just capture the ValueError

* Rich-codex screenshot in the readme

* Corrected 'outdir' flag

Missing a dash for the flag to work.

* Clean up clean_img_paths

* Generate new screengrabs with rich-codex

* Add samtools flagstat column '% Read Mapped'

* update samtools flagstat changelog

* Added try,except for divisions to avoid division by 0 errors

* added the fixing of malt in the change log

* report median read length for fastqc

* add after filtering total reads to general stats table

* GitHub Actions: Tweet about new releases

* Bump to v1.13 for release

* rich-codex screenshots: Manual only, skip git checks

* Generate new screengrabs with rich-codex

* Fix changelog date

* Bump to v1.14dev

* Custom content: Render report even if there's only general stats there

See MultiQC#1756

* Bugfix: Make `config.data_format` work again

* Bump minimum version of Jinja2 to `>=3.0.0`

Closes MultiQC#1642

* Disable search progress bar if running with `--quiet` or `--no-ansi`

Closes MultiQC#1638

* Attempt to cooerce line / scatter x-axes into floats so as not to lose labels

See MultiQC#1242

* Use row 1 as x-axis labels if no sample name.

Closes MultiQC#1242

* Malt: Move changelog up to new version

* Merge changelog up

* Use OrderedDict instead of 'placement'

* Add code comment

* Add CI testing for Python 3.10 and 3.11

* Fix typo

* Quotes so it's 3.10 and not 3.1

* 3.11-dev

* Remove 3.11-dev for Windows

* Move merge markers GHA into lint workflow file

* Shorter job name

* Be more selective about when slow MultiQC test runs fire

- Master only for push event
- Don't run if only docs / markdown

* Run isort

* Remove py2 'from __future__ import print_function'

* Add GitHub actions CI for isort

* Changelog

* Remove all py2 'from __future__ imports'

* Tweak some imports

* Changelog

* added setuptools to flake

* rm emtpy bcftools stats variant depths plot

* moved changelog comment

* adjusted PR num

* fix duplicate heatmap for kraken

* changelog

* use None instead

* First commit of pre-commit

* Comment out all the tests that don't yet work

* Update gene_body_coverage.py

Using a normalized coverage to make genebody coverage plot ( similar to the method used by RSeQC). Us the formula 'norm_cov = ( cov - min(cov ) / ( max(cov) - min(cov) )' to compute normalized coverage

* Update gene_body_coverage.py

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

* Move changelog entry

* Test for Python 3.11 now that the official release is out

* CI: Use new version of actions/checkout to avoid Node.js depreciation warning

* Remove sample and chromosome before converting to int

This fixes issue-1793

* Remove filtered samples from general stats table

This fixes MultiQC#1780

* Update changelog

* Add additional entries for qualimap when region stats present

* fastp: use passed filter reads instead of after filter total reads

Signed-off-by: Josh Chorlton <jchorlton@gmail.com>

* bclconvert now handles different r1 and r2 lengths instead of assuming they are the same

* updated CHANGELOG.md

* update bustools

* Update CHANGELOG.md

* Remove changelog entry

* Move changelog to entry to correct place

* Fix changelog

* Kraken: Improve heatmap config

* Apply suggestions from code review

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>

* handle singleindex data

* cleanup

* CHANGELOG.md bclconvert fix issue link typo and note single-index paired-end data handled

* Qualimap BamQC: Refactor to parse regexes per section

Also: Fix randomly aggressive Snippy module parsing bug

* HsMetrics: Allow custom columns in General Stats too

* Replace nested loop with list comprehension when parsing output file headers

* CHANGELOG

* Output headers order preserved and Sample is first value

* Fix ubuntu version in GitHub CI to preserve Py3.6 testing.

Python 3.6, I think your days are numbered..

* add back original avg field

Signed-off-by: Josh Chorlton <jchorlton@gmail.com>

* fixes

Signed-off-by: Josh Chorlton <jchorlton@gmail.com>

* update busco colors

Signed-off-by: Josh Chorlton <jchorlton@gmail.com>

* fix: frontmatter yaml formatting issue

* Update docs to use --cl-config instead of --cl_config

Closes MultiQC#1825

* Update multiqc/modules/fastqc/fastqc.py

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>

* Update multiqc/modules/fastqc/fastqc.py

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>

* Update multiqc/modules/fastqc/fastqc.py

Co-authored-by: Phil Ewels <phil.ewels@seqera.io>

* suggestion

Signed-off-by: Josh Chorlton <jchorlton@gmail.com>
Co-authored-by: Erik Danielsson <danielsson.erik.0@gmail.com>
Co-authored-by: Phil Ewels <phil.ewels@scilifelab.se>
Co-authored-by: Ido Tamir <ido.tamir@vbcf.ac.at>
Co-authored-by: seb-mueller <sebm@posteo.de>
Co-authored-by: Jonathan Oribello <Jonathan.d.oribello@gmail.com>
Co-authored-by: NiemannJ <69033839+NiemannJ@users.noreply.github.com>
Co-authored-by: fgvieira <fgarrettvieira@gmail.com>
Co-authored-by: mattloose <matt.loose@nottingham.ac.uk>
Co-authored-by: Josh Chorlton <jchorlton@gmail.com>
Co-authored-by: vladsaveliev <vladislav.savelyev@populationgenomics.org.au>
Co-authored-by: Sam Chorlton <>
Co-authored-by: jethror1 <45037268+jethror1@users.noreply.github.com>
Co-authored-by: Garth Kong <kongga2017@gmail.com>
Co-authored-by: Andrei Seleznev <aseleznev@illumina.com>
Co-authored-by: lew2mz <david.lewis@cchmc.org>
Co-authored-by: Phil Ewels <phil@seqera.io>
Co-authored-by: phue <patrick.huether@imp.ac.at>
Co-authored-by: David Lewis <60514384+IllustratedMan-code@users.noreply.github.com>
Co-authored-by: Chang Y <yech1990@gmail.com>
Co-authored-by: beausoleilmo <beausoleilmo@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jean Mainguy <jean.mainguy@outlook.fr>
Co-authored-by: aidaanva <aida.andrades@gmail.com>
Co-authored-by: SusiJo <susanne.jodoin@gmx.de>
Co-authored-by: Phil Ewels <phil.ewels@seqera.io>
Co-authored-by: TNalpat <thomas.nalpathamkalam@gmail.com>
Co-authored-by: Redmar van den Berg <RedmarvandenBerg@lumc.nl>
Co-authored-by: James Fellows Yates <jfy133@gmail.com>
Co-authored-by: Maarten-vd-Sande <maartenvandersande@hotmail.com>
Co-authored-by: Adam Talbot <adam.talbot@nonacus.com>
Co-authored-by: Oleh Pratsko <olehpratsko@gmail.com>
Co-authored-by: Josh Chorlton <jchorl@users.noreply.github.com>
@ewels
Copy link
Member

ewels commented Dec 20, 2022

Thanks @alexiswl! The module still needs quite a bit of work, I've made a start over on #1829 where I can push to the branch. Nearly there now I think, see comment #1829 (comment)

@ewels
Copy link
Member

ewels commented Dec 21, 2022

Ok, I think I've got to a point where I'm happy to merge 🎉

Caveats:

  • I'm pretty sure that the Dragen (and maybe the dragen_fastqc? or maybe not?) test data is pretty incomplete.
    • If anyone is able to contribute more test data there to cover all module outputs, that would be great.
    • I would be surprised if CI tests continue to pass when new test data is added! But that's kind of the point, we can fix that
  • Some stuff in this PR is still not done, eg. heatmap / hover bars showing sample status.
    • This can be added at a later date if needed
  • I haven't properly read / tested all code, as there's just too much of it for me to do in the time I have
    • I'm hoping that the code changes in Fix merge conflicts for #1232 #1829 are isolated enough that anything that makes it through will not affect other MultiQC users
    • If anyone hits bugs in the Dragen code, please report - or better still, fix with a PR! 🙏🏻

Let's do this.. Now or never 😅 Hopefully this means we can bring you all back onto the main MultiQC distribution / development branch!

Many thanks to everyone who has contributed to this PR. Especially @bnbowman for the bulk of the new code but also to those who have come in to help with the cleanup at such a late stage. Much appreciated.

Phil

ewels added a commit that referenced this pull request Dec 21, 2022
@ewels ewels merged commit d9f7965 into MultiQC:master Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants