Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding BBMap qchist #1021

Merged
merged 11 commits into from Jan 31, 2022
Merged

adding BBMap qchist #1021

merged 11 commits into from Jan 31, 2022

Conversation

massiddamt
Copy link
Contributor

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md has been updated

With this update it is possible to handle BBMap qchist reports.
A plot is generated with the number of counts at each Phred Score value.
A table is included in the general stats section with the percentage of bases with Q >= 30.

@ewels
Copy link
Member

ewels commented Oct 4, 2019

Nice! Ping @boulund as this is BBMap code..

@ewels ewels added this to the MultiQC v1.8 milestone Oct 7, 2019
@boulund
Copy link
Contributor

boulund commented Oct 8, 2019

A welcome addition, thanks for filing a PR to add this functionality!

I haven't had a chance to test it yet, but it looks like the code for qchist finding and parsing that has been added is handled outside of the generic parsing and plotting code here: https://github.com/ewels/MultiQC/blob/b035f081fac0baed23b3e78d48b0dfb3084738c7/multiqc/modules/bbmap/bbmap.py#L38-L72

I know that section of the code is rather... opaque... Sorry for that!

It'll most certainly work, but if more additions like this come in this module will become a hairy mess to maintain (more than it already is...). At least the new code that is added is very clear 👍 . If you think it looks OK @ewels, then by all means merge the PR.

@boulund
Copy link
Contributor

boulund commented Oct 8, 2019

Hmmm 🤔....
I had a look at the docs and saw that we already wrote that the BBMap module supports qchist. Noticed we didn't include the filetype in the search patterns: https://github.com/ewels/MultiQC/blob/master/multiqc/utils/search_patterns.yaml

Cloned MultiQC into the folder, added (the spaces below should be tabs)

bbmap/qahist:
    contents: '#Quality    count1  fraction1   count2  fraction2'
    num_lines: 1

to search_patterns.yaml, and then ran a quick check on a test sample with reformat.sh in=sample1_R#.fq.gz qchist=qchist.txt and ran MultiQC: ./MultiQC/scripts/multiqc ..

Lo and behold! We got a plot in the output report.The correct data is indeed plotted, but the plot is ugly and it's giving the description for aqhist for some reason. Since there is no definition for qchist plotting parameters in the code there's probably a bug lurking in the code here somewhere or else it probably wouldn't have grabbed the description for another filetype. I think the custom code @massiddamt has written is probably easier to reason about and I argue we should keep it and ignore that qchist data might have been plottable via the generic parsing/plotting routines already implemented.

@ewels
Copy link
Member

ewels commented Oct 8, 2019

I'm confused - qchist, qahist, aqhist... Are these all different datasets?

@boulund
Copy link
Contributor

boulund commented Oct 9, 2019

@ewels, yes they are all different datasets. BBMap offers a very wide range of histogram outputs. Here are some of them, as described in the BBMap CLI help:

Histogram output parameters:
bhist=<file>        Base composition histogram by position.
qhist=<file>        Quality histogram by position.
qchist=<file>       Count of bases with each quality value.
aqhist=<file>       Histogram of average read quality.
bqhist=<file>       Quality histogram designed for box plots.
lhist=<file>        Read length histogram.
gchist=<file>       Read GC content histogram.

Histograms for mapped sam/bam files only:
ehist=<file>        Errors-per-read histogram.
qahist=<file>       Quality accuracy histogram of error rates versus quality score.
indelhist=<file>    Indel length histogram.
mhist=<file>        Histogram of match, sub, del, and ins rates by position.
idhist=<file>       Histogram of read count versus percent identity.

Some of these I had no idea what to do with when we wrote the module to begin with, and opted for an approach where we wrote a fairly generic histogram parser with generic histogram plotting so we could cover as many as possible, and then added a few specific variants for the non-histogram data (a handful or so). We were hoping that end-users would request or make their own PRs for additional stuff as they realized they wanted them ... 😓

@massiddamt
Copy link
Contributor Author

In the last commit I have changed the search pattern removing the filename (not useful) and moving from
contents: '#Quality count1 fraction1'
to
contents_re: '#Quality count1 fraction1\n'
In this way I am sure the module is considering only bbmap qchist output, excluding other similar bbmap outputs (ie qahist)

@ewels
Copy link
Member

ewels commented Oct 9, 2019

Nice! Cool that using the \n character works - I'm not aware of any other modules using that trick 😎

@ewels
Copy link
Member

ewels commented Nov 15, 2019

Hi @massiddamt,

When I try running your code on the bbmap test data I get a new plot, but it's empty:

image

Weirdly, hovering over the plot shows data tooltips:

image

Looking at the exported data from the report, it seems as though it's a line graph but with only one point for each line.

Is this what you also see? Do you have any example data that you have been running with?

Phil

Copy link
Member

@ewels ewels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't done a full code review yet, will take a proper look when I get the module to produce a plot as expected..

multiqc/modules/bbmap/bbmap.py Outdated Show resolved Hide resolved
multiqc/modules/bbmap/bbmap.py Outdated Show resolved Hide resolved
@boulund
Copy link
Contributor

boulund commented Nov 15, 2019

Like I hinted at in my previous post above I don't think we need any code changes to enable qchist plotting. It seems it was just an issue with the file contents search pattern that made it match another output file from BBMap.

Hmmm 🤔....
I had a look at the docs and saw that we already wrote that the BBMap module supports qchist. Noticed we didn't include the filetype in the search patterns: https://github.com/ewels/MultiQC/blob/master/multiqc/utils/search_patterns.yaml

Cloned MultiQC into the folder, added (the spaces below should be tabs)

bbmap/qahist:
    contents: '#Quality    count1  fraction1   count2  fraction2'
    num_lines: 1

to search_patterns.yaml, and then ran a quick check on a test sample with reformat.sh in=sample1_R#.fq.gz qchist=qchist.txt and ran MultiQC: ./MultiQC/scripts/multiqc ..

Lo and behold! We got a plot in the output report.The correct data is indeed plotted, but the plot is ugly and it's giving the description for aqhist for some reason. Since there is no definition for qchist plotting parameters in the code there's probably a bug lurking in the code here somewhere or else it probably wouldn't have grabbed the description for another filetype. I think the custom code @massiddamt has written is probably easier to reason about and I argue we should keep it and ignore that qchist data might have been plottable via the generic parsing/plotting routines already implemented.

Adding the line above (preferably with the \n at the end) would circumvent the need for any other code changes, as the generic histogram parser/plotter functionality in the bbmap module should deal with it.

@ewels
Copy link
Member

ewels commented Nov 15, 2019

Yes, but you also said at the end:

I think the custom code @massiddamt has written is probably easier to reason about and I argue we should keep it and ignore that qchist data might have been plottable via the generic parsing/plotting routines already implemented.

@boulund
Copy link
Contributor

boulund commented Nov 15, 2019

😅 ... Haha. Indeed I did, and didn't even read the entire post I quoted :D

If we're planning to go that route and keep the new code then it needs all of those items fixed that you just noted in your review @ewels. If we go the other route we need to change a single line of code and check that the plot looks ok. I'd argue that the latter is easier, lower effort, and quicker. There is no real risk of adding more complexity to the module by keeping the existing code, since nothing really changes.

But of course, I haven't tested it properly, so not sure if it'll work.

@ewels
Copy link
Member

ewels commented Nov 17, 2019

Sure! I’m all for that approach if it’s easy. Fancy opening a PR..? 😉

@ewels ewels modified the milestones: MultiQC v1.8, MultiQC v1.9 Nov 19, 2019
@ewels ewels modified the milestones: MultiQC v1.9, MultiQC v1.10 May 30, 2020
@ewels ewels force-pushed the master branch 2 times, most recently from 3e19e54 to 9f00c19 Compare June 24, 2021 15:11
* Fix broken plot series with 0 counts on log axis
* Rewrite general stats code to avoid double-parsing
* Remove superflous data file save
* Remove bad UserWarning stop if no qchist files found
* Remove double-logging
@ewels
Copy link
Member

ewels commented Jan 31, 2022

ok @massiddamt / @boulund - I figured this PR had languished long enough and I wanted to get it off the books 😅

I went over the code and did a bit of refactoring. I tried to make some bits (eg. general stats) use the generalised BBMap code a bit more, and sorted out some of the issues with the plotting and functionality blocking other BBMap sections.

I'm pretty happy with this now and I think it works fine in my testing. I'll merge, but if you guys get a chance to test it out, that'd be fab!

Thanks again for your contribution @massiddamt and discussion @boulund 🎉

@ewels ewels merged commit 057479f into MultiQC:master Jan 31, 2022
@boulund
Copy link
Contributor

boulund commented Feb 1, 2022

Wow! Great work, thanks! I'll try to give it a spin next time an opportunity opens up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants