Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ngsderive: new 'endedness' submodule #1992

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

a-frantz
Copy link
Contributor

@a-frantz a-frantz commented Aug 18, 2023

Newest subcommand of ngsderive is endedness. At this moment, it only detects Paired-End VS Single-End libraries (as that's what we have access to), but support for more diverse library types are possible.

This PR adds one column to the general stats table (Predicted Endedness). In the ngsderive section of the report it adds a new table with 6-7 columns (depending on what options were enabled during runtime of ngsderive).

5 of the columns are numerical, though they are best represented as a table. The algorithm is explained here, but the TLDR is that a bargraph/linegraph/etc. would not be informative for these numeric values. A table suffices.

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md has been updated
  • There is example tool output for tools in the https://github.com/ewels/MultiQC_TestData repository or attached to this PR
  • Code is tested and works locally (including with --lint flag)
  • docs/README.md is updated with link to below
  • docs/modulename.md is created
  • Everything that can be represented with a plot instead of a table is a plot
  • Report sections have a description and help text (with self.add_section)
  • There aren't any huge tables with > 6 columns (explain reasoning if so)
  • Each table column has a different colour scale to its neighbour, which relates to the data (eg. if high numbers are bad, they're red)
  • Module does not do any significant computational work

Copy link
Member

@vladsavelyev vladsavelyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding another feature! Looks great, just some comments mostly relevant to formatting.

multiqc/modules/ngsderive/ngsderive.py Outdated Show resolved Hide resolved
multiqc/modules/ngsderive/ngsderive.py Outdated Show resolved Hide resolved
multiqc/modules/ngsderive/ngsderive.py Outdated Show resolved Hide resolved
multiqc/modules/ngsderive/ngsderive.py Outdated Show resolved Hide resolved
multiqc/modules/ngsderive/ngsderive.py Outdated Show resolved Hide resolved
@vladsavelyev vladsavelyev self-requested a review August 22, 2023 11:10
@vladsavelyev
Copy link
Member

Thanks for addressing the suggestions quickly @a-frantz, looks good to me!
Will leave it to @ewels to merge.

@vladsavelyev vladsavelyev changed the title feat: new 'endedness' submodule for ngsderive ngsderive: new 'endedness' submodule Aug 23, 2023
@vladsavelyev vladsavelyev added the awaits-review Awaiting final review and merge. label Aug 23, 2023
@a-frantz
Copy link
Contributor Author

@ewels
The new endedness command (and the old strandedness command) support a parameter called --split-by-rg. This does what it sounds like it does; instead of one entry per sample (BAM), it will have X+1 entries per sample, where X is the number of read groups in the file (plus 1 for an overall entry).
When first writing the strandedness integration a few years ago, I was under the impression that there wasn't a good way in MultiQC to present multiple data points per sample (in the simplest form, think of a table where multiple rows correspond to the same sample). Because of that belief, I simply decided MultiQC should ignore any files with a ReadGroup column header and called it a day.
I'm rethinking that decision, and wanted to check with you if what I want is already present in MultiQC. We've found a decent number of cases where splitting these analyses by read-group has elucidated something interesting. It would be ideal if we can get this slight change of data format through the MultiQC parsing.

@ewels
Copy link
Member

ewels commented Aug 28, 2023

@a-frantz not really I'm afraid, at least - not yet. I've been wanting to add some core functionality to handle groups of samples / results for years. I actually had a stab at it back in 2017 (time flies 🤯 ) in #576 👀

It will happen at some point. But it's very complex and might be one of the things that ends up in the bigger v2 milestone that we're starting to build. Feel free to follow #542 for progress.

In the mean time, could you just treat each as a separate sample - just adding the read group as a suffix to the sample name? It's a little crude, but would get the data into the report at least. I think that Picard MarkDuplicates does this already:

https://github.com/ewels/MultiQC/blob/a5a31290ecc540c377a5569d5912d1758fab389c/docs/modules/picard.md?plain=1#L134-L135

Also DRAGEN, VerifyBamId and some other modules do stuff with read groups too it seems, based on a quick project-wide search.

Copy link
Member

@ewels ewels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor points that need fixing up..

multiqc/modules/ngsderive/ngsderive.py Outdated Show resolved Hide resolved
headers["FyLn"] = {
"title": "f+l- count",
"description": "Number of reads with the 'first in template' FLAG set and the 'last in template' FLAG unset",
"format": "{:,d}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use "shared_key": "read_count" and the appropriate number base modifiers. See docs.

Hmm, docs are a bit thin on this. Example here:

https://github.com/ewels/MultiQC/blob/57f3588c5a524f855b0706022e6db44f423588d1/multiqc/modules/ivar/ivar.py#L110-L116

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this automatically scale the values? As I shared in another comment, these data are being checked for exact equality. Any scaling would mask problems users should know about. The value of the "one's" place is important, as off-by-one errors are the most common class of error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for readability, decimal separators do the job, so there is no need to use read count multipliers that hide the precision that's important here. Unless we add another derived metric, like the difference between f+l- and f-l+ (not sure how creative MultiQC should be here)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok 👍🏻

Worth considering a custom shared key anyway, to preserve bar scaling across columns? I think that read_count might do some magic with modifiers (?) but a custom string shouldn't.

anchor="ngsderive-endedness",
description="""Predicted library endedness provided by ngsderive. For more information, please see
[the documentation](https://stjudecloud.github.io/ngsderive/subcommands/endedness/).""",
plot=table.plot(table_data, headers, config),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to represent these counts as a stacked bar plot, instead of a table? They're all subsets of a whole, right? Or could they sum to > 100%?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do sum to 100%, so a stacked bar plot is possible. I'm just afraid it might be less informative than the raw numbers. Which is admittedly counter-intuitive. Let me explain how these data will be used.

These data are more or less meant as sanity checks. The patterns you expect to see are:
Paired-End sample: f+l- == f-l+ && f+l+==0 && f-l-==0
Single-End sample: f+l+ > 0 && everything else == 0
So in my mind, the easiest way to check those assumptions is with a table, where I can quickly glance and say "Great, these numbers are the same number" or "O no, something is greater than zero". The degree by which a category is off from it's assumed value doesn't really matter that match.

In practice, I'm running this on our data and seeing some off-by-one cases. I'm afraid that presenting this as a barplot would mask an off-by-one case, because in a plot the difference between 3million and 3million+1 would be impossible to see. But in a table, super easy to see a value is off by one.

I could see using the conditional formatting available to tables to color-code some of these cases. Would that be possible? First instinct is "Give each unique numerical value a distinct color" (there will be at most 4). Equal cells will have the same color. Unequal cells would have some contrast. That should facilitate the "at-a-glance" assumption checking.

So far I've ignored the Reads Per Template value. That is supposed to be exactly 2.0 (Paired-End) or exactly 1.0(Single-End). Deviation from those exact values is what's concerning. Any ideas for how to present that more intuitively?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should keep the raw numbers instead of rounding or plotting them.

But perhaps, indeed, these checks can be automated, so conditional formatting to highlight samples that break the condition is a good direction to look into.

Another approach would be to just add one more derived metric into the general stats, that would simply show if the sample passes those conditions or not. Similarly formatted to the existing "predicted instrument" and "predicted endedness" columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be also helpful to add a "bad" example into the test data, where a sample doesn't pass the endedness criteria.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me complicate things for us: There is a parameter to this command which enables only processing a subset of the BAM file. When this mode is active, there is obviously some amount of deviation from the exact conditions I've outlined previously, just due to the randomness inherent in only processing a portion of the available reads. When run in this mode, the user supplies a tolerance range for determining the conditions. e.g. we can still call the sample Paired-End so long as 0.45 < (f+l-)/((f+l-)+(f-l+)) < 0.55. This is in contrast to the typical case, when every read is processed, that requires (f+l-)==(f-l+). Similarly RPT gets rounded to the nearest int during checks when run in "subset mode".

Generally we would discourage users from using this mode as it will mask errors. But we felt the need to enable it for consistency with the rest of the ngsderive commands. As you can imagine, checking out every read is the file takes a long time. A user has the option to get a decent enough guess in a few seconds if that's what their use-case calls for.

Another approach would be to just add one more derived metric into the general stats, that would simply show if the sample passes those conditions or not. Similarly formatted to the existing "predicted instrument" and "predicted endedness" columns.

Because of the inconsistency outlined above, MultiQC can't get into the business of checking these conditions itself. I would not want to imply "conditions met" or "conditions unmet" with colors such as "green" and "red". The conditions checking is already done and presented by the value of Endedness. That will be Unknown if no conditions are met (or Paired-End or Single-End were those conditions met). I wouldn't be opposed to coloring Unknown red, but I'm not sure it really adds a whole lot. Same for coloring Paired-End and Single-End green. I could take it or leave it.

I think my first instinct:

First instinct is "Give each unique numerical value a distinct color" (there will be at most 4). Equal cells will have the same color. Unequal cells would have some contrast. That should facilitate the "at-a-glance" assumption checking.

is still our best bet. I haven't come up with anything better. It wouldn't be particularly useful for the niche case that only a subset of the BAM was processed, but I think that's fine. We can't make everyone happy, so let's focus on the predominant use-case.

I wouldn't want to color the RPT cell because it is allowed to deviate some. I suppose >=3 would be universally "a problem" today, but in the future we might find a valid use case for that (circular templates?). So to future-proof, I wouldn't want to call any RPT value "bad".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be also helpful to add a "bad" example into the test data, where a sample doesn't pass the endedness criteria.

https://github.com/ewels/MultiQC_TestData/blob/master/data/modules/ngsderive/test.endedness.without_rpt.txt

This represents the common "off by one" error case

Copy link
Member

@ewels ewels Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a fairly atypical module 😆 It would be great to write all of this in the description / help text above the table, so that users can read how about to interpret the results.

I can't think of any other modules doing this, but it should be possible to get what you want by setting scale to False and bgcols to a dynamically generated dict of discovered read counts + assigned colours. Then you can effectively colour code the table cells according to their numerical status.

If the actual number isn't very interesting, then I think this makes sense for me. As the normal cell background bars don't really tell you anything.

@a-frantz
Copy link
Contributor Author

@a-frantz not really I'm afraid, at least - not yet. I've been wanting to add some core functionality to handle groups of samples / results for years. I actually had a stab at it back in 2017 (time flies 🤯 ) in #576 👀

It will happen at some point. But it's very complex and might be one of the things that ends up in the bigger v2 milestone that we're starting to build. Feel free to follow #542 for progress.

In the mean time, could you just treat each as a separate sample - just adding the read group as a suffix to the sample name? It's a little crude, but would get the data into the report at least. I think that Picard MarkDuplicates does this already:

https://github.com/ewels/MultiQC/blob/a5a31290ecc540c377a5569d5912d1758fab389c/docs/modules/picard.md?plain=1#L134-L135

Also DRAGEN, VerifyBamId and some other modules do stuff with read groups too it seems, based on a quick project-wide search.

This seems like a decent enough solution. Not ideal, but I think we can make it work. My big concern with taking this approach is that I'm not sure if there's a way to "hide" these suffixed "samples" from other sections, mainly the General Stats table. I think having 7 entries with the same sample prefix in one block of the table, and then 6 of them being blank for every other section of the table would be pretty ugly.

What I'm envisioning: ngsderive inserts an overall entry when --split-by-rg is True. This entry would be parsed according to standard sample cleaning rules and would appear as an entry in the General Stats Table. Every other ngsderive result with a ReadGroup creates a new "sample" by suffixing the overall sample name with the RG ID. These "new samples" don't appear anywhere outside of ngdserive's block of the MultiQC report. No entries go into General Stats.

Is this possible?

@ewels
Copy link
Member

ewels commented Aug 29, 2023

Is this possible?

Yup 👍🏻 The module code dictates which samples + data go into the General Stats table / report section separately, so you should have complete control over what goes where.

Plenty of precedent on this one. For example, FastQC appends suffixes to sample names for the adapter trimming plot.

@ewels ewels added waiting: changes Issue / PR is on hold, waiting for requested changes and removed awaits-review Awaiting final review and merge. labels Aug 30, 2023
@a-frantz
Copy link
Contributor Author

a-frantz commented Sep 8, 2023

@ewels I'm working on a v4 release of ngsderive. It involves some breaking changes that will need to be accommodated for in MultiQC*.

Would you rather me knock this current PR out first and open another PR to address the v4 changes, or do it all in one PR? What's easiest for your workflow?

*the changes relevant to MultiQC are:

  • reporting percentages instead of proportions in the range of 0-1
    • we'll have to detect what the scale is in MultiQC (does this value need to be multiplied by 100 before going into the MultiQC report?)
  • Headers have been made to consistently use PascalCase, instead of the current mishmash of cases
    • MultiQC will need to be able to parse the data of either

@vladsavelyev
Copy link
Member

@a-frantz, it's really up to you. If you already know what will the new ngsderive output look like, perhaps it makes sense to update the code in this PR, and also update the MultiQC_TestData examples.

* master:
  Just run CI on the oldest + newest supported Python versions (MultiQC#2074)
  Picard: fix parsing mixed strings/numbers, account for trailing tab (MultiQC#2083)
  FastQC: add top overrepresented sequences table (MultiQC#2075)
  Add GitHub Actions bot workflow to fix code linting from a PR comment (MultiQC#2082)
  Use custom exception type instead of `UserWarning` when no samples are found. (MultiQC#2049)
  Lint modules for missing `self.add_software_version` (MultiQC#2081)
  Changelog bot: Update docs (MultiQC#2077)
  Changelog action: remove `.capitalize()`, add changelog entry (MultiQC#2080)
  Add action to populate the change log from PR titles triggered by `@multiqc-bot changelog` (MultiQC#2025)

# Conflicts:
#	CHANGELOG.md
#	multiqc/modules/ngsderive/ngsderive.py
Comment on lines +129 to +134
read_group = row.get("ReadGroup")
if read_group and read_group != "overall":
sample_name += "-" + read_group
row["is_read_group"] = True
else:
row["is_read_group"] = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladsavelyev At this stage of development, the only subcommands which support read group level data are strandedness and endedness. That data is not guaranteed to be present, as a user must opt into that level of granularity.

I would like to support two different modes for the strandedness barplot. The default would be the current implementation (only one "overall" entry per-sample). The user could then toggle the "Read Group" mode which shows samples + their read group level entries. I'll need to play around with the sorting here to see what would be most useful. I'm not sure ATM.

I don't think we need any toggles for the endedness table. I think it's suitable to display all found information by default.

Is there an example elsewhere in the code base I could follow for what I'm describing for strandedness? I'm not quite sure the best way to approach the problem.

"reverse": round(float(strandedness.get("ReversePct")) * 100.0, 2),
}
if float(strandedness.get("ForwardPct")) == 0 and float(strandedness.get("ReversePct")) == 0:
# TODO how to present this case? No reads found for this Read Group.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladsavelyev This is a case I'm stumped over. In v4 of ngsderive, I added a feature where the read group definitions in the BAM header are validated against what appears in the sequences. This now means it's possible for an entry to be NULL, if the read group is defined in the header but lacks any sequences associated with it. There's no data to present here, so it can't go into the barplot. This is most likely an error in the BAM that should be communicated to the user. However I'm not sure how to go about that. I suppose we could log.warning("Read Group info messed up! Yada yada") and just ignore it for the output report... I would like to see it somewhere in the report, but I have no idea about an appropriate way to achieve that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting: changes Issue / PR is on hold, waiting for requested changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants