ngsderive: new 'endedness' submodule #1992

a-frantz · 2023-08-18T14:24:58Z

Newest subcommand of ngsderive is endedness. At this moment, it only detects Paired-End VS Single-End libraries (as that's what we have access to), but support for more diverse library types are possible.

This PR adds one column to the general stats table (Predicted Endedness). In the ngsderive section of the report it adds a new table with 6-7 columns (depending on what options were enabled during runtime of ngsderive).

5 of the columns are numerical, though they are best represented as a table. The algorithm is explained here, but the TLDR is that a bargraph/linegraph/etc. would not be informative for these numeric values. A table suffices.

This comment contains a description of changes (with reason)
CHANGELOG.md has been updated

There is example tool output for tools in the https://github.com/ewels/MultiQC_TestData repository or attached to this PR
Code is tested and works locally (including with --lint flag)
docs/README.md is updated with link to below
docs/modulename.md is created
Everything that can be represented with a plot instead of a table is a plot
Report sections have a description and help text (with self.add_section)
There aren't any huge tables with > 6 columns (explain reasoning if so)
Each table column has a different colour scale to its neighbour, which relates to the data (eg. if high numbers are bad, they're red)
Module does not do any significant computational work

vladsavelyev

Thanks for adding another feature! Looks great, just some comments mostly relevant to formatting.

multiqc/modules/ngsderive/ngsderive.py

vladsavelyev · 2023-08-22T11:12:03Z

Thanks for addressing the suggestions quickly @a-frantz, looks good to me!
Will leave it to @ewels to merge.

a-frantz · 2023-08-25T17:36:36Z

@ewels
The new endedness command (and the old strandedness command) support a parameter called --split-by-rg. This does what it sounds like it does; instead of one entry per sample (BAM), it will have X+1 entries per sample, where X is the number of read groups in the file (plus 1 for an overall entry).
When first writing the strandedness integration a few years ago, I was under the impression that there wasn't a good way in MultiQC to present multiple data points per sample (in the simplest form, think of a table where multiple rows correspond to the same sample). Because of that belief, I simply decided MultiQC should ignore any files with a ReadGroup column header and called it a day.
I'm rethinking that decision, and wanted to check with you if what I want is already present in MultiQC. We've found a decent number of cases where splitting these analyses by read-group has elucidated something interesting. It would be ideal if we can get this slight change of data format through the MultiQC parsing.

ewels · 2023-08-28T12:07:37Z

@a-frantz not really I'm afraid, at least - not yet. I've been wanting to add some core functionality to handle groups of samples / results for years. I actually had a stab at it back in 2017 (time flies 🤯 ) in #576 👀

It will happen at some point. But it's very complex and might be one of the things that ends up in the bigger v2 milestone that we're starting to build. Feel free to follow #542 for progress.

In the mean time, could you just treat each as a separate sample - just adding the read group as a suffix to the sample name? It's a little crude, but would get the data into the report at least. I think that Picard MarkDuplicates does this already:

https://github.com/ewels/MultiQC/blob/a5a31290ecc540c377a5569d5912d1758fab389c/docs/modules/picard.md?plain=1#L134-L135

Also DRAGEN, VerifyBamId and some other modules do stuff with read groups too it seems, based on a quick project-wide search.

ewels

Couple of minor points that need fixing up..

multiqc/modules/ngsderive/ngsderive.py

ewels · 2023-08-28T12:29:19Z

multiqc/modules/ngsderive/ngsderive.py

+        headers["FyLn"] = {
+            "title": "f+l- count",
+            "description": "Number of reads with the 'first in template' FLAG set and the 'last in template' FLAG unset",
+            "format": "{:,d}",


Should use "shared_key": "read_count" and the appropriate number base modifiers. See docs.

Hmm, docs are a bit thin on this. Example here:

https://github.com/ewels/MultiQC/blob/57f3588c5a524f855b0706022e6db44f423588d1/multiqc/modules/ivar/ivar.py#L110-L116

Would this automatically scale the values? As I shared in another comment, these data are being checked for exact equality. Any scaling would mask problems users should know about. The value of the "one's" place is important, as off-by-one errors are the most common class of error.

I think that for readability, decimal separators do the job, so there is no need to use read count multipliers that hide the precision that's important here. Unless we add another derived metric, like the difference between f+l- and f-l+ (not sure how creative MultiQC should be here)

ok 👍🏻

Worth considering a custom shared key anyway, to preserve bar scaling across columns? I think that read_count might do some magic with modifiers (?) but a custom string shouldn't.

ewels · 2023-08-28T12:30:36Z

multiqc/modules/ngsderive/ngsderive.py

+            anchor="ngsderive-endedness",
+            description="""Predicted library endedness provided by ngsderive. For more information, please see
+            [the documentation](https://stjudecloud.github.io/ngsderive/subcommands/endedness/).""",
+            plot=table.plot(table_data, headers, config),


Is it possible to represent these counts as a stacked bar plot, instead of a table? They're all subsets of a whole, right? Or could they sum to > 100%?

They do sum to 100%, so a stacked bar plot is possible. I'm just afraid it might be less informative than the raw numbers. Which is admittedly counter-intuitive. Let me explain how these data will be used.

These data are more or less meant as sanity checks. The patterns you expect to see are:
Paired-End sample: f+l- == f-l+ && f+l+==0 && f-l-==0
Single-End sample: f+l+ > 0 && everything else == 0
So in my mind, the easiest way to check those assumptions is with a table, where I can quickly glance and say "Great, these numbers are the same number" or "O no, something is greater than zero". The degree by which a category is off from it's assumed value doesn't really matter that match.

In practice, I'm running this on our data and seeing some off-by-one cases. I'm afraid that presenting this as a barplot would mask an off-by-one case, because in a plot the difference between 3million and 3million+1 would be impossible to see. But in a table, super easy to see a value is off by one.

I could see using the conditional formatting available to tables to color-code some of these cases. Would that be possible? First instinct is "Give each unique numerical value a distinct color" (there will be at most 4). Equal cells will have the same color. Unequal cells would have some contrast. That should facilitate the "at-a-glance" assumption checking.

So far I've ignored the Reads Per Template value. That is supposed to be exactly 2.0 (Paired-End) or exactly 1.0(Single-End). Deviation from those exact values is what's concerning. Any ideas for how to present that more intuitively?

I agree that we should keep the raw numbers instead of rounding or plotting them.

But perhaps, indeed, these checks can be automated, so conditional formatting to highlight samples that break the condition is a good direction to look into.

Another approach would be to just add one more derived metric into the general stats, that would simply show if the sample passes those conditions or not. Similarly formatted to the existing "predicted instrument" and "predicted endedness" columns.

It would be also helpful to add a "bad" example into the test data, where a sample doesn't pass the endedness criteria.

Let me complicate things for us: There is a parameter to this command which enables only processing a subset of the BAM file. When this mode is active, there is obviously some amount of deviation from the exact conditions I've outlined previously, just due to the randomness inherent in only processing a portion of the available reads. When run in this mode, the user supplies a tolerance range for determining the conditions. e.g. we can still call the sample Paired-End so long as 0.45 < (f+l-)/((f+l-)+(f-l+)) < 0.55. This is in contrast to the typical case, when every read is processed, that requires (f+l-)==(f-l+). Similarly RPT gets rounded to the nearest int during checks when run in "subset mode".

Generally we would discourage users from using this mode as it will mask errors. But we felt the need to enable it for consistency with the rest of the ngsderive commands. As you can imagine, checking out every read is the file takes a long time. A user has the option to get a decent enough guess in a few seconds if that's what their use-case calls for.

Another approach would be to just add one more derived metric into the general stats, that would simply show if the sample passes those conditions or not. Similarly formatted to the existing "predicted instrument" and "predicted endedness" columns.

Because of the inconsistency outlined above, MultiQC can't get into the business of checking these conditions itself. I would not want to imply "conditions met" or "conditions unmet" with colors such as "green" and "red". The conditions checking is already done and presented by the value of Endedness. That will be Unknown if no conditions are met (or Paired-End or Single-End were those conditions met). I wouldn't be opposed to coloring Unknown red, but I'm not sure it really adds a whole lot. Same for coloring Paired-End and Single-End green. I could take it or leave it.

I think my first instinct:

First instinct is "Give each unique numerical value a distinct color" (there will be at most 4). Equal cells will have the same color. Unequal cells would have some contrast. That should facilitate the "at-a-glance" assumption checking.

is still our best bet. I haven't come up with anything better. It wouldn't be particularly useful for the niche case that only a subset of the BAM was processed, but I think that's fine. We can't make everyone happy, so let's focus on the predominant use-case.

I wouldn't want to color the RPT cell because it is allowed to deviate some. I suppose >=3 would be universally "a problem" today, but in the future we might find a valid use case for that (circular templates?). So to future-proof, I wouldn't want to call any RPT value "bad".

It would be also helpful to add a "bad" example into the test data, where a sample doesn't pass the endedness criteria.

https://github.com/ewels/MultiQC_TestData/blob/master/data/modules/ngsderive/test.endedness.without_rpt.txt

This represents the common "off by one" error case

Wow, this is a fairly atypical module 😆 It would be great to write all of this in the description / help text above the table, so that users can read how about to interpret the results.

I can't think of any other modules doing this, but it should be possible to get what you want by setting scale to False and bgcols to a dynamically generated dict of discovered read counts + assigned colours. Then you can effectively colour code the table cells according to their numerical status.

If the actual number isn't very interesting, then I think this makes sense for me. As the normal cell background bars don't really tell you anything.

a-frantz · 2023-08-28T17:04:59Z

@a-frantz not really I'm afraid, at least - not yet. I've been wanting to add some core functionality to handle groups of samples / results for years. I actually had a stab at it back in 2017 (time flies 🤯 ) in #576 👀

It will happen at some point. But it's very complex and might be one of the things that ends up in the bigger v2 milestone that we're starting to build. Feel free to follow #542 for progress.

In the mean time, could you just treat each as a separate sample - just adding the read group as a suffix to the sample name? It's a little crude, but would get the data into the report at least. I think that Picard MarkDuplicates does this already:

https://github.com/ewels/MultiQC/blob/a5a31290ecc540c377a5569d5912d1758fab389c/docs/modules/picard.md?plain=1#L134-L135

Also DRAGEN, VerifyBamId and some other modules do stuff with read groups too it seems, based on a quick project-wide search.

This seems like a decent enough solution. Not ideal, but I think we can make it work. My big concern with taking this approach is that I'm not sure if there's a way to "hide" these suffixed "samples" from other sections, mainly the General Stats table. I think having 7 entries with the same sample prefix in one block of the table, and then 6 of them being blank for every other section of the table would be pretty ugly.

What I'm envisioning: ngsderive inserts an overall entry when --split-by-rg is True. This entry would be parsed according to standard sample cleaning rules and would appear as an entry in the General Stats Table. Every other ngsderive result with a ReadGroup creates a new "sample" by suffixing the overall sample name with the RG ID. These "new samples" don't appear anywhere outside of ngdserive's block of the MultiQC report. No entries go into General Stats.

Is this possible?

ewels · 2023-08-29T22:41:22Z

Is this possible?

Yup 👍🏻 The module code dictates which samples + data go into the General Stats table / report section separately, so you should have complete control over what goes where.

Plenty of precedent on this one. For example, FastQC appends suffixes to sample names for the adapter trimming plot.

a-frantz · 2023-09-08T15:27:35Z

@ewels I'm working on a v4 release of ngsderive. It involves some breaking changes that will need to be accommodated for in MultiQC*.

Would you rather me knock this current PR out first and open another PR to address the v4 changes, or do it all in one PR? What's easiest for your workflow?

*the changes relevant to MultiQC are:

reporting percentages instead of proportions in the range of 0-1
- we'll have to detect what the scale is in MultiQC (does this value need to be multiplied by 100 before going into the MultiQC report?)
Headers have been made to consistently use PascalCase, instead of the current mishmash of cases
- MultiQC will need to be able to parse the data of either

vladsavelyev · 2023-09-16T10:19:50Z

@a-frantz, it's really up to you. If you already know what will the new ngsderive output look like, perhaps it makes sense to update the code in this PR, and also update the MultiQC_TestData examples.

* master: Just run CI on the oldest + newest supported Python versions (MultiQC#2074) Picard: fix parsing mixed strings/numbers, account for trailing tab (MultiQC#2083) FastQC: add top overrepresented sequences table (MultiQC#2075) Add GitHub Actions bot workflow to fix code linting from a PR comment (MultiQC#2082) Use custom exception type instead of `UserWarning` when no samples are found. (MultiQC#2049) Lint modules for missing `self.add_software_version` (MultiQC#2081) Changelog bot: Update docs (MultiQC#2077) Changelog action: remove `.capitalize()`, add changelog entry (MultiQC#2080) Add action to populate the change log from PR titles triggered by `@multiqc-bot changelog` (MultiQC#2025) # Conflicts: # CHANGELOG.md # multiqc/modules/ngsderive/ngsderive.py

a-frantz · 2023-10-02T19:47:50Z

multiqc/modules/ngsderive/ngsderive.py

+            read_group = row.get("ReadGroup")
+            if read_group and read_group != "overall":
+                sample_name += "-" + read_group
+                row["is_read_group"] = True
+            else:
+                row["is_read_group"] = False


@vladsavelyev At this stage of development, the only subcommands which support read group level data are strandedness and endedness. That data is not guaranteed to be present, as a user must opt into that level of granularity.

I would like to support two different modes for the strandedness barplot. The default would be the current implementation (only one "overall" entry per-sample). The user could then toggle the "Read Group" mode which shows samples + their read group level entries. I'll need to play around with the sorting here to see what would be most useful. I'm not sure ATM.

I don't think we need any toggles for the endedness table. I think it's suitable to display all found information by default.

Is there an example elsewhere in the code base I could follow for what I'm describing for strandedness? I'm not quite sure the best way to approach the problem.

a-frantz · 2023-10-02T19:53:31Z

multiqc/modules/ngsderive/ngsderive.py

-                "reverse": round(float(strandedness.get("ReversePct")) * 100.0, 2),
-            }
+            if float(strandedness.get("ForwardPct")) == 0 and float(strandedness.get("ReversePct")) == 0:
+                # TODO how to present this case? No reads found for this Read Group.


@vladsavelyev This is a case I'm stumped over. In v4 of ngsderive, I added a feature where the read group definitions in the BAM header are validated against what appears in the sequences. This now means it's possible for an entry to be NULL, if the read group is defined in the header but lacks any sequences associated with it. There's no data to present here, so it can't go into the barplot. This is most likely an error in the BAM that should be communicated to the user. However I'm not sure how to go about that. I suppose we could log.warning("Read Group info messed up! Yada yada") and just ignore it for the output report... I would like to see it somewhere in the report, but I have no idea about an appropriate way to achieve that.

a-frantz added 3 commits August 18, 2023 10:23

feat: new 'endedness' submodule for ngsderive

22f19f0

fix: lint errors about duplicate HTML IDs

38689b3

docs: note new submodule in CHANGELOG

9a15f7c

a-frantz marked this pull request as ready for review August 18, 2023 14:44

vladsavelyev mentioned this pull request Aug 21, 2023

feat: new test file for new ngsderive endedness subcommand MultiQC/test-data#262

Merged

vladsavelyev self-requested a review August 21, 2023 15:13

vladsavelyev requested changes Aug 21, 2023

View reviewed changes

chore: apply suggestions from @vladsavelyev

87ab029

vladsavelyev self-requested a review August 22, 2023 11:10

vladsavelyev approved these changes Aug 22, 2023

View reviewed changes

vladsavelyev changed the title ~~feat: new 'endedness' submodule for ngsderive~~ ngsderive: new 'endedness' submodule Aug 23, 2023

vladsavelyev added the awaits-review Awaiting final review and merge. label Aug 23, 2023

vladsavelyev and others added 2 commits August 28, 2023 14:11

Merge branch 'master' into ngsderive-endedness

2f5280b

Remove commas to stop Black from splitting into separate lines

06bb8f7

ewels requested changes Aug 28, 2023

View reviewed changes

a-frantz and others added 3 commits August 28, 2023 12:55

feat: accept multiple expected_header_counts

d5b2ef7

Set literals

32631e1

Fix for missing Reads per template col

0cfa8f8

ewels added waiting: changes Issue / PR is on hold, waiting for requested changes and removed awaits-review Awaiting final review and merge. labels Aug 30, 2023

Merge branch 'master' into ngsderive-endedness

02dd9ed

Merge branch 'master' into ngsderive-endedness

7e56a5c

vladsavelyev self-requested a review September 27, 2023 16:45

a-frantz added 3 commits October 2, 2023 13:08

[WIP]feat(ngsderive): add partial support for v4 release

3284d22

fix: better handling of v4 changes

f53943c

a-frantz commented Oct 2, 2023

View reviewed changes

fix: handle '%' literal in v4 results

a8f8d1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ngsderive: new 'endedness' submodule #1992

ngsderive: new 'endedness' submodule #1992

a-frantz commented Aug 18, 2023 •

edited

vladsavelyev left a comment

vladsavelyev commented Aug 22, 2023

a-frantz commented Aug 25, 2023

ewels commented Aug 28, 2023

ewels left a comment

ewels Aug 28, 2023

a-frantz Aug 28, 2023

vladsavelyev Aug 29, 2023 •

edited

ewels Aug 29, 2023

ewels Aug 28, 2023

a-frantz Aug 28, 2023

vladsavelyev Aug 29, 2023 •

edited

vladsavelyev Aug 29, 2023 •

edited

a-frantz Aug 29, 2023

a-frantz Aug 29, 2023

ewels Aug 29, 2023 •

edited

a-frantz commented Aug 28, 2023

ewels commented Aug 29, 2023

a-frantz commented Sep 8, 2023

vladsavelyev commented Sep 16, 2023

a-frantz Oct 2, 2023

a-frantz Oct 2, 2023

ngsderive: new 'endedness' submodule #1992

Are you sure you want to change the base?

ngsderive: new 'endedness' submodule #1992

Conversation

a-frantz commented Aug 18, 2023 • edited

vladsavelyev left a comment

Choose a reason for hiding this comment

vladsavelyev commented Aug 22, 2023

a-frantz commented Aug 25, 2023

ewels commented Aug 28, 2023

ewels left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladsavelyev Aug 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vladsavelyev Aug 29, 2023 • edited

Choose a reason for hiding this comment

vladsavelyev Aug 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ewels Aug 29, 2023 • edited

Choose a reason for hiding this comment

a-frantz commented Aug 28, 2023

ewels commented Aug 29, 2023

a-frantz commented Sep 8, 2023

vladsavelyev commented Sep 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-frantz commented Aug 18, 2023 •

edited

vladsavelyev Aug 29, 2023 •

edited

vladsavelyev Aug 29, 2023 •

edited

vladsavelyev Aug 29, 2023 •

edited

ewels Aug 29, 2023 •

edited