-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ngsderive: new 'endedness' submodule #1992
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding another feature! Looks great, just some comments mostly relevant to formatting.
@ewels |
@a-frantz not really I'm afraid, at least - not yet. I've been wanting to add some core functionality to handle groups of samples / results for years. I actually had a stab at it back in 2017 (time flies 🤯 ) in #576 👀 It will happen at some point. But it's very complex and might be one of the things that ends up in the bigger v2 milestone that we're starting to build. Feel free to follow #542 for progress. In the mean time, could you just treat each as a separate sample - just adding the read group as a suffix to the sample name? It's a little crude, but would get the data into the report at least. I think that Picard MarkDuplicates does this already: Also DRAGEN, VerifyBamId and some other modules do stuff with read groups too it seems, based on a quick project-wide search. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of minor points that need fixing up..
headers["FyLn"] = { | ||
"title": "f+l- count", | ||
"description": "Number of reads with the 'first in template' FLAG set and the 'last in template' FLAG unset", | ||
"format": "{:,d}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use "shared_key": "read_count"
and the appropriate number base modifiers. See docs.
Hmm, docs are a bit thin on this. Example here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this automatically scale the values? As I shared in another comment, these data are being checked for exact equality. Any scaling would mask problems users should know about. The value of the "one's" place is important, as off-by-one errors are the most common class of error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for readability, decimal separators do the job, so there is no need to use read count multipliers that hide the precision that's important here. Unless we add another derived metric, like the difference between f+l- and f-l+ (not sure how creative MultiQC should be here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok 👍🏻
Worth considering a custom shared key anyway, to preserve bar scaling across columns? I think that read_count
might do some magic with modifiers (?) but a custom string shouldn't.
anchor="ngsderive-endedness", | ||
description="""Predicted library endedness provided by ngsderive. For more information, please see | ||
[the documentation](https://stjudecloud.github.io/ngsderive/subcommands/endedness/).""", | ||
plot=table.plot(table_data, headers, config), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to represent these counts as a stacked bar plot, instead of a table? They're all subsets of a whole, right? Or could they sum to > 100%?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do sum to 100%, so a stacked bar plot is possible. I'm just afraid it might be less informative than the raw numbers. Which is admittedly counter-intuitive. Let me explain how these data will be used.
These data are more or less meant as sanity checks. The patterns you expect to see are:
Paired-End sample: f+l-
== f-l+
&& f+l+
==0
&& f-l-
==0
Single-End sample: f+l+
> 0
&& everything else == 0
So in my mind, the easiest way to check those assumptions is with a table, where I can quickly glance and say "Great, these numbers are the same number" or "O no, something is greater than zero". The degree by which a category is off from it's assumed value doesn't really matter that match.
In practice, I'm running this on our data and seeing some off-by-one cases. I'm afraid that presenting this as a barplot would mask an off-by-one case, because in a plot the difference between 3million and 3million+1 would be impossible to see. But in a table, super easy to see a value is off by one.
I could see using the conditional formatting available to tables to color-code some of these cases. Would that be possible? First instinct is "Give each unique numerical value a distinct color" (there will be at most 4). Equal cells will have the same color. Unequal cells would have some contrast. That should facilitate the "at-a-glance" assumption checking.
So far I've ignored the Reads Per Template
value. That is supposed to be exactly 2.0
(Paired-End) or exactly 1.0
(Single-End). Deviation from those exact values is what's concerning. Any ideas for how to present that more intuitively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should keep the raw numbers instead of rounding or plotting them.
But perhaps, indeed, these checks can be automated, so conditional formatting to highlight samples that break the condition is a good direction to look into.
Another approach would be to just add one more derived metric into the general stats, that would simply show if the sample passes those conditions or not. Similarly formatted to the existing "predicted instrument" and "predicted endedness" columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be also helpful to add a "bad" example into the test data, where a sample doesn't pass the endedness criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me complicate things for us: There is a parameter to this command which enables only processing a subset of the BAM file. When this mode is active, there is obviously some amount of deviation from the exact conditions I've outlined previously, just due to the randomness inherent in only processing a portion of the available reads. When run in this mode, the user supplies a tolerance range for determining the conditions. e.g. we can still call the sample Paired-End
so long as 0.45 < (f+l-)/((f+l-)+(f-l+)) < 0.55
. This is in contrast to the typical case, when every read is processed, that requires (f+l-)==(f-l+)
. Similarly RPT gets rounded to the nearest int
during checks when run in "subset mode".
Generally we would discourage users from using this mode as it will mask errors. But we felt the need to enable it for consistency with the rest of the ngsderive
commands. As you can imagine, checking out every read is the file takes a long time. A user has the option to get a decent enough guess in a few seconds if that's what their use-case calls for.
Another approach would be to just add one more derived metric into the general stats, that would simply show if the sample passes those conditions or not. Similarly formatted to the existing "predicted instrument" and "predicted endedness" columns.
Because of the inconsistency outlined above, MultiQC can't get into the business of checking these conditions itself. I would not want to imply "conditions met" or "conditions unmet" with colors such as "green" and "red". The conditions checking is already done and presented by the value of Endedness
. That will be Unknown
if no conditions are met (or Paired-End
or Single-End
were those conditions met). I wouldn't be opposed to coloring Unknown
red, but I'm not sure it really adds a whole lot. Same for coloring Paired-End
and Single-End
green. I could take it or leave it.
I think my first instinct:
First instinct is "Give each unique numerical value a distinct color" (there will be at most 4). Equal cells will have the same color. Unequal cells would have some contrast. That should facilitate the "at-a-glance" assumption checking.
is still our best bet. I haven't come up with anything better. It wouldn't be particularly useful for the niche case that only a subset of the BAM was processed, but I think that's fine. We can't make everyone happy, so let's focus on the predominant use-case.
I wouldn't want to color the RPT cell because it is allowed to deviate some. I suppose >=3
would be universally "a problem" today, but in the future we might find a valid use case for that (circular templates?). So to future-proof, I wouldn't want to call any RPT value "bad".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be also helpful to add a "bad" example into the test data, where a sample doesn't pass the endedness criteria.
This represents the common "off by one" error case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is a fairly atypical module 😆 It would be great to write all of this in the description / help text above the table, so that users can read how about to interpret the results.
I can't think of any other modules doing this, but it should be possible to get what you want by setting scale
to False
and bgcols
to a dynamically generated dict of discovered read counts + assigned colours. Then you can effectively colour code the table cells according to their numerical status.
If the actual number isn't very interesting, then I think this makes sense for me. As the normal cell background bars don't really tell you anything.
This seems like a decent enough solution. Not ideal, but I think we can make it work. My big concern with taking this approach is that I'm not sure if there's a way to "hide" these suffixed "samples" from other sections, mainly the General Stats table. I think having 7 entries with the same sample prefix in one block of the table, and then 6 of them being blank for every other section of the table would be pretty ugly. What I'm envisioning: Is this possible? |
Yup 👍🏻 The module code dictates which samples + data go into the General Stats table / report section separately, so you should have complete control over what goes where. Plenty of precedent on this one. For example, FastQC appends suffixes to sample names for the adapter trimming plot. |
@ewels I'm working on a v4 release of ngsderive. It involves some breaking changes that will need to be accommodated for in MultiQC*. Would you rather me knock this current PR out first and open another PR to address the v4 changes, or do it all in one PR? What's easiest for your workflow? *the changes relevant to MultiQC are:
|
@a-frantz, it's really up to you. If you already know what will the new ngsderive output look like, perhaps it makes sense to update the code in this PR, and also update the MultiQC_TestData examples. |
* master: Just run CI on the oldest + newest supported Python versions (MultiQC#2074) Picard: fix parsing mixed strings/numbers, account for trailing tab (MultiQC#2083) FastQC: add top overrepresented sequences table (MultiQC#2075) Add GitHub Actions bot workflow to fix code linting from a PR comment (MultiQC#2082) Use custom exception type instead of `UserWarning` when no samples are found. (MultiQC#2049) Lint modules for missing `self.add_software_version` (MultiQC#2081) Changelog bot: Update docs (MultiQC#2077) Changelog action: remove `.capitalize()`, add changelog entry (MultiQC#2080) Add action to populate the change log from PR titles triggered by `@multiqc-bot changelog` (MultiQC#2025) # Conflicts: # CHANGELOG.md # multiqc/modules/ngsderive/ngsderive.py
read_group = row.get("ReadGroup") | ||
if read_group and read_group != "overall": | ||
sample_name += "-" + read_group | ||
row["is_read_group"] = True | ||
else: | ||
row["is_read_group"] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladsavelyev At this stage of development, the only subcommands which support read group level data are strandedness
and endedness
. That data is not guaranteed to be present, as a user must opt into that level of granularity.
I would like to support two different modes for the strandedness
barplot. The default would be the current implementation (only one "overall" entry per-sample). The user could then toggle the "Read Group" mode which shows samples + their read group level entries. I'll need to play around with the sorting here to see what would be most useful. I'm not sure ATM.
I don't think we need any toggles for the endedness
table. I think it's suitable to display all found information by default.
Is there an example elsewhere in the code base I could follow for what I'm describing for strandedness
? I'm not quite sure the best way to approach the problem.
"reverse": round(float(strandedness.get("ReversePct")) * 100.0, 2), | ||
} | ||
if float(strandedness.get("ForwardPct")) == 0 and float(strandedness.get("ReversePct")) == 0: | ||
# TODO how to present this case? No reads found for this Read Group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladsavelyev This is a case I'm stumped over. In v4 of ngsderive, I added a feature where the read group definitions in the BAM header are validated against what appears in the sequences. This now means it's possible for an entry to be NULL, if the read group is defined in the header but lacks any sequences associated with it. There's no data to present here, so it can't go into the barplot. This is most likely an error in the BAM that should be communicated to the user. However I'm not sure how to go about that. I suppose we could log.warning("Read Group info messed up! Yada yada")
and just ignore it for the output report... I would like to see it somewhere in the report, but I have no idea about an appropriate way to achieve that.
Newest subcommand of ngsderive is
endedness
. At this moment, it only detects Paired-End VS Single-End libraries (as that's what we have access to), but support for more diverse library types are possible.This PR adds one column to the general stats table (Predicted Endedness). In the ngsderive section of the report it adds a new table with 6-7 columns (depending on what options were enabled during runtime of ngsderive).
5 of the columns are numerical, though they are best represented as a table. The algorithm is explained here, but the TLDR is that a bargraph/linegraph/etc. would not be informative for these numeric values. A table suffices.
CHANGELOG.md
has been updated--lint
flag)docs/README.md
is updated with link to belowdocs/modulename.md
is createdself.add_section
)