Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Element Biosciences AVITI Bases2fastq support to multiqc #1990

Closed
wants to merge 48 commits into from

Conversation

blajoie
Copy link

@blajoie blajoie commented Aug 17, 2023

  • Adding support for Element Biosciences Bases2Fastq (Fastq generation & Demultiplexing) to MultiQC
  • CHANGELOG.md has been updated
  • adding elembio aviti bases2fastq test data test-data#263
  • Code is tested and works locally (including with --lint flag) - linting in progress.
  • docs/README.md is updated with link to below
  • docs/modulename.md is created
  • Everything that can be represented with a plot instead of a table is a plot
  • Report sections have a description and help text (with self.add_section)
  • There aren't any huge tables with > 6 columns (explain reasoning if so)
  • Each table column has a different colour scale to its neighbour, which relates to the data (eg. if high numbers are bad, they're red)
  • Module does not do any significant computational work

cc @YuheCheng62 @ajaltomare

@blajoie
Copy link
Author

blajoie commented Aug 17, 2023

Example MultiQC reports cc @YuheCheng62

Cloudbreak-DVT-ecoli-wgs-2x150
12 runs, 1152 samples total. Each run is a 96plex of Ecoli WGS libraries, sequenced on AVITI.
https://element-public-data.s3.us-west-2.amazonaws.com/multiqc/Cloudbreak-DVT-ecoli-wgs-2x150.html

Cloudbreak-DVT-human-wgs-2x150
10 runs, 20 samples total. Each run is a 2plex of Human WGS libraries, sequenced on AVITI.
https://element-public-data.s3.us-west-2.amazonaws.com/multiqc/Cloudbreak-DVT-human-wgs-2x150.html

Cloudbreak-DVT-human-rnaseq-2x75
10 runs, 160 samples total. Each run is a 16plex of Human UHHR Rna-Seq libraries, sequenced on AVITI.
https://element-public-data.s3.us-west-2.amazonaws.com/multiqc/Cloudbreak-DVT-human-rnaseq-2x75.html

@blajoie
Copy link
Author

blajoie commented Aug 17, 2023

PR into MultiQC_TestData for supprting test data. cc @YuheCheng62
MultiQC/test-data#263

Copy link
Member

@ewels ewels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very fast first-pass review to check some of the common gotchas that come up in PRs. A few tweaks to change here, I'll come back for a more thorough review soon. (Haven't even tried running it yet).

docs/modules/base2fastq.md Outdated Show resolved Hide resolved
docs/modules/base2fastq.md Outdated Show resolved Hide resolved
multiqc/utils/search_patterns.yaml Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
multiqc/modules/bases2fastq/bases2fastq.py Show resolved Hide resolved
multiqc/modules/bases2fastq/bases2fastq.py Show resolved Hide resolved
@ewels
Copy link
Member

ewels commented Aug 18, 2023

ok I still have 4 minutes before my next meeting so tried generating a report very quickly with the test data. Speed notes:

  • Got errors about making Matplotlib figures. Haven't looked into why yet.
  • Colours in tables are all defaults. Colour matters.
  • Please add a column or two to general stats for comparison against results from other tools. Yield maybe?
  • Run name - set Scale to None for text cells, it'll fix text wrapping.
  • Please write more help text. Explain the why of the plot, what to look for, implications, how to spot bad samples etc. Current help text is more suitable as the description field.

Comment on lines 159 to 160
self.groupDict = dict()
self.groupLookupDict = dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should stick to the lower_case_with_underscores naming convention for variables and functions.

(and also note to self - this is something that can be checked automatically with linters)

plotContent = dict()
for s_name in runData.keys():
runStats = dict()
runStats.update({"#Polonies": runData[s_name]["NumPolonies"]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Polonies should be rounded to read or bases counts in the general stats table. See how it's done for the bcl2fastq module as a reference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other metrics. Please take a look if similar metrics that are already reported in other modules, and try to use the same naming style, value formatting, and color scheme. E.g. bcl2fastq has Q30, yield, mean base quality. FastQC can be a reference for other metrics and plots.

return html, plotName, anchor, description, helptext, plotContent


def plot_per_cycle_N_content(sampleData, groupLookupDict, colorDict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the FastQC code for that plot be adapted here?

@vladsavelyev
Copy link
Member

vladsavelyev commented Aug 18, 2023

I tried to open the first example that you shared (
https://element-public-data.s3.us-west-2.amazonaws.com/multiqc/Cloudbreak-DVT-ecoli-wgs-2x150.html) and it froze my browser :( I think the culprits are the sample-level plots, but can't understand what exactly is the bottleneck in rendering. Would be good to look more because even the test sample data makes the rendering veeery slow. We might want to revert to static plots here. But something to get back to when we replace the plotting library.

Overall, the module is a great addition, but I would ask to adjust the code style (particularly, Python's standard is to use lower_case_underscore naming for variables and functions), and to look at similar modules like fastqc bclf2fastq to reuse and adapt the existing code for similar plots and metrics.

CHANGELOG.md Outdated Show resolved Hide resolved
docs/README.md Outdated Show resolved Hide resolved
@vladsavelyev vladsavelyev self-requested a review August 28, 2023 12:31
Copy link
Member

@vladsavelyev vladsavelyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few change suggestions, and make sure to update the branch from master. Otherwise, good with me!

YuheCheng62 and others added 5 commits August 28, 2023 10:46
Co-authored-by: Vlad Savelyev <vladislav.sav@gmail.com>
Co-authored-by: Vlad Savelyev <vladislav.sav@gmail.com>
Co-authored-by: Vlad Savelyev <vladislav.sav@gmail.com>
@vladsavelyev vladsavelyev self-requested a review August 28, 2023 22:12
Copy link
Member

@vladsavelyev vladsavelyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff!

@vladsavelyev vladsavelyev added the awaits-review Awaiting final review and merge. label Aug 28, 2023
@ewels ewels removed the awaits-review Awaiting final review and merge. label Sep 1, 2023
@vladsavelyev vladsavelyev self-requested a review September 1, 2023 11:45
"format": "{d}",
"description": "The (total) number of polonies calculated for the run",
"min": 0,
"scale": "RdYlGn",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just repeating some code comments that might be lost now :)

#Polonies should be rounded to read or bases counts in the general stats table. See how it's done for the bcl2fastq module as a reference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other metrics. Could you take a look if similar metrics that are already reported in other modules, and try to use the same naming style, value formatting, and color scheme. E.g. bcl2fastq has Q30, yield, mean base quality. FastQC can be a reference for other metrics and plots.

return html, plot_name, anchor, description, helptext, plot_content


def plot_per_cycle_N_content(sample_data, group_lookup_dict, color_dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to adapt the FastQC code for that plot

@vladsavelyev vladsavelyev self-requested a review September 5, 2023 09:32
Copy link
Member

@vladsavelyev vladsavelyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes requested - see the code comments above!

@ewels ewels added this to the MultiQC v1.16 milestone Sep 14, 2023
@ewels ewels mentioned this pull request Sep 14, 2023
@ewels
Copy link
Member

ewels commented Sep 14, 2023

ok, the remaining changes are very minor. I just tried to fix the merge conflict but couldn't push it back - because the fork is under Elembio (and not a user account), GitHub is a bit more strict about allowing us to push into the PR directly.

To keep the momentum, I've created a new branch that we can work in and an associated PR: #2044

Hopefully we can do these final minor changes and get this merged soon 😄

@MultiQC MultiQC locked as off-topic and limited conversation to collaborators Sep 22, 2023
@ewels
Copy link
Member

ewels commented Sep 22, 2023

Locking the conversation here so that we remember to move over to #2044

@ewels
Copy link
Member

ewels commented Sep 22, 2023

Actually I think I will just close this PR for clarity.

@ewels ewels closed this Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants