Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New module: BBDuk #1742

Merged
merged 13 commits into from
Jan 8, 2023
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@

- [**Anglerfish**](https://github.com/remiolsen/anglerfish)
- A tool designed to assess pool balancing, contamination and insert sizes of Illumina library dry runs on Oxford Nanopore data.
- [**BBDuk**](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/)
- Combines most common data-quality-related trimming, filtering, and masking operations via kmers into a single high-performance tool.
- [**Cell Ranger**](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger)
- Works with data from 10X Genomics Chromium. Processes Chromium single cell data to align reads, generate feature-barcode matrices, perform clustering and other secondary analysis, and more.
- New MultiQC module parses Cell Ranger quality reports from VDJ and count analysis
Expand Down
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ MultiQC Modules:
Adapter Removal: modules/adapterRemoval.md
AfterQC: modules/afterqc.md
Anglerfish: modules/anglerfish.md
BBDuk: modules/bbduk.md
Bcl2fastq: modules/bcl2fastq.md
BclConvert: modules/bclconvert.md
BioBloom Tools: modules/biobloomtools.md
Expand Down
24 changes: 24 additions & 0 deletions docs/modules/bbduk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
Name: BBDuk
URL: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/
Description: Tool for common data-quality-related trimming, filtering, and masking operations
---

The BBDuk module produces summary statistics from the stdout logging information
from the BBDuk tool of the [BBTools](http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/) suite of tools.

"Duk" stands for Decontamination Using Kmers. BBDuk was developed to combine
most common data-quality-related trimming, filtering, and masking operations
into a single high-performance tool.

The module can summarise data from the following BBDuk funtionality
(descriptions from command line help output):

- `entropy` - entropy filtering
- `ktrim` - kmer trimming
- `qtrim` - quality trimming
- `maq` - read quality filtering
- `ref` contaminant filtering

Additional information on the BBMap tools is available on
[SeqAnswers](http://seqanswers.com/forums/showthread.php?t=41057).
3 changes: 3 additions & 0 deletions multiqc/modules/bbduk/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from __future__ import absolute_import

from .bbduk import MultiqcModule
170 changes: 170 additions & 0 deletions multiqc/modules/bbduk/bbduk.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
""" Module to parse output from BBDuk """

import logging
import re
from collections import OrderedDict

from multiqc.modules.base_module import BaseMultiqcModule
from multiqc.plots import bargraph
from multiqc.utils import config

log = logging.getLogger(__name__)


class MultiqcModule(BaseMultiqcModule):
"""BBDuk Module"""

def __init__(self):

# Initialise the parent object
super(MultiqcModule, self).__init__(
name="BBDuk",
anchor="bbduk",
href="https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/",
info="""is a tool performing common data-quality-related trimming,
filtering, and masking operations with a kmer based approach""",
## One publication, but only for the merge tool:
# doi="10.1371/journal.pone.0185056",
)

## Define the main bbduk multiqc data object
self.bbduk_data = dict()

for f in self.find_log_files("bbduk", filehandles=True):
self.parse_logs(f)

self.bbduk_data = self.ignore_samples(self.bbduk_data)

if len(self.bbduk_data) == 0:
raise UserWarning

log.info("Found {} reports".format(len(self.bbduk_data)))

# Write data to file
self.write_data_file(self.bbduk_data, "bbduk")

self.bbduk_general_stats()
self.bbduk_bargraph_plot()

def parse_logs(self, f):
"""Parses a BBDuk stdout saved in a file"""

s_name = f["s_name"]
for l in f["f"]:
if "jgi.BBDuk" in l and "in1=" in l:
s_name = l.split("in1=")[1].split(" ")[0]
s_name = self.clean_s_name(s_name, f)

if "Input:" in l:
matches = re.search(r"Input:\s+(\d+) reads\s+(\d+) bases", l)
if matches:
self.add_data_source(f, s_name)
if s_name in self.bbduk_data:
log.debug("Duplicate sample name found! Overwriting: {}".format(s_name))
self.bbduk_data[s_name] = dict()

self.bbduk_data[s_name]["Input reads"] = int(matches.group(1))
self.bbduk_data[s_name]["Input bases"] = int(matches.group(2))
# Don't start using regexes until we're in that block
elif "Input reads" in self.bbduk_data.get(s_name, {}):
cats = [
"QTrimmed",
"KTrimmed",
"Trimmed by overlap",
"Low quality discards",
"Low entropy discards",
"Total Removed",
"Result",
]
for cat in cats:
matches = re.search(f"{cat}:\s+(\d+) reads \(([\d\.]+)%\)\s+(\d+) bases \(([\d\.]+)%\)", l)
if matches:
self.bbduk_data[s_name][cat + " reads"] = int(matches.group(1))
self.bbduk_data[s_name][cat + " reads percent"] = float(matches.group(2))
self.bbduk_data[s_name][cat + " bases"] = int(matches.group(3))
self.bbduk_data[s_name][cat + " bases percent"] = float(matches.group(4))
break
elif "Reads Processed:" in l:
return

def bbduk_general_stats(self):
"""BBDuk read counts for general stats"""
headers = OrderedDict()

headers["Total Removed bases percent"] = {
"title": "Bases Removed (%)",
"description": "Percentage of bases removed after filtering",
"scale": "YlOrBr",
"max": 100,
}
headers["Total Removed bases"] = {
"title": "Bases Removed ({})".format(config.base_count_prefix),
"description": "Total Bases removed ({})".format(config.base_count_desc),
"scale": "Reds",
"shared_key": "base_count",
"modify": lambda x: x * config.base_count_multiplier,
"hidden": True,
}
headers["Total Removed reads percent"] = {
"title": "Reads Removed (%)",
"description": "Percentage of reads removed after filtering",
"scale": "OrRd",
"max": 100,
}
headers["Total Removed reads"] = {
"title": "Reads Removed ({})".format(config.read_count_prefix),
"description": "Total Reads removed ({})".format(config.read_count_desc),
"scale": "Reds",
"shared_key": "read_count",
"modify": lambda x: x * config.read_count_multiplier,
"hidden": True,
}
headers["Input reads"] = {
"title": "Total Input Reads ({})".format(config.read_count_prefix),
"description": "Total number of input reads to BBDuk ({})".format(config.read_count_desc),
"scale": "Greens",
"shared_key": "read_count",
"modify": lambda x: x * config.read_count_multiplier,
"hidden": True,
}
self.general_stats_addcols(self.bbduk_data, headers)

def bbduk_bargraph_plot(self):
"""
Beeswarm displaying all possible filtering results reported by BBDuk.

We don't display this as a barchart as the total across all categories
of filters reported don't match exactly the total reads remaining (I
assume there is additional default filtering carried out)
"""
cats = [
"Result",
"QTrimmed",
"KTrimmed",
"Trimmed by overlap",
"Low quality discards",
"Low entropy discards",
]
pconfig = {
"id": "bbduk-filtered-barplot",
"title": "BBDuk: Filtered reads",
"ylab": "Number of Reads",
"data_labels": [
{"name": "Reads", "ylab": "Number of Reads"},
{"name": "Bases", "ylab": "Number of Base Pairs"},
],
}

self.add_section(
name="BBDuk: Filtered Reads",
anchor="bbduk-filtered-reads",
description="The number of reads removed by various BBDuk filters",
plot=bargraph.plot(
[self.bbduk_data, self.bbduk_data],
[
[f"{cat} reads" for cat in cats],
[f"{cat} bases" for cat in cats],
],
pconfig,
),
)
4 changes: 4 additions & 0 deletions multiqc/utils/config_defaults.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -670,6 +670,10 @@ module_order:
module_tag:
- RNA
- DNA
- bbduk:
module_tag:
- RNA
- DNA
- clipandmerge:
module_tag:
- DNA
Expand Down
3 changes: 3 additions & 0 deletions multiqc/utils/search_patterns.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ bamtools/stats:
contents: "Stats for BAM file(s):"
shared: true
num_lines: 10
bbduk:
contents: "Executing jgi.BBDuk"
num_lines: 2
bbmap/stats:
contents: "#Name Reads ReadsPct"
num_lines: 4
Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@
"afterqc = multiqc.modules.afterqc:MultiqcModule",
"anglerfish = multiqc.modules.anglerfish:MultiqcModule",
"bamtools = multiqc.modules.bamtools:MultiqcModule",
"bbduk = multiqc.modules.bbduk:MultiqcModule",
"bbmap = multiqc.modules.bbmap:MultiqcModule",
"bcftools = multiqc.modules.bcftools:MultiqcModule",
"bcl2fastq = multiqc.modules.bcl2fastq:MultiqcModule",
Expand Down