Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large data mode #182

Closed
wants to merge 76 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
1660b3c
:memo: Update bug report template
evanroyrees Mar 12, 2021
7909150
Merge branch 'dev' of https://github.com/KwanLab/Autometa into dev
evanroyrees Mar 16, 2021
a0c472f
Merge branch 'dev' of https://github.com/KwanLab/Autometa into dev
evanroyrees Mar 17, 2021
d41a775
Merge branch 'dev' of https://github.com/KwanLab/Autometa into dev
evanroyrees Apr 20, 2021
3bcf59d
Merge branch 'dev' of https://github.com/KwanLab/Autometa into dev
evanroyrees Apr 21, 2021
a6a1f22
:fire: Remove length table provided to bedtools genomecov
evanroyrees Apr 22, 2021
aa74276
--wip-- [skip ci]
evanroyrees Apr 27, 2021
00b895c
:art: Add large-data-mode feature to recursive_dbscan binning function
evanroyrees May 19, 2021
057c2a5
:art: Add large-data-mode feature of recursive rank kmer embedding
evanroyrees May 19, 2021
13c8381
:art: Add --cache argument for caching kmer embeddings.
evanroyrees May 19, 2021
307b3bd
Merge branch 'dev' of https://github.com/WiscEvan/Autometa into dev
evanroyrees May 19, 2021
2787312
Merge branch 'dev' of https://github.com/WiscEvan/Autometa into large…
evanroyrees May 19, 2021
b9f1613
:bug: Fix argparse parameters in recursive_dbscan to convert inputs t…
evanroyrees May 25, 2021
e34fe7f
:memo: Add logging statements between rank counts subsetting logic
evanroyrees May 25, 2021
df3ea72
:bug: Fix incorrect args called in parse(...)
evanroyrees Jun 15, 2021
5282a60
--WIP--
evanroyrees Jul 7, 2021
f2aef76
Merge branch 'large-data-mode' of https://github.com/WiscEvan/Automet…
evanroyrees Jul 7, 2021
791b43f
:bug: import NCBI to be used in writing canonical ranks in main binni…
evanroyrees Jul 7, 2021
ec3750e
:art: clean-up WIP for recursive_dbscan
evanroyrees Jul 12, 2021
e0276c0
Merge branch 'dev' of github.com:KwanLab/Autometa into large-data-mode
evanroyrees Jul 12, 2021
7e18355
Add more logging debug information to determine marker annotation times
evanroyrees Jul 19, 2021
11c02f6
:art: Add more information to logger message for marker gene cutoff f…
evanroyrees Jul 19, 2021
e6dd75c
:art: Add binning caching
evanroyrees Jul 29, 2021
ab1f416
:art: Add binning checkpointing logic
evanroyrees Jul 29, 2021
9d77c04
:fire: Replace print with logger.debug
evanroyrees Aug 2, 2021
0259f2d
:bug: Fix checkpoint restart logic when retrieving binned contigs
evanroyrees Aug 3, 2021
1de779a
:bug: Fix compressed read functionality in get_checkpoint_info(...) f…
evanroyrees Aug 3, 2021
7cfc496
:art: Add logger emit message when reading annotations for binning ut…
evanroyrees Aug 3, 2021
a5e65d8
:art::memo: Add binning-checkpoints parameter and update type hint fo…
evanroyrees Aug 3, 2021
c185633
:bug: Fix gzipped writing of binning checkpoints file
evanroyrees Aug 3, 2021
7f8e216
:bug: Reindex bins for binning checkpoints
evanroyrees Aug 3, 2021
d6f09a6
:art::memo: Add header lines 'Parameters' and 'Runtime Variables' to …
evanroyrees Aug 3, 2021
006b627
:art: Add timestamp info to runtime variables in binning checkpoints
evanroyrees Aug 3, 2021
a0fe1da
:memo: Add lines between parameter docstrings
evanroyrees Aug 4, 2021
92517c0
:art: Move large-data-mode code to large_data_mode.py.
evanroyrees Aug 4, 2021
6a8ea66
:bug: Fix function call for checkpoint(...)
evanroyrees Aug 4, 2021
3743469
Merge branch 'dev' of https://github.com/KwanLab/Autometa into large-…
evanroyrees Aug 9, 2021
98ab81d
:bug::art::fire: Add logger and remove print statements
evanroyrees Aug 11, 2021
5e702b6
:memo: Add description to reindex_bin_names(...) func return docstring
evanroyrees Aug 11, 2021
88398c3
:art: Add copyright, logging and encoding to header of script l.d.m. …
evanroyrees Aug 11, 2021
de1b9a2
:art::memo: Change choices statement for --rank-filter in recursive_d…
evanroyrees Aug 11, 2021
9ec5744
:art::memo::bug: Change choices statement for --rank-filter in large_…
evanroyrees Aug 11, 2021
8c13c19
:art::memo: Add line in contributing with link to good first issues
evanroyrees Aug 11, 2021
7ced8db
:art::memo: Add line in contributing with link to good first issues
evanroyrees Aug 11, 2021
378960c
:art: Add reference to dead_prot.accession2taxid in NCBI class
evanroyrees Aug 24, 2021
9b4ad1b
Merge branch 'large-data-mode' of github.com:WiscEvan/Autometa into l…
evanroyrees Aug 24, 2021
91e252b
:art: Change file extension of unclustered seqs written to unclustere…
evanroyrees Aug 25, 2021
e0507f9
:art::snake: merge upstream dev
evanroyrees Sep 30, 2021
da12095
:art::snake::green_apple: Re-change autometa-parse-bed entrypoint to …
evanroyrees Sep 30, 2021
2f08998
:art::bug: change list-comprehension for taxids search with RMQ in lc…
evanroyrees Oct 1, 2021
c54261d
:art::memo: add docstring to search_prot_accessions(...) in lca.py
evanroyrees Oct 1, 2021
cf9802c
:bug: incorrect type-hint for reduce_taxids_to_lcas(...)
evanroyrees Oct 3, 2021
cba2d8a
:art::bug::memo: Incorrect dict iteration in convert_sseqids_to_taxid…
evanroyrees Oct 4, 2021
41c0c57
:art::bug::fire: clean-up of lca.py
evanroyrees Oct 6, 2021
9fe4149
:art: change logger.warning(...) to logger.warn(...)
evanroyrees Oct 6, 2021
26daab4
:art: Add functionality to parse prot.accession2taxid.FULL.gz
evanroyrees Oct 6, 2021
1d4840a
:art::snake: Add method to read sseqid_to_taxid output table
evanroyrees Oct 11, 2021
05a4c9f
:art::bug: Add exception handling for autometa-length-filter
evanroyrees Oct 14, 2021
23e9c76
:art: Add core_dist_n_jobs param to run_hdbscan(...)
evanroyrees Oct 14, 2021
08983c2
:art: Add logic to handle when the user already has provided a length…
evanroyrees Oct 14, 2021
cc6bef0
:bug: Fix incorrect import in automeat.config.databases from hmmer to…
evanroyrees Oct 15, 2021
8528aa2
Merge branch 'large-data-mode' of https://github.com/WiscEvan/Automet…
evanroyrees Oct 15, 2021
d1e2198
:art::bug::fire: Remove domain kwarg in get_clusters(...)
evanroyrees Oct 19, 2021
9f951c7
:art::bug: Fix cluster metric addition/filter
evanroyrees Oct 21, 2021
366b2cf
:art::bug: Fix bug at canonical rank kmer embedding stage
evanroyrees Oct 26, 2021
461363d
:art::snake: Add numba thread-safe configuration in recursive_dbscan.py
evanroyrees Nov 11, 2021
46e5944
:bug: change sseqid_to_taxid_output check in lca.py
evanroyrees Nov 16, 2021
0f5205f
:art: Add logger emit to large-data-mode corresponding to markers ann…
evanroyrees Nov 16, 2021
b1a903e
:art::bug::snake: Add --cpus param to autometa-binning and autometa-l…
evanroyrees Nov 18, 2021
08a3e36
:art: Add trimap to choices of kmer embed methods (large-data-mode)
evanroyrees Nov 24, 2021
a0cfee4
:arrow_up: pin scikit-learn to 0.24 to prevent errors arising from hd…
evanroyrees Nov 30, 2021
263d220
Merge branch 'dev' of https://github.com/KwanLab/Autometa into large-…
evanroyrees Dec 2, 2021
62c7042
Merge branch 'large-data-mode' of github.com:WiscEvan/Autometa into l…
evanroyrees Dec 2, 2021
f61a92f
:art::snake: Add pd.DataFrame type input handling for get_metabin_sta…
evanroyrees Dec 9, 2021
e39e2cd
Merge branch 'large-data-mode' of https://github.com/WiscEvan/Automet…
evanroyrees Dec 9, 2021
3f28826
:art: refactor get_metabin_stats(...) using pandas groupby object
evanroyrees Dec 21, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
857 changes: 857 additions & 0 deletions autometa/binning/large_data_mode.py

Large diffs are not rendered by default.

544 changes: 544 additions & 0 deletions autometa/binning/large_data_mode_loginfo.py

Large diffs are not rendered by default.

493 changes: 193 additions & 300 deletions autometa/binning/recursive_dbscan.py

Large diffs are not rendered by default.

168 changes: 110 additions & 58 deletions autometa/binning/summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@

import logging
import os
from typing import Union

import pandas as pd
import numpy as np
Expand All @@ -34,7 +35,7 @@

from autometa.taxonomy.ncbi import NCBI
from autometa.taxonomy import majority_vote
from autometa.common import markers
from autometa.common.markers import load as load_markers


logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -68,7 +69,8 @@ def write_cluster_records(
):
contigs = set(dff.index)
records = [rec for rec in mgrecords if rec.id in contigs]
outfpath = os.path.join(outdir, f"{cluster}.fna")
file_extension = "fasta" if cluster == "unclustered" else "fna"
outfpath = os.path.join(outdir, f"{cluster}.{file_extension}")
SeqIO.write(records, outfpath, "fasta")
return

Expand Down Expand Up @@ -104,17 +106,65 @@ def fragmentation_metric(df: pd.DataFrame, quality_measure: float = 0.50) -> int
return length


def get_agg_stats(
cluster_groups: pd.core.groupby.generic.DataFrameGroupBy, stat_col: str
) -> pd.DataFrame:
"""Compute min, max, (length weighted) mean and median from provided `stat_col`

Parameters
----------
cluster_groups : pd.core.groupby.generic.DataFrameGroupBy
pandas DataFrame grouped by cluster
stat_col : str
column to on which to compute min, max, (length-weighted) mean and median

Returns
-------
pd.DataFrame
index=cluster, columns=[min_{stat_col}, max_{stat_col}, std_{stat_col}, length_weighted_{stat_col}]
"""

def weighted_average(df, values, weights):
return np.average(a=df[values], weights=df[weights])

stats = (
cluster_groups[stat_col]
.agg(["min", "max", "std", "median"])
.rename(
columns={
"min": f"min_{stat_col}",
"max": f"max_{stat_col}",
"std": f"std_{stat_col}",
"median": f"median_{stat_col}",
}
)
)
# stats.assign(range=)
length_weighted_avg = cluster_groups.apply(
weighted_average, values=stat_col, weights="length"
)
length_weighted_avg.name = f"length_weighted_mean_{stat_col}"
return pd.concat([stats, length_weighted_avg], axis=1)


def get_metabin_stats(
bin_df: pd.DataFrame, markers_fpath: str, cluster_col: str = "cluster"
bin_df: pd.DataFrame,
markers: Union[str, pd.DataFrame],
cluster_col: str = "cluster",
) -> pd.DataFrame:
"""Retrieve statistics for all clusters recovered from Autometa binning.

NOTE
----
If `bin_df` is provided without the 'cluster_col' present, an overview of
metagenome stats will be determined.

Parameters
----------
bin_df : pd.DataFrame
Autometa binning table. index=contig, cols=['cluster','length', 'GC', 'coverage', ...]
markers_fpath : str
Path to autometa annotated and filtered markers table respective to kingdom binned.
Autometa binning table. index=contig, cols=['cluster','length', 'gc_content', 'coverage', ...]
markers : str,pd.DataFrame
Path to or pd.DataFrame of markers table corresponding to contigs in `bin_df`
cluster_col : str, optional
Clustering column by which to group metabins

Expand All @@ -124,59 +174,61 @@ def get_metabin_stats(
dataframe consisting of various metagenome-assembled genome statistics indexed by cluster.
"""
logger.info(f"Retrieving metabins' stats for {cluster_col}")
stats = []
markers_df = markers.load(markers_fpath)
for cluster, dff in bin_df.fillna(value={cluster_col: "unclustered"}).groupby(
cluster_col
):
length_weighted_coverage = np.average(
a=dff.coverage, weights=dff.length / dff.length.sum()
if isinstance(markers, str):
markers_df = load_markers(markers)
elif isinstance(markers, pd.DataFrame):
markers_df = markers
else:
raise TypeError(
f"`markers` should be a path to or pd.DataFrame of a markers table corresponding to contigs in `bin_df`. Provided: {markers}"
)
length_weighted_gc = np.average(
a=dff.gc_content, weights=dff.length / dff.length.sum()
)
num_expected_markers = markers_df.shape[1]
cluster_pfams = markers_df[markers_df.index.isin(dff.index)]
if cluster_pfams.empty:
total_markers = 0
num_single_copy_markers = 0
num_markers_present = 0
completeness = pd.NA
purity = pd.NA
else:
pfam_counts = cluster_pfams.sum()
total_markers = pfam_counts.sum()
num_single_copy_markers = pfam_counts[pfam_counts == 1].count()
num_markers_present = pfam_counts[pfam_counts >= 1].count()
completeness = num_markers_present / num_expected_markers * 100
purity = num_single_copy_markers / num_markers_present * 100

stats.append(
{
cluster_col: cluster,
"nseqs": dff.shape[0],
"seqs_pct": dff.shape[0] / bin_df.shape[0] * 100,
"size (bp)": dff.length.sum(),
"size_pct": dff.length.sum() / bin_df.length.sum() * 100,
"N90": fragmentation_metric(dff, quality_measure=0.9),
"N50": fragmentation_metric(dff, quality_measure=0.5),
"N10": fragmentation_metric(dff, quality_measure=0.1),
"length_weighted_gc_content": length_weighted_gc,
"min_gc_content": dff.gc_content.min(),
"max_gc_content": dff.gc_content.max(),
"std_gc_content": dff.gc_content.std(),
"length_weighted_coverage": length_weighted_coverage,
"min_coverage": dff.coverage.min(),
"max_coverage": dff.coverage.max(),
"std_coverage": dff.coverage.std(),
"completeness": completeness,
"purity": purity,
"num_total_markers": total_markers,
f"num_unique_markers (expected {num_expected_markers})": num_markers_present,
"num_single_copy_markers": num_single_copy_markers,
}
)
return pd.DataFrame(stats).set_index(cluster_col).convert_dtypes()

metabin_stat_cols = [cluster_col, "coverage", "length", "gc_content"]
for col in metabin_stat_cols:
if col not in bin_df.columns:
raise ValueError(
f"Required column ({col}) not in bin_df columns: {bin_df.columns}"
)

df = bin_df[metabin_stat_cols].fillna(value={cluster_col: "unclustered"}).copy()

clusters = df.join(markers_df, how="outer").groupby("cluster")

percent_metagenome_size = clusters.length.sum() / df.length.sum() * 100
percent_metagenome_seqs = clusters.size() / df.shape[0] * 100
marker_counts = clusters[markers_df.columns].sum()
cluster_marker_sum = marker_counts.sum(axis=1)
redundant_marker_count = marker_counts.gt(1).sum(axis=1)
single_copy_marker_count = marker_counts.eq(1).sum(axis=1)
unique_marker_count = marker_counts.ge(1).sum(axis=1)
expected_unique_marker_count = markers_df.shape[1]
completeness = unique_marker_count / expected_unique_marker_count * 100
purity = single_copy_marker_count / unique_marker_count * 100
stats_df = pd.DataFrame(
{
"nseqs": clusters.size(),
"size (bp)": clusters.length.sum(),
"completeness": completeness,
"purity": purity,
"marker_sum": cluster_marker_sum,
"unique_marker_count": unique_marker_count,
"single_copy_marker_count": single_copy_marker_count,
"redundant_marker_count": redundant_marker_count,
"expected_unique_marker_count": expected_unique_marker_count,
"percent_of_metagenome_seqs": percent_metagenome_seqs,
"percent_of_metagenome_size": percent_metagenome_size,
"N90": clusters.apply(fragmentation_metric, quality_measure=0.9),
"N50": clusters.apply(fragmentation_metric, quality_measure=0.5),
"N10": clusters.apply(fragmentation_metric, quality_measure=0.1),
}
)
coverage_stats = get_agg_stats(clusters, "coverage")
gc_content_stats = get_agg_stats(clusters, "gc_content")
return (
pd.concat([stats_df, coverage_stats, gc_content_stats], axis=1)
.round(2)
.convert_dtypes()
)


def get_metabin_taxonomies(
Expand Down
Loading