New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Large data mode #182

Closed

evanroyrees wants to merge 76 commits into KwanLab:dev from evanroyrees:large-data-mode

Collaborator

evanroyrees commented Aug 3, 2021 •

edited

Behavior

Iterates through taxa and selects contigs to cluster based on a provided upper bound (--max-partition-size). If the taxon set is below the upper bound, the taxon's k-mer counts are normalized and embedded prior to recursive clustering. If the taxon set is below the lower bound (min_contigs = max([pca_dimensions + 1, embed_dimensions + 1])) k-mer's are retrieved from the canonical rank embedding then the same recursive clustering implementation begins.

Features/Additions

New entrypoint: autometa-large-data-mode-binning
- --max-partition-size
- --cache
- --binning-checkpoints
Add binning checkpoint behavior
Add embedding caching behavior
Add new entrypoint autometa-large-data-mode-binning-loginfo that will parse log file generated from autometa-large-data-mode-binning
- Differentiate timing between normalization, embedding, misc. and clustering
Rename entrypoint autometa-parse-bed to autometa-bedtools-genomecov

evanroyrees added 29 commits

March 12, 2021 12:32


          📝 Update bug report template

1660b3c


          Merge branch 'dev' of https://github.com/KwanLab/Autometa into dev


          Merge branch 'dev' of https://github.com/KwanLab/Autometa into dev

a0c472f


          Merge branch 'dev' of https://github.com/KwanLab/Autometa into dev

d41a775


          Merge branch 'dev' of https://github.com/KwanLab/Autometa into dev

3bcf59d


          🔥 Remove length table provided to bedtools genomecov

a6a1f22

This removes the warning from bedtools, 'WARNING: Genome (-g) files are ignored when BAM input is provided.'
Test of `bedtools genomecov -ibam records.bam > output.tsv` creates the same file as `bedtools genomecov -ibam records.bam -g lengths.tsv > output.tsv`


          --wip-- [skip ci]

aa74276


          🎨 Add large-data-mode feature to recursive_dbscan binning function

00b895c

Binning now uses embeddings from canonical rank or from the specific rank name within the canononical rank depending on the rank partition size


          🎨 Add large-data-mode feature of recursive rank kmer embedding

057c2a5


          🎨 Add --cache argument for caching kmer embeddings.

13c8381


          Merge branch 'dev' of https://github.com/WiscEvan/Autometa into dev

307b3bd


          Merge branch 'dev' of https://github.com/WiscEvan/Autometa into large…

…-data-mode


          🐛 Fix argparse parameters in recursive_dbscan to convert inputs to sp…

b9f1613

…ecified type

🎨 Add string instance check in kmers.embed(...) for pca_dimensions and attempt to convert to int if str given.


          📝 Add logging statements between rank counts subsetting logic

e34fe7f


          🐛 Fix incorrect args called in parse(...)

df3ea72

🐛 Add if statement to check whether user specified an output filepath to update logger message in parse(...)


          --WIP--

5282a60


          Merge branch 'large-data-mode' of https://github.com/WiscEvan/Autometa …

f2aef76

…into large-data-mode


          🐛 import NCBI to be used in writing canonical ranks in main binning o…

791b43f

…utput table


          🎨 clean-up WIP for recursive_dbscan

ec3750e

🎨 Add script to extract log information from recursive_dbscan
🎨 Add autometa-binning-loginfo entrypoint for extracting binning log information


          Merge branch 'dev' of github.com:KwanLab/Autometa into large-data-mode

e0276c0


          Add more logging debug information to determine marker annotation times

7e18355


          🎨 Add more information to logger message for marker gene cutoff filter

11c02f6


          🎨 Add binning caching

e6dd75c


          🎨 Add binning checkpointing logic

ab1f416


          🔥 Replace print with logger.debug

9d77c04


          🐛 Fix checkpoint restart logic when retrieving binned contigs

0259f2d

🐛 Fix checkpoint file writing logic where spaces were prepended to comment (#) lines
🐛 Fix merge logic when updating a checkpoint
🐛🎨 Add gzip functionality for writing of binning checkpoints file


          🐛 Fix compressed read functionality in get_checkpoint_info(...) for g…

1de779a

…zipped checkpoint files


          🎨 Add logger emit message when reading annotations for binning utilities

7cfc496

🎨 Clean logger emit message after retrieval of checkpoint info


          🎨📝 Add binning-checkpoints parameter and update type hint for get_che…

a5e65d8

…ckpoint_info(...)

evanroyrees added the enhancement label

evanroyrees added 8 commits

October 6, 2021 13:49


          🎨 change logger.warning(...) to logger.warn(...)

9fe4149


          🎨 Add functionality to parse prot.accession2taxid.FULL.gz

26daab4

🎨 Move search_prot_accessions(...) method to NCBI class
🎨 Add checks for prot.accession2taxid.FULL.gz in NCBI class


          🎨🐍 Add method to read sseqid_to_taxid output table

1d4840a

LCA blast2lca method now tries to retrieve taxids from sseqid_to_taxid_output iff it is already written
. This allows to skip the expensive step of parsing prot.accession2taxid databases and the blast table


          🎨🐛 Add exception handling for autometa-length-filter

05a4c9f

🎨🐛 Account for bug where output filepath corresponds to a non-existent directory. Now creates all output directories in outpath that do not exist.


          🎨 Add core_dist_n_jobs param to run_hdbscan(...)

23e9c76

🎨 default is -1, which uses (n_cpus + 1 + core_dist_n_jobs)


          🎨 Add logic to handle when the user already has provided a length-fil…

08983c2

…tered metagenome and is trying to retrieve gc_content and other assembly stats.


          🐛 Fix incorrect import in automeat.config.databases from hmmer to hmm…

cc6bef0

…scan


          Merge branch 'large-data-mode' of https://github.com/WiscEvan/Autometa …

8528aa2

…into large-data-mode

Sidduppal reviewed

View reviewed changes

autometa/binning/large_data_mode.py Outdated

+                  gc_content_stddev : float, optional
+                      cluster GC content threshold to retain cluster (the default is 5.0).
+                  taxonomy : bool, optional

Collaborator

Sidduppal Oct 19, 2021 •

edited

Great work on specifying whether the user is specifying a taxonomy or not, however, there is no taxonomy variable present in the function declaration and no taxonomy variable is being used in the function.
How can this entry point be useful if the user hasn't specified any taxonomy file, as the cluster_by_taxon_partitioning function is not checking for that

Collaborator Author

evanroyrees Oct 19, 2021

see filter_taxonomy(..., rank=args.rank_name_filter) call in main().

Also def filter_taxonomy(...) in autometa/binning/utilities.py L72

Sidduppal reviewed

View reviewed changes

autometa/binning/large_data_mode.py

Comment on lines +563 to +599

+                  parser.add_argument(
+                      "--kmers",
+                      help="Path to k-mer counts table",
+                      metavar="filepath",
+                      required=True,
+                  )
+                  parser.add_argument(
+                      "--coverages",
+                      help="Path to metagenome coverages table",
+                      metavar="filepath",
+                      required=True,
+                  )
+                  parser.add_argument(
+                      "--gc-content",
+                      help="Path to metagenome GC contents table",
+                      metavar="filepath",
+                      required=True,
+                  )
+                  parser.add_argument(
+                      "--markers",
+                      help="Path to Autometa annotated markers table",
+                      metavar="filepath",
+                      required=True,
+                  )
+                  parser.add_argument(
+                      "--taxonomy",
+                      metavar="filepath",
+                      help="Path to Autometa assigned taxonomies table",
+                      required=True,
+                  )
+                  parser.add_argument(
+                      "--output-binning",
+                      help="Path to write Autometa binning results",
+                      metavar="filepath",
+                      required=True,
+                  )
+                  parser.add_argument(

Collaborator

Sidduppal Oct 19, 2021

When asking for different input files I feel it would be nice if you can mention the specific format of the file ie. what headers should be there.

Sidduppal reviewed

View reviewed changes

autometa/binning/large_data_mode.py

Comment on lines +675 to +681

+                  parser.add_argument(
+                      "--max-partition-size",
+                      help="Maximum number of contigs to consider for a recursive binning batch.",
+                      default=10000,
+                      metavar="int",
+                      type=int,
+                  )

Collaborator

Sidduppal Oct 19, 2021

I feel it would be nice to include what determines the selection of max-partition-size. How did you reach the 10,000 value? It would be nice for prospective users to know how to select this value.
Basically when running this how should I user know what partition size to use.

Sidduppal reviewed

View reviewed changes

autometa/binning/large_data_mode.py

Comment on lines +600 to +602

+                      "--output-main",
+                      help="Path to write Autometa main table used during/after binning",
+                      metavar="filepath",

Collaborator

Sidduppal Oct 19, 2021

Difference between the binning file and main file is not clear from the help text.

Sidduppal reviewed

View reviewed changes

autometa/binning/recursive_dbscan.py Show resolved Hide resolved

Sidduppal reviewed

View reviewed changes

autometa/binning/large_data_mode.py Outdated Show resolved Hide resolved

evanroyrees added 14 commits

October 19, 2021 17:23


          🎨🐛🔥 Remove domain kwarg in get_clusters(...)

d1e2198

🔥📝 Remove unused variables in docstring


          🎨🐛 Fix cluster metric addition/filter

9f951c7

🎨🐛 Drop metric columns when adding new metrics to avoid addition of suffixes
🔥🎨 Remove dropcols variable to isolate cluster metrics columns to add_metrics(...) and apply_binning_metrics_filter(...)
🔥🎨 Only drop cluster column in run_hdbscan(...)


          🎨🐛 Fix bug at canonical rank kmer embedding stage

366b2cf

🎨 rename 'rank' variable to more specific 'canonical_rank' variable.
🎨 Add logic to retrieve previous canonical rank kmer embedding if the current canonical rank embedding is not possible.


          🎨🐍 Add numba thread-safe configuration in recursive_dbscan.py

461363d


          🐛 change sseqid_to_taxid_output check in lca.py

46e5944

This first checks if the variable has been provided a value prior to filepath and filesize checking


          🎨 Add logger emit to large-data-mode corresponding to markers annotation

0f5205f


          🎨🐛🐍 Add --cpus param to autometa-binning and autometa-large-data-mode…

b1a903e

…-binning entrypoints

Both clustering algorithms may be passed a parameter to allow them to use more cores for parallelization.

e.g. HDBSCAN(..., core_dist_n_jobs) and DBSCAN(..., n_jobs)

This parameter has been propagated through to these functions rather than the previously hardcoded -1 due to errors being raised from the joblib.externals library
while using HDBSCAN. The errors arising from n_jobs=-1 are infrequent, but frequent enough to merit providing the user more control

Some example exceptions that were raised:

- `[11/17/2021 10:53:42 PM ERROR] concurrent.futures: exception calling callback for <Future at 0x7fd2083a9c10 state=finished raised BrokenProcessPool>`
- `joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.`


          🎨 Add trimap to choices of kmer embed methods (large-data-mode)

08a3e36

🐛 Add input type conversion for --cpus arg in binning entrypoints


          ⬆️ pin scikit-learn to 0.24 to prevent errors arising from hdbscan in…

a0cfee4

…ternals using joblib.

🔥🐛 This is a somewhat known error as similar messages have been discussed [here](scikit-learn/scikit-learn#21685)
and on the [hdbscan GH pull-#495](scikit-learn-contrib/hdbscan#495).

The error messages is emitted from joblib.externals.loky.process_executor._RemoteTraceback and emits a ValueError:

'ValueError: buffer source array is read-only.'

So far this has not been encountered with scikit-learn version 0.24


          Merge branch 'dev' of https://github.com/KwanLab/Autometa into large-…

263d220

…data-mode


          Merge branch 'large-data-mode' of github.com:WiscEvan/Autometa into l…

62c7042

…arge-data-mode


          🎨🐍 Add pd.DataFrame type input handling for get_metabin_stats(markers…

f61a92f

…=Union[str,pd.DataFrame])


          Merge branch 'large-data-mode' of https://github.com/WiscEvan/Autometa …

e39e2cd

…into large-data-mode


          🎨 refactor get_metabin_stats(...) using pandas groupby object

3f28826

evanroyrees closed this

evanroyrees mentioned this pull request

🐍🐎 Large data mode #207

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment