Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large data mode #182

Closed
wants to merge 76 commits into from
Closed

Large data mode #182

wants to merge 76 commits into from

Commits on Mar 12, 2021

  1. Configuration menu
    Copy the full SHA
    1660b3c View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2021

  1. Configuration menu
    Copy the full SHA
    7909150 View commit details
    Browse the repository at this point in the history

Commits on Mar 17, 2021

  1. Configuration menu
    Copy the full SHA
    a0c472f View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2021

  1. Configuration menu
    Copy the full SHA
    d41a775 View commit details
    Browse the repository at this point in the history

Commits on Apr 21, 2021

  1. Configuration menu
    Copy the full SHA
    3bcf59d View commit details
    Browse the repository at this point in the history

Commits on Apr 22, 2021

  1. 🔥 Remove length table provided to bedtools genomecov

    This removes the warning from bedtools, 'WARNING: Genome (-g) files are ignored when BAM input is provided.'
    Test of `bedtools genomecov -ibam records.bam > output.tsv` creates the same file as `bedtools genomecov -ibam records.bam -g lengths.tsv > output.tsv`
    evanroyrees committed Apr 22, 2021
    Configuration menu
    Copy the full SHA
    a6a1f22 View commit details
    Browse the repository at this point in the history

Commits on Apr 27, 2021

  1. --wip-- [skip ci]

    evanroyrees committed Apr 27, 2021
    Configuration menu
    Copy the full SHA
    aa74276 View commit details
    Browse the repository at this point in the history

Commits on May 19, 2021

  1. 🎨 Add large-data-mode feature to recursive_dbscan binning function

    Binning now uses embeddings from canonical rank or from the specific rank name within the canononical rank depending on the rank partition size
    evanroyrees committed May 19, 2021
    Configuration menu
    Copy the full SHA
    00b895c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    057c2a5 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    13c8381 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    307b3bd View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    2787312 View commit details
    Browse the repository at this point in the history

Commits on May 25, 2021

  1. 🐛 Fix argparse parameters in recursive_dbscan to convert inputs to sp…

    …ecified type
    
    🎨 Add string instance check in kmers.embed(...) for pca_dimensions and attempt to convert to int if str given.
    evanroyrees committed May 25, 2021
    Configuration menu
    Copy the full SHA
    b9f1613 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    e34fe7f View commit details
    Browse the repository at this point in the history

Commits on Jun 15, 2021

  1. 🐛 Fix incorrect args called in parse(...)

    🐛 Add if statement to check whether user specified an output filepath to update logger message in parse(...)
    evanroyrees committed Jun 15, 2021
    Configuration menu
    Copy the full SHA
    df3ea72 View commit details
    Browse the repository at this point in the history

Commits on Jul 7, 2021

  1. --WIP--

    evanroyrees committed Jul 7, 2021
    Configuration menu
    Copy the full SHA
    5282a60 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f2aef76 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    791b43f View commit details
    Browse the repository at this point in the history

Commits on Jul 12, 2021

  1. 🎨 clean-up WIP for recursive_dbscan

    🎨 Add script to extract log information from recursive_dbscan
    🎨 Add autometa-binning-loginfo entrypoint for extracting binning log information
    evanroyrees committed Jul 12, 2021
    Configuration menu
    Copy the full SHA
    ec3750e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    e0276c0 View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2021

  1. Configuration menu
    Copy the full SHA
    7e18355 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    11c02f6 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2021

  1. 🎨 Add binning caching

    evanroyrees committed Jul 29, 2021
    Configuration menu
    Copy the full SHA
    e6dd75c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ab1f416 View commit details
    Browse the repository at this point in the history

Commits on Aug 2, 2021

  1. Configuration menu
    Copy the full SHA
    9d77c04 View commit details
    Browse the repository at this point in the history

Commits on Aug 3, 2021

  1. 🐛 Fix checkpoint restart logic when retrieving binned contigs

    🐛 Fix checkpoint file writing logic where spaces were prepended to comment (#) lines
    🐛 Fix merge logic when updating a checkpoint
    🐛🎨 Add gzip functionality for writing of binning checkpoints file
    evanroyrees committed Aug 3, 2021
    Configuration menu
    Copy the full SHA
    0259f2d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    1de779a View commit details
    Browse the repository at this point in the history
  3. 🎨 Add logger emit message when reading annotations for binning utilities

    🎨 Clean logger emit message after retrieval of checkpoint info
    evanroyrees committed Aug 3, 2021
    Configuration menu
    Copy the full SHA
    7cfc496 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    a5e65d8 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    c185633 View commit details
    Browse the repository at this point in the history
  6. 🐛 Reindex bins for binning checkpoints

    🐛 Add newline characters for each commented param in binning checkpoints
    evanroyrees committed Aug 3, 2021
    Configuration menu
    Copy the full SHA
    7f8e216 View commit details
    Browse the repository at this point in the history
  7. 🎨📝 Add header lines 'Parameters' and 'Runtime Variables' to binning c…

    …heckpoints
    
    🎨 add checkpoint shape to binning checkpoints info
    evanroyrees committed Aug 3, 2021
    Configuration menu
    Copy the full SHA
    d6f09a6 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    006b627 View commit details
    Browse the repository at this point in the history

Commits on Aug 4, 2021

  1. Configuration menu
    Copy the full SHA
    a0fe1da View commit details
    Browse the repository at this point in the history
  2. 🎨 Move large-data-mode code to large_data_mode.py.

    🎨 Update entrypoints so large data mode is separate from autometa-binning
    🎨 Rename loginfo.py to large_data_mode_loginfo.py to match extracting log info for large_data_mode.py
    🎨 Update loginfo entrypoint to corresopnd to large_data_mode
    🎨 black formatting on summary.py and unclustered_recruitment.py
    🎨 Refactor common binning functions to binning utilities.py
    evanroyrees committed Aug 4, 2021
    Configuration menu
    Copy the full SHA
    92517c0 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6a8ea66 View commit details
    Browse the repository at this point in the history

Commits on Aug 9, 2021

  1. Configuration menu
    Copy the full SHA
    3743469 View commit details
    Browse the repository at this point in the history

Commits on Aug 11, 2021

  1. Configuration menu
    Copy the full SHA
    98ab81d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5e702b6 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    88398c3 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    de1b9a2 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    9ec5744 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    8c13c19 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    7ced8db View commit details
    Browse the repository at this point in the history

Commits on Aug 24, 2021

  1. 🎨 Add reference to dead_prot.accession2taxid in NCBI class

    📝🎨 Update logger emitted messages when querying LCA
    evanroyrees committed Aug 24, 2021
    Configuration menu
    Copy the full SHA
    378960c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9b4ad1b View commit details
    Browse the repository at this point in the history

Commits on Aug 25, 2021

  1. 🎨 Change file extension of unclustered seqs written to unclustered.fa…

    …sta while binned seqs are written to <cluster>.fna
    evanroyrees committed Aug 25, 2021
    Configuration menu
    Copy the full SHA
    91e252b View commit details
    Browse the repository at this point in the history

Commits on Sep 30, 2021

  1. 🎨🐍 merge upstream dev

    🎨🐍 refactor add_metrics(...) with speed-up code from KwanLab#181
    📝 update autometa_clr(...) docstring
    evanroyrees committed Sep 30, 2021
    Configuration menu
    Copy the full SHA
    e0507f9 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    da12095 View commit details
    Browse the repository at this point in the history

Commits on Oct 1, 2021

  1. 🎨🐛 change list-comprehension for taxids search with RMQ in lca.py to …

    …set comprehension
    
    🎨 Refactor sseqids search in prot.accession2taxid.gz and dead_prot.accession2taxid.gz
    🎨📝 Add docstrings and rename variables in large data mode module files
    📝🔥 remove commented mathjax path in conf.py for docs
    evanroyrees committed Oct 1, 2021
    Configuration menu
    Copy the full SHA
    2f08998 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    c54261d View commit details
    Browse the repository at this point in the history

Commits on Oct 3, 2021

  1. Configuration menu
    Copy the full SHA
    cf9802c View commit details
    Browse the repository at this point in the history

Commits on Oct 4, 2021

  1. Configuration menu
    Copy the full SHA
    cba2d8a View commit details
    Browse the repository at this point in the history

Commits on Oct 6, 2021

  1. 🎨🐛🔥 clean-up of lca.py

    🔥🐛 Remove overwrite of input dict with empty dict
    🔥 Remove unused variable in majority_vote.py args
    🎨🍏 Update param arg for local lca nf module
    evanroyrees committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    41c0c57 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9fe4149 View commit details
    Browse the repository at this point in the history
  3. 🎨 Add functionality to parse prot.accession2taxid.FULL.gz

    🎨 Move search_prot_accessions(...) method to NCBI class
    🎨 Add checks for prot.accession2taxid.FULL.gz in NCBI class
    evanroyrees committed Oct 6, 2021
    Configuration menu
    Copy the full SHA
    26daab4 View commit details
    Browse the repository at this point in the history

Commits on Oct 11, 2021

  1. 🎨🐍 Add method to read sseqid_to_taxid output table

    LCA blast2lca method now tries to retrieve taxids from sseqid_to_taxid_output iff it is already written
    . This allows to skip the expensive step of parsing prot.accession2taxid databases and the blast table
    evanroyrees committed Oct 11, 2021
    Configuration menu
    Copy the full SHA
    1d4840a View commit details
    Browse the repository at this point in the history

Commits on Oct 14, 2021

  1. 🎨🐛 Add exception handling for autometa-length-filter

    🎨🐛 Account for bug where output filepath corresponds to a non-existent directory. Now creates all output directories in outpath that do not exist.
    evanroyrees committed Oct 14, 2021
    Configuration menu
    Copy the full SHA
    05a4c9f View commit details
    Browse the repository at this point in the history
  2. 🎨 Add core_dist_n_jobs param to run_hdbscan(...)

    🎨 default is -1, which uses (n_cpus + 1 + core_dist_n_jobs)
    evanroyrees committed Oct 14, 2021
    Configuration menu
    Copy the full SHA
    23e9c76 View commit details
    Browse the repository at this point in the history
  3. 🎨 Add logic to handle when the user already has provided a length-fil…

    …tered metagenome and is trying to retrieve gc_content and other assembly stats.
    evanroyrees committed Oct 14, 2021
    Configuration menu
    Copy the full SHA
    08983c2 View commit details
    Browse the repository at this point in the history

Commits on Oct 15, 2021

  1. Configuration menu
    Copy the full SHA
    cc6bef0 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    8528aa2 View commit details
    Browse the repository at this point in the history

Commits on Oct 19, 2021

  1. 🎨🐛🔥 Remove domain kwarg in get_clusters(...)

    🔥📝 Remove unused variables in docstring
    evanroyrees committed Oct 19, 2021
    Configuration menu
    Copy the full SHA
    d1e2198 View commit details
    Browse the repository at this point in the history

Commits on Oct 21, 2021

  1. 🎨🐛 Fix cluster metric addition/filter

    🎨🐛 Drop metric columns when adding new metrics to avoid addition of suffixes
    🔥🎨 Remove dropcols variable to isolate cluster metrics columns to add_metrics(...) and apply_binning_metrics_filter(...)
    🔥🎨 Only drop cluster column in run_hdbscan(...)
    evanroyrees committed Oct 21, 2021
    Configuration menu
    Copy the full SHA
    9f951c7 View commit details
    Browse the repository at this point in the history

Commits on Oct 26, 2021

  1. 🎨🐛 Fix bug at canonical rank kmer embedding stage

    🎨 rename 'rank' variable to more specific 'canonical_rank' variable.
    🎨 Add logic to retrieve previous canonical rank kmer embedding if the current canonical rank embedding is not possible.
    evanroyrees committed Oct 26, 2021
    Configuration menu
    Copy the full SHA
    366b2cf View commit details
    Browse the repository at this point in the history

Commits on Nov 11, 2021

  1. Configuration menu
    Copy the full SHA
    461363d View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2021

  1. 🐛 change sseqid_to_taxid_output check in lca.py

    This first checks if the variable has been provided a value prior to filepath and filesize checking
    evanroyrees committed Nov 16, 2021
    Configuration menu
    Copy the full SHA
    46e5944 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    0f5205f View commit details
    Browse the repository at this point in the history

Commits on Nov 18, 2021

  1. 🎨🐛🐍 Add --cpus param to autometa-binning and autometa-large-data-mode…

    …-binning entrypoints
    
    Both clustering algorithms may be passed a parameter to allow them to use more cores for parallelization.
    
    e.g. HDBSCAN(..., core_dist_n_jobs) and DBSCAN(..., n_jobs)
    
    This parameter has been propagated through to these functions rather than the previously hardcoded -1 due to errors being raised from the joblib.externals library
    while using HDBSCAN. The errors arising from n_jobs=-1 are infrequent, but frequent enough to merit providing the user more control
    
    Some example exceptions that were raised:
    
    - `[11/17/2021 10:53:42 PM ERROR] concurrent.futures: exception calling callback for <Future at 0x7fd2083a9c10 state=finished raised BrokenProcessPool>`
    - `joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.`
    evanroyrees committed Nov 18, 2021
    Configuration menu
    Copy the full SHA
    b1a903e View commit details
    Browse the repository at this point in the history

Commits on Nov 24, 2021

  1. 🎨 Add trimap to choices of kmer embed methods (large-data-mode)

    🐛 Add input type conversion for --cpus arg in binning entrypoints
    evanroyrees committed Nov 24, 2021
    Configuration menu
    Copy the full SHA
    08a3e36 View commit details
    Browse the repository at this point in the history

Commits on Nov 30, 2021

  1. ⬆️ pin scikit-learn to 0.24 to prevent errors arising from hdbscan in…

    …ternals using joblib.
    
    🔥🐛 This is a somewhat known error as similar messages have been discussed [here](scikit-learn/scikit-learn#21685)
    and on the [hdbscan GH pull-#495](scikit-learn-contrib/hdbscan#495).
    
    The error messages is emitted from joblib.externals.loky.process_executor._RemoteTraceback and emits a ValueError:
    
    'ValueError: buffer source array is read-only.'
    
    So far this has not been encountered with scikit-learn version 0.24
    evanroyrees committed Nov 30, 2021
    Configuration menu
    Copy the full SHA
    a0cfee4 View commit details
    Browse the repository at this point in the history

Commits on Dec 2, 2021

  1. Configuration menu
    Copy the full SHA
    263d220 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    62c7042 View commit details
    Browse the repository at this point in the history

Commits on Dec 9, 2021

  1. Configuration menu
    Copy the full SHA
    f61a92f View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    e39e2cd View commit details
    Browse the repository at this point in the history

Commits on Dec 21, 2021

  1. Configuration menu
    Copy the full SHA
    3f28826 View commit details
    Browse the repository at this point in the history