Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large data mode #182

Closed
wants to merge 76 commits into from
Closed

Large data mode #182

wants to merge 76 commits into from

Conversation

evanroyrees
Copy link
Collaborator

@evanroyrees evanroyrees commented Aug 3, 2021

Behavior

Iterates through taxa and selects contigs to cluster based on a provided upper bound (--max-partition-size). If the taxon set is below the upper bound, the taxon's k-mer counts are normalized and embedded prior to recursive clustering. If the taxon set is below the lower bound (min_contigs = max([pca_dimensions + 1, embed_dimensions + 1])) k-mer's are retrieved from the canonical rank embedding then the same recursive clustering implementation begins.

Features/Additions

  • New entrypoint: autometa-large-data-mode-binning
    • --max-partition-size
    • --cache
    • --binning-checkpoints
  • Add binning checkpoint behavior
  • Add embedding caching behavior
  • Add new entrypoint autometa-large-data-mode-binning-loginfo that will parse log file generated from autometa-large-data-mode-binning
    • Differentiate timing between normalization, embedding, misc. and clustering
  • Rename entrypoint autometa-parse-bed to autometa-bedtools-genomecov

This removes the warning from bedtools, 'WARNING: Genome (-g) files are ignored when BAM input is provided.'
Test of `bedtools genomecov -ibam records.bam > output.tsv` creates the same file as `bedtools genomecov -ibam records.bam -g lengths.tsv > output.tsv`
Binning now uses embeddings from canonical rank or from the specific rank name within the canononical rank depending on the rank partition size
…ecified type

🎨 Add string instance check in kmers.embed(...) for pca_dimensions and attempt to convert to int if str given.
🐛 Add if statement to check whether user specified an output filepath to update logger message in parse(...)
🎨 Add script to extract log information from recursive_dbscan
🎨 Add autometa-binning-loginfo entrypoint for extracting binning log information
🐛 Fix checkpoint file writing logic where spaces were prepended to comment (#) lines
🐛 Fix merge logic when updating a checkpoint
🐛🎨 Add gzip functionality for writing of binning checkpoints file
🎨 Clean logger emit message after retrieval of checkpoint info
@evanroyrees evanroyrees added the enhancement New feature or request label Aug 3, 2021
🎨 Move search_prot_accessions(...) method to NCBI class
🎨 Add checks for prot.accession2taxid.FULL.gz in NCBI class
LCA blast2lca method now tries to retrieve taxids from sseqid_to_taxid_output iff it is already written
. This allows to skip the expensive step of parsing prot.accession2taxid databases and the blast table
🎨🐛 Account for bug where output filepath corresponds to a non-existent directory. Now creates all output directories in outpath that do not exist.
🎨 default is -1, which uses (n_cpus + 1 + core_dist_n_jobs)
…tered metagenome and is trying to retrieve gc_content and other assembly stats.
gc_content_stddev : float, optional
cluster GC content threshold to retain cluster (the default is 5.0).

taxonomy : bool, optional
Copy link
Collaborator

@Sidduppal Sidduppal Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on specifying whether the user is specifying a taxonomy or not, however, there is no taxonomy variable present in the function declaration and no taxonomy variable is being used in the function.
How can this entry point be useful if the user hasn't specified any taxonomy file, as the cluster_by_taxon_partitioning function is not checking for that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see filter_taxonomy(..., rank=args.rank_name_filter) call in main().

Also def filter_taxonomy(...) in autometa/binning/utilities.py L72

Comment on lines +563 to +599
parser.add_argument(
"--kmers",
help="Path to k-mer counts table",
metavar="filepath",
required=True,
)
parser.add_argument(
"--coverages",
help="Path to metagenome coverages table",
metavar="filepath",
required=True,
)
parser.add_argument(
"--gc-content",
help="Path to metagenome GC contents table",
metavar="filepath",
required=True,
)
parser.add_argument(
"--markers",
help="Path to Autometa annotated markers table",
metavar="filepath",
required=True,
)
parser.add_argument(
"--taxonomy",
metavar="filepath",
help="Path to Autometa assigned taxonomies table",
required=True,
)
parser.add_argument(
"--output-binning",
help="Path to write Autometa binning results",
metavar="filepath",
required=True,
)
parser.add_argument(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When asking for different input files I feel it would be nice if you can mention the specific format of the file ie. what headers should be there.

Comment on lines +675 to +681
parser.add_argument(
"--max-partition-size",
help="Maximum number of contigs to consider for a recursive binning batch.",
default=10000,
metavar="int",
type=int,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it would be nice to include what determines the selection of max-partition-size. How did you reach the 10,000 value? It would be nice for prospective users to know how to select this value.
Basically when running this how should I user know what partition size to use.

Comment on lines +600 to +602
"--output-main",
help="Path to write Autometa main table used during/after binning",
metavar="filepath",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference between the binning file and main file is not clear from the help text.

🔥📝 Remove unused variables in docstring
🎨🐛 Drop metric columns when adding new metrics to avoid addition of suffixes
🔥🎨 Remove dropcols variable to isolate cluster metrics columns to add_metrics(...) and apply_binning_metrics_filter(...)
🔥🎨 Only drop cluster column in run_hdbscan(...)
🎨 rename 'rank' variable to more specific 'canonical_rank' variable.
🎨 Add logic to retrieve previous canonical rank kmer embedding if the current canonical rank embedding is not possible.
This first checks if the variable has been provided a value prior to filepath and filesize checking
…-binning entrypoints

Both clustering algorithms may be passed a parameter to allow them to use more cores for parallelization.

e.g. HDBSCAN(..., core_dist_n_jobs) and DBSCAN(..., n_jobs)

This parameter has been propagated through to these functions rather than the previously hardcoded -1 due to errors being raised from the joblib.externals library
while using HDBSCAN. The errors arising from n_jobs=-1 are infrequent, but frequent enough to merit providing the user more control

Some example exceptions that were raised:

- `[11/17/2021 10:53:42 PM ERROR] concurrent.futures: exception calling callback for <Future at 0x7fd2083a9c10 state=finished raised BrokenProcessPool>`
- `joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.`
🐛 Add input type conversion for --cpus arg in binning entrypoints
…ternals using joblib.

🔥🐛 This is a somewhat known error as similar messages have been discussed [here](scikit-learn/scikit-learn#21685)
and on the [hdbscan GH pull-#495](scikit-learn-contrib/hdbscan#495).

The error messages is emitted from joblib.externals.loky.process_executor._RemoteTraceback and emits a ValueError:

'ValueError: buffer source array is read-only.'

So far this has not been encountered with scikit-learn version 0.24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python Python related issues/code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Large data mode needs more granular checkpointing during clustering
3 participants