-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dev branch] Issues with autometa/common/mag.py #53
Comments
How about |
|
Time consuming properties have been cached using python3.7 implementation. GC is now weighted by length. Example of caching is below with the GC implementation: import os
import numpy as np
from functools import lru_cache
from Bio import SeqIO
from Bio.SeqUtils import GC
class MiniMetaBin:
def __init__(self, assembly, contigs, outdir=None):
self.assembly = os.path.realpath(assembly)
self.contigs = contigs
@property
@lru_cache(maxsize=None)
def seqrecords(self):
"""Retrieve seqrecords from assembly contained in `self.contigs`.
Returns
-------
list
list of SeqIO [SeqRecords, ...]
"""
return [seq for seq in SeqIO.parse(self.assembly, 'fasta')
if seq.id in self.contigs]
@property
@lru_cache(maxsize=None)
def length_weighted_gc(self):
weights = [len(rec.seq)/self.size for rec in self.seqrecords]
gc_content = [SeqUtils.GC(rec.seq) for rec in self.seqrecords]
return np.average(a=gc_content, weights=weights) |
I've added a |
I have removed this to avoid confusion |
The coverage file path is now used in the method |
This file contains the definition of the MAG class, which is basically a subset of the contigs in a metagenome. The issues that I came across are outlined below:
The name of the script and the class is perhaps not completely logical. Most of the time a MAG class represents a kingdom bin and not a metagenome-assembled-genome (MAG). There is also a problem with the name "bin", because this is already a function in Python that converts an integer to a binary string. Perhaps something like
meta_bin
would be more appropriate?The MAG class itself does not have a docstring.
There are two attributes in the MAG class that are kind of similar -
self.assembly_name
, which is defined asself.basename.split('.')[0]
andself.root
which is defined asos.path.splitext(self.basename)[0]
. It looks likeself.root
is used forself.nucls_fname
andself.prots_fname
, but isself.assembly_name
used or needed?A number of the methods in the MAG class have the
@property
decorator, and so they are sort of like attributes but are computed when they are called upon. The problem here is that a number of them callself.get_seqs
, which returns the BioPython SeqIO records for the MAG. This means that whentotalsize
,size_pct
,nallseqs
,seqs_pct
,size
, orgc_content
are called, the fasta file the MAG is based on will be parsed each time. I would suggest the following - perhaps theget_seqs
method could populate all the other values when it is first run, and this other things liketotalsize
could be converted from a@property
to a@cached_property
as defined in the functools module. A@cached_property
will remember the value after it is first run to save on re-computing time.The
gc_content
property in the MAG class uses a simplenp.mean
statement. I think it would be more appropriate to calculate a length-weighted mean of GC, because otherwise the GC value will tend to be skewed by short contigs, especially if the bin is made up of a lot of short contigs. The thing I am worried about here is that the GC content of short contigs is less likely to be completely representative of their parent genomes.Many of the methods and properties of the MAG class do not have docstrings, which is important now that we are using docstrings directly for the documentation.
The
prepared
method seems to check whether the path given in the argumentfpath
exists and is non-zero size. I don't think this is a great way to determine if a step finished because the file could be there, be non-zero size, but still be corrupted.The
get_orfs
method retrieves ORFs corresponding to the MAG, and one of the arguments isorf_type='prot'
. Near the beginning of the method, the values ofself.prot_orfs_fpath
andself.nucl_orfs_fpath
are put into a dictionary calledorfs_fpaths
, keyed under'prot'
and'nucl'
, the two accepted values oforf_type
. Then the appropriate path is picked using adict.get
statement:Autometa/autometa/common/mag.py
Lines 134 to 135 in 3e8c771
The problem I see here is that if the
orf_type
argument is not'prot'
or'nucl'
, then the method silently just assumes'prot'
, and I just think it would be better for bug fixing to throw an error instead.The
split_nucleotides
method is not yet implemented, and there is no docstring describing what it is supposed to do, so I can't tell if it should be implemented right now or not.In the
get_binning
method, it might be useful to outline in the docstring what parameters are currently accepted/expected in the**kwargs
argument.In the
get_binning
method, right now the embedding method that is requested is'UMAP'
rather than'bhtsne'
. Also, I think that the method arguments forkmers.embed
are now lowercase and so this method will crash. My point here is that if we are trying to emulate Autometa 1.0 functionality, the default should be'bhtsne'
.The
get_binning
method appears to expect a pandas DataFrame for the'coverage'
argument in**kwargs
, but expects a path for the'taxonomy'
argument, which it then loads into a DataFrame. It is not entirely clear why there is this difference in data types expected.The text was updated successfully, but these errors were encountered: