[dev branch] Issues with autometa/common/mag.py #53

jason-c-kwan · 2020-04-28T21:26:03Z

This file contains the definition of the MAG class, which is basically a subset of the contigs in a metagenome. The issues that I came across are outlined below:

The name of the script and the class is perhaps not completely logical. Most of the time a MAG class represents a kingdom bin and not a metagenome-assembled-genome (MAG). There is also a problem with the name "bin", because this is already a function in Python that converts an integer to a binary string. Perhaps something like meta_bin would be more appropriate?
The MAG class itself does not have a docstring.
There are two attributes in the MAG class that are kind of similar - self.assembly_name, which is defined as self.basename.split('.')[0] and self.root which is defined as os.path.splitext(self.basename)[0]. It looks like self.root is used for self.nucls_fname and self.prots_fname, but is self.assembly_name used or needed?
A number of the methods in the MAG class have the @property decorator, and so they are sort of like attributes but are computed when they are called upon. The problem here is that a number of them call self.get_seqs, which returns the BioPython SeqIO records for the MAG. This means that when totalsize, size_pct, nallseqs, seqs_pct, size, or gc_content are called, the fasta file the MAG is based on will be parsed each time. I would suggest the following - perhaps the get_seqs method could populate all the other values when it is first run, and this other things like totalsize could be converted from a @property to a @cached_property as defined in the functools module. A @cached_property will remember the value after it is first run to save on re-computing time.
The gc_content property in the MAG class uses a simple np.mean statement. I think it would be more appropriate to calculate a length-weighted mean of GC, because otherwise the GC value will tend to be skewed by short contigs, especially if the bin is made up of a lot of short contigs. The thing I am worried about here is that the GC content of short contigs is less likely to be completely representative of their parent genomes.
Many of the methods and properties of the MAG class do not have docstrings, which is important now that we are using docstrings directly for the documentation.
The prepared method seems to check whether the path given in the argument fpath exists and is non-zero size. I don't think this is a great way to determine if a step finished because the file could be there, be non-zero size, but still be corrupted.
The get_orfs method retrieves ORFs corresponding to the MAG, and one of the arguments is orf_type='prot'. Near the beginning of the method, the values of self.prot_orfs_fpath and self.nucl_orfs_fpath are put into a dictionary called orfs_fpaths, keyed under 'prot' and 'nucl', the two accepted values of orf_type. Then the appropriate path is picked using a dict.get statement:

Autometa/autometa/common/mag.py

Lines 134 to 135 in 3e8c771

    
           orfs_fpaths = {'prot':self.prot_orfs_fpath, 'nucl':self.nucl_orfs_fpath} 
        
           orfs_fpath = orfs_fpaths.get(orf_type, 'prot')

The problem I see here is that if the orf_type argument is not 'prot' or 'nucl', then the method silently just assumes 'prot', and I just think it would be better for bug fixing to throw an error instead.

The split_nucleotides method is not yet implemented, and there is no docstring describing what it is supposed to do, so I can't tell if it should be implemented right now or not.
In the get_binning method, it might be useful to outline in the docstring what parameters are currently accepted/expected in the **kwargs argument.
In the get_binning method, right now the embedding method that is requested is 'UMAP' rather than 'bhtsne'. Also, I think that the method arguments for kmers.embed are now lowercase and so this method will crash. My point here is that if we are trying to emulate Autometa 1.0 functionality, the default should be 'bhtsne'.
The get_binning method appears to expect a pandas DataFrame for the 'coverage' argument in **kwargs, but expects a path for the 'taxonomy' argument, which it then loads into a DataFrame. It is not entirely clear why there is this difference in data types expected.

The text was updated successfully, but these errors were encountered:

evanroyrees · 2020-05-05T20:41:40Z

The name of the script and the class is perhaps not completely logical. Most of the time a MAG class represents a kingdom bin and not a metagenome-assembled-genome (MAG). There is also a problem with the name "bin", because this is already a function in Python that converts an integer to a binary string. Perhaps something like meta_bin would be more appropriate?

How about metabin.py and MetaBin for the class?

evanroyrees · 2020-05-05T20:44:03Z

There are two attributes in the MAG class that are kind of similar - self.assembly_name, which is defined as self.basename.split('.')[0] and self.root which is defined as os.path.splitext(self.basename)[0]. It looks like self.root is used for self.nucls_fname and self.prots_fname, but is self.assembly_name used or needed?

self.assembly_name has now been removed.

evanroyrees · 2020-05-05T20:58:31Z

A number of the methods in the MAG class have the @property decorator, and so they are sort of like attributes but are computed when they are called upon. The problem here is that a number of them call self.get_seqs, which returns the BioPython SeqIO records for the MAG. This means that when totalsize, size_pct, nallseqs, seqs_pct, size, or gc_content are called, the fasta file the MAG is based on will be parsed each time. I would suggest the following - perhaps the get_seqs method could populate all the other values when it is first run, and this other things like totalsize could be converted from a @property to a @cached_property as defined in the functools module. A @cached_property will remember the value after it is first run to save on re-computing time.

The gc_content property in the MAG class uses a simple np.mean statement. I think it would be more appropriate to calculate a length-weighted mean of GC, because otherwise the GC value will tend to be skewed by short contigs, especially if the bin is made up of a lot of short contigs. The thing I am worried about here is that the GC content of short contigs is less likely to be completely representative of their parent genomes.

Time consuming properties have been cached using python3.7 implementation. GC is now weighted by length. Example of caching is below with the GC implementation:

import os

import numpy as np

from functools import lru_cache
from Bio import SeqIO
from Bio.SeqUtils import GC


class MiniMetaBin:
    def __init__(self, assembly, contigs, outdir=None):
        self.assembly = os.path.realpath(assembly)
        self.contigs = contigs
    @property
    @lru_cache(maxsize=None)
    def seqrecords(self):
        """Retrieve seqrecords from assembly contained in `self.contigs`.

        Returns
        -------
        list
            list of SeqIO [SeqRecords, ...]

        """
        return [seq for seq in SeqIO.parse(self.assembly, 'fasta')
            if seq.id in self.contigs]

    @property
    @lru_cache(maxsize=None)
    def length_weighted_gc(self):
        weights = [len(rec.seq)/self.size for rec in self.seqrecords]
        gc_content = [SeqUtils.GC(rec.seq) for rec in self.seqrecords]
        return np.average(a=gc_content, weights=weights)

evanroyrees · 2020-05-05T21:04:16Z

The prepared method seems to check whether the path given in the argument fpath exists and is non-zero size. I don't think this is a great way to determine if a step finished because the file could be there, be non-zero size, but still be corrupted.

I've added a # COMBAK comment, as this functionality should be imported from utilities. This is something that is prevalent across the pipeline. To avoid redundancy we should implement a generalizable function or decorator and import.

evanroyrees · 2020-05-05T21:15:30Z

The split_nucleotides method is not yet implemented, and there is no docstring describing what it is supposed to do, so I can't tell if it should be implemented right now or not.

I have removed this to avoid confusion

evanroyrees · 2020-05-05T21:36:21Z

The get_binning method appears to expect a pandas DataFrame for the 'coverage' argument in **kwargs, but expects a path for the 'taxonomy' argument, which it then loads into a DataFrame. It is not entirely clear why there is this difference in data types expected.

The coverage file path is now used in the method

… into issue-KwanLab#53

jason-c-kwan added the code review label Apr 28, 2020

jason-c-kwan added this to the Aim 0: Release 2.0 milestone Apr 28, 2020

jason-c-kwan mentioned this issue Apr 28, 2020

Jason has not yet finished the code review #39

Closed

evanroyrees mentioned this issue May 5, 2020

fixes-#54 Metagenome #66

Merged

evanroyrees self-assigned this May 5, 2020

evanroyrees mentioned this issue May 5, 2020

Update MAG class to MetaBin #67

Merged

evanroyrees linked a pull request May 5, 2020 that will close this issue

Update MAG class to MetaBin #67

Merged

evanroyrees added a commit to evanroyrees/Autometa that referenced this issue May 13, 2020

Merge branch 'issue-KwanLab#54' of https://github.com/WiscEvan/Autometa…

fe84654

… into issue-KwanLab#53

jason-c-kwan added a commit that referenced this issue May 14, 2020

Resolved merge conflicts with WiscEvan-issue-#53

587c873

jason-c-kwan added a commit that referenced this issue May 14, 2020

Merge branch 'WiscEvan-issue-#53' into dev

d0cc81f

jason-c-kwan closed this as completed May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev branch] Issues with autometa/common/mag.py #53

[dev branch] Issues with autometa/common/mag.py #53

jason-c-kwan commented Apr 28, 2020 •

edited

Loading

evanroyrees commented May 5, 2020

evanroyrees commented May 5, 2020

evanroyrees commented May 5, 2020 •

edited

Loading

evanroyrees commented May 5, 2020

evanroyrees commented May 5, 2020

evanroyrees commented May 5, 2020

[dev branch] Issues with autometa/common/mag.py #53

[dev branch] Issues with autometa/common/mag.py #53

Comments

jason-c-kwan commented Apr 28, 2020 • edited Loading

evanroyrees commented May 5, 2020

evanroyrees commented May 5, 2020

evanroyrees commented May 5, 2020 • edited Loading

evanroyrees commented May 5, 2020

evanroyrees commented May 5, 2020

evanroyrees commented May 5, 2020

jason-c-kwan commented Apr 28, 2020 •

edited

Loading

evanroyrees commented May 5, 2020 •

edited

Loading