Version 2.1 release

AviranLab · Oct 20, 2021 · 9008b29 · 9008b29
1 parent 35c1f5c
commit 9008b29
Show file tree

Hide file tree

Showing 9 changed files with 111 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -10,9 +10,9 @@ Rapid mining of RNA secondary structure motifs from structure profiling data.
 
 *patteRNA* is an unsupervised pattern recognition algorithm that rapidly mines RNA structure motifs from structure profiling (SP) data. 
 
-It features a discretized observation model, hidden Markov model (DOM-HMM) of reactivity that enables automated and calibrated processing of SP data without a dependence on reference structures. It is compatible with most current probing technique (e.g. SHAPE, DMS, PARS) and can help analyze datasets of any sizes, from small scale experiments to transcriptome-wide assays. It scales well to millions or billions of nucleotides.
+It features a discretized observation model, hidden Markov model (DOM-HMM) of reactivity that enables automated and calibrated processing of SP data without a dependence on reference structures. It is compatible with most current probing techniques (e.g., SHAPE, DMS, PARS) and can help analyze datasets of any size, from small scale experiments to transcriptome-wide assays. It scales well to millions or billions of nucleotides.
 
-The training and scoring implementations are parallelized, so the algorithm can benefit greatly when deployed in a high CPU count environment.
+The training and scoring implementations are parallelized, so the algorithm can benefit greatly when deployed in a high CPU-count environment.
 
 
 
@@ -284,19 +284,15 @@ HDSL is a measure of local structure that assists in converting patteRNA's predi
 If you used patteRNA in your research, please reference the following citations depending on which version of patteRNA you utilized.
 
 **Version 2.1**: \
-Radecki P., Uppuluri R., Deshpande K., and Aviran S. (2021) "Accurate Detection of RNA Stem-Loops in Structurome Data Reveals Widespread Association with Protein Binding Sites." *RNA Biology*. doi TBA.
+Radecki P., Uppuluri R., Deshpande K., and Aviran S. (2021) "Accurate Detection of RNA Stem-Loops in Structurome Data Reveals Widespread Association with Protein Binding Sites." *RNA Biology*. (in press) doi: [10.1080/15476286.2021.1971382](https://doi.org/10.1080/15476286.2021.1971382)
 
 **Version 2.0**: \
-Radecki P., Uppuluri R., and Aviran S. (2021) "Rapid Structure-Function Insights via Hairpin-Centric Analysis of Big RNA Structure Probing Datasets." *NAR Genomics and Bioinformatics* 3(3). doi: [10.1093/nargab/lqab073](https://doi.org/10.1093/nargab/lqab073).
-
-**Version 1.0–1.2**: \
-Ledda M. and Aviran S. (2018) “PATTERNA: Transcriptome-Wide Search for Functional RNA Elements via Structural Data Signatures.” *Genome Biology* 19(28). doi: [10.1186/s13059-018-1399-z](https://doi.org/10.1186/s13059-018-1399-z).
-
+Radecki P., Uppuluri R., and Aviran S. (2021) "Rapid Structure-Function Insights via Hairpin-Centric Analysis of Big RNA Structure Probing Datasets." *NAR Genomics and Bioinformatics* 3(3). doi: [10.1093/nargab/lqab073](https://doi.org/10.1093/nargab/lqab073)
 
 
 ## Issue Reporting
 
-patteRNA is actively supported and all changes are listed in the [CHANGELOG](CHANGES.md). To report a bug open a ticket in the [issues tracker](https://github.com/AviranLab/patteRNA/issues). Features can be requested by opening a ticket in the [pull request](https://github.com/AviranLab/patteRNA/pulls).
+patteRNA is actively supported and all changes are listed in the [CHANGELOG](CHANGES.md). To report a bug open a ticket in the [issues tracker](https://github.com/AviranLab/patteRNA/issues). Features can be requested by opening a [pull request](https://github.com/AviranLab/patteRNA/pulls).
 
 
 

diff --git a/dist/patteRNA-2.1.tar.gz b/dist/patteRNA-2.1.tar.gz
diff --git a/dist/patteRNA-2.1b0.tar.gz b/dist/patteRNA-2.1b0.tar.gz
diff --git a/src/patteRNA/DOM.py b/src/patteRNA/DOM.py
@@ -2,6 +2,12 @@
 
 
 class DOM:
+    """
+    Object representing a discretized observation model. Comprised primarily by the
+    DOM.edges and DOM.chi vectors, which represent the discrete mask and state-dependent
+    emission probabilities, respectively.
+    """
+
     def __init__(self):
         self.k = None
         self.n_bins = None
@@ -12,10 +18,25 @@ def __init__(self):
         self.n_params = None
 
     def set_params(self, config):
+        """
+        Set relevant parameters for DOM object.
+
+        Args:
+            config (dict): Parameters to set.
+
+        """
         params = {'n_bins', 'edges', 'classes', 'chi', 'n_params'}
         self.__dict__.update((param, np.array(value)) for param, value in config.items() if param in params)
 
     def initialize(self, k, stats):
+        """
+        Initialize DOM parameters according to dataset properties.
+
+        Args:
+            k (int): Number of components to use
+            stats (dict): Dictionary of dataset sets, generated by Dataset.compute_stats()
+
+        """
 
         k = k + 5
 
@@ -60,16 +81,27 @@ def compute_emissions(self, transcript, reference=False):
         Args:
             transcript (src.patteRNA.Transcript.Transcript): Transcript to process
             reference (bool): Whether or not it's a reference transcript
+
         """
         if reference:
             pass
         transcript.B = self.chi[:, transcript.obs_dom-1]
 
     @staticmethod
     def post_process(transcript):
-        pass
+        pass  # No post-processing needed for DOM model
 
     def m_step(self, transcript):
+        """
+        Compute pseudo-counts en route to updating model parameters according to maximium-likelihood approach.
+
+        Args:
+            transcript (Transcript): Transcript to process
+
+        Returns:
+            params (dict): Partial pseudo-counts
+
+        """
 
         chi_0 = np.fromiter((transcript.gamma[0, transcript.obs_dom == dom_class].sum()
                              for dom_class in self.classes), float)
@@ -82,24 +114,44 @@ def m_step(self, transcript):
         return params
 
     def update_from_pseudocounts(self, pseudocounts, nan=False):
+        """
+        Scheme model parameters from transcript-level pseudo-counts.
+
+        Args:
+            pseudocounts (dict): Dictionary of total pseudo-counts
+            nan (bool): Whether or not to treat NaNs as informative
+
+        """
         self.chi = pseudocounts['chi'] / pseudocounts['chi_norm'][:, None]
         self.scale_chi(nan=nan)
 
     def scale_chi(self, nan=False):
+        """
+        Scale chi vector to a probability distribution.
+
+        Args:
+            nan (bool): Whether or not to treat NaNs as informative
+
+        """
         if nan:
             self.chi[:, :] = self.chi[:, :] / np.sum(self.chi[:, :], axis=1)[:, np.newaxis]
         else:
             self.chi[:, :-1] = 0.9 * self.chi[:, :-1] / np.sum(self.chi[:, :-1], axis=1)[:, np.newaxis]
             self.chi[:, -1] = 0.1  # NaN observations
 
     def snapshot(self):
+        """
+        Returns a text summary of model parameters.
+
+        """
         text = ""
         text += "{}:\n{}\n".format('chi', np.array2string(self.chi))
         return text
 
     def serialize(self):
         """
         Return a dictionary containing all of the parameters needed to describe the emission model.
+
         """
         return {'type': self.type,
                 'n_bins': self.n_bins,
@@ -109,6 +161,10 @@ def serialize(self):
                 'n_params': self.n_params}
 
     def reset(self):
+        """
+        Reset DOM object to un-initialized state.
+
+        """
         self.edges = None
         self.chi = None
         self.k = None

diff --git a/src/patteRNA/Dataset.py b/src/patteRNA/Dataset.py
@@ -9,11 +9,9 @@
 
 class Dataset:
     def __init__(self, fp_observations, fp_sequences=None, fp_references=None):
-
         self.fp_obs = fp_observations
         self.fp_fasta = fp_sequences
         self.fp_refs = fp_references
-
         self.rnas = dict()
         self.stats = dict()
 
@@ -127,9 +125,6 @@ def pre_process(self, model, scoring=False):
         if model.emission_model.type == 'DOM':
             for rna in self.rnas:
                 model.emission_model.discretize(self.rnas[rna])
-        # if model.emission_model.type == 'GMM':
-        #     for rna in self.rnas:
-        #         model.emission_model.generate_discrete_masks(self.rnas[rna])
 
         if scoring:
             for rna in self.rnas.values():
@@ -147,7 +142,7 @@ def spawn_set(self, rnas):
 
     def spawn_reference_set(self):
         spawned_set = Dataset(fp_observations=None, fp_references=None, fp_sequences=None)
-        references = [rna for rna in self.rnas if self.rnas[rna].ref is not -1]
+        references = [rna for rna in self.rnas if self.rnas[rna].ref is not None]
         spawned_set.rnas = {rna: self.rnas[rna] for rna in references}
         spawned_set.compute_stats()
         return spawned_set

diff --git a/src/patteRNA/Transcript.py b/src/patteRNA/Transcript.py
@@ -6,10 +6,10 @@ class Transcript:
     def __init__(self, name, seq, obs):
         self.name = name
         self.seq = seq
-        self.obs = obs
+        self.obs = np.array(obs)
         self.T = len(obs)
         self.obs_dom = None
-        self.ref = -1
+        self.ref = None
 
         self.alpha = None
         self.beta = None

diff --git a/src/patteRNA/cli.py b/src/patteRNA/cli.py
@@ -78,7 +78,7 @@ def main(testcmd=None):
         if run_config['reference']:
 
             # Spawn training set of reference RNAs
-            logger.info("Using reference set.")
+            logger.info("Using reference set")
             clock.tick()
             reference_set = data.spawn_reference_set()
 

diff --git a/src/patteRNA/version.py b/src/patteRNA/version.py
@@ -1 +1 @@
-__version__ = "2.1-beta"
+__version__ = "2.1"
diff --git a/src/patteRNA/viennalib.py b/src/patteRNA/viennalib.py
@@ -1,29 +1,70 @@
+import logging
+
+logger = logging.getLogger(__name__)
+
 vienna_imported = False
+
 try:
+    # Following comment suppresses unresolved reference warnings related to the fact that RNA
+    # will not be declared if the import statement fails within this try block
+    # noinspection PyUnresolvedReferences
     import RNA
 
     fc = RNA.fold_compound('GCGCGCAAAGCGCGC')
     mfe, _ = fc.mfe()
     if mfe == '((((((...))))))':
         vienna_imported = True
     else:
-        raise RuntimeWarning('WARNING - ViennaRNA Python interface was imported, but did not behave as '
-                             'expected. Check results or use --no-vienna to run patteRNA without NNTM folding.')
+        logger.warning('WARNING - ViennaRNA Python interface was imported, but did not behave as '
+                       'expected. Check results or use --no-vienna to run patteRNA without NNTM folding.')
 except ModuleNotFoundError:
-    pass
+    logger.debug('ViennaRNA Python interface not detected.')  # Debug level log message
+except ImportError as e:
+    logger.warning('WARNING - ViennaRNA Python interface was found, but could not be imported successfully. '
+                   'Check that you are using the same verison of Python that was configured with the interface. '
+                   'Check results or use --no-vienna to run patteRNA without NNTM folding. '
+                   'See error below:\n{}'.format(repr(e)))
 
 
 def fold(seq):
+    """
+    Compute the minimum free energy of a given sequence.
+
+    Args:
+        seq (str): RNA sequence to fold
+
+    Returns:
+        mfe (float): Minimum free energy of folded structure
+
+    """
     return RNA.fold(seq)[1]
 
 
 def hc_fold(seq, hcs):
+    """
+    Compute the minimum free energy of a given sequence subject to hard constraints (either
+    base-pairs or unpaired nucleotides).
+
+    Args:
+        seq (str): RNA sequence to fold
+        hcs (list): List of base pairing constraints
+
+    Returns:
+        mfe (float): Minimum free energy of folded structure
+
+    """
     rna = RNA.fold_compound(seq)
     add_hcs(rna, hcs)
     return rna.mfe()[1]
 
 
 def add_hcs(rna, hcs):
+    """
+    Add hard constraints to a ViennaRNA fold_compound object.
+    Args:
+        rna (RNA.fold_compound): Fold compound object from ViennaRNA
+        hcs (list): List of hard constraints
+    """
     for hc in hcs:
         if hc[1] >= 0:
             rna.hc_add_bp(hc[0] + 1, hc[1] + 1)