Skip to content

Commit

Permalink
Version 2.1 release
Browse files Browse the repository at this point in the history
  • Loading branch information
peradecki committed Oct 20, 2021
1 parent 35c1f5c commit 9008b29
Show file tree
Hide file tree
Showing 9 changed files with 111 additions and 23 deletions.
14 changes: 5 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ Rapid mining of RNA secondary structure motifs from structure profiling data.

*patteRNA* is an unsupervised pattern recognition algorithm that rapidly mines RNA structure motifs from structure profiling (SP) data.

It features a discretized observation model, hidden Markov model (DOM-HMM) of reactivity that enables automated and calibrated processing of SP data without a dependence on reference structures. It is compatible with most current probing technique (e.g. SHAPE, DMS, PARS) and can help analyze datasets of any sizes, from small scale experiments to transcriptome-wide assays. It scales well to millions or billions of nucleotides.
It features a discretized observation model, hidden Markov model (DOM-HMM) of reactivity that enables automated and calibrated processing of SP data without a dependence on reference structures. It is compatible with most current probing techniques (e.g., SHAPE, DMS, PARS) and can help analyze datasets of any size, from small scale experiments to transcriptome-wide assays. It scales well to millions or billions of nucleotides.

The training and scoring implementations are parallelized, so the algorithm can benefit greatly when deployed in a high CPU count environment.
The training and scoring implementations are parallelized, so the algorithm can benefit greatly when deployed in a high CPU-count environment.



Expand Down Expand Up @@ -284,19 +284,15 @@ HDSL is a measure of local structure that assists in converting patteRNA's predi
If you used patteRNA in your research, please reference the following citations depending on which version of patteRNA you utilized.

**Version 2.1**: \
Radecki P., Uppuluri R., Deshpande K., and Aviran S. (2021) "Accurate Detection of RNA Stem-Loops in Structurome Data Reveals Widespread Association with Protein Binding Sites." *RNA Biology*. doi TBA.
Radecki P., Uppuluri R., Deshpande K., and Aviran S. (2021) "Accurate Detection of RNA Stem-Loops in Structurome Data Reveals Widespread Association with Protein Binding Sites." *RNA Biology*. (in press) doi: [10.1080/15476286.2021.1971382](https://doi.org/10.1080/15476286.2021.1971382)

**Version 2.0**: \
Radecki P., Uppuluri R., and Aviran S. (2021) "Rapid Structure-Function Insights via Hairpin-Centric Analysis of Big RNA Structure Probing Datasets." *NAR Genomics and Bioinformatics* 3(3). doi: [10.1093/nargab/lqab073](https://doi.org/10.1093/nargab/lqab073).

**Version 1.0–1.2**: \
Ledda M. and Aviran S. (2018) “PATTERNA: Transcriptome-Wide Search for Functional RNA Elements via Structural Data Signatures.” *Genome Biology* 19(28). doi: [10.1186/s13059-018-1399-z](https://doi.org/10.1186/s13059-018-1399-z).

Radecki P., Uppuluri R., and Aviran S. (2021) "Rapid Structure-Function Insights via Hairpin-Centric Analysis of Big RNA Structure Probing Datasets." *NAR Genomics and Bioinformatics* 3(3). doi: [10.1093/nargab/lqab073](https://doi.org/10.1093/nargab/lqab073)


## Issue Reporting

patteRNA is actively supported and all changes are listed in the [CHANGELOG](CHANGES.md). To report a bug open a ticket in the [issues tracker](https://github.com/AviranLab/patteRNA/issues). Features can be requested by opening a ticket in the [pull request](https://github.com/AviranLab/patteRNA/pulls).
patteRNA is actively supported and all changes are listed in the [CHANGELOG](CHANGES.md). To report a bug open a ticket in the [issues tracker](https://github.com/AviranLab/patteRNA/issues). Features can be requested by opening a [pull request](https://github.com/AviranLab/patteRNA/pulls).



Expand Down
Binary file added dist/patteRNA-2.1.tar.gz
Binary file not shown.
Binary file removed dist/patteRNA-2.1b0.tar.gz
Binary file not shown.
58 changes: 57 additions & 1 deletion src/patteRNA/DOM.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@


class DOM:
"""
Object representing a discretized observation model. Comprised primarily by the
DOM.edges and DOM.chi vectors, which represent the discrete mask and state-dependent
emission probabilities, respectively.
"""

def __init__(self):
self.k = None
self.n_bins = None
Expand All @@ -12,10 +18,25 @@ def __init__(self):
self.n_params = None

def set_params(self, config):
"""
Set relevant parameters for DOM object.
Args:
config (dict): Parameters to set.
"""
params = {'n_bins', 'edges', 'classes', 'chi', 'n_params'}
self.__dict__.update((param, np.array(value)) for param, value in config.items() if param in params)

def initialize(self, k, stats):
"""
Initialize DOM parameters according to dataset properties.
Args:
k (int): Number of components to use
stats (dict): Dictionary of dataset sets, generated by Dataset.compute_stats()
"""

k = k + 5

Expand Down Expand Up @@ -60,16 +81,27 @@ def compute_emissions(self, transcript, reference=False):
Args:
transcript (src.patteRNA.Transcript.Transcript): Transcript to process
reference (bool): Whether or not it's a reference transcript
"""
if reference:
pass
transcript.B = self.chi[:, transcript.obs_dom-1]

@staticmethod
def post_process(transcript):
pass
pass # No post-processing needed for DOM model

def m_step(self, transcript):
"""
Compute pseudo-counts en route to updating model parameters according to maximium-likelihood approach.
Args:
transcript (Transcript): Transcript to process
Returns:
params (dict): Partial pseudo-counts
"""

chi_0 = np.fromiter((transcript.gamma[0, transcript.obs_dom == dom_class].sum()
for dom_class in self.classes), float)
Expand All @@ -82,24 +114,44 @@ def m_step(self, transcript):
return params

def update_from_pseudocounts(self, pseudocounts, nan=False):
"""
Scheme model parameters from transcript-level pseudo-counts.
Args:
pseudocounts (dict): Dictionary of total pseudo-counts
nan (bool): Whether or not to treat NaNs as informative
"""
self.chi = pseudocounts['chi'] / pseudocounts['chi_norm'][:, None]
self.scale_chi(nan=nan)

def scale_chi(self, nan=False):
"""
Scale chi vector to a probability distribution.
Args:
nan (bool): Whether or not to treat NaNs as informative
"""
if nan:
self.chi[:, :] = self.chi[:, :] / np.sum(self.chi[:, :], axis=1)[:, np.newaxis]
else:
self.chi[:, :-1] = 0.9 * self.chi[:, :-1] / np.sum(self.chi[:, :-1], axis=1)[:, np.newaxis]
self.chi[:, -1] = 0.1 # NaN observations

def snapshot(self):
"""
Returns a text summary of model parameters.
"""
text = ""
text += "{}:\n{}\n".format('chi', np.array2string(self.chi))
return text

def serialize(self):
"""
Return a dictionary containing all of the parameters needed to describe the emission model.
"""
return {'type': self.type,
'n_bins': self.n_bins,
Expand All @@ -109,6 +161,10 @@ def serialize(self):
'n_params': self.n_params}

def reset(self):
"""
Reset DOM object to un-initialized state.
"""
self.edges = None
self.chi = None
self.k = None
Expand Down
7 changes: 1 addition & 6 deletions src/patteRNA/Dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,9 @@

class Dataset:
def __init__(self, fp_observations, fp_sequences=None, fp_references=None):

self.fp_obs = fp_observations
self.fp_fasta = fp_sequences
self.fp_refs = fp_references

self.rnas = dict()
self.stats = dict()

Expand Down Expand Up @@ -127,9 +125,6 @@ def pre_process(self, model, scoring=False):
if model.emission_model.type == 'DOM':
for rna in self.rnas:
model.emission_model.discretize(self.rnas[rna])
# if model.emission_model.type == 'GMM':
# for rna in self.rnas:
# model.emission_model.generate_discrete_masks(self.rnas[rna])

if scoring:
for rna in self.rnas.values():
Expand All @@ -147,7 +142,7 @@ def spawn_set(self, rnas):

def spawn_reference_set(self):
spawned_set = Dataset(fp_observations=None, fp_references=None, fp_sequences=None)
references = [rna for rna in self.rnas if self.rnas[rna].ref is not -1]
references = [rna for rna in self.rnas if self.rnas[rna].ref is not None]
spawned_set.rnas = {rna: self.rnas[rna] for rna in references}
spawned_set.compute_stats()
return spawned_set
Expand Down
4 changes: 2 additions & 2 deletions src/patteRNA/Transcript.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ class Transcript:
def __init__(self, name, seq, obs):
self.name = name
self.seq = seq
self.obs = obs
self.obs = np.array(obs)
self.T = len(obs)
self.obs_dom = None
self.ref = -1
self.ref = None

self.alpha = None
self.beta = None
Expand Down
2 changes: 1 addition & 1 deletion src/patteRNA/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def main(testcmd=None):
if run_config['reference']:

# Spawn training set of reference RNAs
logger.info("Using reference set.")
logger.info("Using reference set")
clock.tick()
reference_set = data.spawn_reference_set()

Expand Down
2 changes: 1 addition & 1 deletion src/patteRNA/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "2.1-beta"
__version__ = "2.1"
47 changes: 44 additions & 3 deletions src/patteRNA/viennalib.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,70 @@
import logging

logger = logging.getLogger(__name__)

vienna_imported = False

try:
# Following comment suppresses unresolved reference warnings related to the fact that RNA
# will not be declared if the import statement fails within this try block
# noinspection PyUnresolvedReferences
import RNA

fc = RNA.fold_compound('GCGCGCAAAGCGCGC')
mfe, _ = fc.mfe()
if mfe == '((((((...))))))':
vienna_imported = True
else:
raise RuntimeWarning('WARNING - ViennaRNA Python interface was imported, but did not behave as '
'expected. Check results or use --no-vienna to run patteRNA without NNTM folding.')
logger.warning('WARNING - ViennaRNA Python interface was imported, but did not behave as '
'expected. Check results or use --no-vienna to run patteRNA without NNTM folding.')
except ModuleNotFoundError:
pass
logger.debug('ViennaRNA Python interface not detected.') # Debug level log message
except ImportError as e:
logger.warning('WARNING - ViennaRNA Python interface was found, but could not be imported successfully. '
'Check that you are using the same verison of Python that was configured with the interface. '
'Check results or use --no-vienna to run patteRNA without NNTM folding. '
'See error below:\n{}'.format(repr(e)))


def fold(seq):
"""
Compute the minimum free energy of a given sequence.
Args:
seq (str): RNA sequence to fold
Returns:
mfe (float): Minimum free energy of folded structure
"""
return RNA.fold(seq)[1]


def hc_fold(seq, hcs):
"""
Compute the minimum free energy of a given sequence subject to hard constraints (either
base-pairs or unpaired nucleotides).
Args:
seq (str): RNA sequence to fold
hcs (list): List of base pairing constraints
Returns:
mfe (float): Minimum free energy of folded structure
"""
rna = RNA.fold_compound(seq)
add_hcs(rna, hcs)
return rna.mfe()[1]


def add_hcs(rna, hcs):
"""
Add hard constraints to a ViennaRNA fold_compound object.
Args:
rna (RNA.fold_compound): Fold compound object from ViennaRNA
hcs (list): List of hard constraints
"""
for hc in hcs:
if hc[1] >= 0:
rna.hc_add_bp(hc[0] + 1, hc[1] + 1)
Expand Down

0 comments on commit 9008b29

Please sign in to comment.