# Quantifying RNA 2ndary structure

RNA secondary structure from prediction software is usually provided in dot bracket notation

```
((((....))))
```

Where paired parens are basepaired ribonucleotides and `.`s are unpaired bases. 

For plasmid variable regions (and more generally transcribed regions want to get some idea of possible degree of structure that can form over those regions. 

Programs like `rnaFold` produce statistics about free energy of predicted structures but not a lot of statistics that directly relate to the degree of secondary structure in a given sequence, at least not ones that I can interpret easily. So I want to create a parser that will produce the following statistics for a given dot-bracket sequence

- Paired ribonucleotides / total ribonucleotides
- Loop associated ribonucleotides / total ribonucleotides
    - This differs from first statistic as `.`s will be counted as long as they reside inbetween two regions of base pairing

## Degree of base pairing

This one is pretty basic.

In [3]:
def degree_basepairing(dot_bracket_str):
    paired = dot_bracket_str.count('(') + dot_bracket_str.count(')')
    unpaired = dot_bracket_str.count('.')
    all_ribos = len(dot_bracket_str)
    assert paired + unpaired == all_ribos
    return paired / all_ribos    

In [4]:
s = '((((........))))'
assert degree_basepairing(s) == 0.5

## Degree of looping

In [7]:
s1 = '((((........))))...(.((((........)))).)...().()...'

import re  # use regex
loop_regex = re.compile(r'\(+\.*\)+')

def find_loops(dot_bracket_str):
    return loop_regex.findall(dot_bracket_str)

print(find_loops(s1))

['((((........))))', '((((........))))', '()', '()']


Fails to find the nested loop `(.((((........)))).)` also should ignore `()` since not really a loop.

Maybe just use a program like [this](https://github.com/hendrixlab/bpRNA)? https://academic.oup.com/nar/article/46/11/5381/4994207

Not much documentation about the output format though which looks like this

```
#Name: bpRNA_PDB_650
#Length:  71 
#PageNumber: 5
#Warning: Structure contains linked PK-segments of same sizes 1 and 1. Using PK brackets for the more 5' segment
GGACACUGAUGAUCGCGUGGAUAUGGCACGCAUUGAAUUGUUGGACACCGUAAAUGUCCUAACACGUGUCC
((((((.([({((.(((((..<<.<<))))))(.A.).]}..(((((>>()>>.))))).))a).))))))
SSSSSSISISBSSBSSSSSHHHHHHHSSSSSSSHHHSMMMMMSSSSSIISSIIISSSSSMSSISISSSSSS
NNNNNNNNKNKNNNNNNNNNNKKNKKNNNNNNNNKNNNKKNNNNNNNKKNNKKNNNNNNNNNKNNNNNNNN
S1 1..6 "GGACAC" 66..71 "GUGUCC"
S2 8..8 "G" 64..64 "A"
S3 10..10 "U" 62..62 "A"
S4 12..12 "A" 61..61 "A"
S5 13..13 "U" 32..32 "A"
S6 15..19 "GCGUG" 27..31 "CACGC"
S7 33..33 "U" 37..37 "A"
S8 43..47 "GGACA" 55..59 "UGUCC"
S9 50..51 "GU" 50..51 "GU"
H1 20..26 "GAUAUGG" (19,27) G:C PK{3,4}
H2 34..36 "UGA" (33,37) U:A PK{5}
H3 51..50 "" (50,51) G:U 
B1 11..11 "G" (10,62) U:A (61,12) A:A PK{2}
B2 14..14 "C" (13,32) U:A (31,15) C:G 
I1.1 7..7 "U" (6,66) C:G 
I1.2 65..65 "C" (64,8) A:G 
I2.1 9..9 "A" (8,64) G:A PK{1}
I2.2 63..63 "C" (62,10) A:U PK{5}
```



The paper had a bit more info on this filetype but still mostly infering from context. Will need to use as rnaFold produces more complex dot diagrams than is worth creating a script for.

Letter designate the type of structure over a given region. 
- S: stem, paired ribos
- H: Hairpin coordinates re the internal region of the hairpin
- I: Interior loop, sort of a buldge type structure that occurs when ribos are unpaired between two stems
- B: Buldge, displaced ribo that occurs within a stem

This identities are pretty much infered from this image from the paper database.

![](http://bprna.cgrb.oregonstate.edu/images/bpRNA_structure.png)

So instead of script to calculate these values now need a script to parse this new structure fileformat.