# Working out parsing functions for bpRNA perl script output

## Parsing structure description lines

Structure description lines occur after headers and sequences and describe individual RNA secondary structures. The authors provide no easily visible documentation but each structure is clearly associated with a code shown in dictionary below.

In [1]:
     structure_codes = {
        'E': '',
        'S': 'stem',
        'M': 'multiloop',
        'I': 'internal loop',
        'B': 'bulge',
        'H': 'hairpin loop',
    }

Example structure description lines

In [15]:
lines = [
    'H1 98..107 "GUCGCUGCUA" (97,108) A:U ',
    'S1 7..7 "C" 195..195 "G"',
    'M1.1 8..86 "CUAAAUUUAAUCGUGGCAGUUUCCUUAUACAAACCGAAUAUUUACAAGUGACGACUCCGCAUUACUCUUGGAAUGAAUU" (7,195) C:G (87,121) A:U',
'segment4 3bp 179..182 UAUG 186..189 CAUA'
]

There is def diversity in what these lines look like but generally the ones I am intested start with a code letter, an id number that tells you what number code structure this is and then possiblely a subcode seperated from the main id if the line is refering to a structure composed of multible other structures.

Currently I am most interested in just avoiding hairpin loops in my plasmid design so going to ignore for now some of the more subtle and complex types of structures in favor of easier parsing.

In [12]:
def parse_struct_descrip_line(line):
    split_line = line.split(' ')  # space delim
    code = split_line[0][0]
    if code in structure_codes:  # ignore if not in code dict
        number = split_line[0][1:]  # everything from beyond the code
        numbers = number.split('.')  # seperate primary and seconday id
        assert len(numbers) <= 2
        id = int(numbers[0])
        if len(numbers) == 2:
            sec_id = int(numbers[1])
        else:
            sec_id = None
        return code, id, sec_id
    else:
        return None

In [16]:
for line in lines:
    p = parse_struct_descrip_line(line)
    print('='*20)
    print(line)
    print(p)

H1 98..107 "GUCGCUGCUA" (97,108) A:U 
('H', 1, None)
S1 7..7 "C" 195..195 "G"
('S', 1, None)
M1.1 8..86 "CUAAAUUUAAUCGUGGCAGUUUCCUUAUACAAACCGAAUAUUUACAAGUGACGACUCCGCAUUACUCUUGGAAUGAAUU" (7,195) C:G (87,121) A:U
('M', 1, 1)
segment4 3bp 179..182 UAUG 186..189 CAUA
None


Turn it into an object

In [20]:
class StructDescrip():

    @classmethod
    def init_from_line(cls, line):
        split_line = line.split(' ')  # space delim
        code = split_line[0][0]
        if code in structure_codes:  # ignore if not in code dict
            number = split_line[0][1:]  # everything from beyond the code
            numbers = number.split('.')  # seperate primary and seconday id
            assert len(numbers) <= 2
            id = int(numbers[0])
            if len(numbers) == 2:
                sec_id = int(numbers[1])
            else:
                sec_id = None
            return cls(code, id, sec_id)
        else:
            return None

    def __init__(self, code, prim_id, sec_id=None):
        self.code = code
        self.prim_id = prim_id
        self.sec_id = sec_id
    

    def __repr__(self):
        return ' '.join([f'{key}: {val}' for key, val in self.__dict__.items()])
    


In [22]:
for line in lines:
    p = StructDescrip.init_from_line(line)
    print('='*20)
    print('INPUT LINE:', line)
    print('STRUCT INSTANCE:', p)

INPUT LINE: H1 98..107 "GUCGCUGCUA" (97,108) A:U 
STRUCT INSTANCE: code: H prim_id: 1 sec_id: None
INPUT LINE: S1 7..7 "C" 195..195 "G"
STRUCT INSTANCE: code: S prim_id: 1 sec_id: None
INPUT LINE: M1.1 8..86 "CUAAAUUUAAUCGUGGCAGUUUCCUUAUACAAACCGAAUAUUUACAAGUGACGACUCCGCAUUACUCUUGGAAUGAAUU" (7,195) C:G (87,121) A:U
STRUCT INSTANCE: code: M prim_id: 1 sec_id: 1
INPUT LINE: segment4 3bp 179..182 UAUG 186..189 CAUA
STRUCT INSTANCE: None
