## RNA secondary structure

As we will discuss in the lecture, the 3D structure of biopolymers, whether proteins or RNAs, is crucial for their functions. For instance, tRNAs fold into characteristic clover leaf structures, while miRNAs - which are small RNAs (~22 nucleotides in their final functional form) - are generated from precursors that fold into hairpins. These structures are maintained though evolution, despite changes in the sequence. Consequently, one approach to finding regions in the genome that encode particular types of RNAs is to look for regions that fold into the structure characteristic for the RNA of interest.

The hairpin structure of miRNA precursors can be found quite frequently in genomes, so on its own, is not sufficient for a reliable prediction of miRNA-encoding genes. However, one study (https://academic.oup.com/bioinformatics/article/20/17/2911/186725) went a step further in evaluating the significance of hairpin-forming sequences. Specifically, the authors reasoned that miRNA sequences have undergone evolutionary optimization to form the hairpins that they form, while other sequences with the same nucleotide composition would not. They went to show this via randomization, which you can try to replicate.

Specifically,
1. you can download miRNA precursor sequences from the standard repository of the field, miRBase, https://www.mirbase.org/download/.
There are sequences from many organisms, you can recognize the human sequences by the prefix of their name - hsa.
2. using either the webserver or the distributable version of RNAfold (http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi), predict the secondary structure of human miRNA precursors and record their minimum free energy of folding.
3. for each human miRNA precursor generate 1000 shuffled variant sequences, with the same nucleotide composition.
4. predict the MFE for all of the shuffled variants and calculate the z-score of the original sequence with respect to its shuffled variants.

You could try the same procedure for tRNAs (from https://gtrnadb.ucsc.edu/genomes/eukaryota/Hsapi38/hg38-mature-tRNAs.fa).

What do you conclude?

**Note**: if you cannot install the standalone version of RNAfold, limit the number of miRNAs/tRNAs that you submit to the server.

In [None]:
import random
import numpy as np
import matplotlib.pyplot as plt

from Bio import SeqIO
from RNA import RNA

sequences = []

def shuffle(sequence: str, amount: int = 1000) -> list[str]:
    shuffle_sequences = []
    for i in range(amount):
        l = list(sequence)
        random.shuffle(l)
        shuffle_sequences.append(''.join(l))

    return shuffle_sequences

def all_mfe(sequences: list[str]) -> list[float]:
    mfe = []
    for i, sequence in enumerate(sequences):
        if i % 100 == 0:
            print(f"I : {i}")
        fold, fold_mfe = RNA.fold(sequence)
        mfe.append(fold_mfe)

    return mfe

fasta_sequences = SeqIO.parse(open('hairpin.fa'),'fasta')
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    sequences.append(sequence)

mfes = all_mfe(sequences)
plt.hist(mfes)
plt.show()

z_scores = []
for i, sequence in enumerate(sequences):
    if i % 100 == 0:
        print(f"I : {i}")
    fold, mfe_fold = RNA.fold(sequence)

    shuffle_sequences = shuffle(sequence)
    mfe = []
    for shuffle_seq in shuffle_sequences:
        shuffle_result = RNA.fold(shuffle_seq)
        mfe.append(shuffle_result[1])

    z = (mfe_fold - np.mean(mfe)) / np.var(mfe)
    z_scores.append(z)

plt.hist(z)
plt.show()

I : 0
I : 100
I : 200
I : 300
I : 400
I : 500
I : 600
I : 700
I : 800
I : 900
