<a href="https://colab.research.google.com/github/Biocanter/fastbook/blob/master/RNAfold_sequence_parameters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Exploratory Data Analysis (EDA) of Combinatorial optimization of mRNA structure, stability, and translation for RNA-based therapeutics**

Kathrin Leppek.et al. 2022

https://daslab.stanford.edu/site_data/pub_pdf/LeppekEtAl_NatureCommunications_2022.pdf

The goal of this notebook is to reproduce some mRNA sequence parameters as GC content, free energy MFE structure (ViennaRNA).


You cand find the data in this link https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-022-28776-w/MediaObjects/41467_2022_28776_MOESM4_ESM.zip

Download & extract excel and then load: Supplementary Data 1 - Attributes for pooled 233 sequences

Ángel Cantero-Camacho, PhD

https://www.linkedin.com/in/biocanter/

09-08-2022

In [None]:
###install first conda enviroment
!pip install -q condacolab
import condacolab
condacolab.install()
###install ViennaRNA
!conda install -c bioconda viennarna
###install biopython
!pip install biopython

[0m✨🍰✨ Everything looks OK!
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / - \ | / - \ | / - \ | / - \ | / done

# All requested packages already installed.

Retrieving notices: ...working... done
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [None]:
###import libraries
import pandas as pd
import numpy as np
###mount drive to acces to excel file
from google.colab import drive
drive.mount('/content/drive')
###path to excel file (you need to add your own path)
File="/content/drive/MyDrive/Colab Notebooks//Supp1_mRNA_DAS.xlsx"

MessageError: ignored

In [None]:
####ViennaRNA python instructions: https://www.tbi.univie.ac.at/RNA/ViennaRNA/doc/html/helloworld_swig.html
import RNA
seq = "GAGUAGUGGAACCAGGCUAUGUUUGUGACUCGCAGACUAACA"
 # compute minimum free energy (MFE) and corresponding structure
 ###ss.- Secondary Structure
 ###MFE.-Minimal Free Energy 
(ss, mfe) = RNA.fold(seq)
 # print output
print("{}\n{} [ {:6.2f} ]".format(seq, ss, mfe))

GAGUAGUGGAACCAGGCUAUGUUUGUGACUCGCAGACUAACA
..(((((........)))))(((((((...)))))))..... [  -8.80 ]


In [None]:
###load data into panda dataframe
df = pd.read_excel(File)

In [None]:
##function to apply to panda columns
def fold_RNA(seq):
  (ss, mfe) = RNA.fold(seq)
  return (pd.Series([ss, mfe]))

In [None]:
##Use fold_RNA function in order to calculate secondary ss and MFE
temp_serie=df['RNA sequence'].apply(fold_RNA)
####create 
df['ss_mRNA'], df['mfe_vienna']=temp_serie[0],temp_serie[1]

In [None]:
##compare both parameters 
###recently calculated 
print (df['mfe_vienna'])
### Paper paramaters
print (df['dG(MFE) Vienna'])
###plot it
import plotly.express as px
fig = px.scatter(df, x="mfe_vienna", y="dG(MFE) Vienna")
fig.show()

0     -395.000000
1     -352.100006
2     -375.700012
3     -375.299988
4     -369.899994
          ...    
228   -289.440002
229   -340.600006
230   -272.200012
231   -284.600006
232   -316.799988
Name: mfe_vienna, Length: 233, dtype: float64
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
        ...  
228   -289.44
229   -340.60
230   -272.20
231   -284.60
232   -316.80
Name: dG(MFE) Vienna, Length: 233, dtype: float64


In [None]:
###calculate G/C
##biopython module
from Bio.SeqUtils import GC
##apply GC_calcualte function to Sequence CDS columns
df['GC']=df['Sequence CDS'].apply(GC)
df['GC']=df['GC']/100

In [None]:
##compare both parameters 
fig = px.scatter(df, x="GC", y="GC content (CDS)")
fig.show()

In [None]:
##plot Group vs MFE 
fig = px.violin(df, y="mfe_vienna", x='Group',color='Group', box=True, points="all")
fig.show()

In [None]:
##Data normalization into (0-1) 
from sklearn import preprocessing
mfe_vienna_norm = np.array(df['mfe_vienna'])
mfe_vienna_norm=preprocessing.normalize([mfe_vienna_norm])
df['mfe_vienna']=mfe_vienna_norm.transpose()

In [None]:
#plot MFE vs GC
fig = px.scatter(df, x='mfe_vienna', y="GC" ,color="Group")
fig.show()

In [None]:
###END
##I will continue with RNA sequence parameters: SUP and EternaFold data
#Ángel Cantero-Camacho PhD

https://www.linkedin.com/in/biocanter/
