# Predicting RNA secondary structure with LinearRNA

LinearRNA includes a series of linear-time prediction algorithms/softwares for RNA secondary structure analysis: **LinearFold** and **LinearPartition**. 

# Part I: LinearFold

**LinearFold** is the first linear-time prediction algorithm/software for RNA secondary structures. 
The LinearFold paper has been accepted by ISMB, a top-level conference on computational biology and published on Bioinformatics, an authoritative journal. The link of the paper is: [LinearFold: linear-time approximate RNA folding by 5'-to-3' dynamic programming and beam search](https://academic.oup.com/bioinformatics/article/35/14/i295/5529205)

## RNA structure prediction

### Machine learning model

In [1]:
import pahelix.toolkit.linear_rna as linear_rna
input_sequence = "AACUCCGCCAGGCCUGGAAGGGAGCAACGGUAGUGACACUCUCUGUGUGCGUAGGUUGCCUAGCUACCAUUU"
linear_rna.linear_fold_c(input_sequence)

('..((((.(((....)))...))))....((((((............................))))))....',
 0.4548597317188978)

In [3]:
# with constraints
constraint = "??(???(??????)?(????????)???(??????(???????)?)???????????)??.???????????"
linear_rna.linear_fold_c(input_sequence, use_constraints = True, constraint = constraint)

('..(.(((......)((........))(((......(.......).))).....))..)..............',
 -27.328358240425587)

### Parameter setting
- rna_sequence: the input RNA sequence to predict the secondary structure.
- beam_size: int (default 100), set 0 to turn off the beam pruning.
- use_constraints: bool (default False), enable adding constraints when predicting structures.
- constraint: string (default ""), the constraint sequence. It works when the parameter use_constraints is Ture. The  constraint sequence should have the same length as the RNA sequence. "? . ( )" indicates a position for which the proper matching is unknown, unpaired, left or right parenthesis respectively. The parentheses must be well-banlanced and non-crossing.
- no_sharp_turn: bool (default True), disable sharpturn in prediction.

### Thermodynamic model
The parameters are the same as the machine learning-based model.

In [4]:
linear_rna.linear_fold_v(input_sequence)

('..((((.(((....)))...))))....((((((.((((.....))))...((((...))))))))))....',
 -18.4)

In [5]:
# with constriants
linear_rna.linear_fold_v(input_sequence, use_constraints = True, constraint = constraint)

('..(.(((......)((........))(((......(.......).))).....))..)..............',
 13.4)

# Part II: LinearPartition

**LienarPartition** is the first linear-time partition function and base pair probabilities calculation algorithm/software for RNA secondary structures. The LinearPartition paper has been accepted by ISMB, a top-level conference on computational biology and published on Bioinformatics, an authoritative journal. The link of the paper is: [LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities](https://academic.oup.com/bioinformatics/article/36/Supplement_1/i258/5870487)

## Partition function and base pair probabilities calculation

### Machine learning model

In [7]:
input_sequence = "AACUCCGCCAGGCCUGGAAGGGAGCAACGGUAGUGACACUCUCUGUGUGCGUAGGUUGCCUAGCUACCAUUU"
linear_rna.linear_partition_c(input_sequence)

99934864044),
  (16, 31, 0.00016783550381660461),
  (16, 37, 0.00037275999784469604),
  (16, 39, 0.00014984235167503357),
  (16, 40, 0.00014948472380638123),
  (16, 41, 0.00021246075630187988),
  (16, 42, 0.0010166727006435394),
  (16, 43, 0.0015860721468925476),
  (16, 44, 0.009785622358322144),
  (16, 46, 0.0028750598430633545),
  (16, 48, 0.0014885924756526947),
  (16, 50, 0.0032700300216674805),
  (16, 52, 0.015047132968902588),
  (16, 56, 0.0004921853542327881),
  (16, 57, 0.0009927190840244293),
  (16, 59, 0.00016970932483673096),
  (16, 60, 0.0021891817450523376),
  (16, 61, 0.004012703895568848),
  (16, 64, 0.0009740591049194336),
  (16, 65, 0.00402340292930603),
  (16, 67, 0.001359991729259491),
  (16, 68, 0.05359542369842529),
  (16, 70, 0.00023096054792404175),
  (16, 71, 0.00039472803473472595),
  (16, 72, 0.004300475120544434),
  (17, 25, 0.0014351047575473785),
  (17, 28, 0.000629182904958725),
  (17, 31, 0.00013668090105056763),
  (17, 37, 7.22445547580719e-05),
  (17, 3

### Parameter setting
- rna_sequence: string, the input RNA sequence to calculate partition function and base pair probabities. 
- beam_size: int (default 100), set 0 to turn off the beam pruning.
- bp_cutoff: double (default 0.0), only output base pairs with correponding proabilities whose values larger than the bp_cutoff (between 0 and 1).
- no_sharp_turn: bool (default True), enable sharpturn in prediction.

### Thermodynamic model
The parameters are the same as the machine learning model.

In [8]:
linear_rna.linear_partition_v(input_sequence, bp_cutoff = 0.5)

(-19.403078079223633,
 [(3, 24, 0.8039621710777283),
  (4, 23, 0.8239085078239441),
  (5, 22, 0.8219183087348938),
  (6, 21, 0.8141640424728394),
  (8, 17, 0.8630755543708801),
  (9, 16, 0.8678420186042786),
  (10, 15, 0.7041950225830078),
  (29, 68, 0.8213568925857544),
  (30, 67, 0.8230675458908081),
  (31, 66, 0.8223718404769897),
  (32, 65, 0.8192430138587952),
  (33, 64, 0.7854012250900269),
  (34, 63, 0.690216600894928),
  (36, 48, 0.8604442477226257),
  (37, 47, 0.91445392370224),
  (38, 46, 0.9144330024719238),
  (39, 45, 0.9141958951950073),
  (52, 62, 0.564305305480957),
  (53, 61, 0.6316171884536743),
  (54, 60, 0.6385029554367065),
  (55, 59, 0.6180907487869263)])