# Sequence Molecules (DNA, RNA & Protein Sequences)

Sequence molecules are derived from `DNA`, `RNA` and `Protein` classes in the `bioseq` module. Internally, their type hierarchies are 

* `Molecule -> SequenceMolecule -> Polynucleotide -> DNA,RNA`
* `Molecule -> SequenceMolecule -> Polypeptide -> Protein`. 

This allows generic sequence initialization and access methods to be shared between `(DNA, RNA, Protein)`, whereas nucleotide specific methods are shared only between `(DNA,RNA)`.

To **create a single strand** of `(DNA, RNA, Protein)`, call the respective constructor. The optional kwarg `ambiguous` can be used to indicate whether to use a strict alphabet (e.g., `GATC`) or a more permissive alphabet by default (e.g., `GATCRYWSMKHBVDN`). 

For double-stranded molecules, each strand is represented separately as its own molecule.

In [1]:
from wc_rules.bioseq import DNA, RNA, Protein
dna1 = DNA(ambiguous=False)

To **attach a sequence to a molecule**, use `set_sequence(inputstr)`.

In [2]:
inputstr = 'TTGTTATCGTTACCGGGAGTGAGGCGTCCGCGTCCCTTTCAGGTCAAGCGACTGAAAAACCTTGCAGTTGATTTTAAAGCGTATAGAAGACAATACAGA'
dna1.set_sequence(inputstr)

<wc_rules.bioseq.DNA at 0x1f732f8d400>

To **read the sequence of a molecule**, use __`get_sequence()`__. 

In [3]:
print(dna1.get_sequence())

TTGTTATCGTTACCGGGAGTGAGGCGTCCGCGTCCCTTTCAGGTCAAGCGACTGAAAAACCTTGCAGTTGATTTTAAAGCGTATAGAAGACAATACAGA


To **get the sequence length**, use __`get_sequence_length()`__.

In [4]:
print(dna1.get_sequence_length())

99


To **read a subsequence** use `get_sequence()` with `(start,end)` or `(start,length)` kwargs. `end` takes priority over `length`. Sequences are indexed like Python strings, i.e., 0 to L-1 for a sequence of L bases. The position L refers to the position _after_ the last base.

In [5]:
print(dna1.get_sequence(start=90,end=99))
print(dna1.get_sequence(90,99))

CAATACAGA
CAATACAGA


In [6]:
print(dna1.get_sequence(start=90,length=9))
print(dna1.get_sequence(90,None,9))

CAATACAGA
CAATACAGA


`get_sequence_length()` also works for subsequence with `(start,end)` specified.

In [7]:
print(dna1.get_sequence_length(start=90,end=99))

9


`get_sequence()` outputs a `Bio.Seq.Seq` object by default (from Biopython). To get a pure string, use kwarg `as_string=True`. 

In [8]:
print(type(dna1.get_sequence()))
print(type(dna1.get_sequence(as_string=True)))

<class 'Bio.Seq.Seq'>
<class 'str'>
