`BpForms` is a toolkit for unambiguously describing the primary sequence of biopolymers such as DNA, RNA, and proteins, including modified DNA, RNA, and proteins. BpForms represents biopolymers as monomers linked
together via backbones. This tutorial illustrates how to use the `BpForms` Python API. Please see the [documentation](https://docs.karrlab.org/bpforms/) for more information about the `BpForms` notation and more instructions for using the `BpForms` website, JSON REST API, and command line interface.

### Import `BpForms`

In [1]:
import bpforms

### Create instances of `BpForm`
Use the [`BpForm` notation](https://docs.karrlab.org/bpforms) and the `BpForm.from_str` method to create an instance of `BpForm` to represent a form of a biopolymer.

`BpForms` includes six predefined alphabet and six corresponding pre-defined subclasses of `BpForm`:
* Canonical DNA nucleobases (`bpforms.canonical_dna_alphabet`, `bpforms.CanonicalDnaForm`): four canonical nucleobases
* Canonical RNA nucleosides (`bpforms.canonical_rna_alphabet`, `bpforms.CanonicalRnaForm`): four canonical nucleosides
* Canonical protein residues (`bpforms.canonical_protein_alphabet`, `bpforms.CanonicalProteinForm`): 20 canonical protein residues
* DNAmod DNA nucleobases (`bpforms.dna_alphabet`, `bpforms.DnaForm`): four canonical nucleobases plus the non-canonical nucleobases defined in [DNAmod](https://dnamod.hoffmanlab.org)
* MODOMICS RNA nucleosides (`bpforms.rna_alphabet`, `bpforms.RnaForm`): four canonical nucleosides plus the non-canonical nucleosides defined in [MODOMICS](http://modomics.genesilico.pl/modifications/)
* RESID protein residues (`bpforms.protein_alphabet`, `bpforms.ProteinForm`): 20 canonical protein residues plus the non-canonical protein residues defined in [RESID](https://pir.georgetown.edu/resid/)

#### Create `BpForm`s composed canonical monomers
Monomers defined in the alphabets can be referenced by their single character codes.

In [2]:
dna_form = bpforms.DnaForm().from_str('ACGT')

#### Create `BpForm`s that includes non-canonical monomers
Some of the non-canonical monomers in the alphabets are represented by multiple characters. Their character codes must be delimited by curly brackets.

In [3]:
dna_form = bpforms.DnaForm().from_str('A{m2C}GT')

#### Create `BpForm`s that include monomers that are not defined in the alphabet
Additional monomers can be described in square brackets using one or more attributes separated by vertical pipes ("|"):
* `id`
* `name`
* `synonym`
* `identifier`
* `structure`: SMILES-encoded string that represents the structure of the monomer
* `monomer-bond-atom`: list of atoms involved in bond(s) with the backbone
* `monomer-displaced-atom`: list of atoms displaced by the formation of bond(s) with the backbone
* `left-bond-atom`: list of atoms involved in bond(s) with monomers to the left
* `left-displaced-atom`: list of atoms displaced by the formation of bond(s) with monomers to the left
* `right-bond-atom`: list of atoms involved in bond(s) with monomers to the right
* `right-displaced-atom`: list of atoms displaced by the formation of bond(s) with monomers to the right
* `delta-mass`: additional mass in Daltons beyond that described by the `structure` attribute; used to represent uncertainty in the structure of the monomer.
* `delta-charge`: additional charge beyond that described by the `structure` attribute; used to represent uncertainty in the structure of the monomer.
* `position`: represents uncertainty in the location of a non-canonical monomer.
* `base-monomer`: represents the parent monomer of the monomer (e.g. the parent of m2A is A)
* `comments`

In [4]:
dna_form = bpforms.DnaForm().from_str(
    '''A[
         id: "m2C" 
         | name: "2-O-methylcytosine"
         | synonym: "4-amino-2-methoxypyrimidine"
         | synonym: "o-2-methylcytosine"
         | identifier: "ChEBI" / "CHEBI:70854"
         | structure: "COC1=NC(N)=CC=N1"
         | monomer-bond-atom: Monomer / N / 9 / 0
         | comments: "Methylation of deoxycytidine"
         | base-monomer: "C"
         | position: 2-3
        ]GT'''.replace('\n', '').replace(' ', ''))

### Get and set monomers and slices of monomers of biopolymers
Individual residues and slices of residues can be get and set similar to lists.

In [5]:
dna_form[0]

<bpforms.core.Monomer at 0x7f2d4d094908>

In [6]:
dna_form[2] = bpforms.dna_alphabet.monomers.A

In [7]:
dna_form[2:4]

[<bpforms.core.Monomer at 0x7f2d4d094908>,
 <bpforms.core.Monomer at 0x7f2d4d03f198>]

In [8]:
dna_form[2:4] = bpforms.DnaForm().from_str('TA')

### Get and set the bases of monomers

In [9]:
dna_form[1].base_monomers

{<bpforms.core.Monomer at 0x7f2d4d086940>}

### Calculate the major protonation state of biopolymer forms at specific pHs

In [10]:
microspecies = dna_form.get_major_micro_species(8.)
microspecies.GetTotalCharge()

-5

In [11]:
microspecies = dna_form.get_major_micro_species(3.)
microspecies.GetTotalCharge()

-2

### Calculate physical properties of biopolymer forms
`BpForms` can calculate the length, formula, molecular weight, and charge of the biopolymer forms.

In [12]:
len(dna_form)

4

In [13]:
str(dna_form.get_formula())

'C40H49N15O24P4'

In [14]:
dna_form.get_mol_wt()

1247.808047992

In [15]:
dna_form.get_charge()

-5

### Get FASTA representations of biopolymer forms
`BpForms` can generate FASTA representations of forms. This is useful for understanding the semantic meaning of a form using tools such as BLAST. Where annotated, this method uses the `base_monomers` attribute to represent modified monomers using the codes for their roots (e.g. m2A is represented as "A"). Monomers without annotated bases are represented as "N".

In [16]:
dna_form.get_fasta()

'ACTA'

In [17]:
dna_form_2 = bpforms.DnaForm().from_str(
    '''A[
         id: "m2C" 
         | name: "2-O-methylcytosine"
         | synonym: "4-amino-2-methoxypyrimidine"
         | synonym: "o-2-methylcytosine"
         | identifier: "ChEBI" / "CHEBI:70854"
         | structure: "COC1=NC(N)=CC=N1"
         | monomer-bond-atom: Monomer / N / 9 / 0
         | comments: "Methylation of deoxycytidine"
         | base-monomer: "C"
         | position: 2-3
        ]GT'''.replace('\n', '').replace(' ', ''))
dna_form_2.get_fasta()

'ACGT'

In [18]:
dna_form_2 = bpforms.DnaForm().from_str(
    '''A[
         id: "m2C" 
         | name: "2-O-methylcytosine"
         | synonym: "4-amino-2-methoxypyrimidine"
         | synonym: "o-2-methylcytosine"
         | identifier: "ChEBI" / "CHEBI:70854"
         | structure: "COC1=NC(N)=CC=N1"
         | monomer-bond-atom: Monomer / N / 9 / 0
         | comments: "Methylation of deoxycytidine"
         | position: 2-3
        ]GT'''.replace('\n', '').replace(' ', ''))
dna_form_2.get_fasta()

'ANGT'

### Determine if biopolymers are equal
Use the `is_equal` method to check if two biopolymers are equal.

In [19]:
dna_form_1 = bpforms.DnaForm().from_str('ACGT')
dna_form_2 = bpforms.DnaForm().from_str('ACGT')
dna_form_3 = bpforms.DnaForm().from_str('GCTC')

In [20]:
dna_form_1.is_equal(dna_form_2)

True

In [21]:
dna_form_1.is_equal(dna_form_3)

False