`BpForms` is a toolkit for unambiguously describing the primary sequence of biopolymers such as DNA, RNA, and proteins, including modified DNA, RNA, and proteins. This tutorial illustrates how to use the `BpForms` Python API. Please see the [documentation](https://docs.karrlab.org/bpforms/) for more information about the `BpForms` notation and more instructions for using the `BpForms` website, JSON REST API, and command line interface.

### Import `BpForms`

In [1]:
import bpforms

### Create instances of `BpForm`
Use the [`BpForm` notation](https://docs.karrlab.org/bpforms) and the `BpForm.from_str` method to create an instance of `BpForm` to represent a form of a biopolymer.

`BpForms` includes six predefined alphabet and six pre-defined subclasses of `BpForm`:
* Canonical DNA nucleobases (`bpforms.canonical_dna_alphabet`, `bpforms.CanonicalDnaForm`): four canonical nucleobases
* Canonical RNA nucleosides (`bpforms.canonical_rna_alphabet`, `bpforms.CanonicalRnaForm`): four canonical nucleosides
* Canonical protein residues (`bpforms.canonical_protein_alphabet`, `bpforms.CanonicalProteinForm`): 20 canonical protein residues
* DNAmod DNA nucleobases (`bpforms.dna_alphabet`, `bpforms.DnaForm`): four canonical nucleobases plus the non-canonical nucleobases defined in [DNAmod](https://dnamod.hoffmanlab.org)
* MODOMICS RNA nucleosides (`bpforms.rna_alphabet`, `bpforms.RnaForm`): four canonical nucleosides plus the non-canonical nucleosides defined in [MODOMICS](http://modomics.genesilico.pl/modifications/)
* RESID protein residues (`bpforms.protein_alphabet`, `bpforms.ProteinForm`): 20 canonical protein residues plus the non-canonical protein residues defined in [RESID](https://pir.georgetown.edu/resid/)

#### Create `BpForm`s composed canonical monomers
Monomers defined in the alphabets can be referenced by their single character codes.

In [2]:
dna_form = bpforms.DnaForm().from_str('ACGT')

#### Create `BpForm`s that includes non-canonical monomers
Some of the non-canonical monomers in the alphabets are represented by multiple characters. Their character codes must be delimited by curly brackets.

In [3]:
dna_form = bpforms.DnaForm().from_str('A{m2C}GT')

#### Create `BpForm`s that include monomers that are not defined in the alphabet
Additional monomers can be described in square brackets using one or more attributes separated by vertical pipes ("|"):
* `id`
* `name`
* `synonym`
* `identifier`
* `structure`: InChI-encoded string that represents the structure of the monomer
* `delta-mass`: additional mass in Daltons beyond that described by the `structure` attribute; used to represent uncertainty in the structure of the monomer.
* `delta-charge`: additional charge beyond that described by the `structure` attribute; used to represent uncertainty in the structure of the monomer.
* `position`: represents uncertainty in the location of a non-canonical monomer.
* `base-monomer`: represents the parent monomer of the monomer (e.g. the parent of m2A is A)
* `comments`

In [4]:
dna_form = bpforms.DnaForm().from_str(
    '''A[
         id: "m2C" 
         | name: "2-O-methylcytosine"
         | synonym: "4-amino-2-methoxypyrimidine"
         | synonym: "o-2-methylcytosine"
         | identifier: "ChEBI" / "CHEBI:70854"
         | structure: InChI=1S/C5H7N3O/c1-9-5-7-3-2-4(6)8-5/h2-3H,1H3,(H2,6,7,8)
         | comments: "Methylation of deoxycytidine"
         | base-monomer: "C"
         | position: 2-3
        ]GT'''.replace('\n', '').replace(' ', ''))

### Get and set monomers and slices of monomers of biopolymers
Individual residues and slices of residues can be get and set similar to lists.

In [5]:
dna_form[0]

<bpforms.core.Monomer at 0x7fb476cae8d0>

In [6]:
dna_form[2] = bpforms.dna_alphabet.monomers.A

In [7]:
dna_form[2:4]

[<bpforms.core.Monomer at 0x7fb476cae8d0>,
 <bpforms.core.Monomer at 0x7fb476bf5518>]

In [8]:
dna_form[2:4] = bpforms.DnaForm().from_str('TA')

### Get and set the bases of monomers

In [9]:
dna_form[1].base_monomers

{<bpforms.core.Monomer at 0x7fb476cae940>}

### Calculate the major protonatation state of biopolymer forms at specific pHs

In [10]:
dna_form.protonate(8.)
str(dna_form)

'A[id: "m2C" | name: "2-O-methylcytosine" | synonym: "4-amino-2-methoxypyrimidine" | synonym: "o-2-methylcytosine" | identifier: "ChEBI" / "CHEBI:70854" | structure: InChI=1S/C5H6N3O/c1-9-5-7-3-2-4(6)8-5/h2-3H,1H3,(H-,6,7,8)/q-1/p+1 | position: 2-3 | base-monomer: "C" | comments: "Methylationofdeoxycytidine"]TA'

### Calculate physical properties of biopolymer forms
`BpForms` can calculate the length, formula, molecular weight, and charge of the biopolymer forms.

In [11]:
len(dna_form)

4

In [12]:
str(dna_form.get_formula())

'C40H48N15O27P4'

In [13]:
dna_form.get_mol_wt()

1294.797047992

In [14]:
dna_form.get_charge()

-5

### Get FASTA representations of biopolymer forms
`BpForms` can generate FASTA representations of forms. This is useful for understanding the semantic meaning of a form using tools such as BLAST. Where annotated, this method uses the `base_monomers` attribute to represent modified monomers using the codes for their roots (e.g. m2A is represented as "A"). Monomers without annotated bases are represented as "N".

In [15]:
dna_form.to_fasta()

'ACTA'

### Determine if biopolymers are equal
Use the `is_equal` method to check if two biopolymers are equal.

In [16]:
dna_form_1 = bpforms.DnaForm().from_str('ACGT')
dna_form_2 = bpforms.DnaForm().from_str('ACGT')
dna_form_3 = bpforms.DnaForm().from_str('GCTC')

In [17]:
dna_form_1.is_equal(dna_form_2)

True

In [18]:
dna_form_1.is_equal(dna_form_3)

False