---
title: Biocollections
category: BioFSharp Core
categoryindex: 1
index: 3
---

In [5]:
(*** hide ***)

(*** condition: prepare ***)
#r "nuget: Plotly.NET, 4.2.0"
#r "nuget: FSharpAux, 2.0.0"
#r "nuget: FSharpAux.IO, 2.0.0"
#r "nuget: FSharp.Stats, 0.4.11"
#r "../../src/BioFSharp/bin/Release/netstandard2.0/BioFSharp.dll"
#r "../../src/BioFSharp.IO/bin/Release/netstandard2.0/BioFSharp.IO.dll"
#r "../../src/BioFSharp.BioContainers/bin/Release/netstandard2.0/BioFSharp.BioContainers.dll"
#r "../../src/BioFSharp.ML/bin/Release/netstandard2.0/BioFSharp.ML.dll"
#r "../../src/BioFSharp.Stats/bin/Release/netstandard2.0/BioFSharp.Stats.dll"

// in the documentation, we have to register formatters manually because we cannot load the extension as nuget package to trigger automatic registration
#r "../../src/BioFSharp.Interactive/bin/Release/net6.0/BioFSharp.Interactive.dll"
BioFSharp.Interactive.Formatters.registerAll()

# BioCollections

*Summary:* This example shows how to use collections of biological items in BioFSharp

Analogous to the build-in collections BioFSharp provides BioSeq, BioList and BioArray for individual collection specific optimized operations. 
The easiest way to create them are the `ofBioItemString` -functions

In [6]:
open BioFSharp

let s1 = "PEPTIDE" |> BioSeq.ofAminoAcidString 
let s2 = "PEPTIDE" |> BioList.ofAminoAcidString 
let s3 = "TAGCAT"  |> BioArray.ofNucleotideString 

In [10]:
s1, s2, s3

Unnamed: 0,Unnamed: 1
Item1,1 PEPTIDE
Item2,1 PEPTIDE
Item3,1 TAGCAT


## Nucleotides

![Nucleotides1](../img/Nucleotides.svg)

**Figure 1: Selection of covered nucleotide operations** (A) Biological principle. (B) Workflow with `BioSeq`. (C) Other covered functionalities.

Let's imagine you have a given gene sequence and want to find out what the according protein might look like.

In [11]:
let myGene = BioArray.ofNucleotideString "ATGGCTAGATCGATCGATCGGCTAACGTAA"

myGene

Yikes! Unfortunately we got the 5'-3' coding strand. For proper transcription we should get the complementary strand first:

In [12]:
let myProperGene = BioArray.complement myGene

myProperGene

Now let's transcribe and translate it:

In [16]:
let myTranslatedGene = 
    myProperGene
    |> BioArray.transcribeTemplateStrand
    |> BioArray.translate 0

myTranslatedGene

Of course, if your input sequence originates from the coding strand, you can directly transcribe it to mRNA since the 
only difference between the coding strand and the mRNA is the replacement of 'T' by 'U' (Figure 1B)

In [14]:
let myTranslatedGeneFromCodingStrand = 
    myGene
    |> BioArray.transcribeCodingStrand
    |> BioArray.translate 0

myTranslatedGeneFromCodingStrand

Other Nucleotide conversion operations are also covered:

In [17]:
let mySmallGene = BioSeq.ofNucleotideString  "ATGTTCCGAT"

mySmallGene

In [18]:
BioSeq.reverse mySmallGene 

In [19]:
BioSeq.complement mySmallGene

In [20]:
BioSeq.reverseComplement mySmallGene

## AminoAcids

### Basics
Some functions which might be needed regularly are defined to work with nucleotides and amino acids:


In [22]:
let myPeptide = "PEPTIDE" |> BioSeq.ofAminoAcidString 

myPeptide

In [25]:
myPeptide 
|> BioSeq.toFormula 
|> Formula.toString 

C34.00 H51.00 N7.00 O14.00

In [27]:
BioSeq.toAverageMass myPeptide 

### Digestion
BioFSharp also comes equipped with a set of tools aimed at cutting apart amino acid sequences. To demonstrate the usage, we'll throw some `trypsin` at the small RuBisCO subunit of _Arabidopos thaliana_:  
In the first step, we define our input sequence and the protease we want to use.


In [28]:
let RBCS = 
    """MASSMLSSATMVASPAQATMVAPFNGLKSSAAFPATRKANNDITSITSNGGRVNCMQVWP
    PIGKKKFETLSYLPDLTDSELAKEVDYLIRNKWIPCVEFELEHGFVYREHGNSPGYYDGR
    YWTMWKLPLFGCTDSAQVLKEVEECKKEYPNAFIRIIGFDNTRQVQCISFIAYKPPSFT""" 
    |> BioArray.ofAminoAcidString

RBCS

In [30]:
let trypsin = Digestion.Table.getProteaseBy "Trypsin"

let digestedRBCS = Digestion.BioArray.digest trypsin 0 RBCS 

digestedRBCS
|> Seq.head

Unnamed: 0,Unnamed: 1
ProteinID,0
MissCleavages,0
CleavageStart,0
CleavageEnd,27
PepSequence,1 MASSMLSSAT MVASPAQATM VAPFNGLK


In reality, proteases don't always completely cut the protein down. Instead, some sites stay intact and should be considered for in silico analysis. 
This can easily be done with the `concernMissCleavages` function. It takes the minimum and maximum amount of misscleavages you want to have and also the digested protein. As a result you get all possible combinations arising from this information.


In [31]:
let digestedRBCS' = Digestion.BioArray.concernMissCleavages 0 2 digestedRBCS

digestedRBCS
|> Seq.item 1

Unnamed: 0,Unnamed: 1
ProteinID,0
MissCleavages,0
CleavageStart,28
CleavageEnd,36
PepSequence,1 SSAAFPATR
