# NB07a Working with Deedle

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/CSBiology/BIO-BTE-06-L-7/gh-pages?filepath=NB07a_Working_With_Deedle.ipynb)

[Download Notebook](https://github.com/CSBiology/BIO-BTE-06-L-7/releases/download/NB07a/NB07a_Working_With_Deedle.ipynb)

[Deedle](http://bluemountaincapital.github.io/Deedle/index.html)  is an easy to use library for data and time series manipulation and for scientific 
programming. It supports working with structured data frames, ordered and unordered data, as well as time series.

The analysis of your data in the following notebooks will be mostly done in Deedle, so here are some explanations and examples to help you better understand 
the analysis notebooks.

We start by loading our usual nuget packages and the Deedle package.



In [1]:
#r "nuget: Deedle, 2.3.0"
#r "nuget: BioFSharp, 2.0.0-beta4"
#r "nuget: BioFSharp.IO, 2.0.0-beta4"
#r "nuget: BioFSharp.Mz, 0.1.5-beta"
#r "nuget: BIO-BTE-06-L-7_Aux, 0.0.9"
#r "nuget: FSharp.Stats"

#r "nuget: Plotly.NET, 2.0.0-preview.16"
#r "nuget: Plotly.NET.Interactive, 2.0.0-preview.16"

open Plotly.NET
open BioFSharp
open BioFSharp.Mz
open BIO_BTE_06_L_7_Aux.FS3_Aux
open BIO_BTE_06_L_7_Aux.Deedle_Aux
open System.IO
open Deedle
open FSharp.Stats


## Deedle Basics
Familiarize yourself with Deedle! Create a series yourself that you add to the frame 'persons' frame.



In [2]:
let firstNames      = Series.ofValues ["Kevin";"Lukas";"Benedikt";"Michael"] 
let coffeesPerWeek  = Series.ofValues [15;12;10;11] 
let lastNames       = Series.ofValues ["Schneider";"Weil";"Venn";"Schroda"]  
let group           = Series.ofValues ["CSB";"CSB";"CSB";"MBS"] 
let persons = 
    Frame.ofColumns(List.zip ["fN";"lN";"g"] [firstNames;lastNames;group])
    |> Frame.addCol "cpw" coffeesPerWeek


In [None]:
persons
|> formatAsTable 0.
|> Chart.withSize (600.,700.)


Follow the above scheme and create another frame that is exactly the same, but represents different persons (the frame can be small, e.g. two persons).
Use the function Frame.merge to combine your frame and 'persons'.
Back to the frame 'persons'! In the following you see a series of frame/series manipulations.



In [3]:
let coffeePerWeek' :Series<int,int> = Frame.getCol ("cpw") persons 
let groupedByG :Frame<string*int,_> = persons |> Frame.groupRowsBy "g"
let withOutG :Frame<string*int,_> = groupedByG |> Frame.sliceCols ["fN";"lN";"cpw"]
let coffeePerWeek'' :Series<string*int,int>= groupedByG |> Frame.getCol ("cpw")
let coffeePerWeekPerGroup = Series.applyLevel Pair.get1Of2 (Series.values >> Seq.sum) coffeePerWeek''


Now that you got to know the object `Frame` which is a collection of `Series`, we move on to a real dataset. 
As our dataset we take the FASTA with Chlamy proteins, select 50 random proteins, and digest them.
The digested peptides are represented using a record type. Deedle frames can be directly constructed from
record types with `Frame.ofRecords`. Alternatively, a character separated file could be used as source for a Frame as well.



In [4]:
let path = Path.Combine[|__SOURCE_DIRECTORY__;"downloads/Chlamy_JGI5_5(Cp_Mp).fasta"|]
downloadFile path "Chlamy_JGI5_5(Cp_Mp).fasta" "bio-bte-06-l-7"

let examplePeptides = 
    path
    |> IO.FastA.fromFile BioArray.ofAminoAcidString
    |> Seq.toArray
    |> Array.take 50
    |> Array.mapi (fun i fastAItem ->
        Digestion.BioArray.digest Digestion.Table.Trypsin i fastAItem.Sequence
        |> Digestion.BioArray.concernMissCleavages 0 0 
        |> Array.map (fun dp ->
            {|
                PeptideSequence = dp.PepSequence
                Protein = fastAItem.Header.Split ' ' |> Array.head
            |}
        )
    )
    |> Array.concat
    |> Array.filter (fun x -> x.PeptideSequence.Length > 5)

let peptidesFrame =
    examplePeptides
    |> Frame.ofRecords


In [None]:
peptidesFrame
|> Frame.take 10
|> formatAsTable 0.
|> Chart.withSize (700.,900.)


PeptideSequence       Protein            0 -> [Asp; Leu; His; ... ] Cre38.g759997.t1.1 1 -> [Val; Gln; Tyr; ... ] Cre38.g759997.t1.1 2 -> [Ala; Gln; Gln; ... ] Cre38.g759997.t1.1 3 -> [Asp; Thr; Ile; ... ] Cre38.g759997.t1.1 4 -> [His; Leu; His; ... ] Cre38.g759997.t1.1 5 -> [Leu; Pro; Pro; ... ] Cre38.g759997.t1.1 6 -> [Ala; Asp; Leu; ... ] Cre38.g759997.t1.1 7 -> [Met; Asp; Ala; ... ] Cre38.g759997.t1.1 8 -> [Tyr; Ser; Ala; ... ] Cre08.g363350.t1.1 9 -> [Met; Tyr; Gly; ... ] Cre08.g363350.t1.1 val it : unit = ()

As you can see, our columns are named the same as the field of the record type, while our rows are indexed by numbers only. It is often helpful to use a more descriptive
row key. In this case, we can use the peptide sequence for that.  
**Note:** Row keys must be unique. By grouping with "PeptidesSequence", we get the sequence tupled with the index as key. 
The function `Frame.reduceLevel` aggregates the rows now based on the first part of the tuple, the peptide sequence, ignoring the second part of the tuple, the index. 
The aggregator function given to `Frame.reduceLevel` aggregates each column separately.



In [6]:
let pfIndexedSequenceList : Frame<list<AminoAcids.AminoAcid>,string> =
    peptidesFrame
    |> Frame.groupRowsBy "PeptideSequence"
    |> Frame.dropCol "PeptideSequence"
    |> Frame.reduceLevel fst (fun a b -> a + "," + b)


In [None]:
pfIndexedSequenceList
|> Frame.take 10
|> formatAsTable 0.
|> Chart.withSize (500.,900.)


Protein            [Asp; Leu; His; ... ] -> Cre38.g759997.t1.1 [Val; Gln; Tyr; ... ] -> Cre38.g759997.t1.1 [Ala; Gln; Gln; ... ] -> Cre38.g759997.t1.1 [Asp; Thr; Ile; ... ] -> Cre38.g759997.t1.1 [His; Leu; His; ... ] -> Cre38.g759997.t1.1 [Leu; Pro; Pro; ... ] -> Cre38.g759997.t1.1 [Ala; Asp; Leu; ... ] -> Cre38.g759997.t1.1 [Met; Asp; Ala; ... ] -> Cre38.g759997.t1.1 [Tyr; Ser; Ala; ... ] -> Cre08.g363350.t1.1 [Met; Tyr; Gly; ... ] -> Cre08.g363350.t1.1 val it : unit = ()

Our rows are now indexed with the peptide sequences. The peptide sequence is still an aarray of amino acids. For better visibility we can transform it to its string representation. 
For that we can map over our row keys similar to an array and call the function `BioList.toString` on each row key.



In [8]:
let pfIndexedStringSequence =
    pfIndexedSequenceList
    |> Frame.mapRowKeys (fun rc -> rc |> BioList.toString)


In [None]:
pfIndexedStringSequence
|> Frame.take 10
|> formatAsTable 0.
|> Chart.withSize (800.,900.)


Protein            DLHPLLTSLPTKPGSAATPYTTTQSPPSTTLS* -> Cre38.g759997.t1.1 VQYTPQSAISLGFAGTIR                -> Cre38.g759997.t1.1 AQQQQALAADLR                      -> Cre38.g759997.t1.1 DTIHIIEAGYTADTNHAAK               -> Cre38.g759997.t1.1 HLHYRPDILLIPSISLAAALNPDFVVLPSER   -> Cre38.g759997.t1.1 LPPWLLPDQEGKPAGR                  -> Cre38.g759997.t1.1 ADLPDYAADNR                       -> Cre38.g759997.t1.1 MDATSK                            -> Cre38.g759997.t1.1 YSADGVTVCGR                       -> Cre08.g363350.t1.1 MYGAIR                            -> Cre08.g363350.t1.1 val it : unit = ()

We now have a frame containing information about our peptides. To add additional information we can go back to the peptide array we started with and calculate 
the monoisotopic mass, for example. The monoisotopic mass is tupled with the peptide sequence as string, the same as in our peptide frame. The resulting array
can then be transformed into a `series`



In [10]:
let peptidesAndMasses =
    examplePeptides
    |> Array.distinctBy (fun x -> x.PeptideSequence)
    |> Array.map (fun peptide ->
        // calculate mass for each peptide
        peptide.PeptideSequence |> BioList.toString, BioSeq.toMonoisotopicMassWith (BioItem.monoisoMass ModificationInfo.Table.H2O) peptide.PeptideSequence
        )

let peptidesAndMassesSeries =
    peptidesAndMasses
    |> series


The columns in frames consist of series. Since we now have a series containing our monoisotopic masses, together with the peptide sequence, we can simply add 
it to our frame and give the column a name.



In [11]:
let pfAddedMass =
    pfIndexedStringSequence
    |> Frame.addCol "Mass" peptidesAndMassesSeries


In [None]:
pfAddedMass
|> Frame.take 10
|> formatAsTable 0.
|> Chart.withSize (1000.,900.)


Protein            Mass               DLHPLLTSLPTKPGSAATPYTTTQSPPSTTLS* -> Cre38.g759997.t1.1 3279.6874526271804 VQYTPQSAISLGFAGTIR                -> Cre38.g759997.t1.1 1908.0105115979898 AQQQQALAADLR                      -> Cre38.g759997.t1.1 1311.6895118347097 DTIHIIEAGYTADTNHAAK               -> Cre38.g759997.t1.1 2039.9912327025907 HLHYRPDILLIPSISLAAALNPDFVVLPSER   -> Cre38.g759997.t1.1 3465.91364502852   LPPWLLPDQEGKPAGR                  -> Cre38.g759997.t1.1 1772.9573538285597 ADLPDYAADNR                       -> Cre38.g759997.t1.1 1219.54692992139   MDATSK                            -> Cre38.g759997.t1.1 651.2897762857499  YSADGVTVCGR                       -> Cre08.g363350.t1.1 1126.50770796338   MYGAIR                            -> Cre08.g363350.t1.1 709.3581306307699  val it : unit = ()

Alternatively, we can take a column from our frame, apply a function to it, and create a new frame from the series.



In [13]:
let pfChargedMass =
    pfAddedMass
    |> Frame.getCol "Mass"
    |> Series.mapValues (fun mass -> Mass.toMZ mass 2.)
    |> fun s -> ["Mass Charge 2", s]
    |> Frame.ofColumns


In [None]:
pfChargedMass
|> Frame.take 10
|> formatAsTable 0.
|> Chart.withSize (1000.,900.)


Mass Charge 2      DLHPLLTSLPTKPGSAATPYTTTQSPPSTTLS* -> 1640.8510027804002 VQYTPQSAISLGFAGTIR                -> 955.0125322658049  AQQQQALAADLR                      -> 656.8520323841649  DTIHIIEAGYTADTNHAAK               -> 1021.0028928181054 HLHYRPDILLIPSISLAAALNPDFVVLPSER   -> 1733.96409898107   LPPWLLPDQEGKPAGR                  -> 887.4859533810899  ADLPDYAADNR                       -> 610.7807414275051  MDATSK                            -> 326.652164609685   YSADGVTVCGR                       -> 564.2611304485     MYGAIR                            -> 355.686341782195   val it : unit = ()

The new frame has the same row keys as our previous frame. The information from our new frame can be joined with our old frame by using `Frame.join`.
`Frame.join` is similar to `Frame.addCol`, but can join whole frames at once instead of single columns.



In [15]:
let joinedFrame =
    pfAddedMass
    |> Frame.join JoinKind.Left pfChargedMass


In [None]:
joinedFrame
|> Frame.take 10
|> formatAsTable 0.
|> Chart.withSize (1500.,900.)


Mass Charge 2      Protein            Mass               DLHPLLTSLPTKPGSAATPYTTTQSPPSTTLS* -> 1640.8510027804002 Cre38.g759997.t1.1 3279.6874526271804 VQYTPQSAISLGFAGTIR                -> 955.0125322658049  Cre38.g759997.t1.1 1908.0105115979898 AQQQQALAADLR                      -> 656.8520323841649  Cre38.g759997.t1.1 1311.6895118347097 DTIHIIEAGYTADTNHAAK               -> 1021.0028928181054 Cre38.g759997.t1.1 2039.9912327025907 HLHYRPDILLIPSISLAAALNPDFVVLPSER   -> 1733.96409898107   Cre38.g759997.t1.1 3465.91364502852   LPPWLLPDQEGKPAGR                  -> 887.4859533810899  Cre38.g759997.t1.1 1772.9573538285597 ADLPDYAADNR                       -> 610.7807414275051  Cre38.g759997.t1.1 1219.54692992139   MDATSK                            -> 326.652164609685   Cre38.g759997.t1.1 651.2897762857499  YSADGVTVCGR                       -> 564.2611304485     Cre08.g363350.t1.1 1126.50770796338   MYGAIR                            -> 355.686341782195   Cre08.g363350.t1.1 709.3581306307699  val