# Example 1: Build a pipeline using Peak Performance's convenience function

In [26]:
import pandas
import numpy as np
import arviz as az
from pathlib import Path
from peak_performance import pipeline as pl

## User information

First, store the path to a folder containing only the raw data you want to analyze in the `path_raw_data` variable.  
You can use a string with a preceding `r` so that the backslashes are recognized correctly or the `Path` method from the `pathlib` package for an OS-independent alternative.

In [27]:
# specify the absolute path to the raw data files (as a str or a Path object), e.g. to the provided example files

# path_raw_data = r"C:\Users\niesser\Desktop\Local GitLab Repositories\peak-performance\example"
path_raw_data = Path(r"C:\Users\niesser\Desktop\Local GitLab Repositories\peak-performance\example")

Then, use the specified path as an argument for the `detect_raw_data()` function which returns a list of all files of the given data type in the given path.  
If you don't specify a data type, the default is `.npy`.

In [28]:
# obtain a list of all raw data file names (including their data type)
raw_data_files = pl.detect_raw_data(path_raw_data)
print(raw_data_files)

['A1t1R1Part2_110_109.9_110.1.npy', 'A1t1R1Part2_111_109.9_110.1.npy', 'A1t1R1Part2_111_110.9_111.1.npy', 'A1t1R1Part2_112_110.9_111.1.npy', 'A1t1R1Part2_112_111.9_112.1.npy', 'A2t2R1Part1_132_85.9_86.1.npy', 'A4t4R1Part2_137_72.9_73.1.npy']


Based on the list of files, provide the following information which is necessary for the pipeline:  
1. A dictionary `double_peak` containing the file names (including data type) as keys and Boolean values depending on whether the file contains a __single peak (False)__ or __double peak (True)__.  
    
2. A Boolean `pre_filtering`. Choose whether you want to filter out obvious false positive signals before sampling to save computation time (__True__) or not (__False__). If "False" was chosen, skip 3 - 5.
    
    3. If `pre_filtering` was set to __True__: Provide a dictionary `retention_time_estimate` containing the file names (including data type) as keys and a rough retention time estimate of the compound pertaining to each given raw data file.  
  
    4. If `pre_filtering` was set to __True__: A float or integer `peak_width_estimate` containing a rough estimate of the average peak width of your LC-MS/MS method (in minutes).  
  
    5. If `pre_filtering` was set to __True__: A float or integer `minimum_sn` defining a lower threshold of the signal-to-noise ratio which has to be exceed for a signal to be accepted as a peak during pre-filtering. 
  
6) A Boolean `plotting`. Choose whether to create plots (__True__) or not (__False__). In the latter case, the pipeline will only yield the Excel data report sheet with all results and inference data objects for each sampled signal.

In [None]:
# using the previously acquired list raw_data_files, this is the easiest way to define the double_peak dictionary
# the list with the Booleans needs to be in the same order as raw_data_files
double_peak = dict(zip(raw_data_files, 5*[False] + [True] + [False]))       

pre_filtering = True                
retention_time_estimate = dict(zip(raw_data_files, 5*[26.2] + [(11.7, 12.5)] + [26.2]))    
peak_width_estimate = 1             
minimum_sn = 5                      

plotting = True

## Pipeline

When triggering the pipeline, a folder for the results named after the current data and time will be created automatically in the directory with the raw data files.  
The path to this folder will be returned and stored in the `results` variable.  

In [None]:
results = pl.pipeline(path_raw_data = path_raw_data,
    raw_data_file_format = ".npy",
    pre_filtering = pre_filtering,
    double_peak = double_peak,
    retention_time_estimate = retention_time_estimate,
    peak_width_estimate = peak_width_estimate,
    minimum_sn = minimum_sn,
    plotting = plotting,
)

## Data analysis

Since the __inference data objects__ for all signals were saved in the path stored in `results`, you can open any one you are interested in with the command `idata = az.from_netcdf()`.  
These objects contain not only the timeseries of the particular signal but also samples from the prior predictive, posterior, and posterior predictive sampling.  
This allows you to explore the data in detail and/or build your own plots different from the ones featured in Peak Performance.  
  
It is highly recommended to check the documentations for `pymc` and `arviz` to get information and inspiration for this.

In [None]:
# open an inference data object
idata = az.from_netcdf(results / "A1t1R1Part2_1_110_109.9_110.1")
idata

In [None]:
# store the summary in the DataFrame az_summary
az_summary = az.summary(idata)
az_summary