# Example 1: Build a pipeline using PeakPerformance's convenience functions

In [1]:
import pandas
import numpy as np
import arviz as az
from pathlib import Path
from peak_performance import pipeline as pl

## User information

First, store the path to a folder containing only the raw data you want to analyze in the `path_raw_data` variable.  
Then, store the path to the directory containing the Excel file `Template.xlsx` from PeakPerformance in the `path_template` variable. You can download the file directly from GitHub or clone the PeakPerformance repository locally.  
You can use a string with a preceding `r` so that the backslashes are recognized correctly or the `Path` method from the `pathlib` package for an OS-independent alternative.  

For this example, the general paths within the PeakPerformance repository have already been formulated below (it is recommended to clone the repository on your local machine).

In [2]:
# specify the absolute path to the raw data files (as a str or a Path object), e.g. to the provided example files

path_raw_data = Path("..") / "example"
path_template = Path("..")

The first step of the process is always the `prepare_model_selection()` function. Its job is to prepare and partly fill out an Excel file called `Template.xlsx` which serves as the input for user data and is copied into the directory stored in the `path_raw_data` variable. Conveniently, the function only needs the two paths you defined above.    

In [3]:
pl.prepare_model_selection(path_raw_data, path_template)

Now, navigate to the directory stored in `path_raw_data` and open `Template.xlsx`. Read the explanations and fill out the sheets accordingly. Then, save and close the Excel file (not closing it leads to a permission error when executing the next method). 
   
The next step depends on what information you entered into `Template.xlsx`. If you specified a model type for peak fitting for every `unique_identifier`, then you can skip the automated model selection described in the next section. If you left the model type open for at least one `unique_identifier`, then go ahead with the automated model selection.

## Automated model selection (optional)

The intended standard workflow involves an automated selection of the model or distribution used for peak fitting. This is performed based on a representative peak for every target analyte (or `unique_identifier` as they are referred to in `Template.xlsx`). For each of these, an information criterion is calculated based on which the models are ranked and the best model for any given target is selected. Finally, `Template.xlsx` is updated with these selected models wherever no model was specified by the user.  
  
This step may take a while since every file in question is fit with each of the models and the number of tuning samples is higher than usual so the sampling time per model is additionally increased.  

The returned `model_dict` is a dictionary with the unique_identifiers as keys and the selected models as values.  
The returned `result` is a DataFrame with all rankings from the model selection process.  

When using the example data, you can use the following settings:

![](../example/model_selection_example.png)

In [None]:
result, model_dict = pl.model_selection(path_raw_data)

In case you left `Template.xlsx` open and received a `UserWarning` to that effect, just close it now and execute the subsequent cell to update `Template.xlsx` with the results of the model selection. Otherwise, skip the next cell.

In [65]:
df_signals = pandas.read_excel(Path(path_raw_data) / "Template.xlsx", sheet_name="signals")
pl.selected_models_to_template(path_raw_data, df_signals, model_dict)

## Pipeline

When every `unique_identifier` has been matched with a model type, it is time to start the actual peak fitting pipeline. This is once again done with just one simple command which needs the already defined `path_raw_data` variable. Additionally, the user has to supply the data format of the raw data files. The example files are ".npy" files but others are acceptable as long as they follow PeakPerformance's standardized naming scheme and contain the correctly formatted data.  
  
When triggering the pipeline, a folder for the results named after the current date and time will be created automatically in the directory with the raw data files. The path to this folder will be returned and stored in the `results` variable.  

When using the example data, you can use the following settings:

![](../example/pipeline_example.png)

In [None]:
results = pl.pipeline(
    path_raw_data = path_raw_data,
    raw_data_file_format = ".npy",
)

In [None]:
results

## Data analysis

Since the __inference data objects__ for all signals were saved in the path stored in `results`, you can open any one you are interested in with the command `idata = az.from_netcdf()`.  
These objects contain not only the timeseries of the particular signal but also samples from the prior predictive, posterior, and posterior predictive sampling.  
This allows you to explore the data in detail and/or build your own plots aside from the ones featured in PeakPerformance.  
  
It is highly recommended to check the documentations for [`PyMC`](docs.pymc.io/) and [`ArviZ`](https://python.arviz.org/en/latest/) to get information and inspiration for this purpose.

In [108]:
# open an inference data object
idata = az.from_netcdf(results / "A1t1R1Part2_110_109.9_110.1.nc")
idata

In [109]:
# store the summary in the DataFrame az_summary
az_summary = az.summary(idata)
az_summary

Unnamed: 0,mean,sd,hdi_3%,hdi_97%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
baseline_intercept,-44.003,7.227,-57.478,-30.260,0.079,0.057,8290.0,6088.0,1.0
baseline_slope,6.659,0.514,5.672,7.616,0.007,0.005,5471.0,5696.0,1.0
noise_log__,4.645,0.073,4.510,4.785,0.001,0.001,9311.0,5961.0,1.0
mean,25.949,0.013,25.926,25.975,0.000,0.000,2650.0,3786.0,1.0
std_log__,-0.644,0.042,-0.727,-0.568,0.001,0.001,2429.0,2921.0,1.0
...,...,...,...,...,...,...,...,...,...
y[94],147.639,13.291,123.255,172.644,0.168,0.119,6287.0,6354.0,1.0
y[95],147.941,13.311,123.565,173.030,0.168,0.119,6280.0,6251.0,1.0
y[96],148.243,13.332,123.833,173.360,0.168,0.119,6274.0,6251.0,1.0
y[97],148.545,13.352,124.101,173.711,0.168,0.119,6268.0,6251.0,1.0
