# Using HDX_LIMIT Part 3: Factorization and Generating HDX Time Series 
***

The purpose of this notebook is to provide explanation and usage examples for the modules of the pipeline subpackage of the HDX_LIMIT package. 

**The pipeline subpackage is used to extract signal corresponding to library proteins from known positions in each HDX timepoint, and process those signals into a best-estimate HDX-mass-addition time series for each library protein.**

>The subpackage relies on .mzML.gz files and library_info.csv from preprocessing, but no other inputs are needed.

>The pipeline modules work on top of the HDX_LIMIT.datatypes and .processing core-modules, and are designed to be context-flexible for use in Snakemake, command-line, and python contexts. 

**The pipeline subpackage performs three main tasks, each to its own module:**
1. Extracting protein signals from predetermined locations found in library_info.csv
2. Packaging extracted signals into custom classes that perform deconvolution
3. Selecting from pools of candidate signals to create a best-estimate HDX-mass-addition timeseries for each library protein identified.

**The two 'IdotP' modules: idotp_check and idotp_filter, are used in Snakemake to reduce the number of candidate signals to be processed.** 
>The idotp_check Snakefile extracts signal from all undeuterated replicates for each entry in library_info.csv, uses the core classes to clean and deconvolute the data, and measures the quality of the observed signal against the theoretical isotopic distribution determined from the given protein's sequence. 

>The measure of quality used is the dot product between the observed and theoretical normalized integrated m/Z distributions (range [0,1]), or 'isotopic dot product' shortened to 'idotp'. A lower threshold on idotp is given to idotp_filter, which returns a set of indices from library_info.csv with idotp >= threshold. Lines in library_info.csv with poor idotp are not considered for further processing, crucially reducing computational load. 

>Default values for the idotp threshold are ~0.99 for extremely-high-confidence, and ~0.95 for high-confidence. Thresholds should be as close to 1 as possible to avoid excessive load and garbage-in-garbage-out problems, while retaining the highest identification-rate possible. The idotp_filter module creates plots of the idotp distribution, which help in making informed choices about idotp threshold.

**The main output of the pipeline modules is the best-estimate hdx-mass-addition timeseries, contained in the PathOptimizer object as .winners, a list of IsotopeCluster objects. This is used in the estimation of per-residue exchange rates and $\Delta G_{unfolding}$ of library proteins.**

In [1]:
import os
import sys
import yaml
import glob
import numpy as np
import pandas as pd
import seaborn as sns

# Set matplotlib backend to work in jupyter.
import matplotlib
# matplotlib.use('nbAgg') # best for windows but works on Mac
matplotlib.use('MacOSX') # best for notebooks on Mac
import matplotlib.pyplot as plt
%matplotlib inline

# Make the Jupyter environment see workflow/scripts/.
sys.path.append('/'.join(os.getcwd().split('/')[:-1])+'/workflow/scripts') # default 'path/to/HDX_LIMIT-Pipeline/workflow/scripts/'
library_info = pd.read_csv('../resources/library_info/library_info.csv')
config = yaml.load(open('../config/config.yaml', 'rb').read(), Loader=yaml.FullLoader)

# Load and alias main functions of preprocessing modules
from HDX_LIMIT.pipeline.idotp_check import main as idotp_check
from HDX_LIMIT.pipeline.idotp_filter import main as idotp_filter
from HDX_LIMIT.pipeline.extract_timepoint_tensors import main as extract_timepoint_tensors
from HDX_LIMIT.pipeline.generate_tensor_ics import main as generate_tensor_ics
from HDX_LIMIT.pipeline.optimize_paths import main as optimize_paths

from HDX_LIMIT.io import limit_read, limit_write, optimize_paths_inputs

**Here we consider groups of charged species together as an 'rt-group': a group of all the observed charge states of a library protein, clustered in LC retention time.**<br>
This is necessary to limit the size of the LC extraction window, as our designed proteins have been observed to have bimodal elution profiles. Clustering by elution-time can produce several rt-groups for each library protein, but these can be filtered with the idotp_check module to reduce computational load. RT-group names are assigned during the creation of the library_info.csv, and here we create sub-dataframes from library_info for each rt_group.

In [2]:
# Divide library_info into rt-group level dataframes
lib_names = list(set(library_info['name'].values))
lib_names = sorted(lib_names, key=lambda i: float(i.split('_')[-1]))

rt_group_dfs = {}
for name in lib_names:
    rt_group_dfs[name] = library_info.loc[library_info['name']==name]

## Cleaning and deconvoluting extracted signals

**The generate_tensor_ics pipeline module is meant to fully abstract the signal processsing of the extracted tensors.** 

>Usually the only output is what will be used in generating the hdx-mass-addition time series: the list of IsotopeCluster objects identified from the input tensor. However in python contexts where the main function can be exposed (like this notebook), the return_flag argument can be used to return the TensorGenerator object from generate_tensor_ics. The TensorGenerator object contains a DataTensor object: the source of the IsotopeCluster objects, and more information than would be available with only the IsotopeCluster outputs. 