# Tutorial 3: Data Processing for Time Resolved, Temperature-Jump SAXS Scattering Curves

**Package Information:**<br>
Currently the [tr_tjump_saxs](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/tree/main?ref_type=heads "tr_tjump_saxs") package only works through the Python3 command line. The full dependencies can be found on the [Henderson GitLab Page](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/tree/main?ref_type=heads "tr_tjump_saxs") and the environment can be cloned from the [environment.yml file](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/blob/main/environment.yml?ref_type=heads "environment.yml file"). The data analysis can be executed from an interactive Python command line such as [iPython](https://www.python.org/) or [Jupyter](https://jupyter.org/) or the code can be written in a script to run in a non-interactive mode. The preferred usage is in Jupyter Lab as this is the environment the package was developed in. Jupyter also provides a file where all code, output of code, and notes can be contained in a single file and serves a record of the data analysis performed, the code used to conduct the data analysis, and the output of the analysis. 

**Tutorial Information:**<br>
This set of tutorial notebooks will cover how to use the `tr_tjump_saxs` package to analyze TR, T-Jump SAXS data and the <a href="https://www.science.org/doi/10.1126/sciadv.adj0396">workflow used to study HIV-1 Envelope glycoprotein dynamics. </a> This package contains multiple modules, each containing a set of functions to accomplish a specific subtask of the TR, T-Jump SAXS data analysis workflow. Many of the functions are modular and some can be helpful for analyzing static SAXS and other data sets as well. 

**Package Modules:**<br>
> 1. `file_handling`<br>
> 2. `saxs_processing`<br>
> 3. `saxs_qc`<br>
> 4. `saxs_kinetics`<br>
> 5. `saxs_modeling`<br>

**Developer:** [@ScientistAsh](https://github.com/ScientistAsh "ScientistAsh GitHub")

**Updated:** 2 February 2024

# Tutorial 3 Introduction
In the Tutorial 3 + 4 notebooks, I introduce the `saxs_processing` module from the `tr_tjump_saxs` package. In this Tutorial 3 Notebook I will cover the processing of T-Jump scattering curves. 

The `saxs_processing` module provides functions that will find and remove outliers, scale curves, and subtract curves. This tutorial assumes that you have finished the first two tutorials. If you find any issues with this tutorial or module, please create an issue on the repository GitLab page ([tr_tjump_saxs issues](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/issues "tr_tjump_saxs Issues")). 

# Module functions:
> 1. `svd_outliers()` uses an SVD method to determine outliers in scattering curves. <br>
> 2. `iterative_chi()` uses a $\chi^2$ method to determine outliers in T-Jump difference curves . <br>
> 3. `saxs_scale()` scales a given curve to a given reference curve. <br>
> 4. `saxs_sub()` subtracts a given curve from a reference curve. <br>
> 5. `auc_outliers()` will determine the outliers of a given data set using an area under the curve the method.$^*$ <br>
> 6. `move_outliers()` will move the outlier files to a given directory for quarantine. This is not recommended because files can get lost easily.$^*$ <br>

$^*$Please note that the `move_outliers` and `auc_outliers` functions are no longer used and will not be maintained and not recommended for use

## Tutorial Files:

### Data Files
The original data used in this analysis is deposited on the [SASBDB](https://www.sasbdb.org/) with accession numbers:
> **Static Data:** <br>
    - *CH505 Temperature Sereies*: SASDT29, SASDT39, SASDT49, SASDT59 <br>
    - *CH848 Temperature Series*: SASDTH9, SASDTJ9, SASDTK9, SASDTL9 <br>
<br>
> **T-Jump Data:** <br>
    - *CH505 T-Jump Data*: SASDT69, SASDT79, SASDT89, SASDT99, SASDTA9, SASDTB9, SASDTC9, SASDTD9, SASDTE9, SASDTF9, SASDTG9 <br>
     - *CH848 T-Jump Data*: SASDTM9, SASDTN9, SASDTP9, SASDTQ9, SASDTR9, SASDTS9, SASDTT9, SASDTU9, SASDTV9, SASDTW9 <br>
<br>
> **Static Env SOSIP Panel:** SASDTZ9, SASDU22, SASDU32, SASDU42, SASDTX9, SASDTY9 <br>

Additional MD data associated with the paper can be found on [Zenodo](https://zenodo.org/records/10451687).

### Output Files
Example output is included in the [OUTPUT](https://github.com/ScientistAsh/tr_tjump_saxs/tree/main/TUTORIALShttps://github.com/ScientistAsh/tr_tjump_saxs/tree/main/TUTORIALS/OUTPUT/) subdirectory in the [TUTORIALS](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/tree/main/TUTORIALS?ref_type=heads) directory.  

# How to Use Jupyter Notebooks
You can execute the code directly in this notebook or create your own notebook and copy the code there.

<div class="alert alert-block alert-info">
    
    <b><i class="fa fa-info-circle" aria-hidden="true"></i>&nbsp; Tips</b><br>
    
    <b>1.</b> To run the currently highlighted cell, hit the <code>shift</code> and <code>enter</code> keys at the same time.<br>
    <b>2</b>. To get help with a specific function, place the cursor in the functions brackets and hit the <code>shift</code> and <code>tab</code> keys at the same time.

</div>

<div class="alert alert-block alert-info" style="background-color: white; border: 2px solid; padding: 10px">
    <b><i class="fa fa-star" aria-hidden="true"></i>&nbsp; In the Literature</b><br>
    
    Our <a href="https://www.biorxiv.org/content/10.1101/2023.05.17.541130v1">recent paper </a> on BioArxiv provides an example of the type of data, the analysis procedure, and example output for this type of data analysis.  <br> 
    
    <p style="text-align:center">
    
</div>

# Import Modules

In order to use the `saxs_processing` module, the `tr_tjump_saxs` package needs to be imported. The dependecies will automatically be imported with the package import.

In [None]:
# import sys to allow python to use the file browser to find files
import sys

# append the path for the tr_tjump_saxs_analysis package to the PYTHONPATH
sys.path.append(r'../')

# import CH505TF_SAXS analysis dependent packages and custom functions
from file_handling import *
from saxs_processing import *

<div class="alert alert-block alert-info">
    <b><i class="fa fa-info-circle" aria-hidden="true"></i>&nbsp; Tips</b><br>
    Be sure that the path for the <code>tr_tjump_saxs</code> package appended to the <code>PYTHONPATH</code> matches the path to the repository on your machine.
    </div>

<a id='Overview'></a>

# Overview: Finding SAXS Scattering Outlier Curves

The first step in analyzing TR, T-Jump SAXS data is to detect and remove outlier scattering and difference curves. During a TR, T-Jump collection, scattering curves are measured for both buffer and protein. Static SAXS scattering curves are collected in addition to "laser off" and "laser on" T-Jump scattering curves for both protein and buffer T-Jumps. The outliers for all of these sets of scattering curves needs to be determined before further analysis. 

In this tutorial, we will only show this analysis on the T-Jump scattering curves. Determining outliers for T-Jump scattering curves is more complicated than for the static sets because the outliers have to be determined separately for "laser on" curves and "laser off" curves, requiring a multi-step procedure. Additionally, the "laser off" outlier curves must also be removed from the "laser on" curves (so if curve #25 is a "laser off" curve outlier then curve #25 must be removed from the "laser on" set even if it is not determined as an outlier from the "laser on" data set). 

# Example 1: Basic Usage

## Step 1: Load Data

### Step 1.1: Load Curves

Before we can process SAXS data, we need to load the data using the functions in the `file_handling` module covered in [Tutorial 1](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/blob/main/TUTORIALS/tutorial1_file-handling.ipynb?ref_type=heads "Tutorial 1"). We will use the `load_set()` function to load an entire set of curves and see how the outlier, scaling, and subtraction functions work with this example data set. If you have questions about loading data, slicing through the data arrays, or looping over multiple data sets, see [Tutorial 1](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/blob/main/TUTORIALS/tutorial1_file-handling.ipynb?ref_type=heads "Tutorial 1").

Because we need to analyze both the "laser off" and "laser on" T-Jump scattering curves, we need to load both sets of data. 

In [None]:
%%time 

# Get all laser on scattering curves
on_files = make_flist(directory='../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/', 
                        prefix='protein_20hz_set01_1ms_', suffix='_Q.chi')

# Get all -10us scattering curves from 20 Hz set 3 CH505TF Data
off_files = make_flist(directory='../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/',
                        prefix='protein_20hz_set01_-5us_', suffix='_Q.chi')

# sort files lists
on_files.sort()
off_files.sort()
    

# load laser on scattering curves
on_data, on_array, q, on_err = load_set(flist=on_files, delim=' ', mask=10, err=False)

# load laser off scattering curves
off_data, off_array, q, off_err = load_set(flist=off_files, delim=' ', mask=10, err=False)

<div class="alert alert-block alert-warning">
    
    <i class="fa fa-exclamation-triangle"></i>&nbsp; <b>Check your PATH </b><br>
    Make sure that the input and output directories match the PATH on your machine. Also be sure that the file prefixes and suffixes match what is used for your files. Edit the PATH varaibles for your machine. 
    </div>


<div class="alert alert-block alert-warning">
    
    <i class="fa fa-exclamation-triangle"></i>&nbsp; <b>Check your data structure</b><br>
    Your data may be stored differently and it is important to make sure you understand your data structure before beginning any analysis. It is always a good idea to practice slicing and plotting your data set to be sure you understand the data structure once it is loaded.
    </div>


### Step 1.2: Check Data

In [None]:
# check length of array
print(len(on_array))
print(len(off_array))

In [None]:
# check shape of array
print(on_array.shape)
print(off_array.shape)

In [None]:
# show first 5 files
on_files[:5]

In [None]:
# show first 5 files
off_files[:5]

The array containing our data has length 250, which corresponds to the 250 different scattering curves collected for this time delay. Each of the 250 curves has 1908 points as indicated by the `shape()` function. The data has the correct structure to proceed with processing. 

<div class="alert alert-block alert-warning">
    
    <i class="fa fa-exclamation-triangle"></i>&nbsp; <b>Overwritting Variables</b><br>
    Note that the data array is overwritten with each iteration of the loop. 
    </div>

<div class="alert alert-block alert-warning">
    
    <i class="fa fa-exclamation-triangle"></i>&nbsp; <b>Iterating</b><br>
    <code>for</code> loops are not the most efficient implementation in Python and other methods of iteration may be faster. 
    </div>


### Step 1.3: Check Loaded curves

In [None]:
on_labs = []
off_labs = []

for i,j in zip(on_files, off_files):
    on_labs.append(i[-9:-6])
    off_labs.append(j[-9:-6])
    
plot_curve(data_arr=on_array, q_arr=q, labels=on_labs, qmin=0.02, qmax=0.15, 
           imin=None, imax=None, x='Scattering Vector Å $^{-1}$', y='Scattering Intensity',
           title='CH505 TR, T-Jump Scattering Curves for 1ms', save=True, save_dir='./OUTPUT/TUTORIAL3/PLOTS/', 
           save_name='tutorial3_ex1_step1.3a.png')

plot_curve(data_arr=off_array, q_arr=q, labels=off_labs, qmin=0.02, qmax=0.15, 
           imin=None, imax=None, x='Scattering Vector Å $^{-1}$', y='Scattering Intensity',
           title='CH505 TR, T-Jump Scattering Curves for -5$\mu$s', save=True, save_dir='./OUTPUT/TUTORIAL3/PLOTS/', 
           save_name='tutorial3_ex1_step1.3b.png')

Both the "laser on" and "laser off" curves plotted above look reasonable. We can now move on to outlier detection on the "laser on" and "laser off" scattering curves. 

## Step 2: Detecting Outliers in SAXS Scattering Curves

The `tr_tjump_saxs` package uses a singular value decomposition (SVD) method to determine outliers in scattering curves (in addition to the previously mentioned pre-print, see also [Thompson et al](https://pubmed.ncbi.nlm.nih.gov/31527847/https://pubmed.ncbi.nlm.nih.gov/31527847/ "Thompson et al")).

The `svd_outliers()` function will perform this analysis. `svd_outliers()` has 6 input parameters.

> 1. The `arr` parameter indicates the data array containing the curves to run the outlier analysis on and is required input.<br>
> 2. The `flist` parameter indicates the file list associated with the indicated data array and is required input.<br>
> 3. `q` indicates the array containing the scattering vectors and is a required input.<br>
> 4. The `cutoff` parameter indicates the threshold for outlier detection in xSTD and is an optional parameter with the default value  2.5x the standard deviation.<br>
> 5. The `save_dir` and `save_name` parameters indicate the directory to save output in and the file name for output data, respectively, and are optional parameters. <br>

The function returns a list of outlier files and a list containing the list index associated with the outlier files.

There are no custom errors raised by the `svd_outliers()` function. For any errors raised, refer to the documentation for the function indicated in the traceback. 

### Step 2.1: Initial Outlier Detection
As metioned in the [Overview](#Overview), outlier detection on T-Jump scattering curves is a multi-step process. First, an initial SVD outlier detection is run using all the collected curves. Outlier detection is run separately for "laser on" and "laser off" curves. 

In [None]:
#SVD outlier detection iteration 1
print('Running SVD 1...')
on_outlier_files1, on_outlier_index1 = svd_outliers(arr=on_array, flist=on_files, q=q, cutoff=2.5, 
                                                    save_dir='./OUTPUT/TUTORIAL3/OUTLIERS/', 
                                                    save_name='1ms_svd_outliers1')

off_outlier_files1, off_outlier_index1 = svd_outliers(arr=off_array, flist=off_files, q=q, cutoff=2.5, 
                                                        save_dir='./OUTPUT/TUTORIAL3/OUTLIERS/', 
                                                        save_name='-5us_svd_outliers1')


### Step 2.2: Initial Outlier Removal

The `svd_outliers()` function only detects outliers, it does not remove them from the file list. 

The `remove_outliers()` function removes previously determined outliers from the indicated file list. There are 3 input parameters for this function:
> 1. The `flist` and `olist` parameters indicate the file list to remove outliers from and the list containing the outlier files, respectively. <br>
> 2. The `fslice` parameter is optional and indicates how the input file names should be sliced to produce the output file name. 

The function returns the set of cleaned files as a list and the outlier list. 

There are no custom errors raised by the `remove_outliers()` function. For any raised errors refer to the docs for the function indicated in the traceback. 

In [None]:
# remove SVD iteration 1 outliers from file lists
on_cleaned, on_outliers = remove_outliers(flist=on_files, olist=on_outlier_files1, fslice=[-9,-6])        
off_cleaned, off_outliers = remove_outliers(flist=off_files, olist=off_outlier_files1, fslice=[-9,-6])

In [None]:
on_cleaned[:5]

In [None]:
on_outliers

### Step 2.3: Load Cleaned Curves
Now that the first round of outliers have been removed, we need to load the cleaned file lists into new data arrays so that the outliers are also removed from the data arrays. 

In [None]:
# load clean curves
on_data, on_array, q, on_err = load_set(flist=on_cleaned, delim=' ', mask=10, err=False)
off_data, off_array, q, off_err = load_set(flist=off_cleaned, delim=' ', mask=10, err=False)

### Step 2.4: Outlier Detection
Outlier detection is determined relative to the average, yet the presence of outliers can skew the average value. If outliers are found in the initial outlier detection step, then it is best practice to run a second outlier detection to ensure that no outliers were missed due to a skewed value for the average. The procedure is the same as for the intial round of outlier detection. This is not necessary for this data set, but we will practice it here anyways. 

In [None]:
#SVD outlier detection iteration 2
print('Running SVD 1...')
on_outlier_files2, on_outlier_index2 = svd_outliers(arr=on_array, flist=on_cleaned, q=q, cutoff=2.5, 
                                                    save_dir='./OUTPUT/TUTORIAL3/OUTLIERS/', 
                                                    save_name='1ms_svd_outliers2')

off_outlier_files2, off_outlier_index2 = svd_outliers(arr=off_array, flist=off_cleaned, q=q, cutoff=2.5, 
                                                        save_dir='./OUTPUT/TUTORIAL3/OUTLIERS/', 
                                                        save_name='-5us_svd_outliers2')


<div class="alert alert-block alert-warning">
    
    <i class="fa fa-exclamation-triangle"></i>&nbsp; <b>Overwritting data arrays and outlier lists</b><br>
    Notice how when we loaded the cleaned files into a new array we just overwrote the initial array with all the files loaded but when we re-ran the outlier detection we stored the output as a new list instead of overwritting the list. This is because the <code>svd_outlier</code> function will append the list instead of overwritting it. To avoid issues caused by this bahvior, it is best to store the second round (and any subsequent iterations) of outlier detection in separate lists. 

### Step 2.5: Remove Outliers
Now that we have detected all the scattering curve outliers in both "laser on" and "laser off" scattering curves, we need to create a file list that has all of the outliers removed. This is itself a multi-step process. 

#### Step 2.5.1: Combining Outlier Lists
First, the outlier lists need to be combined to a common list. 

In [None]:
# combine all outliers into a single list
all_on_outliers = on_outlier_files1 + on_outlier_files2
all_off_outliers = off_outlier_files1 + off_outlier_files2

all_outliers = all_on_outliers + all_off_outliers

#### Step 2.5.2: Removing Duplicate Outliers from Outlier Lists
Becuase the same curve number can simultaneously be an outlier for both "laser off" and "laser on" data sets, the combined outlier lists could have duplicate curve numbers, which would cause issues with removing these files from the list. Therefore, to avoid issues further into the worflow, it is best practice to get the unique set of outlier curve numbers and sort the file list in order of collection using built in Python functions.

In [None]:
# select only unique values for outliers
svd_set = set(all_outliers)
all_outliers = list(svd_set)

# sort list of outliers
all_outliers.sort()

#### Step 2.5.3: Creating a Cleaned File List
The final step of the outlier detection process is to remove all the outliers from the file list to create a cleaned file list that can be used as input for further processing and analysis. The outlier removal follows the same procedure as the initial outlier removal step. 

In [None]:
# remove SVD iteration 2 outliers from file lists
on_cleaned, on_outliers = remove_outliers(flist=on_files, olist=all_outliers, fslice=[-9,-6])        
off_cleaned, off_outliers = remove_outliers(flist=off_files, olist=all_outliers, fslice=[-9,-6])

<div class="alert alert-block alert-warning">
    
    <i class="fa fa-exclamation-triangle"></i>&nbsp; <b>Static SAXS Outlier Detection</b><br>
    This function can also be used to detect the outliers in static SAXS data sets as well. In the case of static SAXS data, the workflow is simplified because, instead of paired "laser on" and "laser off" curves, there is only one set of "laser off" scattering curves so the analysis only has to be done on the one set. It is still recommended to use at least 2 outlier detection steps just as for the TR, T-Jump data.  

## Step 3: Calculate TR, T-Jump Difference Curves

In static SAXS analysis, the scattering curves remaining after outlier removal are used in subequent analysis. However, in TR, T-Jump SAXS analysis the difference curves are used in subsequent analysis. The difference curves are determined by subtracting each "laser off" curve from its paired "laser on" curve. Calculating the difference curves is also a multi-step process because the curves must be scaled to a common point prior to subtraction. 

### Step 3.1: Create Labels
The file names can be too long to use as labels so, to keep the output files names and plot labels tidy, it is recommended to create a list of labels that can be mapped to the data array.

In [None]:
# Create list of labels to use for saving diff curves
labs = []
for i in on_cleaned:
    labs.append(i[-9:-6])

<br>

For creating the labels, we used the on list in the above code, but both the `on_cleaned` and `off_cleaned` files should be sorted and be in the same order. Assuming that both the `on_cleaned` and `off_cleaned` files are sorted in the same order, either list could be used to create the label list. 

### Step 3.2 & 3.3: Scaling and Subtracting Curves
Because the difference curves have to be calculated for each set of paired "laser on" and "laser off" curves, these 2 steps have to be performed together within a loop. 

Prior to subtracting the "laser off" curve from its paired "laser on" curve the curves must be scaled to a common curve. Because the "laser on" data contains the scattering signal from which kinetics will later be extracted, the best practice is to scale the "laser off" curve to the "laser on" curve so that artifacts are not introduced to the "laser on" curves during data processing. This means that the "laser on" curve can be used as the reference curve for the scaling of its paired "laser off" curve. The range of scaling should be determined for each data set individually. 

The `saxs_scale` function will scale a given curve to a given reference curve. This function has several input parameters:
>1. `ref` is a string that indicates the file (including the full path) containing the reference curve, typically the "laser on" curve for this analysis. <br>
> 2. `scale` is a string that indicates the file containing the curve to be scaled, typically the paired "laser off" curve for this analysis. <br>
> 3. The `dataset` parameter is a string that includes a label for the data set, which will be used for output files and plots. These three parameters are all required input. There are also several optional parameters.<br>
> 4. `err` is a boolean indicating if the files for both the reference and scaled curves contain a column for measurement errors. If set the True, then the errors will be propagated for this step. The default value is False, which does nothing with the errors. <br>
> 5. `delim` indicates the type of delimitter used in the referece and scaled curve files and the default value is a comma.<br>
> 6. The `mask` parameter is an integer indicating the number of rows to skip when importing data, with the default value set to 0. This parameter can be useful when trying to load data files that have metadata or headers. the `qmin` and `qmax` values indicate the minimum and maximum, respectively, values of the scattering vector to use for the scaling range. <br>
> 7. `outfile` and `outdir` indicate the output file name and location of output file, respectively. <br>
> 8. The `saxs_scale` function returns the scaled curve as a numpy array. This function will also display before and after scaling plots for scaled curve. <br>

It is recommended that these plots be examined closely for any issues with scaling as this is one of the trickier steps in the analysis workflow. 

The `saxs_sub` function will subtract the indicated curve from a given reference curve. This function has 2 required input parameters, `ref` and `data`, which are both strings containing the file name (with the full path) to the curve that will be subtracted and the curve that will be corrected, respectively. When calculating T-Jump difference curves, the reference curve is the "laser off" curve and the data curve is the paired "laser on" curve. Optional parameters include:
> 1. `delim_ref` and `delim_data` indicate the delimitters used in the reference curve file and the data curve file, respectively. <br>
> 2. Likewise, `ref_skip` and `data_skip` indicate the number of rows to skip when loadng the reference curve and data curve, respectively.<br>
> 3. The `outdir` and `outfile` parameters indicate the name of the output file and where it should be saved. If the indicated output directory does not exist, the function will create it using the `make_dir()` function covered in [Tutorial 1](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/blob/main/TUTORIALS/tutorial1_file-handling.ipynb?ref_type=heads "Tutorial 1").<br>

The `saxs_sub` function returns a numpy array containing the corrected curve. 

In [None]:
for n, f, l in zip(on_cleaned, off_cleaned, labs):
    # scale laser off to laser on curve at isosbestic point for water
    scaled = saxs_scale(ref=str(n), scale=str(f), dataset='SAXS_Tutorial3_' + str(l),
                       qmin=1.5, qmax=2.5, delim=' ', err=False, mask=10, outfile='scaled' + str(n[75:]),
                       outdir='./OUTPUT/TUTORIAL3/LASER_OFF_SCALED/')

    buf_sub = saxs_sub(ref=f, data=n, delim_ref=' ', delim_data=' ', err=False, ref_skip=10, 
                        data_skip=10, outdir='./OUTPUT/TUTORIAL3/DIFF_CURVES/', 
                        outfile= 'diff' + str(n[75:]))

# Example 2: Iterating Over Multiple Time Delays and Sets
You can automate the above analysis over multiple time delays to speed up processing for TR, T-Jump SAXS curves. 

## Step 1: Define Dataset Variables
To iterate over multiple time delays at once, the dataset(s), time delays, file directories, and file prefixes need to be assigned. 

In [None]:
%%time

# define det numbers
sets = ['20hz_set02', '20hz_set02', '20hz_set02',
        '20hz_set01', '20hz_set01', '20hz_set01', '20hz_set01', '20hz_set01',
        '20hz_set03', '20hz_set03', '20hz_set03',
        '5hz_set01', '5hz_set01', '5hz_set01']

# define datasets
time_delays = ['1.5us', '3us', '5us', 
               '10us', '50us', '100us', '500us', '1ms',
               '5us', '300us', '1ms',
               '1ms', '10ms', '100ms']

# define data directories
directories = ['../../../TR_T-jump_SAXS_July2022/protein_20hz_set02/processed/', '../../../TR_T-jump_SAXS_July2022/protein_20hz_set02/processed/',
               '../../../TR_T-jump_SAXS_July2022/protein_20hz_set02/processed/', '../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/',
               '../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/','../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/',
               '../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/', '../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/',
               '../../../TR_T-jump_SAXS_July2022/protein_20hz_set03_redo/processed/', '../../../TR_T-jump_SAXS_July2022/protein_20hz_set03_redo/processed/',
               '../../../TR_T-jump_SAXS_July2022/protein_20hz_set03_redo/processed/', '../../../TR_T-jump_SAXS_July2022/protein_5hz_set01/processdb/',
               '../../../TR_T-jump_SAXS_July2022/protein_5hz_set01/processdb/', '../../../TR_T-jump_SAXS_July2022/protein_5hz_set01/processdb/'] 

prefixes = ['protein_20hs_44C_', 'protein_20hs_44C_', 'protein_20hs_44C_',
           'protein_20hz_set01_','protein_20hz_set01_','protein_20hz_set01_','protein_20hz_set01_','protein_20hz_set01_',
           'protein_20hz_44C_', 'protein_20hz_44C_', 'protein_20hz_44C_',
           'protein_5hz_set01_', 'protein_5hz_set01_', 'protein_5hz_set01_']
               

## Step 2: Iterating Processing Over All Time Delays

In [None]:
# Loop over each data set
for t,d,p,s in zip(time_delays, directories, prefixes, sets):
    
    print('Analyzing ' + str(t) + '..')
    
    # Get all laser on scattering curves
    on_files = make_flist(directory=str(d), 
                            prefix=str(p) + str(t), suffix='_Q.chi')

    # Get all -10us scattering curves from 20 Hz set 3 CH505TF Data
    off_files = make_flist(directory=str(d),
                            prefix=str(p), suffix='_Q.chi')

    # sort files lists
    on_files.sort()
    off_files.sort()
    

    # load laser on scattering curves
    on_data, on_array, q, on_err = load_set(flist=on_files, delim=' ', mask=10, err=False)

    # load laser off scattering curves
    off_data, off_array, q, off_err = load_set(flist=off_files, delim=' ', mask=10, err=False)
    

    # plot curves for sanity check
    print('Plotting Scattering Curves...')

    # Create list of labels to use for plot legend
    labs = []
    for i in on_files:
        labs.append(i[-9:-6])

    # plot laser on curves
    plot_curve(data_arr=on_array, q_arr=q, labels=labs, qmin=0.025, qmax=0.15, imin=None, imax=None,
                x='scattering vector (Å)', y='scattering intensity', title=str(s) + ' SAXS Scattering at ' + str(t), 
                save=False, save_dir=None, save_name=None)

    # plot laser off curves
    plot_curve(data_arr=off_array, q_arr=q, labels=labs, qmin=0.025, qmax=0.15, imin=None, imax=None, 
                x='scattering vector (Å)', y='scattering intensity', title=str(s) + ' SAXS Scattering at -5us', 
                save=False, save_dir=None, save_name=None)

    #SVD outlier detection iteration 1
    print('Running SVD 1...')
    on_outlier_files1, on_outlier_index1 = svd_outliers(arr=on_array, flist=on_files, q=q, cutoff=2.5, 
                                                        save_dir='./OUTPUT/TUTORIAL3/' + str(s) + '/OUTLIERS/' + str(t) + '/', 
                                                        save_name=str(t) + '_svd_outliers1')

    off_outlier_files1, off_outlier_index1 = svd_outliers(arr=off_array, flist=off_files, q=q, cutoff=2.5, 
                                                            save_dir='./OUTPUT/TUTORIAL3/' + str(s) + '/OUTLIERS/' + str(t) + '/', 
                                                            save_name=str(t) + '_-5us_svd_outliers1')
    
    # combine all outliers into a single list
    all_svd_outliers = on_outlier_files1 + off_outlier_files1
    
    # select only unique values for outliers
    svd_outliers1 = unique_set(lst=all_svd_outliers)
    
    # sort list of outliers
    svd_outliers1.sort()

    # remove SVD iteration 1 outliers from file lists
    on_cleaned, on_outliers = remove_outliers(flist=on_files, olist=svd_outliers1, fslice=[-9,-6])        
    off_cleaned, off_outliers = remove_outliers(flist=off_files, olist=svd_outliers1, fslice=[-9,-6])
    
    # load clean curves
    on_data, on_array, q, on_err = load_set(flist=on_cleaned, delim=' ', mask=10, err=False)
    off_data, off_array, q, off_err = load_set(flist=off_cleaned, delim=' ', mask=10, err=False)
    
    #SVD outlier detection iteration 2
    print('Running SVD 2...')
    on_outlier_files2, on_outlier_index2 = svd_outliers(arr=on_array, flist=on_cleaned, q=q, cutoff=2.5, 
                                                        save_dir='./OUTPUT/TUTORIAL3/' + str(s) + '/OUTLIERS/' + str(t) + '/', 
                                                        save_name=str(t) + '_svd_outliers2')

    off_outlier_files2, off_outlier_index2 = svd_outliers(arr=off_array, flist=off_cleaned, q=q, cutoff=2.5, 
                                                            save_dir='./OUTPUT/TUTORIAL3/' + str(s) + '/OUTLIERS/' + str(t) + '/', 
                                                            save_name=str(t) + '_-5us_svd_outliers2')
    
    # combine all outliers into a single list
    all_on_outliers = on_outlier_files1 + on_outlier_files2
    all_off_outliers = off_outlier_files1 + off_outlier_files2
    all_outliers = all_on_outliers + all_off_outliers
    
    # select only unique values for outliers
    #svd_set = set(all_outliers)
    #all_outliers = list(svd_set)
    unique_outliers = unique_set(all_outliers)
    
    # sort list of outliers
    unique_outliers.sort()
    
    # remove SVD iteration 2 outliers from file lists
    on_cleaned, on_outliers = remove_outliers(flist=on_files, olist=unique_outliers, fslice=[-9,-6])        
    off_cleaned, off_outliers = remove_outliers(flist=off_files, olist=unique_outliers, fslice=[-9,-6])
    
    # Create list of labels to use for saving diff curves
    labs = []
    for i in on_cleaned:
        labs.append(i[-9:-6])
    
    # loop over all laser on and laser off files
    for n, f, l in zip(on_cleaned, off_cleaned, labs):
        # scale laser off to laser on curve at isosbestic point for water
        scaled = saxs_scale(ref=str(n), scale=str(f), dataset=str(s) + '_' + str(t) + '_' + str(l) + '_-5us',
                            qmin=1.4, qmax=1.6, delim=' ', err=False, mask=10, 
                            outfile=str(s) + '_' + str(n[-9:-6]) + '_Q.chi',
                            outdir='./OUTPUT/TUTORIAL3/' + str(s) + '/LASER_OFF_SCALE/' + str(t))
        
        # calculate difference curve
        buf_sub = saxs_sub(ref=f, 
                           data='./OUTPUT/TUTORIAL3/' + str(s) + '/LASER_OFF_SCALE/' + str(t) + '/' + str(s) + '_' + str(n[-9:-6]) + '_Q.chi',
                           delim_ref=' ', delim_data=',', err=False, ref_skip=10, 
                            data_skip=10, outdir='./OUTPUT/TUTORIAL3/' + str(s) + '/-5us_DIFF/' + str(t) + '/', 
                            outfile= str(s) + '_' + str(n[-9:-6]) + '_Q.chi')


<div class="alert alert-block alert-success">
    
    <i class="fa fa-check-circle"></i>&nbsp; <b>Congratulations!</b><br>
    You completed the third tutorial! Continue with <a href="https://github.com/ScientistAsh/tr_tjump_saxs/blob/main/TUTORIALS/tutorial4_saxs_processing_difference.ipynb">Tutorial 4</a> to continue the analysis with processing T-Jump difference curves.

</div>