# Tutorial 4: Data Processing for Time Resolved, Temperature-Jump SAXS Difference Curves

**Package Information:**<br>
Currently the [tr_tjump_saxs](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/tree/main?ref_type=heads "tr_tjump_saxs") package only works through the Python3 command line. The full dependencies can be found on the [Henderson GitLab Page](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/tree/main?ref_type=heads "tr_tjump_saxs") and the environment can be cloned from the [environment.yml file](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/blob/main/environment.yml?ref_type=heads "environment.yml file"). The data analysis can be executed from an interactive Python command line such as [iPython](https://www.python.org/) or [Jupyter](https://jupyter.org/) or the code can be written in a script to run in a non-interactive mode. The preferred usage is in Jupyter Lab as this is the environment the package was developed in. Jupyter also provides a file where all code, output of code, and notes can be contained in a single file and serves a record of the data analysis performed, the code used to conduct the data analysis, and the output of the analysis. 

**Tutorial Information:**<br>
This set of tutorial notebooks will cover how to use the `tr_tjump_saxs` package to analyze TR, T-Jump SAXS data and the <a href="https://www.science.org/doi/10.1126/sciadv.adj0396">workflow used to study HIV-1 Envelope glycoprotein dynamics. </a> This package contains multiple modules, each containing a set of functions to accomplish a specific subtask of the TR, T-Jump SAXS data analysis workflow. Many of the functions are modular and some can be helpful for analyzing static SAXS and other data sets as well.  

**Package Modules:**<br>
> 1. `file_handling`<br>
> 2. `saxs_processing`<br>
> 3. `saxs_qc`<br>
> 4. `saxs_kinetics`<br>
> 5. `saxs_modeling`<br>

**Developer:** [@ScientistAsh](https://github.com/ScientistAsh "ScientistAsh GitHub")

**Updated:** 6 February 2024

# Tutorial 4 Introduction
In the Tutorial 3 + 4 notebooks, I introduce the `saxs_processing` module from the `tr_tjump_saxs` package. This tutorial will cover processing SAXS difference curves from T-jump experiments. The `saxs_processing` module provides functions that will find and remove outliers, scale curves, and subtract curves. This tutorial assumes that you have finished the first two tutorials. If you need help with loading files or processing SAXS scattering curves, see the relevant [tutorials](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/tree/main/TUTORIALS?ref_type=heads). If you find any issues with this tutorial or module, please create an issue on the repository GitLab page ([tr_tjump_saxs issues](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/issues "tr_tjump_saxs Issues")). 

# Module functions:
> 1. `svd_outliers()` uses and SVD method to determine outliers in scattering curves. <br>
> 2. `iterative_chi()` uses a $\chi^2$ method to determine outliers in T-Jump difference curves . <br>
> 3. `saxs_scale()` scales a given curve to a given reference curve. <br>
> 4. `saxs_sub()` subtracts a given curve from a reference curve. <br>
> 5. `auc_outliers()` will determine the outliers of a given data set using an area under the curve method.$^*$ <br>
> 6. `move_outliers()` will move the outlier files to a given directory for quarantine. This is not recommended because files can get lost easily.$^*$ <br>

$^*$Please note that the `move_outliers` and `auc_outliers` functions are no longer used and will not be maintained and not recommended for use.

# Tutorial Data Files:

### Data Files
The original data used in this analysis is deposited on the [SASBDB](https://www.sasbdb.org/) with accession numbers:
> **Static Data:** <br>
    - *CH505 Temperature Sereies*: SASDT29, SASDT39, SASDT49, SASDT59 <br>
    - *CH848 Temperature Series*: SASDTH9, SASDTJ9, SASDTK9, SASDTL9 <br>
<br>
> **T-Jump Data:** <br>
    - *CH505 T-Jump Data*: SASDT69, SASDT79, SASDT89, SASDT99, SASDTA9, SASDTB9, SASDTC9, SASDTD9, SASDTE9, SASDTF9, SASDTG9 <br>
     - *CH848 T-Jump Data*: SASDTM9, SASDTN9, SASDTP9, SASDTQ9, SASDTR9, SASDTS9, SASDTT9, SASDTU9, SASDTV9, SASDTW9 <br>
<br>
> **Static Env SOSIP Panel:** SASDTZ9, SASDU22, SASDU32, SASDU42, SASDTX9, SASDTY9 <br>

Additional MD data associated with the paper can be found on [Zenodo](https://zenodo.org/records/10451687).

### Output Files
Example output is included in the [OUTPUT](https://github.com/ScientistAsh/tr_tjump_saxs/tree/main/TUTORIALShttps://github.com/ScientistAsh/tr_tjump_saxs/tree/main/TUTORIALS/OUTPUT/) subdirectory in the [TUTORIALS](https://github.com/ScientistAsh/tr_tjump_saxs/tree/main/TUTORIALShttps://github.com/ScientistAsh/tr_tjump_saxs/tree/main/TUTORIALS/) directory. 

# How to Use Jupyter Notebooks
You can execute the code directly in this notebook or create your own notebook and copy the code there. 

<div class="alert alert-block alert-info">
    
    <b><i class="fa fa-info-circle" aria-hidden="true"></i>&nbsp; Tips</b><br>
    
    <b>1.</b> To run the currently highlighted cell, hit the <code>shift</code> and <code>enter</code> keys at the same time.<br>
    <b>2</b>. To get help with a specific function, place the cursor in the functions brackets and hit the <code>shift</code> and <code>tab</code> keys at the same time.

</div>

<div class="alert alert-block alert-info" style="background-color: white; border: 2px solid; padding: 10px">
    <b><i class="fa fa-star" aria-hidden="true"></i>&nbsp; In the Literature</b><br>
    
    Our <a href="https://www.science.org/doi/10.1126/sciadv.adj0396">recent paper </a> in Science Advances provides an example of the type of data, the analysis procedure, and example output for this type of data analysis.  <br> 
    
    <p style="text-align:center">
    
</div>

# Import Modules

In order to use the `saxs_processing` module, the `tr_tjump_saxs` package needs to be imported. The dependecies will automatically be imported with the package import.

In [None]:
# import sys to allow python to use the file browser to find files
import sys

# append the path for the tr_tjump_saxs_analysis package to the PYTHONPATH
sys.path.append(r'../')

# import CH505TF_SAXS analysis dependent packages and custom functions
from file_handling import *
from saxs_processing import *

<div class="alert alert-block alert-info">
    <b><i class="fa fa-info-circle" aria-hidden="true"></i>&nbsp; Tips</b><br>
    Be sure that the path for the <code>tr_tjump_saxs</code> package appended to the <code>PYTHONPATH</code> matches the path to the repository on your machine.
    </div>

<a id='Overview'></a>

# Overview: Finding SAXS Difference Curve Outliers

The first step in analyzing TR, T-Jump SAXS data is to detect and remove outlier scattering and difference curves. During a TR, T-Jump collection, scattering curves are measured for both buffer and protein. Static SAXS scattering curves are collected in addition to "laser off" and "laser on" T-Jump scattering curves for both protein and buffer T-Jumps. The outliers for all of these sets of scattering curves needs to be determined before further analysis. 

In this tutorial, we will only show this processing on the T-Jump difference curves. At the end, looping over multiple time delays and data sets is explained. 

# Example 1: Basic Usage
## Step 1: Load Difference Curves
The rest of the TR, T-Jump SAXS processing and analysis will use the difference curves. Now that the difference curves have been calculated, they can be loaded into an array using the `load_set()` function the same as was done for the SAXS scattering curves. It is always a good idea to plot the curves to make sure the data was calculated and loaded correctly. 

In [None]:
# load difference files for iterative chi test
diff_files = make_flist(directory='./OUTPUT/TUTORIAL3/20hz_set01/-5us_DIFF/1ms/', 
                        prefix=None, suffix='.chi')
# sort files lists
diff_files.sort()

# load difference curves
diff_data, diff_array, q, diff_err = load_set(flist=diff_files, delim=',', mask=0, err=False)

# Create list of labels to use for plot legend
print('Plotting difference curves')
labs = []
for i in diff_files:
    labs.append(i[-9:-6])

# plot difference curves as sanity check 
plot_curve(data_arr=diff_array, q_arr=q, labels=labs, qmin=0.03, qmax=0.15, imin=None, imax=None, x='scattering vector (Å)',
            y='scattering intensity', 
            title='CH505 TR, T-Jump SAXS Difference Curves at 1ms', save=True, 
            save_dir='./OUTPUT/TUTORIAL4/PLOTS/', save_name='tutorial4_ex1_step1.png')  


# Step 2: Find and Remove Outlier Difference Curves

## Step 2.1: Run Iterative $\chi^2$ Analysis

In [None]:
%%time 

# run iterative chi test
print('Running iterative chi test...')
iterative_chi(arr=diff_array, flist=diff_files, chi_cutoff=1.5, 
              outfile='./OUTPUT/TUTORIAL4/OUTLIERS/1ms_chi_outliers.csv', calls=1)

## Step 2.2: Remove Outliers

In [None]:
# load chi outliers as a list
chi_outliers_file = './OUTPUT/TUTORIAL4/OUTLIERS/1ms_chi_outliers.csv'

# read on outliers into a list
with open(chi_outliers_file, 'r') as f:
    reader = csv.reader(f)
    outliers = list(reader)

# created list to store chi outlier files in
chi_outliers = []

# build outlier list
for i in outliers:
    chi_outliers.append(i[0])

# get unique set
unique_chi = unique_set(chi_outliers)

# remove outliers from difference curves
cleaned_files, chi_outliers = remove_outliers(flist=diff_files, olist=unique_chi, 
                             fslice=[-9,-6])


<div class="alert alert-block alert-warning">
    
    <i class="fa fa-exclamation-triangle"></i>&nbsp; <b>Outlier Warning</b><br>
    The <code>remove_outliers</code> function determines the number of files in the input list and outlier list and determines if the remaining number of files matches what is expected based on these numbers. This will work without issues when loading an outlier list and a file_list set. However, if you are working through the processing workflow and are running the <code>remove_outliers</code> immediately after running the <code>iterative_chi</code> function, the diff_files list input into the <code>remove_outliers</code> function will already have most of the outliers removed but the outlier list will include all of the determined outliers. This means that you may see a warning indicating the resulting file list does not contain the amount expected based on the input lists. Generally, this can be ignored if you are sure number of remaining files is correct. Otherwise, double-check this by making sure that the number of files printed at the end of the printed statement matches the number of files the <code>remove_outliers</code> warning indicates.
    </div>



Notice that the number of files remaining after outlier removal is 217, which matches the 217 files left after the `iterative_chi` analysis. Before moving on, we need to reload the `diff_array` because we last loaded this array before removing the outliers and we now need to remove this data from our diff_array.

## Step 2.3: Reload diff_array

In [None]:
# sort files lists
cleaned_files.sort()

# load difference curves
diff_data, diff_array, q, diff_err = load_set(flist=cleaned_files, delim=',', mask=0, err=False)

# Create list of labels to use for plot legend
print('Plotting difference curves')
labs = []
for i in cleaned_files:
    labs.append(i[-9:-6])

# plot difference curves as sanity check 
plot_curve(data_arr=diff_array, q_arr=q, labels=labs, qmin=0.03, qmax=0.15, imin=None, imax=None, x='scattering vector (Å)',
            y='scattering intensity', 
            title='CH505 TR, T-Jump SAXS Difference Curves at 1ms', save=True, 
            save_dir='./OUTPUT/TUTORIAL4/PLOTS/', save_name='tutorial4_ex1_step2.png')  


# Step 3: Calculate the Average Difference Curve

The last and final step of the TR, T-Jump SAXS data processing workflow is to determine the average curves and standard error. Here, we will use the above calculated difference curves for the 1ms time delay to calculate the average 1ms difference curve and the standrd error of the mean. 

## Step 3.1: Get Average Curve

In [None]:
# get avg curve
avg_curve = diff_array.mean(axis=0)

## Step 3.2: Get Standard Error

In [None]:
# get standard deviation
curve_std = diff_array.std(axis=0)

# get standard error
curve_err = curve_std/math.sqrt(len(diff_array))

## Step 3.3: Save Average Curve and SEM

In [None]:
# concat avg curve and sem into single array
avg_sem = np.array([avg_curve, curve_err])

make_dir('./OUTPUT/TUTORIAL4/AVERAGE_DIFF/')

# save avg_curvbe and sem to file
np.savetxt('./OUTPUT/TUTORIAL4/AVERAGE_DIFF/1ms.csv', 
           np.c_[q, avg_curve, curve_err], delimiter=",")

<div class="alert alert-block alert-warning">
    
    <i class="fa fa-exclamation-triangle"></i>&nbsp; <b>ATSAS Software</b><br>
    The <a href="https://www.embl-hamburg.de/biosaxs/software.html">ATSAS Software </a> works better with space-delimitted '.dat' files. The current version of the <a href="https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/tree/main?ref_type=heads">tr_tjump_saxs </a> package, some the the analysis has to be performed in ATSAS so it may be more convenient to save the file in the ATSAS-compatible format.  
    </div>



## Step 3.4: Plot Data

In [None]:
# plot all data
print('Plotting data...')

ax = plt.axes([0.125,0.125, 5, 5])

plt.plot(q, avg_sem[0], linewidth=2)
plt.fill_between(q, avg_sem[0]-avg_sem[1], avg_sem[0]+avg_sem[1], alpha=0.5)
plt.xlabel('q ($\AA^{-1}$)', fontsize=70, fontweight='bold')
plt.ylabel('Change in Scattering Intensity', fontsize=70, fontweight='bold')
plt.title('CH505TF Average SAXS Difference Curve at 1ms', fontsize=80, fontweight='bold')
plt.xticks(fontsize=60)
plt.yticks(fontsize=60)

for axis in ['top','bottom','left','right']:
    ax.spines[axis].set_linewidth(5)

zoom = plt.axes([-5, 0.5, 4, 4])
plt.plot(q, avg_sem[0], linewidth=2)
plt.fill_between(q, avg_sem[0]-avg_sem[1], avg_sem[0]+avg_sem[1], alpha=0.5)
# style plot
plt.ylabel('Change in Scattering Intensity', fontsize=70, fontweight='bold')
plt.xlabel('q ($\AA^{-1}$)', fontsize=70, fontweight='bold')
plt.xticks(fontsize=55)
plt.yticks(fontsize=55)
plt.xlim([0.02, 0.1])
plt.title('CH505TF Average SAXS Difference Curve at 1ms\nZoom', fontsize=80, fontweight='bold')

for axis in ['top','bottom','left','right']:
    zoom.spines[axis].set_linewidth(5)

# mark inset
mark_inset(ax, zoom, loc1=1, loc2=4, fc="none", ec="0.5", linewidth=4)


plt.savefig('./OUTPUT/TUTORIAL4/AVERAGE_DIFF/tutorial4_ex1_step3.png', bbox_inches='tight')

plt.show()


This is the entire workflow for processing SAXS data. When processing TR, T-Jump SAXS data it is usually more convenient to process all time delays at once. Now we show how to combine the analysis described above to iterate through all time delays at once. 

# Example 2: Looping over Multiple Time Delays and Data Sets

## Step 1: Define Dataset Variables

In [None]:
%%time

# define det numbers
sets = ['20hz_set02', '20hz_set02', '20hz_set02',
        '20hz_set01', '20hz_set01', '20hz_set01', '20hz_set01', '20hz_set01',
        '20hz_set03', '20hz_set03', '20hz_set03',
        '5hz_set01', '5hz_set01', '5hz_set01']

# define datasets
time_delays = ['1.5us', '3us', '5us', 
               '10us', '50us', '100us', '500us', '1ms',
               '5us', '300us', '1ms',
               '1ms', '10ms', '100ms']

# define data directories
directories = ['./OUTPUT/TUTORIAL3/20hz_set02/', './OUTPUT/TUTORIAL3/20hz_set02/',
               './OUTPUT/TUTORIAL3/20hz_set02/', './OUTPUT/TUTORIAL3/20hz_set01/',
               './OUTPUT/TUTORIAL3/20hz_set01/','./OUTPUT/TUTORIAL3/20hz_set01/',
               './OUTPUT/TUTORIAL3/20hz_set01/', './OUTPUT/TUTORIAL3/20hz_set01/',
               './OUTPUT/TUTORIAL3/20hz_set03/', './OUTPUT/TUTORIAL3/20hz_set03/',
               './OUTPUT/TUTORIAL3/20hz_set03/', './OUTPUT/TUTORIAL3/5hz_set01/',
               './OUTPUT/TUTORIAL3/5hz_set01/', './OUTPUT/TUTORIAL3/5hz_set01/'] 

prefixes = ['20hz_set02_', '20hz_set02_', '20hz_set02_',
           '20hz_set01_','20hz_set01_','20hz_set01_','20hz_set01_','20hz_set01_',
           '20hz_set03_', '20hz_set03_', '20hz_set03_',
           '5hz_set01_', '5hz_set01_', '5hz_set01_']
               

## Step 2: Process Data

In [None]:
%%time

# loop over time delay, make a file lists, and load the curve set
for t,d,p,s in zip(time_delays, directories, prefixes, sets):
    print('Loading ' + str(t) + ' curves')
    
    # load difference files for iterative chi test
    diff_files = make_flist(directory=str(d) + '-5us_DIFF/' + str(t) + '/', 
                            prefix=str(p), suffix='_Q.chi')

    # sort files lists
    diff_files.sort()

    # load difference curves
    diff_data, diff_array, q, diff_err = load_set(flist=diff_files, delim=',', mask=0, err=False)

    # Create list of labels to use for plot legend
    print('Plotting difference curves')
    
    labs = []
    for i in diff_files:
        labs.append(i[-9:-6])

    # plot difference curves as sanity check 
    plot_curve(data_arr=diff_array, q_arr=q, labels=labs, qmin=0.03, qmax=0.15, imin=None, imax=None, x='scattering vector (Å)',
                y='scattering intensity', 
                title='CH505 TR, T-Jump SAXS Difference Curves at ' + str(t), save=True, 
                save_dir='./OUTPUT/TUTORIAL4/PLOTS/', save_name='tutorial4_ex2_before_processing.png')  
    
    
    print('Done loading data!')
    
    # run iterative chi test
    print('Running iterative chi test...')
    iterative_chi(arr=diff_array, flist=diff_files, chi_cutoff=1.5, 
                  outfile='./OUTPUT/TUTORIAL4/OUTLIERS/' + str(t) + '_chi_outliers.csv', calls=1)
    
    # load chi outliers as a list
    chi_outliers_file = './OUTPUT/TUTORIAL4/OUTLIERS/1ms_chi_outliers.csv'

    # read on outliers into a list
    with open(chi_outliers_file, 'r') as f:
        reader = csv.reader(f)
        outliers = list(reader)

    # created list to store chi outlier files in
    chi_outliers = []

    # build outlier list
    for i in outliers:
        chi_outliers.append(i[0])

    # get unique set
    unique_chi = unique_set(chi_outliers)

    # remove outliers from difference curves
    cleaned_files, chi_outliers = remove_outliers(flist=diff_files, olist=unique_chi, 
                                 fslice=[-9,-6])

    
    # sort files lists
    cleaned_files.sort()

    # load difference curves
    diff_data, diff_array, q, diff_err = load_set(flist=cleaned_files, delim=',', mask=0, err=False)

    # Create list of labels to use for plot legend
    print('Plotting difference curves')
    labs = []
    for i in cleaned_files:
        labs.append(i[-9:-6])

    # plot difference curves as sanity check 
    plot_curve(data_arr=diff_array, q_arr=q, labels=labs, qmin=0.03, qmax=0.15, imin=None, imax=None, x='scattering vector (Å)',
                y='scattering intensity', 
                title='CH505 TR, T-Jump SAXS Difference Curves at ' + str(t), save=True, 
                save_dir='./OUTPUT/TUTORIAL4/PLOTS/', save_name='tutorial4_ex2_cleaned_files.png')  

    # get avg curve
    avg_curve = diff_array.mean(axis=0)
    
    # get standard deviation
    curve_std = diff_array.std(axis=0)

    # get standard error
    curve_err = curve_std/math.sqrt(len(diff_array))
    
    # concat avg curve and sem into single array
    avg_sem = np.array([avg_curve, curve_err])

    make_dir('./OUTPUT/TUTORIAL4/AVERAGE_DIFF/')

    # save avg_curve and sem to file
    np.savetxt('./OUTPUT/TUTORIAL4/AVERAGE_DIFF/' + str(t) + '.csv', 
               np.c_[q, avg_curve, curve_err], delimiter=",")
    
    # plot all data
    print('Plotting data...')

    ax = plt.axes([0.125,0.125, 5, 5])

    plt.plot(q, avg_sem[0], linewidth=2)
    plt.fill_between(q, avg_sem[0]-avg_sem[1], avg_sem[0]+avg_sem[1], alpha=0.5)
    plt.xlabel('q ($\AA^{-1}$)', fontsize=70, fontweight='bold')
    plt.ylabel('Change in Scattering Intensity', fontsize=70, fontweight='bold')
    plt.title('CH505TF Average SAXS Difference Curve at ' + str(t), fontsize=80, fontweight='bold')
    plt.xticks(fontsize=60)
    plt.yticks(fontsize=60)

    for axis in ['top','bottom','left','right']:
        ax.spines[axis].set_linewidth(5)

    zoom = plt.axes([-5, 0.5, 4, 4])
    plt.plot(q, avg_sem[0], linewidth=2)
    plt.fill_between(q, avg_sem[0]-avg_sem[1], avg_sem[0]+avg_sem[1], alpha=0.5)
    # style plot
    plt.ylabel('Change in Scattering Intensity', fontsize=70, fontweight='bold')
    plt.xlabel('q ($\AA^{-1}$)', fontsize=70, fontweight='bold')
    plt.xticks(fontsize=55)
    plt.yticks(fontsize=55)
    plt.xlim([0.02, 0.1])
    plt.title('CH505TF Average SAXS Difference Curve at ' + str(t) + '\nZoom', fontsize=80, fontweight='bold')

    for axis in ['top','bottom','left','right']:
        zoom.spines[axis].set_linewidth(5)

    # mark inset
    mark_inset(ax, zoom, loc1=1, loc2=4, fc="none", ec="0.5", linewidth=4)


    plt.savefig('./OUTPUT/TUTORIAL4/PLOTS/' + str(t) + '_avg_diff.png', bbox_inches='tight')

    plt.show()


<div class="alert alert-block alert-success">
    
    <i class="fa fa-check-circle"></i>&nbsp; <b>Congratulations!</b><br>
    You completed the fourth tutorial! Continue with <a href="https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/blob/main/TUTORIALS/tutorial5_saxs_modeling.ipynb?ref_type=heads">Tutorial 5</a> to continue assess your data for systematic errors.

</div>