# Tutorial 1: File Handling for Time Resolved, Temperature-Jump SAXS Data Analysis

**Package Information:**<br>
Currently the [tr_tjump_saxs](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/tree/main?ref_type=heads "tr_tjump_saxs") package only works through the Python3 command line. The full dependencies can be found on the [Henderson GitLab Page](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/tree/main?ref_type=heads "tr_tjump_saxs") and the environment can be cloned from the [environment.yml file](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/blob/main/environment.yml?ref_type=heads "environment.yml file"). The data analysis can be executed from an interactive Python command line such as [iPython](https://www.python.org/) or [Jupyter](https://jupyter.org/) or the code can be written in a script to run in a non-interactive mode. The preferred usage is in Jupyter Lab as this is the environment the package was developed in. Jupyter also provides a file where all code, output of code, and notes can be contained in a single file and serves a record of the data analysis performed, the code used to conduct the data analysis, and the output of the analysis. 

**Tutorial Information:**<br>
This set of tutorial notebooks will cover how to use the `tr_tjump_saxs` package to analyze TR, T-Jump SAXS data and the <a href="https://www.science.org/doi/10.1126/sciadv.adj0396">workflow used to study HIV-1 Envelope glycoprotein dynamics. </a> This package contains multiple modules, each containing a set of functions to accomplish a specific subtask of the TR, T-Jump SAXS data analysis workflow. Many of the functions are modular and some can be helpful for analyzing static SAXS and other data sets as well. 

**Package Modules:**<br>
> 1. `file_handling`<br>
> 2. `saxs_processing`<br>
> 3. `saxs_qc`<br>
> 4. `saxs_kinetics`<br>
> 5. `saxs_modeling`<br>

**Developer:** [@ScientistAsh](https://github.com/ScientistAsh "ScientistAsh GitHub")

**Updated:** 6 February 2024

# Tutorial 1 Introduction
In this Tutorial 1 notebook, I introduce the `file_handling` module from the `tr_tjump_saxs` package. The `file_handling` module provides functions that will load or plot a single SAXS curve or a full set of SAXS curves as well as other functions to create file lists and directories for storing output data. If you find any issues with this tutorial, please create an issue on the repository GitLab page ([tr_tjump_saxs issues](https://gitlab.oit.duke.edu/tr_t-jump_saxs/y22-23/-/issues "tr_tjump_saxs Issues")).

## Module functions:
> `make_dir()` makes a new directory to store output. <br>
> `make_flist()` makes a list of files. <br>
> `load_saxs()` load a single SAXS scattering or difference curves. <br>
> `load_set()` load a set of SAXS scattering or difference curves. <br>
> `plot_curve()` plots a single or set of SAXS scattering or difference curves. <br>

## Tutorial Files:

### Data Files
The original data used in this analysis is deposited on the [SASBDB](https://www.sasbdb.org/) with accession numbers:
> **Static Data:** <br>
    - *CH505 Temperature Sereies*: SASDT29, SASDT39, SASDT49, SASDT59 <br>
    - *CH848 Temperature Series*: SASDTH9, SASDTJ9, SASDTK9, SASDTL9 <br>
<br>
> **T-Jump Data:** <br>
    - *CH505 T-Jump Data*: SASDT69, SASDT79, SASDT89, SASDT99, SASDTA9, SASDTB9, SASDTC9, SASDTD9, SASDTE9, SASDTF9, SASDTG9 <br>
     - *CH848 T-Jump Data*: SASDTM9, SASDTN9, SASDTP9, SASDTQ9, SASDTR9, SASDTS9, SASDTT9, SASDTU9, SASDTV9, SASDTW9 <br>
<br>
> **Static Env SOSIP Panel:** SASDTZ9, SASDU22, SASDU32, SASDU42, SASDTX9, SASDTY9 <br>

Additional MD data associated with the paper can be found on [Zenodo](https://zenodo.org/records/10451687).

### Output Files
Example output is included in the [OUTPUT](https://github.com/ScientistAsh/tr_tjump_saxs/tree/main/TUTORIALS/OUTPUT/) subdirectory in the [TUTORIALS](https://github.com/ScientistAsh/tr_tjump_saxs/tree/main/TUTORIALS/) directory.  

# How to Use Jupyter Notebooks
You can execute the code directly in this notebook or create your own notebook and copy the code there.


<div class="alert alert-block alert-info">
    
    <b><i class="fa fa-info-circle" aria-hidden="true"></i>&nbsp; Tips</b><br>
    
    <b>1.</b> To run the currently highlighted cell, hit the <code>shift</code> and <code>enter</code> keys at the same time.<br>
    <b>2</b>. To get help with a specific function, place the cursor in the functions brackets and hit the <code>shift</code> and <code>tab</code> keys at the same time.

</div>

<div class="alert alert-block alert-info" style="background-color: white; border: 2px solid; padding: 10px">
    <b><i class="fa fa-star" aria-hidden="true"></i>&nbsp; In the Literature</b><br>
    
    Our <a href="https://www.science.org/doi/10.1126/sciadv.adj0396">recent paper </a> in Science Advances provides an example of the type of data, the analysis procedure, and example output for this type of data analysis.  <br> 
    
    <p style="text-align:center">
    
</div>

# Import Modules

The first step is to import the necessary python packages. The dependecies will automatically be imported with the package import.

In [None]:
# import sys to allow python to use the file browser to find files
import sys

# append the path for the tr_tjump_saxs_analysis package to the PYTHONPATH
sys.path.append(r'../')

# import CH505TF_SAXS analysis dependent packages and custom functions
from file_handling import *

<div class="alert alert-block alert-info">
    <b><i class="fa fa-info-circle" aria-hidden="true"></i>&nbsp; Tips</b><br>
    Be sure that the path for the <code>tr_tjump_saxs</code> package appended to the <code>PYTHONPATH</code> matches the path to the repository on your machine.
    </div>

# Load Data

## `load_saxs()` Function 

### Overview
After importing the analysis modules, we then need to load the data files. The `tr_tjump_saxs` package has 2 different options for loading SAXS data. You can load only one curve at a time with the `load_saxs()` function or you can load a full set of curves at one time with the `load_set()` function. First, we will look at how to load one curve with the `load_saxs()` function. 

The `load_saxs()` is best suited for loading a single SAXS scattering or difference curve. This function will automatically load all columns. The `load_set()` function is best suited for loading a set of SAXS scattering or difference curves.

### Input Parameters
There are three input parameters for this function:
> 1. `file` indicates the file, including the full path, containing the SAXS curve. <br>
> 2. `delim` indicates the delimitter used in the input file. This parameter is optional and has the default value ' ' (space-delimitted). <br>
> 3. `mask` parameter indicates the number of rows that need to be skipped when loading the file. This can be used to skip rows in which a mask is applied to the data so the curve contains NaN values or to avoid importing string-type headers. Because this function loads the data as a np.array and np.arrays can only contain one data type, this function will raise an error if string-type headers are attemped to import at the same time as the SAXS scattering/difference curve. This parameter is optional and the default value is 0 (all rows imported).  <br>

### Returned Values
> 1. A numpy array with a shape determined by input data. 

### Raised Errors
There are no custom errors raised by this function. If any errors are raised, follow the docs for the function indicated in the traceback. 

### Example 1: Basic usage

In [None]:
load_saxs(file='../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/diff_protein_20hz_set01_1ms_118_-10us_118_Q.chi', 
          delim=' ', mask=10)

### Example 2: Store Returned Values as Varaibles

In [None]:
curve = load_saxs(file='../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/diff_protein_20hz_set01_1ms_118_-10us_118_Q.chi', 
                  delim=' ', mask=10)

In [None]:
curve

## Slicing Returned Values
Once a curve is loaded, you will want to work with this curve. The proper slicing will be important for conducting the correct analysis. In Python the starting value for the slicing is inclusive while the slicing value for the end of the range is not. So, if you want row 50 to be the last row in the slice, than the slicing range must end at 51. Generally speaking, the scattering vector (here referred to as q) is typically stored as the first column and the scattering intensity (i) as the second column. In numpy arrays, columns and rows are 0-indexed, meaning that the first column has index 0, the second column has index 1, and so on. 

### Returning q values
Since the q values are stored in the first column, you can access this column by passing 0 into the columns dimension for numpy array slicing. We would like to select all rows so we can have the full set of scattering vectors. The syntax for numpy slicing is `array[row,col]`. 

In [None]:
# To select all rows for the first column containing the scattering vectors q
curve[:, 0]

In [None]:
# To select the first 10 rows
curve[:10, 0]

In [None]:
# To select the last 10 rows
curve[-10:, 0]

### Retuning i values
Since the i values are stored in the second column, you can access this column by passing 1 into the column dimension for numpy slicing.

In [None]:
# Select the scattering intensity values
curve[:, 1]

In [None]:
# To select the first 10 rows
curve[:10, 1]

In [None]:
# To select the last 10 rows
curve[-10:, 1]

### Selecting a Specific Point

In [None]:
# this gives an i value
curve[500,1]

In [None]:
# this give a q value
curve[500, 0]

In [None]:
# this gives both q and i values
curve[500, :]

### Selecting a range of points

In [None]:
# both q and i values for the first 10 points
curve[:10, :]

In [None]:
# rows 49-59
curve[50:60, :]

In [None]:
# Both q and i values for every 20 points from 50 to 500
curve[50:500:20, :]

<br>

## Checking Size of Array
It is a good idea to check the size of you data to help familiarize yourself with the data structure. There are several different ways to check the size of an array.

### Get the length of the array
This will give you the number of rows in the array. In the case of the `curve` array defined above, it will tell us how many points are in one curve. 

In [None]:
len(curve)

### Get the shape of the array
This will tell you how many dimensions the array is, how many rows are in the array, and how many columns are in the array. In the case of the 2d `curve` array there are 1908 rows, each reprsenting one point along the scattering vector, and there are 2 columns, one containing the scattering vector and one containing the scattering intensity. 

In [None]:
curve.shape

### Get the size of the array
This will tell you how many entries are in the entire array. For 2d+ dimension arrays, the reported size is determined for the flattened array. 

In [None]:
curve.size

<div class="alert alert-block alert-warning">
    
    <i class="fa fa-exclamation-triangle"></i>&nbsp; <b>Check your data structure</b><br>
    Your data may be stored differently and it is important to make sure you understand your data structure before beginning any analysis. It is always a good idea to practice slicing on your data set to be sure you understand the data structure once it is loaded.
    </div>


## `load_set()` Function

### Overview

This function is best suited for loading a set of SAXS scattering or difference curves. 

### Input Parameters
There are four input parameters for this function:
> 1 `flist` parameter indicates the file list containg the curves to be loaded. File names should include the full path of the file.<br>
> 2 `delim` parameter indicates the delimitter used in the input file. This parameter is optional and has the default value ' ' (space-delimitted).<br>
> 3 `mask` parameter indicates the number of rows that need to be skipped when loading the file. This can be used to skip rows in which a mask is applied to the data so the curve contains NaN values or to avoid importing string-type headers. Because this function loads the data as a np.array and np.arrays can only contain one data type, this function will raise an error if string-type headers are attemped to import at the same time as the SAXS scattering/difference curve. This parameter is optional and the default value is 0 (all rows imported). <br>
> 4 `err` parameter is boolean indicating the presence of a column indicating measured errors. When set to True, errors will be loaded into the returned array, This parameter is optional and the default value is `False`. <br>

### Returned Values
> 1. `data`: A list containing the scattering intensity (i) vector for the loaded curves. The curves are loaded in the same order that are in flist. <br>
> 2. `data_arr`: A numpy array containing the scattering vector (q) and scattering intensity (i). Array has shape (n, r), where n is the number of scattering curves loaded and r is the number of entries in each loaded curve. 2 represents scattering intensity(i), for which there are r number of values in each, for each loaded curve n. The curves are loaded in the same order that are in flist. <br>
> 3. `q`: A numpy array containing scattering vector (q) values. <br>  
> 4. `error': A numpy array containing the experimental error for scattering intensity (i). Will be an empty array if there is no error column in the imported. <br>
        
### Raised Errors
In addition to errors raised by the dependecies (see documentation for function indicated by the traceback if for this situation) this function also rasies  and `IndexError` when a column for error values is indicated but does not exist in the given files. Automatically will change the parameter to false and load the first 2 columns. 

## `make_flist()` Function

### Overview
The first step to load a set of SAXS curves, is to create a file list. This can be done manually, which is usually easiest if your list only contains one or two files. Alternatively, there is the `make_flist()` function. This function will automatically generate a file list given an input directory, prefix and suffix. This function is ideal for loading a set of curves, usually the reduced scattering curves, that are all in the same folder. This function will not be useful for creating files lists of files in different folders. A file list of files from different folders would have to be created using alternative methods. 

### Input Parameters
> 1. The `directory` parameter indicates the directory where the scattering curves are stored and is an optional parameter with the default value being the current working directory. <br>
> 2. The `prefix` parameter indicates the prefix of the files to be loaded. This parameter is useful when you only want to load a subset of files from the given directory. When set to None, the files will not be filtered by prefix. The `prefix` parameter is optional and has the default value of `None`.<br>
>3. The `suffix` parameter indicates the file suffix to filter by. If set to None then no suffix filters are applied to the files. `suffix` is an optional parameter and the default value is `None`. If both `prefix` and `suffix` are `None` then all files in the given directory will be loaded. 

### Returned Values
This function returns a list containing the appended files. 

### Raised Errors
This function raises no custom errors. For any errors raised, see the documentation for the function indicated in the traceback. 

## Example 3: Basic Usage Loading Files
### Make File List

In [None]:
# first, make a file list
files = make_flist(directory='../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/',
                  prefix='diff_protein_20hz_set01_1ms_', suffix='_Q.chi')

<br>
This function automatically reports the number of files loaded. It is good to double check and make sure that the correct files are loaded. For this set, all 1ms curves from the directory passed to the `make_flist` function should be loaded. 

In [None]:
# show first 5 files in file list
files[:5]

<br>

### Sorting Files
As you can see, 250 1ms files are loaded from the indicated directory (you can view all files by removing the slicing). Also note that the files are not necessarily in order after loading them. Functions that require files to be loaded for the analysis will automatically conduct that sorting. If you would like to sort the files, you can use the `sort()` function.

In [None]:
files.sort()
files[:5]

### Load File List
For the example data set, the data is comma-delimitted, a mask applied to the first 10 points, and there are no errors stored in the files. 

In [None]:
data, data_arr, q, err = load_set(flist=files, delim=' ', mask=10, err=False)

#### `data` list
The `load_set` function returns one list and four arrays. `data` is returned as a list with each row containing an array of the scattering intensities from one curve. 

In [None]:
# show first 5 rows of data
data[:5]

#### `data_arr` array
`data_arr` is the array with all of the scattering intensity values from each curve loaded as a row. This is the primary array that analysis will be conducted on going forward. 

In [None]:
data_arr[:5]

#### `q` array
The `q` array contains all the q values. For scattering curves collected during the same experiment the scattering vectors should be the same for every curve, hence only loading one q for all 250 curves (for this example). If you have difference q values you would need to construct those q arrays separately and store them as different variables to access them. 

In [None]:
q[:5]

#### `err` array
The `err` array contains the errors for each curve loaded as a row. This array will be empty if `err=Fasle` is passed to the `load_set()`function

In [None]:
err

## Example 4: Looping over Multiple Time Delays
When analyzing time-resolved SAXS data, it is very helpful to be able to loop over multiple data sets that represent different time points at one time. Depending on memory limitations, you can loop over files lists or you can loop over the already loaded array elements. 

In [None]:
# define time delays as a list of strings
times = ['10us', '50us', '100us', '500us', '1ms']

# loop over time delay, make a file lists, and load the curve set
for t in times:
    print('Loading ' + str(t) + ' curves')
    files = make_flist(directory='../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/',
                  prefix='diff_protein_20hz_set01_' + str(t), suffix='_Q.chi')
    data, data_arr, q, err = load_set(flist=files, delim=' ', mask=10, err=False)
print('Done loading data!')

## Example 5: Looping Over Data Arrays
For some analyses such as bootstrapping (covered in later tutorials), loading files over and over again can cause isues with memory handling. And alternative to looping over file lists is to load all the data to analyze, and then loop over the arrays. 

In [None]:
# Create an empty list to store the resampled data
all_data = []

# load protein curves and remove outliers
prot_curves = []    
for t in times:
    
    # protein average curve
    # load difference files for iterative chi test
    files = make_flist(directory='../../../TR_T-jump_SAXS_July2022/protein_20hz_set01/processedb/',
                  prefix='diff_protein_20hz_set01_' + str(t), suffix='_Q.chi')
    
    # get length of file list before outliers are removed
    print('Number of ' + str(t) + ' files loaded: ' + str(len(files)))
    
    # sort files
    files.sort()
        
    # load difference curves
    data, array, q, err = load_set(flist=files, delim=' ', mask=10, err=False)
        
    # append array to prot curves
    all_data.append(array)

<br>

The result is a list of arrays, with each array representing a time delay and each row within each array representing a single curve in that time delay. The length of the `times` list should match the length of the `all_data` list

In [None]:
# look at data structure
all_data

In [None]:
# check length of all_data
len(all_data)

In [None]:
#check length of times
len(times)

<br>

To access a specific time delay in the `all_data` list use basic slicing. 

In [None]:
# get the first time delay
all_data[0]

<br>

To access a specific curves in the `all_data` list use basic slicing. 

In [None]:
all_data[0][0]

<div class="alert alert-block alert-success">
    
    <i class="fa fa-check-circle"></i>&nbsp; <b>Congratulations!</b><br>
       You completed the first tutorial! Continue with <a href="https://github.com/ScientistAsh/tr_tjump_saxs/blob/main/TUTORIALS/tutorial2_plotting.ipynb">Tutorial 2</a> to learn about visualizing SAXS curves.
    </div>