# Running TB Profiler on SRA Data

## Exploring On Own Computer

Code: [TBProfiler](https://github.com/jodyphelan/TBProfiler) 

Data: [SRA](https://www.ncbi.nlm.nih.gov/sra)

Suppose I want to explore using the TB Profiler tool with sample SRA data. I download a few sample data files on my computer, follow the installation instructions, and run a few test computations. I know there are hundreds of data files that I may want to analyze and I don't want to tie up my computer. I'm going to move these commands from my computer to an HTC system, using an Access Point and HTCondor. 

## Identifying Job Components

Based on our experience running on our own computer, these are the components that we need to account for or recreate for our jobs: 
* A single job corresponds to a single sample, we will start with **48 samples** (for now) to run. 
* **Software environment:** installing tb-profiler using conda
* Command format is: `tb-profiler profile -1 SRA####_1.fastq -2 SRA####_2.fastq -t 1 -p SRA####`
* **Input:** Pre-stage the needed fastq files or download in the job; file name convention is `SRA####_1.fastq` and `SRA####_2.fastq`
* **Output:** Output is in a "results" folder and has the naming convention: `SRA####.results.json`
* **Compute Resources:** 1 core, 4GB of disk space, unknown memory
* **Time:** one job takes a few minutes to run


## Prepare and Test Software Environment

> [These training materials](https://portal.osg-htc.org/documentation/support_and_training/training/osgusertraining/#using-containerized-software-on-the-open-science-pool) provide a nice introduction to containers in the OSPool. 

For this job, we will recreate our conda environment in a container. 

Containers can be created (or "built") from a definition file. The definition file below includes some standard configuration for setting up a conda environment. It has been customized in the `%post` section where the specific tb-profiler installation commands are inserted (directly from the [installation instructions](https://github.com/jodyphelan/TBProfiler#conda)). 

In [None]:
cat build/tb-profiler-sra.def

To build the container in the `build` folder, you would run this sequence of commands: 

```
cd ../build
apptainer build tb-profiler-sra.sif tb-profiler-sra.def
cd ../with-sra
```

You do not need to build the container to participate in the tutorial (it takes a long time to build!), so we have a copy pre-staged in a public location that can be used for submitting jobs. 

Similarly, once the container is built, the following sequence of commands can be used to download the container...


In [None]:
stashcp /osgconnect/public/osg/tutorial-tb-profiler/tb-profiler-sra.sif ../build/

These commands can then be run from the command line to test it. 
```
##  to be run in the command line: 
# download the container if you didn't build a local copy
# start the container with a shell
apptainer shell ../build/tb-profiler-sra.sif
# run the tb-profiler command to see if it is in the container
tb-profiler
# see where the tb-profiler program is installed in the container
which tb-profiler
# exit the container
exit
```

## Stage Inputs

First, how will fetch the **input** data to our jobs? We have a few options: 
1) download the data directly from NCBI in the job, using `sra-toolkit`
2) upload inputs to our home folder on an Access Point, use HTCondor's default file transfer to fetch the inputs to jobs
3) upload the inputs to an OSDF folder, use OSDF URLs to fetch inputs to jobs

We generally recommend the last two options because they are more visible and managed by HTCondor. In this example, we'll be using the SRA toolkit to download data in the jobs. 

One thing we can do now to make life easier for ourselves later is to generate a list of the input files to use later for our job submission. 

In [None]:
head -n 10 ../SraAccList.csv

## Organize Output Files

We've already sectioned off our software environment (container and definition files) into their own folder. Our input files are staged in the OSDF. The remaining thing to think about are the outputs. We will have the `.json` file produced for each sample, which we will put in an `outputs` folder, and job log, error and stdout files, which we will put in a `logs` folder. 

In [None]:
mkdir -p outputs

In [None]:
mkdir -p logs

## Writing an Executable

Because our job needs to do two things - fetch the data and then run the tb-profiler program, we'll write a shell script that does both steps and use it for our jobs executable. To make it easier to scale up later, the SRA value is an argument provided to the script. 

In [None]:
cat tb-profiler.sh

## Submit One Job

All this comes together in a submit file. The first half captures all the set up work we did in the previous steps: 
- invoking our container
- executing the tb-profiler script
- bringing the results files back from the job
- writing log and error information into a sub folder

In [None]:
cat tb-profiler-notebook.sub

In [None]:
condor_submit tb-profiler-notebook.sub

In [None]:
condor_q

## Submit Multiple Jobs

To submit multiple jobs, we simply change the last line of the submit file. Instead of setting `sequence_read` manually and submitting one job, the `queue .. from` syntax can be used to submit a job for each item in our list of SRA id numbers: 
```
queue sequence_read from ../SraAccList.csv
```
Make this change in the submit file and then resubmit. 

In [None]:
condor_submit tb-profiler-notebook.sub

In [None]:
condor_q