# Running TB Profiler on SRA Data

## Exploring On Own Computer

[TBProfiler](https://github.com/jodyphelan/TBProfiler)

[SRA](https://www.ncbi.nlm.nih.gov/sra)

I download a few sample data files on my computer, follow the installation instructions, and run a few test computations. I know there are hundreds of data files that I may want to analyze and I don't want to tie up my computer. I'm going to move these commands from my computer to an HTC system, using an Access Point and HTCondor. 

## Identifying Job Components

Based on our experience running on our own computer, we can see the following components that we need to account for or recreate for our jobs: 
* A single job corresponds to a single sample, we will start with **48 samples** (for now) to run
* **Input:** Pre-stage the needed fastq files or download in the job; file name convention is `SRA####_1.fastq` and `SRA####_2.fastq`
* **Software environment:** installing tb-profiler using conda
* Command format is: `tb-profiler profile -1 SRA####_1.fastq -2 SRA####_2.fastq -t 1 -p SRA####`
* **Output:** Output is in a "results" folder and has the naming convention: `SRA####.results.json`
* **Compute Resources:** 1 core, 4GB of disk space, unknown memory
* **Time:** one job takes a few minutes to run


## Stage Inputs

First, how will fetch the **input** data to our jobs? We have a few options: 
1) download the data directly from NCBI in the job, using `sra-toolkit`
2) upload inputs to our home folder on an Access Point, use HTCondor's default file transfer to fetch the inputs to jobs
3) upload the inputs to an OSDF folder, use OSDF URLs to fetch inputs to jobs

We generally recommend the last two options because they are more visible and managed by HTCondor. In this example, the inputs are large enough (around 1GB or larger) and likely to be reused, so using option 3 is the best choice. For this example, we've pre-staged the data on a public OSDF folder, accessible via the path `/osgconnect/public/osg/tutorial-tb-profiler/sra-data-files/`. 

In [17]:
stashcp --list-dir /osgconnect/public/osg/tutorial-tb-profiler/tb-sra-files .

One thing we can do now to make life easier for ourselves later is to generate a list of the input files to use later for our job submission. 

In [18]:
head -n 10 SraAccList.csv

SRR18715196
SRR18715198
SRR18715199
SRR18715200
SRR18715201
SRR18715202
SRR18715203
SRR18715204
SRR18715205
SRR18715206


## Prepare and Test Software Environment

> [These training materials](https://portal.osg-htc.org/documentation/support_and_training/training/osgusertraining/#using-containerized-software-on-the-open-science-pool) provide a nice introduction to containers in the OSPool. 

For this job, we will recreate our conda environment in a container. 

Containers can be created (or "built") from a definition file. The definition file below includes some standard configuration for setting up a conda environment. It has been customized in the `%post` section where the specific tb-profiler installation commands are inserted (directly from the [installation instructions](https://github.com/jodyphelan/TBProfiler#conda)). 

In [19]:
cat build/tb-profiler.def

Bootstrap: docker
From: continuumio/miniconda3:23.3.1-0

%environment
  export PATH=/opt/conda/bin:$PATH
  . /opt/conda/etc/profile.d/conda.sh
  conda activate

%post
  conda config --add channels defaults
  conda config --add channels bioconda
  conda config --add channels conda-forge
  conda install -y -c bioconda tb-profiler


To build the container in the `build` folder, you would run this sequence of commands: 

```
cd build
apptainer build tb-profiler.sif tb-profiler.def
cd ..
```

You do not need to build the container to participate in the tutorial (it takes a long time to build!), so we have a copy pre-staged in a public location that can be used for submitting jobs. 

Similarly, once the container is built, the following sequence of commands can be used to explore/test the container: 

```
##  to be run in the command line: 
# download the container if you didn't build a local copy
stashcp /ospool/PROTECTED/christina.koch/singularity_imgs/tb-profiler-test.sif build/
# start the container with a shell
apptainer shell build/tb-profiler.sif
# run the tb-profiler command to see if it is in the container
tb-profiler
# see where the tb-profiler program is installed in the container
which tb-profiler
# exit the container
exit
```

## Organize Files

We've already sectioned off our software environment (container and definition files) into their own folder. Our input files are staged in the OSDF. The remaining thing to think about are the outputs. We will have the `.json` file produced for each sample, which we will put in an `outputs` folder, and job log, error and stdout files, which we will put in a `logs` folder. 

In [20]:
mkdir -p outputs

In [21]:
mkdir -p logs

## Submit One Job

In [25]:
cat tb-profiler.sub

universe = container
container_image = osdf:///osgconnect/public/osg/tutorial-tb-profiler/tb-profiler-test.sif
transfer_executable = false

executable = /opt/conda/bin/tb-profiler
arguments = tb-profiler profile -1 $(sequence_read)_1.fastq -2 $(sequence_read)_2.fastq -t 1 -p $(sequence_read)

data_path = osdf:///osgconnect/public/osg/tutorial-tb-profiler/tb-sra-files/$(sequence_read)
transfer_input_files = $(data_path)/$(sequence_read)_1.fastq, $(data_path)/$(sequence_read)_2.fastq

output_file = $(sequence_read).results.json
transfer_output_files = results/$(output_file)
transfer_output_remaps = "$(output_file) = outputs/$(output_file)"

log = logs/tb-profiler.log
error = logs/$(sequence_read).$(Cluster).$(Process).err
output = logs/$(sequence_read).$(Cluster).$(Process).out

requirements = (Target.has_avx == true)

request_cpus = 1
request_memory = 4GB
request_disk = 4GB

sequence_read = SRR18714896
queue
#queue sequence_read from SraAccList.csv


In [14]:
condor_submit tb-profiler.sub

Submitting job(s).
1 job(s) submitted to cluster 192393.


In [28]:
condor_q



-- Schedd: ap40.uw.osg-htc.org : <128.105.68.92:9618?... @ 07/11/23 22:28:28
OWNER        BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
christina.ko ID: 192393   7/11 22:22      _      _      1      1 192393.0

Total for query: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for christina.koch: 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended 
Total for all users: 8240 jobs; 0 completed, 0 removed, 1043 idle, 6572 running, 625 held, 0 suspended



## Submit Multiple Jobs

In [None]:
condor_submit tb-profiler.sub

In [None]:
condor_q