# Bioinformatics Tutorial: Quality Assessment of Data with FastQC

The first step of most biofinformatic analyses is to assess the quality of the data you have recieved. In this example, we are working with real DNA sequencing data from a research project studying E. coli. We will use a common software, FastQC, to assess the quality of the data.  

## Step 1: Download data

First, we need to download our sequencing data to that we want to analyze for our research project. 

We have a script called `download_data.sh` that will download our bioinformatic data. Let's go ahead and run this script to download our data. 

In [None]:
./download_data.sh

Our sequencing data files, all ending in .fastq, can now be found in a folder called /data. 

## Step 2: Install software

Now that we have our data, we need to install the software we want to use to analyze it. 

There are different ways to install and use software, including installing from source, using pre-compiled binaries, and containers. In the biology domains, many software packages are already available as pre-built containers. We can fetch one of these containers and have HTCondor set it up for our job, which means we do not have to install the FastQC software or it's dependencies. 

The container we will use is available on Docker Hub: biocontainers/fastqc

It is possible to convert this Docker container to an apptainer container by creating an apptainer definition file:

In [3]:
ls software/

fastqc.def  fastqc.sif


In [4]:
cat software/fastqc.def

Bootstrap: docker
From: staphb/fastqc


And then running a command to build an apptainer container (which we won't run, but for future reference): 
`$ apptainer build fastqc.sif software/fastqc.def`

### A quick aside: test on your local machine

## Step 3: Prepare an executable

We need to create an executable to pass to our HTCondor jobs, so that HTCondor knows what to run on our behalf. 

Let's take a look at our executable, `fastqc.sh`:

In [5]:
cat fastqc.sh

#!/bin/bash
# Executable name: fastqc.sh

# Run FastQC to determine the quality of our raw .fastq sequencing data
fastqc $1


# Step 4: Prepare HTCondor submit file

Now we create our HTCondor submit file, which tells HTCondor what to run and how many resources to make available to our job:

In [14]:
cat fastqc.submit

# HTCondor Submit File: fastqc.submit

# Provide our executable and arguments
executable = fastqc.sh
arguments = SRR2584863_1.trim.sub.fastq

# Provide the container for our software
universe    = container
container_image = software/fastqc.sif

# List files that need to be transferred to the job
transfer_input_files = data/SRR2584863_1.trim.sub.fastq
should_transfer_files = YES

# Tell HTCondor to transfer output to our /results directory
transfer_output_files = SRR2584863_1.trim.sub_fastq.html
transfer_output_remaps = "SRR2584863_1.trim.sub_fastq.html = /results/SRR2584863_1.trim.sub_fastq.html"

# Track job information
log = logs/fastqc.log
output = logs/fastqc.out
error = logs/fastqc.err

# Resource Requests
request_cpus = 1
request_memory = 1GB
request_disk = 1GB

# Tell HTCondor to run our job once:
queue 1


## Submit a Job

We are ready to submit our first job!

In [23]:
condor_submit fastqc.submit

Submitting job(s).

1 job(s) submitted to cluster 290010.


We can check on the status of our job in the queue using:

In [None]:
condor_q

## Check results

In [None]:
ls results/

## Step 5: Scale out

## 5.1: Create a list of all files we want analyzed

To queue a job to analyze each of our sequencing data files, we will take advantage of HTCondor's `queue` statement. First, let's create a list of files we want analyzed:

In [None]:
ls data/ > list_of_samples.txt

Let us take a look: 

In [None]:
cat list_of_samples.txt

## 5.2 Create a submit file to queue a job to analyze each biological sample

HTCondor has different `queue` syntaxes to help researchers automatically queue many jobs. We will use `queue <variable> from <list.txt>` to queue a job for each of of our samples in samples.txt. 

The main changes to make to the submit file are replacing each occurence of the sample identifier with the $(sample) variable, and then iterating through our list of samples as shown in the queue statement at the end. 

Once we define `<variable>`, we can also use it elsewhere in the submit file. 

In [23]:
cat many-fastqc.submit

cat: many-fastqc.submit: No such file or directory


: 1

In [None]:
# HTCondor Submit File: many-fastqc.submit

# Provide our executable and arguments
executable = fastqc.sh
arguments = $(samples)

# Provide the container for our software
universe    = container
container_image = software/fastqc.sif

# List files that need to be transferred to the job
transfer_input_files = data/$(samples)
should_transfer_files = YES

# Tell HTCondor to transfer output to our /results directory
transfer_output_files = $(samples).html
transfer_output_remaps = "$(samples).html = /results/$(samples).html"

# Track job information
log = $(samples).fastqc.log
output = $(samples).fastqc.out
error = $(samples).fastqc.err

# Resource Requests
request_cpus = 1
request_memory = 1GB
request_disk = 1GB

# Tell HTCondor to run a job for each sample in our list
queue samples from list_of_samples.txt

And then submit many jobs using this single submit file!

In [None]:
condor_submit many-fastqc.submit

## 5.3 Check on jobs in the queue

We can check on the status of our multiple jobs in HTCondor's queue by using:

In [None]:
condor_q

## 5.4 Check results

ls /results

09/22/23 17:30:05 (fd:4) (pid:255) (D_CONFIG) config: using subsystem 'TOOL', local ''
09/22/23 17:30:05 (fd:4) (pid:255) (D_LOAD) Reading from /proc/cpuinfo
09/22/23 17:30:05 (fd:4) (pid:255) (D_LOAD) Found: Physical-IDs:True; Core-IDs:True
09/22/23 17:30:05 (fd:4) (pid:255) (D_LOAD) Analyzing 40 processors using IDs...
09/22/23 17:30:05 (fd:4) (pid:255) (D_HOSTNAME) NETWORK_INTERFACE=* matches lo 127.0.0.1, eth0 10.129.189.107, lo ::1, eth0 2607:f388:2200:a1:5302:f63:f88e:3f7c, eth0 fe80::d067:70ff:fedd:3112, choosing IP 2607:f388:2200:a1:5302:f63:f88e:3f7c
09/22/23 17:30:05 (fd:4) (pid:255) (D_HOSTNAME) DNS returned:
09/22/23 17:30:05 (fd:4) (pid:255) (D_HOSTNAME) 	2607:f388:2200:a1:5302:f63:f88e:3f7c
09/22/23 17:30:05 (fd:4) (pid:255) (D_HOSTNAME) 	10.129.189.107
09/22/23 17:30:05 (fd:4) (pid:255) (D_HOSTNAME) We returned:
09/22/23 17:30:05 (fd:4) (pid:255) (D_HOSTNAME) 	10.129.189.107
09/22/23 17:30:05 (fd:4) (pid:255) (D_HOSTNAME) 	2607:f388:2200:a1:5302:f63:f88e:3f7c
09/22/23 17