# QC of FASTQ runs

This Jupyter notebook allows you to perform some basic quality control analysis of reads for a given SRA accession. Quality checks are computed by the program [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and the output data is parsed by python functions in this notebook. 

## 0. Setting up

In [1]:
from qc_functions import *
from IPython.display import display, Markdown

## need to check if FastQC and fastq-dump are available
## download binaries if not present

## 1. Input SRA accessions

In the cell below, enter the SRA run accession numbers for which you would like to obtain QC results. Multiple run accessions should be separated by a space. 

In [2]:
sra_accs = "DRR042075   DRR019508"

## 2. Fetch reads from SRA and run FastQC

In this step we will parse the list of SRA run accessions, download the reads in FASTQ format from SRA and run FastQC on those reads. 

In [3]:
sra_accs = parse_sra_accs(sra_accs)
print(sra_accs)

DRR042075 DRR019508


In [4]:
%%bash -s "$sra_accs"

FASTQ_DUMP="/usr/bin/fastq-dump"
FASTQC="/home/vkkodali/FastQC/fastqc"

for acc in $1; do
    echo "Fetching ${acc} reads in FASTQ format from SRA..." ;
    $FASTQ_DUMP -A ${acc} --gzip -N 1 -X 10000 ;

    echo "Running FASTQC on ${acc} reads..." ;
    $FASTQC --extract -q ${acc}.fastq.gz ;
done

## delete fastq.gz files?

Fetching DRR042075 reads in FASTQ format from SRA...
Read 10000 spots for DRR042075
Written 10000 spots for DRR042075
Running FASTQC on DRR042075 reads...
Fetching DRR019508 reads in FASTQ format from SRA...
Read 10000 spots for DRR019508
Written 10000 spots for DRR019508
Running FASTQC on DRR019508 reads...


## 3. Inspect QC data

In this step, we will parse the FastQC output and generate a table with pertinent data to inspect. Click on the 'Report' link in the table to view the entire FastQC report. 

## 3a. [Optional] Change default parameters
There are two parameters that you can make changes to. By default, we have picked sensibles cut-offs but feel free to change them to your liking. 

1. `qc_level` can be either 'FAIL' or 'WARN'. By default, it is 'FAIL', meaning a list of all metrics that have failed will be displayed in the table. If you choose 'WARN', both failed metrics and the ones with warnings will be shown in the table. 
2. `threshold` is the minimum quality score for the read that is acceptable. By default, this value is set to 27. The percentage of reads with overall quality greater than the `threshold` is shown in the table. 

In [5]:
## Edit the following values if you want to change the defaults

qc_level = 'fail'  ## can be 'fail' or 'warn'
threshold = 27 ## must be a number below 40 

In [6]:
tbl_str = generate_results_table(sra_accs, qc_level, threshold)
display(Markdown(tbl_str))

|SRA Acc.|No. of reads|Read length|Percent GC|Poor qual reads|Failed metrics|Percent reads over threshold quality|FastQC Report|
|----|----|----|----|----|----|----|----|
|DRR042075|10000|287-301|48|0|Per base sequence quality<br>Per base sequence content<br>Per sequence GC content|99.71|<a href="./DRR042075_fastqc.html" target="_blank">Report</a> |
|DRR019508|10000|86-502|45|0|Per base sequence quality<br>Per sequence GC content|99.94|<a href="./DRR019508_fastqc.html" target="_blank">Report</a> |
