# <u>RNA-Seq Analysis Phase I: Quality Control
## <i>This Notebook Performs QC on Paired-End RNA-Seq Data</i>
#### Last Revision: July  2017
#### Author: Charles David and Dan Jones
#### This analysis done by Dan Jones and Karmun Chooi

## The Raw Data files are located on PowerPlant in the following location:


```
/input/genomic/viral/metagenomic/170621_150PE_HS4K2A/Almeida
```

In [26]:
# SANITY CHECK
ls -s /input/genomic/viral/metagenomic/170621_150PE_HS4K2A/Almeida

total 57503130
4437818 RACP005_11_S11_L002_R1_001.fastq.gz
5087946 RACP005_11_S11_L002_R2_001.fastq.gz
3832426 RACP005_12_S12_L002_R1_001.fastq.gz
4360034 RACP005_12_S12_L002_R2_001.fastq.gz
5640426 RACP005_13_S13_L002_R1_001.fastq.gz
6387466 RACP005_13_S13_L002_R2_001.fastq.gz
4137362 RACP005_1_S8_L002_R1_001.fastq.gz
4970210 RACP005_1_S8_L002_R2_001.fastq.gz
5553850 RACP005_5_S9_L002_R1_001.fastq.gz
6399450 RACP005_5_S9_L002_R2_001.fastq.gz
3088722 RACP005_8_S10_L002_R1_001.fastq.gz
3607426 RACP005_8_S10_L002_R2_001.fastq.gz


In [1]:
# SANITY CHECK: What does the README.md file say?
cat /input/genomic/viral/metagenomic/170621_150PE_HS4K2A/README.md

# 170621_150PE_HS4K2A

scientist:Arnaud Blouin

downloaded:Ben Warren Thu Aug 31 16:40:30 NZST 2017

type:RNAseq

tech:150PE HiSeq4000 (minimum 2% PhiX control run on all libraries, up to 25% for diversity)

organism:Vitis vinifera

organism:Nicotiana benthamiana

organism:GLRaV-3 (Grapevine leafroll-associated virus 3)

description:RNA from hosts infected with virus

samples:Almeida, Christensen, Martin, Nagalingum, Taylor

notes:

 - Lane 1 Matthew Niemiller & Taylor Lab
 - Lane 2 Cecilia & Prator Lab
 - Lane 3 Alan Christensen Lab
 - Lane 4 Nathalie Nagalingum/James Clugston &  Cal Academy
 - Lane 5-8 Christopher Martin Lab


150PE HiSeq4000 (minimum 2% PhiX control run on all libraries, up to 25% for diversity)

 - 1 BPO, 3nM, INDEX (426M Reads, 9.4% PhiX Aligned)
 - 2 RACO005, 3nM, INDEX (396M Reads, 3.0% PhiX Aligned)
 - 3 ACK101, 3nM, INDEX (390M Reads, 4.7% PhiX Aligned)
 - 4 Cycas4, 3nM, INDEX (397M Reads, 5.2% PhiX Aligned)
 - 5 CHM-CustomPool, 3nM, INDEX (382M Reads, 3.3% Ph

## <u>The key steps are: </u>##

#### I. Establish Data Management Structure on PowerPlant
1. Make the necessary directories for the data and the analysis
2. Name the directories using standard workflow naming conventions
3. Name files using standard workflow naming conventions
4. Make README.md files when needed

#### II. Perform the analyses
1. FastQC RAW Data
2. SortMeRNA
3. FastQC SortMeRNA Output
4. Trimmomatic
5. FastQC Trimmomatic Output
6. Clean Up Workspace:
     - Delete un-needed intermediate files
     - Compress files that are still required

## <u>Step I: Establish Data Management Structure on PowerPlant</u>

### Define Project Variables

* We define my Plant and Food Research (PFR) username as a variable: `USER`
* We define a unique project name as a variable: `PROJECTNAME`
* By combining these variables, we create a new variable that defines a directory within my workspace where all outputs will be placed: `PROJECT`
* We also create a number of variables that refer to specific subdirectories within `PROJECT`

In [2]:
# Define the user as a variable
USER="hradxj"
PROJECTNAME="karmun_awesome_experiment"
# Define the project directory and temp subdirectory as a variable
PROJECT="/workspace/$USER/$PROJECTNAME"

# Define the location of various subdirectories within PROJECT

RAW=$PROJECT/000.raw
FASTQC_RAW=$PROJECT/001.fastqc_raw
SORTMERNA=$PROJECT/002.SMRNA
FASTQC_SORTMERNA=$PROJECT/003.fastqc_smrna
TRIMMOMATIC=$PROJECT/004.trimmomatic
FASTQC_TRIMMOMATIC=$PROJECT/005.fastqc_trim
TEMP="$PROJECT/TEMP"

### Create analysis directories

At this point, we have not actually created any directories... only defined what the directory is going to be called __when__ we create it. We still need to actually create the directories.

We use the Unix shell command `mkdir` to create the directories. The switch `-p` suppresses error messages if the directory already exists.

In [28]:
# Create the project directory

mkdir -p $PROJECT

# Create project subdirectories

mkdir -p $RAW
mkdir -p $FASTQC_RAW
mkdir -p $SORTMERNA
mkdir -p $FASTQC_SORTMERNA
mkdir -p $TRIMMOMATIC
mkdir -p $FASTQC_TRIMMOMATIC
mkdir -p $TEMP


#### SANITY CHECK
Have the project directory and subdirectories been created correctly?

In [3]:
ls $PROJECT

000.raw		003.fastqc_smrna  006.MIA   009.STAR	  Illumina.fa
001.fastqc_raw	004.trimmomatic   007.STAR  010.edgeR_Nb  log
002.SMRNA	005.fastqc_trim   008.MBA   011.edgeR_Vv  TEMP


We can see that the project directory exists and has the appropriate subdirectories.

In [30]:
# Create symlinks of all input fastq files and put them in $RAW

ln -s /input/genomic/viral/metagenomic/170621_150PE_HS4K2A/Almeida/*.fastq.gz $RAW

In [33]:
# Sanity check: does the raw directory exist, and what is in it?
echo "Sanity check: does the raw directory exist, and what is in it?"
ls -s $RAW


Sanity check: does the raw directory exist, and what is in it?
total 18
2 RACP005_11_S11_L002_R1_001.fastq.gz  2 RACP005_1_S8_L002_R1_001.fastq.gz
2 RACP005_11_S11_L002_R2_001.fastq.gz  2 RACP005_1_S8_L002_R2_001.fastq.gz
2 RACP005_12_S12_L002_R1_001.fastq.gz  2 RACP005_5_S9_L002_R1_001.fastq.gz
2 RACP005_12_S12_L002_R2_001.fastq.gz  2 RACP005_5_S9_L002_R2_001.fastq.gz
2 RACP005_13_S13_L002_R1_001.fastq.gz  2 RACP005_8_S10_L002_R1_001.fastq.gz
2 RACP005_13_S13_L002_R2_001.fastq.gz  2 RACP005_8_S10_L002_R2_001.fastq.gz


In [34]:
# Change filenames
FILENAMES=$(ls $RAW)
#echo $FILENAMES
for FILE in $FILENAMES
do
NEWFILENAME=$(echo $FILE | sed 's/_001//g')
mv $RAW/$FILE $RAW/$NEWFILENAME
done

In [35]:
ls -s $RAW


total 18
2 RACP005_11_S11_L002_R1.fastq.gz  2 RACP005_1_S8_L002_R1.fastq.gz
2 RACP005_11_S11_L002_R2.fastq.gz  2 RACP005_1_S8_L002_R2.fastq.gz
2 RACP005_12_S12_L002_R1.fastq.gz  2 RACP005_5_S9_L002_R1.fastq.gz
2 RACP005_12_S12_L002_R2.fastq.gz  2 RACP005_5_S9_L002_R2.fastq.gz
2 RACP005_13_S13_L002_R1.fastq.gz  2 RACP005_8_S10_L002_R1.fastq.gz
2 RACP005_13_S13_L002_R2.fastq.gz  2 RACP005_8_S10_L002_R2.fastq.gz


## <u>Step II Part 1: FastQC RAW Data</u>
- The input for this step is the raw data from the provider in FASTQ format
- The output from this step are the HTML FASTQC Reports

In [45]:
# Define the location for the QC reports:
OUT="${PROJECT}/001.fastqc_raw"
LOG="${OUT}/logs"

mkdir -p $LOG

# Define the list of files to process:
FILES=`ls ${RAW}/*.gz`

# Load the FastQC module:
module load FastQC

for file in $FILES
    do
        COMMAND="fastqc --nogroup -q -t 2 -o ${OUT} ${file}"
        bsub -o ${LOG}/FQC.out -e ${LOG}/FQC.err -J FASTQC -n 2 $COMMAND
    done


In [38]:
# Sanity check: Have we produced the FASTQC reports?
# If you run this cell, you should see a set of ".html" and ".zip" files.
# You can view the reports by opening a web browser and looking at:
# http://storage.powerplant.pfr.co.nz/workspace/hrhsxj/vpnhr/001.
bjobs

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
596445  hradxj  RUN   normal     aklppr31    wkoppb37    MERGE      Nov  8 09:21
596447  hradxj  RUN   normal     aklppr31    wkoppb37    MERGE      Nov  8 09:21
596448  hradxj  RUN   normal     aklppr31    wkoppb37    MERGE      Nov  8 09:21
596450  hradxj  RUN   normal     aklppr31    wkoppb37    MERGE      Nov  8 09:21
596451  hradxj  RUN   normal     aklppr31    wkoppb37    MERGE      Nov  8 09:21
596452  hradxj  RUN   normal     aklppr31    wkoppb37    MERGE      Nov  8 09:21
596490  hradxj  RUN   normal     aklppr31    wkoppb35    FASTQC     Nov  8 09:52
                                             wkoppb35
596491  hradxj  RUN   normal     aklppr31    wkoppb35    FASTQC     Nov  8 09:52
                                             wkoppb35
596492  hradxj  RUN   normal     aklppr31    wkoppb35    FASTQC     Nov  8 09:52
                                             wkoppb35
596493  hradxj  RUN   normal 

In [48]:
# Create multiQC report of FastQC results
module load MultiQC;
multiqc $OUT -o $OUT

Conflict with existing python virtual environment. Please run `deactivate` to unload MultiQC-1.2
[INFO   ]         multiqc : This is MultiQC v1.2
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '/workspace/hradxj/karmun_awesome_experiment/001.fastqc_raw'
[?25lSearching 26 files..  [####################################]  100%[?25h
[INFO   ]          fastqc : Found 12 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : ../../../../workspace/hradxj/karmun_awesome_experiment/001.fastqc_raw/multiqc_report.html
[INFO   ]         multiqc : Data        : ../../../../workspace/hradxj/karmun_awesome_experiment/001.fastqc_raw/multiqc_data
[INFO   ]         multiqc : MultiQC complete


In [49]:
echo "http://storage.powerplant.pfr.co.nz/workspace/hradxj/karmun_awesome_experiment/001.fastqc_raw/multiqc_report.html"

http://storage.powerplant.pfr.co.nz/workspace/hradxj/karmun_awesome_experiment/001.fastqc_raw/multiqc_report.html


### NOTES:
* Put your own notes here about what the FASTQC results look like

## <u>Step II Part 2: SortMeRNA</u>
* In this step we will remove any rRNA contamination by comparing our reads to 6 databases of known rRNA's
* We will capture the rRNA reads in case further investigation is needed
* We will output the filtered reads to use for our workflow

### Merge (interleave) paired fastq files
* The input to this step is the raw data in FASTQ format

__NOTE THAT THIS WILL WORK ONLY IF PAIRED FILES END IN "_R1.fastq.gz" and "_R2.fastq.gz"__

* SortMeRNA requires that paired-end files are merged (intereaved) prior to execution
* There is a Bash shell script that does this called `merge-paired-reads.sh` in the `/scripts` subdirectory
* NOTE! This script can NOT process zipped files!!!
  * Therefore, all files must be de-compressed prior to use...
  * This is best done with process substitution, using `<(zcat ...)`
* The output from this step are interleaved fastq files

In [74]:
# Define the location of the various QC programs we will be using,
# and the location of SortMeRNA rRNA databases

SMRNA="/workspace/cflcyd/software/sortmerna-2.1b"
SCRIPTS="${SMRNA}/scripts"
DB="${SMRNA}/rRNA_databases"
INDEX="${SMRNA}/index"
SORTMERNADB="${DB}/silva-bac-16s-id90.fasta,\
${INDEX}/silva-bac-16s-db:\
${DB}/silva-bac-23s-id98.fasta,\
${INDEX}/silva-bac-23s-db:\
${DB}/silva-arc-16s-id95.fasta,\
${INDEX}/silva-arc-16s-db:\
${DB}/silva-arc-23s-id98.fasta,\
${INDEX}/silva-arc-23s-db:\
${DB}/silva-euk-18s-id95.fasta,\
${INDEX}/silva-euk-18s-db:\
${DB}/silva-euk-28s-id98.fasta,\
${INDEX}/silva-euk-28s:\
${DB}/rfam-5s-database-id98.fasta,\
${INDEX}/rfam-5s-db:\
${DB}/rfam-5.8s-database-id98.fasta,\
${INDEX}/rfam-5.8s-db"

/workspace/hradxj/karmun_awesome_experiment


In [39]:
# Sanity check: Do the paired input files end in "_R1.fastq.gz" and "_R2.fastq.gz"?
# If the following command does not result in the full list
# of the name of your input files (without directories), STOP, since there is a problem.
basename --multiple $RAW/* | grep '_R1.fastq.gz\|_R2.fastq.gz'


RACP005_11_S11_L002_R1.fastq.gz
RACP005_11_S11_L002_R2.fastq.gz
RACP005_12_S12_L002_R1.fastq.gz
RACP005_12_S12_L002_R2.fastq.gz
RACP005_13_S13_L002_R1.fastq.gz
RACP005_13_S13_L002_R2.fastq.gz
RACP005_1_S8_L002_R1.fastq.gz
RACP005_1_S8_L002_R2.fastq.gz
RACP005_5_S9_L002_R1.fastq.gz
RACP005_5_S9_L002_R2.fastq.gz
RACP005_8_S10_L002_R1.fastq.gz
RACP005_8_S10_L002_R2.fastq.gz


In [52]:
# Define the location for the merged files:
OUT="${PROJECT}/002.SMRNA"
LOG="${OUT}/logs"

# Define a set of unique names of the paired files,
# but excluding the _R1.fastq.gz and _R2.fastq.gz.
# This means that the variable "FILES" will consist of a unique name
# for each PAIR of paired fastq files. We can then append the 
# _R1.fastq.gz and _R2.fastq.gz suffix within the loop, to ensure
# that each iteration of the loop is working on two correctly paired files.

FILES=`basename -a ${RAW}/*.gz | sed 's/_R[1,2].fastq.gz//g'|sort -u `

for file in $FILES
     do

        file1=${file}_R1.fastq.gz
        file2=${file}_R2.fastq.gz
        COMMAND="${SCRIPTS}/Merge.sh \
                <(zcat $RAW/${file1}) \
                <(zcat $RAW/${file2}) \
                ${OUT}/${file}_MERGED.fastq"
        #echo "$COMMAND"
        bsub -o ${LOG}/MERGE.out -e ${LOG}/MERGE.err -J MERGE bash -c "${COMMAND}"
     done

### Note that the bash -c is needed to open a proper bash shell
### (instead of a bourne shell) for the processes substitution to work with OpenLava ###

In [53]:
# Sanity check: Have we actually produced the interleaved FASTQ files for use in SortMeRNA?
# You should see a file called <yourfile>_MERGED.fastq for every PAIR of reads in your 
# input data.If the process is still running, you should see files called 
# <yourfile>__MERGED.fastq.READS1 and <yourfile>__MERGED.fastq.READS2. 
# Do not proceed to the next step until you ONLY see files called <yourfile>_MERGED.fastq
# Note that this merging step can take several hours!
ls -s $OUT

total 301062105
49333114 RACP005_11_S11_L002_MERGED.fastq
39056434 RACP005_12_S12_L002_MERGED.fastq
57527458 RACP005_13_S13_L002_MERGED.fastq
54648034 RACP005_1_S8_L002_MERGED.fastq
55260978 RACP005_5_S9_L002_MERGED.fastq
45236090 RACP005_8_S10_L002_MERGED.fastq


### Run the Main SortMeRNA Program
* This is what actually does the sorting
* The input to this step are the merged fastq files
* The output are the rRNA matches and the filtered raw reads in interleaved fastq format

In [54]:
# Define the location for the input and output files:
IN="${PROJECT}/002.SMRNA"
OUT="${PROJECT}/002.SMRNA"
FILTERED="${OUT}/filtered/merged"
rRNA="${OUT}/rRNA"
LOG="${OUT}/logs"

mkdir -p $rRNA
mkdir -p $LOG
mkdir -p $FILTERED

### module load sortmerna ### Not until latest version is installed and configured...

FILES=`ls ${IN}/*_MERGED.fastq`

for file in $FILES
    do
        NAME=`basename $file`
        COMMAND="${SMRNA}/sortmerna --ref ${SORTMERNADB} --reads ${file} \
                --paired_in -a 4 -m 3911 -v --log --fastx \
                --aligned ${rRNA}/${NAME}_rRNA \
                --other ${FILTERED}/${NAME}_sortmerna"
        bsub -o ${LOG}/${NAME}.out -e ${LOG}/${NAME}.err -J SMRNA -n 4 $COMMAND
     done


Job <620803> is submitted to default queue <normal>.
Job <620804> is submitted to default queue <normal>.
Job <620805> is submitted to default queue <normal>.
Job <620806> is submitted to default queue <normal>.
Job <620807> is submitted to default queue <normal>.
Job <620808> is submitted to default queue <normal>.


### Un-merge (de-interleave) the filtered fastq files
* The other programs in our workflow use standard non-interleaved files, so we unmerge them.
* The input is the merged fastq files
* The output are the unmerged fastq files

In [55]:
IN="${PROJECT}/002.SMRNA/filtered/merged"
OUT="${PROJECT}/002.SMRNA/filtered"

FILENAMES=`ls ${IN}/*sortmerna*`

for file in $FILENAMES
     do
        # echo $file
        NAME=`basename $file`
        PREFIX=`echo $NAME | awk -F'[. ]' '{print $1}'`
        #echo $PREFIX
        file1=${OUT}/${PREFIX}_R1.fastq
        file2=${OUT}/${PREFIX}_R2.fastq
        COMMAND="${SCRIPTS}/Unmerge.sh $file $file1 $file2"
       # echo $COMMAND
        bsub -J UNMERGE -n 3 ${COMMAND}
     done


Job <631684> is submitted to default queue <normal>.
Job <631685> is submitted to default queue <normal>.
Job <631686> is submitted to default queue <normal>.
Job <631687> is submitted to default queue <normal>.
Job <631688> is submitted to default queue <normal>.
Job <631689> is submitted to default queue <normal>.


In [56]:
# Remove merged files
rm -rf ${PROJECT}/002.SMRNA/filtered/merged/*

## <u>Step II Part 3: FastQC SortMeRNA Filtered Output</u>
* We now verify that we did not break anything and re-check the quality of our reads after sorting
* The input for this step is the filtered data from SortMeRNA in FASTQ format
* The output from this step are the HTML FastQC Reports

In [57]:
IN="${PROJECT}/002.SMRNA/filtered"
OUT="${PROJECT}/003.fastqc_smrna"
LOG="${OUT}/logs"

# Get the files to check:
FILES=`ls ${IN}/*.fastq`
#echo $FILES

# Load the FastQC module:
module load FastQC

for file in $FILES
    do
        COMMAND="fastqc --nogroup -q -t 2 -o ${OUT} ${file}"
        bsub -o ${LOG}/FQC.out -e ${LOG}/FQC.err -J FASTQC -n 2 $COMMAND
    done


Job <634263> is submitted to default queue <normal>.
Job <634264> is submitted to default queue <normal>.
Job <634265> is submitted to default queue <normal>.
Job <634266> is submitted to default queue <normal>.
Job <634267> is submitted to default queue <normal>.
Job <634268> is submitted to default queue <normal>.
Job <634269> is submitted to default queue <normal>.
Job <634270> is submitted to default queue <normal>.
Job <634271> is submitted to default queue <normal>.
Job <634272> is submitted to default queue <normal>.
Job <634273> is submitted to default queue <normal>.
Job <634274> is submitted to default queue <normal>.


In [43]:
# Create multiQC report of FastQC results
module load MultiQC;
multiqc $OUT -o $OUT

[INFO   ]         multiqc : This is MultiQC v1.2
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '/workspace/hradxj/karmun_awesome_experiment/002.SMRNA'
[?25lSearching 18 files..  [####################################]  100%[?25h
[INFO   ]         multiqc : MultiQC complete


In [60]:
echo "http://storage.powerplant.pfr.co.nz/workspace/hradxj/karmun_awesome_experiment/003.fastqc_smrna/multiqc_report.html" 

http://storage.powerplant.pfr.co.nz/workspace/hradxj/karmun_awesome_experiment/003.fastqc_smrna/multiqc_report.html


### NOTES:
* Check for the level of contaimination: < 2% is great. Not more than 10% is OK, but should be noted.
* You will see that the total nuber of reades in each file is now slightly less due to the filtering
* Many issues still remain: Over-represented sequences, Adapter removal, Quality trimming
  * These issues will be handled next by Trimmomatic and MarkIlluminaAdapters

## <u>Step II Part 4: TRIMMOMATIC</u>
* Now that the reads are filtered, we will remove adapters, over-represented sequences, and poor-quality bases from the reads
* The command specifies that bases with quality scores less than 30 will be clipped
* Also, after clipping, the min length for a read will be 50 bp
* The `Illumina.fa` file contains the TruSeq adapter sequences and homo-polymer sequences to clip
  * This file needs to be edited to contain the appropriate sequences.
* The input for this step are the SortMeRNA filtered reads
* The output are the trimmed reads

In [73]:
# Run the Trimmomatic program on the filtered data to remove Illumina adapters, homo-polymers, and low quality reads:
  # Note that to do this, it is necessary to edit the file containing the adapter sequences
  # to include all sequences that you wish to remove:
  # This file is called Illumina.fa and is in the 000.raw directory.

IN="${PROJECT}/002.SMRNA/filtered"
OUT="${PROJECT}/004.trimmomatic"
UNPAIRED="${OUT}/unpaired"
LOG="${OUT}/logs"

mkdir -p $IN
mkdir -p $OUT
mkdir -p $UNPAIRED
mkdir -p $LOG

# Set the path to the adapter file:
CLIP="${PROJECT}/Illumina.fa"

# Get the files to trim:
# Use echo statements to be sure that the results from awk are what you really want...
FILES=`basename -a ${IN}/*.fastq | sed 's/_R[1,2].fastq//g'|sort -u `
#FILES=`ls ${IN}/*.fastq | awk -F'[_ ]' '{print $1"_"$2"_"$3"_"$4"_"$5"_"$6}' | sort -u`

#echo $FILES
module load Trimmomatic

for FILE in $FILES
     do
        In_File1=${IN}/${FILE}_R1.fastq
        In_File2=${IN}/${FILE}_R2.fastq
  #      echo $In_File1
  #      echo $In_File2
        Out_PAIRED_1=${OUT}/${FILE}_trimmomatic_R1.fastq
        Out_UNPAIRED_1=${UNPAIRED}/${FILE}_trimmomatic_unpaired_1.fastq
        Out_PAIRED_2=${OUT}/${FILE}_trimmomatic_R2.fastq
        Out_UNPAIRED_2=${UNPAIRED}/${FILE}_trimmomatic_unpaired_2.fastq
  #      echo $Out_PAIRED_1
  #      echo $Out_UNPAIRED_1
  #      echo $Out_PAIRED_2
  #      echo $Out_UNPAIRED_2
        COMMAND="java -jar -Xms8G -Xmx8G \
                 ${TRIMMOMATIC} PE -threads 3 \
                 ${In_File1} ${In_File2} \
                 ${Out_PAIRED_1} ${Out_UNPAIRED_1} ${Out_PAIRED_2} ${Out_UNPAIRED_2} \
                 ILLUMINACLIP:${CLIP}:2:30:10 SLIDINGWINDOW:5:20 MINLEN:50"
        #echo $COMMAND
        bsub -o ${LOG}/${PREFIX}.out -e ${LOG}/${PREFIX}.err -J TRIM -n 3 $COMMAND
     done

# It is critical to set the -X settings for Java for the program to run correctly
# Here, the VM is instantiated with 8GB of heap space, with a max of 8GB...


Job <635141> is submitted to default queue <normal>.
Job <635142> is submitted to default queue <normal>.
Job <635143> is submitted to default queue <normal>.
Job <635144> is submitted to default queue <normal>.
Job <635145> is submitted to default queue <normal>.
Job <635146> is submitted to default queue <normal>.


### Results Summary:
* ILLUMINACLIP: Using 1 prefix pairs, 8 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
* Quality encoding detected as phred33
* Input Read Pairs: 12,928,498 
* Both Surviving: 11,702,894 (90.52%) 
* Forward Only Surviving: 570,865 (4.42%) 
* Reverse Only Surviving: 181,229 (1.40%) 
* Dropped: 473,510 (3.66%)
* So we have over 90% of reads passing our criteria!


## <u>Step II Part 5: FASTQC of TRIMMED READS</u>
* We now verify that we did not break anything and re-check the quality of our reads after trimming
* The input for this step are the filtered trimmed reads in FASTQ format
* The output from this step are the HTML Reports

In [75]:
IN="${PROJECT}/004.trimmomatic"
OUT="${PROJECT}/005.fastqc_trim"
LOG="${OUT}/logs"

# Get the files to check:
FILES=`ls ${IN}/*trimmomatic*`

# Load the FastQC module:
module load FastQC

for file in $FILES
    do
        COMMAND="fastqc --nogroup -q -t 2 -o ${OUT} ${file}"
        bsub -o ${LOG}/FQC.out -e ${LOG}/FQC.err -J FASTQC -n 2 $COMMAND
    done
    


Job <646929> is submitted to default queue <normal>.
Job <646930> is submitted to default queue <normal>.
Job <646931> is submitted to default queue <normal>.
Job <646932> is submitted to default queue <normal>.
Job <646933> is submitted to default queue <normal>.
Job <646934> is submitted to default queue <normal>.
Job <646935> is submitted to default queue <normal>.
Job <646936> is submitted to default queue <normal>.
Job <646937> is submitted to default queue <normal>.
Job <646938> is submitted to default queue <normal>.
Job <646939> is submitted to default queue <normal>.
Job <646940> is submitted to default queue <normal>.


In [76]:
# Create multiQC report of FastQC results
module load MultiQC;
multiqc $OUT -o $OUT;
module unload MultiQC;

[INFO   ]         multiqc : This is MultiQC v1.3
[INFO   ]         multiqc : Template    : default
[INFO   ]         multiqc : Searching '/workspace/hradxj/karmun_awesome_experiment/005.fastqc_trim'
[?25lSearching 24 files..  [####################################]  100%[?25h
[INFO   ]          fastqc : Found 12 reports
[INFO   ]         multiqc : Compressing plot data
[INFO   ]         multiqc : Report      : ../../../../workspace/hradxj/karmun_awesome_experiment/005.fastqc_trim/multiqc_report.html
[INFO   ]         multiqc : Data        : ../../../../workspace/hradxj/karmun_awesome_experiment/005.fastqc_trim/multiqc_data
[INFO   ]         multiqc : MultiQC complete


In [70]:
echo "http://storage.powerplant.pfr.co.nz/workspace/hradxj/karmun_awesome_experiment/005.fastqc_trim/multiqc_report.html"

http://storage.powerplant.pfr.co.nz/workspace/hradxj/karmun_awesome_experiment/005.fastqc_trim/multiqc_report.html


Adaptor contamination in R1 reads. This was successfully removed, which you can see when comparing the pre- and post- trimmomatic MultiQC reports.

## <u>Step III: Assess the transformed data for suitabillity in downstream analysis</u>
* Here we need to assess the condition of our transformed reads for downstream analysis
* Look at the FastQC Report for the trimmed data: Looks great.
* Check for any obvious problems: No red flags.
* Look at the metrics files from MIA to see what was found: A few more adapters were cliped...not bad.
* Different standards apply depending on the type of downstream analysis: differential experession analysis or variant calling

### This marks the end of the QC Pre-Processing of the RNA-Seq Data
* The next step typically is to align to an available reference set:
    * Genome
        * `RNA_Seq_Analysis_Phase_IIa_Alignment_to_Reference_Genome_Revised_July_2017.ipynb`
    * Transcriptome
        * `RNA_Seq_Analysis_Phase_IIb_Alignment_to_Transcriptome.ipynb`