# Introduction to RNA Sequencing
Welcome to the RNA-sequencing workshop for SSRP. This notebook will cover step-by-step how to perform a basic RNA-seq experiment. 

As this is built to operate on our [Binderhub](https://binderhub.readthedocs.io/en/latest/index.html), the environment has been preconfigured with appropriate software installed. These softwares will be called as needed. To view them, please see the environment file within the binder directory. If you try to run this notebook on a local machine without properly installing each software package, it will not operate correctly. 

Since this Jupyter kernel is based on Python, the base code for each block is Python unless defined otherwise. Text will be generated in markdown blocks, but otherwise look for %% to then define the language for the block. Such as a block starting with "%%bash", it will be executed as if it were the standard command line interface terminal, or "%%R", executes as if it were R code. 

## Primary analysis
Primary analysis entails the first line of analysis, in this case the very first step will have already been performed by your sequencing service provider- demultiplexing. This step converts the native Illumina base calls into the FASTQ format, while separating out different samples based upon their respective index. Per sample you should have:
- SampleName_R1.fastq.gz
- SampleName_R2.fastq.gz

The .gz at the end is a [gzip](https://www.gzip.org/) compression to reduce the overall file size. 

In this case, we will be utilizing publically available data. To that end, we would traditionally need to download the files through the [Short Reads Archive(SRA)](https://www.ncbi.nlm.nih.gov/sra). This would involve using their SRA toolkit, which has already been installed in this environment and used to download the files. The code would look like:
```bash
fastq-dump --split-files --gzip --outdir rawfiles/ SAMN09354753
```
Which would need to be done for each sample. The --split-files generates both R1 and R2, --gzip allows for continued compression, and --outdir notes the output directory for the files. 

Unfortunately, the SRA toolkit can be rather finnicky and doesnt like to operate consistently. In some cases it can be easier to pull the files from the [European Nucleotide Archive (ENA)](https://www.ebi.ac.uk/ena). For this, you need to find the FTP path, but then you can just use the standard unix tool wget to download the files. 
```bash
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR726/008/SRR7261718/SRR7261718_1.fastq.gz
```
For expediency- this has been included during the notebook initialization. 

For this workshop, we will be using data from [GSE115330](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115330). This study was done to investigate the effects of various pollutions on the transcriptome of *Escherichia coli*. Each sample's RNA was isolated via trizol extraction, followed by ribosomal depeletion with the RiboZero kit, libraries generated by the TruSeq RNA library kit (sonic fragmentation), and lastly sequenced on an Illumina HiSeq 4000 in a 2x150bp configuration. 

**Sample breakdown table**

|  SRA Sample Name | SRA Run Name | Sample Name | Sample Type |
| --- | --- | --- | --- |
| SAMN09354753 | SRR7261718 | WT1 | Wild type |
| SAMN09354752 | SRR7261719 | WT2 | Wild type |
| SAMN09354751 | SRR7261720 | WT3 | Wild type |
| SAMN09354750 | SRR7261721 | Urban1 | Treated with collected urban air |
| SAMN09354749 | SRR7261722 | Urban2 | Treated with collected urban air |
| SAMN09354748 | SRR7261723 | Urban3 | Treated with collected urban air |
| SAMN09354747 | SRR7261724 | Diesel1 | Treated with collected diesel exhaust |
| SAMN09354746 | SRR7261725 | Diesel2 | Treated with collected diesel exhaust |
| SAMN09354745 | SRR7261726 | Diesel3 | Treated with collected diesel exhaust |

After receiving these files from your sequencing service provider (or in this case, downloading them), the next step in primary analysis is checking the quality of the reads. This follows the simple principle of **Garbage in=Garbage out** and always drives the need for the highest quality input available.

### FASTQ quality check
To check your FASTQ files, we will use FASTQC which is published by the [Babraham Bioinformatics](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) team as the first line QC tool.

This tool will evaluate the overall health of your sample. In particular, you will want to look at:
- Total number of sequences
    - Make sure it aligns with your targetted expectations
- Per base sequencing quality
    - It should drop over time, especially in 2x300bp reads. Ensure that it doesnt drop too far, usually you want Q30>90% if possible
- QC content
    - Make sure it aligns with your genome of interest
- %N
    - N is an ambiguous base. You dont want many of these in your sample (if any)
- Sequence Length
    - Make sure the length you put in to the sequencer is whats coming out
- Duplication levels
    - Keep as low as possible, but any library with PCR you expect some
- Adapter content
    - This will be driven by your insert size and sequencing length. If you are sequencing through your entire insert and in to the adapter on the other side, you would expect contamination here. Usually best to avoid that, but if you see some you will need to trim it out. 

In [None]:
%%bash
fastqc raw_files/.fastq.gz
fastqc raw_files/.fastq.gz

Now if you didnt want to do that manually, you could set up a loop to iterate through and do this on all files that end in ".fastq.gz". As an example, here is a small code snippet to show how to do this. Try it out if you would like!

```bash
for file in raw_files/*.fastq.gz
do 
    fastqc raw_files/$file
done
```

To view your FASTQC output, navigate into the raw_align directory (containing the raw files for alignment) and look for .html files that are the output. Download those and then open them in your browser to explore the output.

### Read Trimming
After checking the input quality of the samples, you often (or always) will need to trim the samples. This can be critical because again, **Garbage in=Garbage out**. 
This step remove ambiguous bases, low quality bases/reads, and also removes any adapter content that may be in the read. For most trimming cases, [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) is a viable tool. Another commonly used tool is [CutAdapt](https://cutadapt.readthedocs.io/en/stable/), but here we will use trimmomatic. 

Parameters used:
- PE
    - Denotes the input as paired end reads. Following that, you need to put the two input files, then trimmed-1-unpaired, trimmed-1-paired, trimmed-2-unpaired, trimmed-2-paired for output files
- ILLUMINACLIP
    - To clip off specifically any Illumina reads that are found. The numbers that follow are the seed mismatch (maximum count of mismatches to identify adapter), palindrome clip threshold (to remove possible identical adapters on both ends of reads), and the simple clip threshold (how accurate the adapter is to read beyond the seed)
- LEADING TRAILING
    - Removes bases from the start and end of the read if they are below thresholds. "2" is Illumina for "low quality"
- SLIDING WINDOW
    - Looks at the read in a sliding window frame and takes the average of multiple bases. The first number is the window size and the second is quality score. Again "2" is Illumina for "low quality", but this allows for 1 base to not potentially trash an otherwise high quality read. 
- MINLEN
    - The minimum length of a read after trimming to accept.  

In this case, I opted to use a quick bash loop. The first line looks through the raw files directory, then pulls out the sample name from the fastq files. We then can use that to build the aspects of the trimmomatic command. 

In [None]:
%%bash
mkdir trimmed
for prefix in $(ls raw_files/*.fastq.gz | sed -r 's/_[12][.]fastq.gz//' | uniq)
do
    trimmomatic PE raw_files/${prefix}_1.fastq.gz raw_files/${prefix}_2.fastq.gz \
    trimmed/${prefix}_trimmed_1_paired.fastq.gz trimmed/${prefix}_trimmed_1_unpaired.fastq.gz \
    trimmed/${prefix}_trimmed_2_paired.fastq.gz trimmed/${prefix}_trimmed_2_unpaired.fastq.gz \
    ILLUMINACLIP:raw_files/adapters.fasta:2:40:15 \
    LEADING:2 TRAILING:2 \
    SLIDINGWINDOW:4:2 \
    MINLEN:140
done

And then do a QC check on the samples again to see how they are after trimming!

In [None]:
%%bash
for file in trimmed/*.fastq.gz
do 
    fastqc trimmed/$file
done

And with that you have checked the quality of your samples, removed low quality contaminants, and can show the hopefully high-quality input for your secondary analysis!
We also introduce a tool called [MultiQC](https://multiqc.info/). This is a great QC aggregation tool. Play around with it a bit, but after running this you can download the multiqc file, open it in a browser, and view things aggregated instead of all the individual files. 

In [None]:
%%bash
multiqc trimmed/.

## Secondary analysis
So secondary analysis is the data reduction step of an NGS workflow. Specifically to RNA-sequencing, this is where we would align the reads to the reference and assess the quality of alignment. There are many different aligners out there, but the one that has been the gold standard for RNA-sequencing has been [STAR](https://github.com/alexdobin/STAR). This aligner is accurate, fast, and splice-aware (for applicable organisms). 

### Build STAR reference index
The first step in performing STAR alignment is generating a reference index. The STAR algorithm uses a seed searching technique on an uncompressed suffix array, so requires indexing. 

During the notebook initilization, the reference genome and transcript tracks should have been downloaded as well into the references directory. For reference, we will be using the [BW25113 strain of *E.coli*](https://www.ncbi.nlm.nih.gov/nuccore/749300132).

In [None]:
%%bash
mkdir star
STAR --runMode genomeGenerate \
    --sjdbGTFfile references/ecoli_genes.gff \
    --sjdbGTFtagExonParentTranscript Parent \
    --genomeDir star/ \
    --genomeFastaFiles references/NZ_CP009273.1.fa

### Align reads to reference
Now that your reference is built appropriately for STAR, you can align your reads to it. This will be the longest steps of the processing. Below is a standard command for calling STAR. There are a variety of additional parameters you can pass to tweak the operations. These would generally be more advanced applications, and involve altering the seeding/extensions/scoring. For here though, we are going to make a directory for the output, pull our sample names again into a loop, and align each sample. The last step is to index the output file as well- this makes it easier to import into visualizing tools. 

In [None]:
%%bash
mkdir aligned
for prefix in $(ls trimmed/*.fastq.gz | sed -r 's/_[12]_paired[.]fastq.gz//' | uniq)
do
    STAR --genomeDir star/ \
            --sjdbGTFfile references/ecoli_genes.gff \
            -- readFilesIn ${prefix}_1_paired.fastq.gz ${prefix}_2_paired.fastq.gz \
            --twopassMode Basic \
            --outWigType bedGraph \
            --outSAMtype BAM SortedByCoordinate \
            --readFilesCommand zcat \
            --outFileNamePrefix aligned/${prefix}
    samtools index aligned/${prefix}Aligned.sortedByCoord.out.bam
done

### Post alignment QC
Now as has been mentioned several times, Garbage in=Garbage out. This is where we can more in depth evaluate the output quality to ensure that the different samples were successful throughout the entire workflow. We will use several tools for generating different aspects of QC, and then pull them together using the MultiQC tool. 

## Tertiary analysis
#DESeq2 steps

## Quartenary analysis
#Plotting?
#GO