## Getting set up

In [1]:
mkdir -p /mnt/storage/$USER/jupyternotebooks/RNA-seq
cd /mnt/storage/$USER/jupyternotebooks/RNA-seq

In [2]:
vdb-config -s /repository/user/cache-disabled=true

In [12]:
fastq-dump --split-files SRR9908384 SRR9908385 SRR9908386 SRR9908387 

Rejected 33468246 READS because READLEN < 1
Read 33468246 spots for SRR9908384
Written 33468246 spots for SRR9908384
Rejected 30176403 READS because READLEN < 1
Read 30176403 spots for SRR9908385
Written 30176403 spots for SRR9908385
2020-11-05T19:00:14 fastq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
Rejected 28272372 READS because READLEN < 1
Read 28272372 spots for SRR9908386
Written 28272372 spots for SRR9908386
Rejected 34131697 READS because READLEN < 1
Read 34131697 spots for SRR9908387
Written 34131697 spots for SRR9908387
Read 126048718 spots total
Written 126048718 spots total


# Quality control using FastQC
In this section I perform quality control for each of the runs and analyse the output. First I run FastQC: 

In [13]:
/usr/bin/fastqc -o . *.fastq

Started analysis of SRR9908384_1.fastq
Approx 5% complete for SRR9908384_1.fastq
Approx 10% complete for SRR9908384_1.fastq
Approx 15% complete for SRR9908384_1.fastq
Approx 20% complete for SRR9908384_1.fastq
Approx 25% complete for SRR9908384_1.fastq
Approx 30% complete for SRR9908384_1.fastq
Approx 35% complete for SRR9908384_1.fastq
Approx 40% complete for SRR9908384_1.fastq
Approx 45% complete for SRR9908384_1.fastq
Approx 50% complete for SRR9908384_1.fastq
Approx 55% complete for SRR9908384_1.fastq
Approx 60% complete for SRR9908384_1.fastq
Approx 65% complete for SRR9908384_1.fastq
Approx 70% complete for SRR9908384_1.fastq
Approx 75% complete for SRR9908384_1.fastq
Approx 80% complete for SRR9908384_1.fastq
Approx 85% complete for SRR9908384_1.fastq
Approx 90% complete for SRR9908384_1.fastq
Approx 95% complete for SRR9908384_1.fastq
Analysis complete for SRR9908384_1.fastq
Started analysis of SRR9908385_1.fastq
Approx 5% complete for SRR9908385_1.fastq
Approx 10% complete for

Each of the fastqc.html files are given in the archive folder with the notebooks. I'll add relevant screenshots here and discuss the results. For analysis of the FastQC results, I sourced heavily from this [tutorial](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/#:~:text=The%20output%20from%20FastQC%2C%20after,or%20%E2%80%9CFail%E2%80%9D%20is%20assigned.).

The results were very similar for each run. The list of FastQC criteria and the pass or fail result is given below:

![QC-summary](RNA-seq/QC-plots/QC-summary.png "QC-summary")

## Per-base sequence quality
Quality scores across all bases are given in the image below. The quality of the reads for the first siz base-pairs is lower. It is normal with all Illumina sequencers for the median quality score to start out lower over the first 5-7 bases and to then rise. The blue line is the mean quality score at each base position/window. The per-base quality graph given for each of these runs is very good.

![PBSQ](RNA-seq/QC-plots/PBSQ.png "Per-base sequence quality") 

## Per-sequence quality scores
This gives a plot of the total number of reads vs the average quality score over full length of that read.

In a high-quality run, the distribution of average read quality should be fairly tight in the upper range of the plot, as is the case here.

![PSQS](RNA-seq/QC-plots/PSQS.png "Per-sequence quality scores") 

## Per-base sequence content
This plot reports the percent of bases called for each of the four nucleotides at each position across all reads in the file. With most RNA-Seq library preparation protocols there is clear non-uniform distribution of bases for the first 10-15 nucleotides. This is normal and expected depending on the type of library kit used. RNA-Seq data showing this non-uniform base composition will always be classified as failed by FastQC for this module even though the sequence is perfectly good, which is the case here.

![PBQC](RNA-seq/QC-plots/PBSC.png "Per-base sequence content")

## Per-sequence GC content
This gives a plot of the number of reads vs. GC% per read. The expectation is that the GC content of all reads should form a normal distribution with the peak of the curve at the mean GC content for the organism sequenced.
The plot given below shows that this run fits the expected distribution. 

![PSGC](RNA-seq/QC-plots/PSGC.png "Per-sequence GC content")

If the observed distribution deviates too far from the theoretical, FastQC will call a fail. However, this call can often be ignored. For example, in RNA sequencing there may be a greater or lesser distribution of mean GC content among transcripts causing the observed plot to be wider or narrower than an idealised normal distribution.

## Sequence length distribution
This gives a graph showing the distribution of fragment sizes. The sequencer used generates uniform-length sequences of 50bp. The plot given below shows this.

![SLD](RNA-seq/QC-plots/SLD.png "Sequence length distribution")

## Sequence duplication levels
Percentage of reads of a given sequence in the file which are present a given number of times in the file. There are generally two sources of duplicate reads: 

 - PCR duplication in which library fragments have been over represented due to biased PCR enrichment 
 - Truly over represented sequences such as very abundant transcripts in an RNA-Seq library 
 
PCR duplication is a concern because PCR duplicates misrepresent the true proportion of sequences in your starting material. The latter is an expected case and not of concern because it does faithfully represent your input.

When sequencing RNA there will be some very highly abundant transcripts and some lowly abundant. It is expected that duplicate reads will be observed for high abundance transcripts. The sequence duplication plot below was called as Failed by FastQC even though the duplication is expected in this case.

![SDL](RNA-seq/QC-plots/SDL.png "Sequence duplication levels")

## Over-represented sequences
There were no over-represented sequences. 

## Adapter content
Ideally Illumina sequence data should not have any adapter sequence present, however when using long read lengths it is possible that some of the library inserts are shorter than the read length resulting in read-through to the adapter at the 3’ end of the read. This is more likely to occur with RNA-Seq libraries where the distribution of library insert sizes is more varied and likely to include some short inserts. In the QC performed here, no adapters were identified and so no trimming is needed.

## K-mer content
Measures the count of each short nucleotide of length _k_ (default = 7) starting at each positon along the read. Any given Kmer should be evenly represented across the length of the read. A list of kmers which appear at specific positions with greater than expected frequency are reported.

As with the sequence duplication module described above, RNA-seq libraries may have highly represented Kmers that are derived from highly expressed sequences. The plot given below highlights this fact. The k-mer content being flagged as Failed is not an issue

![KMC](RNA-seq/QC-plots/KC.png "K-mer content")

# Mapping to the genome
We will use STAR to map to the genome. STAR also needs a database with the human genome and gene annotation, we will be using hg19.

In [18]:
STAR --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db \
     --genomeLoad NoSharedMemory \
     --runThreadN 2 \
     --readFilesIn SRR9908384_1.fastq \
     --outFileNamePrefix YT1.

STAR --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db \
     --genomeLoad NoSharedMemory \
     --runThreadN 2 \
     --readFilesIn SRR9908385_1.fastq \
     --outFileNamePrefix YT2.
     
STAR --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db \
     --genomeLoad NoSharedMemory \
     --runThreadN 2 \
     --readFilesIn SRR9908386_1.fastq \
     --outFileNamePrefix YS1.
     
STAR --genomeDir /mnt/nfs/mfiers/STAR/hg19_star_db \
     --genomeLoad NoSharedMemory \
     --runThreadN 2 \
     --readFilesIn SRR9908387_1.fastq \
     --outFileNamePrefix YS2.

Nov 13 14:30:47 ..... started STAR run
Nov 13 14:30:47 ..... loading genome
Nov 13 14:31:14 ..... started mapping
Nov 13 14:37:47 ..... finished successfully
Nov 13 14:37:48 ..... started STAR run
Nov 13 14:37:48 ..... loading genome
Nov 13 14:38:06 ..... started mapping
Nov 13 14:43:51 ..... finished successfully
Nov 13 14:43:52 ..... started STAR run
Nov 13 14:43:52 ..... loading genome
Nov 13 14:44:09 ..... started mapping
Nov 13 14:49:32 ..... finished successfully
Nov 13 14:49:32 ..... started STAR run
Nov 13 14:49:32 ..... loading genome
Nov 13 14:49:50 ..... started mapping
Nov 13 14:56:33 ..... finished successfully


To be able to quickly access the data we must sort the SAM file and convert it to a BAM file. The last step is to create an index for quick access.

We use **samtools** to do this. The first step sorts and outputs bam:

In [22]:
samtools sort -o YT1.bam YT1.Aligned.out.sam
samtools sort -o YT2.bam YT2.Aligned.out.sam
samtools sort -o YS1.bam YS1.Aligned.out.sam
samtools sort -o YS2.bam YS2.Aligned.out.sam

[bam_sort_core] merging from 9 files and 1 in-memory blocks...
[bam_sort_core] merging from 8 files and 1 in-memory blocks...
[bam_sort_core] merging from 7 files and 1 in-memory blocks...
[bam_sort_core] merging from 9 files and 1 in-memory blocks...


Use samtools to generate an index:

In [23]:
samtools index YT1.bam
samtools index YT2.bam
samtools index YS1.bam
samtools index YS2.bam

With samtools idxstats we can see how many reads map to each chromosome

In [26]:
samtools idxstats YT1.bam

chrM	16571	931723	0
chr1	249250621	4147308	0
chr2	243199373	2927265	0
chr3	198022430	1847308	0
chr4	191154276	1214587	0
chr5	180915260	1987579	0
chr6	171115067	2258192	0
chr7	159138663	1950961	0
chr8	146364022	1123584	0
chr9	141213431	1636851	0
chr10	135534747	1172177	0
chr11	135006516	1925142	0
chr12	133851895	2601630	0
chr13	115169878	645595	0
chr14	107349540	1057458	0
chr15	102531392	1405627	0
chr16	90354753	1724400	0
chr17	81195210	2789356	0
chr18	78077248	1151075	0
chr19	59128983	2739025	0
chr20	63025520	878371	0
chr21	48129895	441500	0
chr22	51304566	834969	0
chrX	155270560	1336723	0
chrY	59373566	77887	0
*	0	0	0


Running samtools flagstat tells us what the distribution of mapping flags (column 2 in the sam/bam file) is:

In [28]:
samtools flagstat YT1.bam

40806293 + 0 in total (QC-passed reads + QC-failed reads)
8213107 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
40806293 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


# Converting Reads to Gene Counts
First we make a symbolic link to the annotation file

In [3]:
ln -sf /mnt/storage/data/RNA-seq/gencode.v19.nopseudo.plus.sort.gtf .

Then we run `featureCounts` on each of the BAM files.

In [30]:
featureCounts \
    -Q 10 \
    -g gene_name \
    -a /mnt/nfs/data/RNA-seq/gencode.v19.nopseudo.plus.sort.gtf \
    -o all.counts \
    YT1.bam YT2.bam YS1.bam YS2.bam


       [44;37m =====      [0m[36m   / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
       [44;37m   =====    [0m[36m  | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
       [44;37m     ====   [0m[36m   \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
       [44;37m       ==== [0m[36m   ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
	  v1.6.0

||  [0m                                                                          ||
||             Input files : [36m4 BAM files  [0m [0m                                   ||
||                           [32mS[36m YT1.bam[0m [0m                                       ||
||                           [32mS[36m YT2.bam[0m [0m                                       ||
||                           [32mS[36m YS1.bam[0m [0m                                       ||
||                           [32mS[36m YS2.bam[0m [0m                                       ||
||  [0m                                             

There are a number of columns on the gene structure, and a number with the actual counts. We'll separate these for use later:

In [31]:
cut -f-6 all.counts > all.genedata.tsv

In [32]:
cut -f1,7- all.counts | grep -v '^#' > all.gene.counts