# RNA-seq exercises
For this exercise, we will go through the main steps of differential expression analysis, including:
<br>
a) Retrieving Samples information<br>
b) Alignment to a reference genome<br>
c) Read count<br>
c) Statistical analyses<br>

### General samples information
Use SRA (Sequence Read Archive; https://www.ncbi.nlm.nih.gov/sra) to retrieve information about
these two samples with the following accession numbers: SRR057656 and SRR057642. For each
accession number you will have a specific page with several information and useful links (e.g. GEO
Web Link)
Using the information you will find, try to answer to the following questions:

**Q1: For which study were they generated?** <br>



**Q2: How many bases were generated for each run?** <br>


**Q3: Which sample is the case and which sample is the control?**<br>


**Q4: Is it single-end or paired end sequencing?**<br>


**Q5: What can you tell about the extraction method and the data generation of both samples?**<br>



**Reads alignment**

Now we know what kind of samples we are analyzing. Because of limited power/students and time,
we will work assuming these files have been already filtered and trimmed according with what you
have already learnt in the previous lectures. Moreover, we will consider only the chr9 that contains the
PCA3 gene. This gene produces a spliced, long non-coding RNA that is highly overexpressed in most
types of prostate cancer cells and is used as a specific biomarker for this type of cancer through the
analysis of peripheral blood and urine.

All the files you will need will be into the following directory:

*/exercises/rnaseq2*

Because they are derived by paired-end sequencing, you will see two fastq files for each sample.

[HiSat2](https://daehwankimlab.github.io/hisat2/) is the next development of TopHat2. To use it we need to index our reference (chr9).

1) Let us create a symbolic link link to the reference sequence for chromosome 9:


In [None]:
!mkdir ex_rnaseq
%cd ex_rnaseq
!ln -s /exercises/rnaseq2/GRCh38_Chr9.fasta ./GRCh38_Chr9.fasta # command to make symbolic link
!pwd
!ls

2) Let's build the indexed reference, 


In [None]:
!hisat2-build GRCh38_Chr9.fasta GRCh38_Chr9


Now we are ready to do the alignment. Choose one of the two samples, e.g. the affected. You do not need to copy your fastq files in your directory, simply recall them from the directory in which they are stored

In [None]:
%%bash
hisat2 -x GRCh38_Chr9 -1 /exercises/rnaseq2/SRR057642_chr9-1.fastq \
    -2 /exercises/rnaseq2/SRR057642_chr9-2.fastq> affected.sam

Look at the output you have in the screen after the alignment:

<b>Q6: how many reads were paired?</b>



<b>Q7: how many read pairs have multi mapping sites?</b>


<b>Q8: how many read pairs have been mapped exactly one time?</b>


<b>Q9: what is the overall alignment rate?</b>


We then convert the sam files to bam files that will be used for the next analyses. Storing a bam file uses less storage compared to storing a sam file. While converting the sam file to a bam file, we will also sort the file. The file should be sorted not by position, but by read name, so that each pair is consecutive (necessary for downstream analyses).


In [None]:
!samtools view  affected.sam -Sb | samtools sort -n  > affected.bam

**Read count**

Now we are ready to do count how many reads map to each gene/transcript. For this, we will use the htseq-count tool from [HTSeq](https://htseq.readthedocs.io/en/release_0.10.0/). 

Type *htseq-count --help* to see what options you can use. 

HTseq needs a bam-file with the RNA alignments as input (which we just made) and a gtf or gff file that contains 
information on where the different exons and transcripts are on chr9. We already have a gff file for chr9 in the rnaseq2 folder, so let's create a symbolic link to the the gff file suing the ln -s command like earlier. 

Try to look at it (using the command “head”). 

You will see that it contains information on the transcripts and the different exons.


In [None]:
!ln -s /exercises/rnaseq2/chr9.gff . # creates symbolic link

!head chr9.gff


<br>HTseq will output the counts in two columns with <i>column1=gene</i> and <i>column2=#reads</i>, we will put that in two files called affected.counts.txt or control.counts.txt (if using the control samples). <br>

In [None]:
! htseq-count -f bam -s yes -m intersection-strict --idattr=gene affected.bam chr9.gff > affected.counts.txt

<br>The 2-column file output contains the number of reads aligned to each genomic features on chromosome 9. Using standard UNIX tools, it is possible to look at this output. <br>
E.g.:
    
<i>head affected.counts.txt</i>

<i>tail affected.counts.txt</i>


<b>Q10: How many total reads are counted in the genomic features? (hint: using for example grep and awk is one way to get this information)</b>


<b>Q11: What is the raw read count for PCA3 gene in your sample?</b>


***STATISTICAL ANALYSES***

We will now perform differential expression among all the affected and control samples of the project. In this case, read count was performed using transcript accession numbers from [Ensembl](http://www.ensembl.org/index.html).

Copy the directory "counts" in your working directory, then enter in to the directory



In [None]:
%cd ~/ex_rnaseq/
! cp -r /exercises/rnaseq2/counts .
%cd ~/ex_rnaseq/counts

<b> However since we are working in a jupyter notebook in a python environment it is a bit cumbersome to run R. Instead of running the R commands directly in the jupyter notebook, we will run an R script using the bash command line. 
    
<b> In the cell underneath we show the contents of the R script that will be used, so that you understand how the differential expression is performed using R.

In the code, we have omitted the ENSG id for the gene PCA3. This kind of "ENSGxxxxxxxxx" id are gene ids from Ensembl. 
To figure out what the id is for PCA3 you need to go to the [Ensembl-website](https://www.ensembl.org/index.html) and search for PCA3. 

<b> Q12: What is the ENSG identifier for PCA3?</b>

Now, let's run the R code and display the PCA3.jpg graph

In [None]:
!cp /home/jupyter-sdc_admin/exercise_day4.R .
! /usr/bin/Rscript exercise_day4.R

In [None]:
from IPython.display import Image

Image(filename='PCA3.jpeg') 

<b>Q13: Are the counts as expected from your knowledge about PCA3? Why?</b>



***Outlier detection:***
    
To check the distribution of the dataset and investigate possible outliers, we can do principal component analysis and hierarchical clustering on the data. To do these commands, the following R code has also been added to the R script that we ran through the command line.

We are using the function "rlog" to convert the data in to log2 as well as stabilizing the variance. Try to get the help rlog by writing "?rlog" in R - read the description of the function.


<i>
dds_rlog = rlog(dds)	
<br>dists = as.matrix(dist(t(assay(dds_rlog))))
    
rownames(dists) = colnames(dists) = colData(dds)$condition
<br>hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)<br>


<br>z <- plotPCA(dds_rlog, intgroup=c("condition"))<br>
nudge <- position_nudge(y=2)
<br>z + geom_label(aes(label = sample_table$sampleName), position = nudge)<br>


heatmap.2(dists, trace='none', col=rev(hmcol))
</i>

Run the following cell to see the heatmap generated by the R script




In [None]:
from IPython.display import Image

Image(filename='PCAplot.jpeg') 

<b>Q14: Do you see any obvious trends in the PCA plot?</b>


In [None]:
from IPython.display import Image

Image(filename='HM.jpeg') 

<b> Q15: When looking at the hierarchical clustering do you see any clusters? Do you see any possible outliers that would need to be excluded? </b>


***Differential expression***

Let us try to identify differentially expressed genes in the data. To do this, in the R script, we have fit a negative binomial GLM where we compare the groups specified by the condition factor (carcinoma vs. normal).
The command returns a fitted model object. To get hold of the differential expression test results from this object, you can see below  a brief view of what is inside of the results.





In [None]:
import pandas as pd
df1 = pd.read_csv("res.csv")
df1

<b>Q16: Try to understand what is the content of the different columns of the results dataframe. What is a p-value and what is an adjusted p-value?</b>


<b>Q17: What is the log2FoldChange column?</b>



Let us order the results after the smallest adjusted p-value.




In [None]:
df1.sort_values("padj")

Let us see how many genes are differentially expressed at a false discovery rate of 5%


In [None]:
## using pandsas - we will see more next 

df1[df1['padj']<.05]





<b>Q18: How many genes are differentially expressed at a False Discovery Rate (FDR) of 5%?</b>


<b>Q19: Try to search for those genes in the Ensembl web-site. Which genes are they?</b>



<b>Q20: What is the direction and the dimension of the foldchange and what is the adjusted pvalue for PCA3?</b>

 
<b>Q21: When looking at the plot we made earlier of PCA3 (the plotCounts command in R and Image(filename='PCA3.jpeg') in this notebook) does it then make sense?</b>
