# Functional analysis of gene signatures

In this session we will analyze the differentially expressed genes found by DESeq2. Two replicates of control knockdown (YT) RNA-seq samples were compared with 2 replicates of UHRF1 knockdown (YS) RNA-seq samples. 

The DESeq2 analysis resulted in a deseq.results.tsv file, which we will use here.

## Getting set up

In [19]:
mkdir -p /mnt/storage/$USER/jupyternotebooks/functional_analysis_gene_signatures/
cd /mnt/storage/$USER/jupyternotebooks/functional_analysis_gene_signatures/

In [20]:
ln -sf /mnt/storage/r0773125/jupyternotebooks/RNA-seq/deseq.results.tsv .
ln -sf /mnt/storage/r0773125/jupyternotebooks/RNA-seq/deseq.results.unshrunken.tsv .

## Using arbitrary thresholds to create lists of up- and down-regulated genes
Many people use arbitrary LFC and p-value thresholds to determine differentially expressed genes. A standard threshold is a LFC of 2 with a p-value of .05. The paper on which I'm basing this arbitrary threshold, and also does not use LFC shrinkage. Below I will first examine differentially expressed genes using this same arbitrary threshold, before using different LFC and p-value rankings to determine the best cut-off.

First, I make text files with up- and down-regulated genes.

In [22]:
# shrunken
awk '$3 != "NA" && $3 > 1 && $7 < 0.05 {print $1}' deseq.results.tsv > up-logFC1-p05.txt
awk '$3 != "NA" && $3 < -1 && $7 < 0.05 {print $1}' deseq.results.tsv > down-logFC1-p05.txt

#unshrunken
awk '$3 != "NA" && $3 > 1 && $7 < 0.05 {print $1}' deseq.results.unshrunken.tsv > up-logFC1-p05-unshrunken.txt
awk '$3 != "NA" && $3 < -1 && $7 < 0.05 {print $1}' deseq.results.unshrunken.tsv > down-logFC1-p05-unshrunken.txt

Check the number of up- and down-regulated genes with and without LFC shrinkage. Without LFC shrinkage there are 781 DEGs. With LFC shrinkage there are 470 DEGs. In the paper there were 829 DEGs in the UHRF1-knockdown cells. This means that in total there were just under 50 more DEGs given in the paper than in my analysis here, when using the same arbitrary LFC threshold and significance cut-off. In the paper DEGs were detected using the EdgeR package, so that is potentially the reason for the difference, though it seems unlikely to me. I didn't end up looking much further into this as I wanted to focus on the functional analysis. 

In [24]:
# unshrunken
wc -l up-logFC1-p05-unshrunken.txt
wc -l down-logFC1-p05-unshrunken.txt

559 up-logFC1-p05-unshrunken.txt
212 down-logFC1-p05-unshrunken.txt


In [25]:
# shrunken
wc -l up-logFC1-p05.txt
wc -l down-logFC1-p05.txt

349 up-logFC1-p05.txt
121 down-logFC1-p05.txt


Moving forward, I will be using the data with LFC shrinkage as it is more conservative and more likely to be representative of a larger sample size. I do this because we only have two replicates for both control and knockdown and so I think it's best to be conservative with the estimates.

I understand that this might give me different results to the original publication.

## Humanmine.org
The list of DEGs were input to HumanMine to determine gene ontology enrichments and
The outputs from HumanMine are given below.

First let's look at the output for significantly upregulated genes:

![MH-up-1](functional_analysis_gene_signatures/images/HM-up-1.png)

![MH-up-2](functional_analysis_gene_signatures/images/HM-up-2.png)

What can we see here? Firstly, the expected and observed distribution of genes across the chromosomes are similar, which is good. If we look at the enriched biological processes we see that the most highly-enriched processes are visual perception and sensory preception of light stimulus, which is expected as this sample was taken from retinoblastoma tissues. The pathway enrichment analysis also show that phototransduction is highly enriched.


Now let's look at the output for significantly downregulated genes:

![MH-up-1](functional_analysis_gene_signatures/images/HM-down-1.png)

![MH-down-2](functional_analysis_gene_signatures/images/HM-down-2.png)

Let's set the p-value threshold to 1, just to see the most prevalent gene-ontology enrichments and their significance. 

![MH-down-3](functional_analysis_gene_signatures/images/HM-down-3.png)

## Use the entire ranking to determine the "leading edge"
### GORilla
First we sort all genes descending by their logFC

In [27]:
cat deseq.results.tsv | sort -k 3,3gr | awk '$3 != "NA" {print $1}' | grep -v Gene > deseq.results.sortFCdesc.txt
cat deseq.results.tsv | sort -k 3,3g | awk '$3 != "NA" {print $1}' | grep -v Gene > deseq.results.sortFCasc.txt

Given below is a screenshot of GO terms and their enrichment from GORilla. The enrichment cutoffs could then be used for a more detailed selection DEGs related to each given GO term.

![mmm](functional_analysis_gene_signatures/images/gorup-list.png)

Looking at the figure above, we see that the enrichment is highest for photoreceptor cell development (30.3), rhodopsin mediated signaling pathway (28.98), and phototransduction (15.92). 

The row for phototransduction gives (15854,39,383,15). This implies there are 20 genes that are known to be involved in phototransduction, within the top 383 genes; while among all annotated 15854 genes in the human genome, there are 39 genes with this function.

Therefore the leading edge (the position at which enrichment is highest for phototransduction), is given by 383. If we are interested in how UHRF1 regulates photoreceptor cell development, we can choose to use 383 as a more stringent cut-off than the arbitrary cutoff chosen previously.

# Gene Set Enrichment Analysis


Generating a rnk file:

In [28]:
cat deseq.results.tsv | sort -k 3,3gr | awk '$3 != "NA" {print $1, $3}' | grep -v Gene | tr ' ' '\t' > deseq.logFC.rnk
head deseq.logFC.rnk

RCVRN	4.87355936460648
DBH-AS1	4.56772020966474
RNF152	4.14525881712738
DBH	4.02335481716765
TNC	3.74712049839423
IMPG1	3.69070300732618
TFF1	3.64799319678884
DKK3	3.54294325178647
MYCL	3.2096358536784
CTNNA2	3.19842349056195


Using this ranked set of genes, I ran a GSEA analysis using a camera type eye development gene set from MSigDB. The geneset is given [here](https://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=GO_CAMERA_TYPE_EYE_DEVELOPMENT&keywords=photoreceptor). The outputs of the GSEA run are given below.

![summary](functional_analysis_gene_signatures/images/GSEA_summary.png)
![enrichment](functional_analysis_gene_signatures/images/GSEA-enrichment.png)
![table](functional_analysis_gene_signatures/images/GSEA_table.png)


I'm not really sure if we can make any valuable inferences from these results that have not already been made from the previous analyses. Again we see that there is enrichment in photoreceptors and eye development. I could not include all genes give in the list above, but there were enriched genes as far down in the ranking as 2000.

# Motif Discorvery - TF identification
Lastly, I used iRegulon in cytoscape to identify enriched motifs and predicted transcription factors. The nitufs are give in an image below. The most enriched motif ID is 'transfac pro_M02101' which has a number of associated transcription factors: MYOG, TCF3, ASCL2, NEUROD1, NHLH1, and TCF12. I searched these transcription factors on GeneCards and found that most of them are involved in the initiation of neuronal differentiation.

![motifs](functional_analysis_gene_signatures/images/tf-motifs.png)