# Functional analysis of gene signatures

In this session we will analyze the differentially expressed genes found by DESeq2 on the bulk RNA-seq data, for the p53 case study. Two replicates of non-stimulated (NS) RNA-seq samples were compared with 2 replicates of stimulated (S) RNA-seq samples. 

The DESeq2 analysis resulted in a deseq.results.tsv file, which we will use here.
The Seurat analysis of scRNA-seq data ...
The Scanpy analysis of scRNA-seq data ...

In [11]:
mkdir -p /mnt/storage/$USER/jupyternotebooks/functional_analysis_gene_signatures/
cd /mnt/storage/$USER/jupyternotebooks/functional_analysis_gene_signatures/

In [20]:
ln -sf /mnt/storage/r0773125/jupyternotebooks/RNA-seq/deseq.results.tsv .

In [21]:
head deseq.results.tsv

Gene	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
RCVRN	13012.0630398435	-5.03622231268386	0.107239964794059	-46.9621779749305	0	0
SUSD2	3750.29753383858	-3.23452917559081	0.132304959690488	-24.4475277658345	5.34648162690658e-132	4.04247475810407e-128
UHRF1	2481.16663973242	3.38820129078875	0.154275487048341	21.9620197324484	6.64776705861637e-107	3.35091778201322e-103
MYO5B	1840.62993187403	-2.75934327535151	0.126076242204659	-21.8863064690037	3.50813676020489e-106	1.32032416816083e-102
DBH-AS1	707.606037844285	-5.30273538539319	0.242395978446196	-21.8763340026709	4.36557389287407e-106	1.32032416816083e-102
SEPT4	3776.76355499303	-2.21119871915032	0.103701322541852	-21.3227629595363	6.98126588949692e-101	1.75951171301621e-97
OLAH	1715.97990933185	-2.5512400155512	0.121965344125656	-20.9177453959606	3.69158554557816e-97	7.97487951717614e-94
C2orf71	7807.96542956033	-2.11854179247527	0.108697833742474	-19.4901933141971	1.32977788252841e-84	2.51361264244933e-81
NPTX1	4342.91468586552	1.

In [22]:
head -1 deseq.results.tsv
grep -n CDKN1A deseq.results.tsv
grep -n BBC3 deseq.results.tsv
grep -n GDF15 deseq.results.tsv

Gene	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
15:CDKN1A	5581.59669285255	2.04608629655726	0.118416068563776	17.2787892840345	6.79596304358873e-67	7.3406109389392e-64
11164:BBC3	465.351001397489	-0.0920452974015471	0.174928764731525	-0.526187317122002	0.598758065203526	0.810746728331637
254:GDF15	1443.51187631608	1.13199254557894	0.155124492574137	7.29731666995099	2.93563317601618e-13	1.75464999556192e-11


## Use arbitrary thresholds to create lists of up- and down-regulated genes
* careful: there are a lot of genes without detected expression, they have NA in the logFC column; so column 3 ($3) should not be "NA" 
* we're using awk to filter this file, selecting only rows where the logFC (column 3, indicated by "\$3") is higher than a threshold; and the padj is lower than a threshold
* print ; => prints all the columns of the rows that fulfill our requirements

In [23]:
awk '$3 != "NA" && $3 > 1 && $7 < 0.05 {print ;}' deseq.results.tsv | head

UHRF1	2481.16663973242	3.38820129078875	0.154275487048341	21.9620197324484	6.64776705861637e-107	3.35091778201322e-103
NPTX1	4342.91468586552	1.85232097824919	0.0973267533263673	19.0319815974727	9.26808362611439e-81	1.55724400660113e-77
ELAVL3	1190.77364964771	2.42132378972336	0.138470500546997	17.4862066661019	1.82508518938519e-68	2.29991151949023e-65
CDKN1A	5581.59669285255	2.04608629655726	0.118416068563776	17.2787892840345	6.79596304358873e-67	7.3406109389392e-64
DLG5	3143.06225104625	1.60274691900141	0.11837374570092	13.5397161719531	9.11403810906267e-42	4.17643891773472e-39
ITPR3	2688.66278362004	1.40989166212935	0.104744232015028	13.4603274567622	2.67727099071539e-41	1.19075564475289e-38
HMGB2	6945.75430320763	1.36545648236078	0.102837473391484	13.2778104841484	3.113659152081e-40	1.27256091075051e-37
DCAF16	1268.06843394424	1.66107846534364	0.126654775196081	13.1150085953888	2.70151604569504e-39	1.02130814107501e-36
CACNA2D1	1025.98754074204	1.74925290796934	0.135734144000752	12

In [24]:
awk '$3 != "NA" && $3 > 1 && $7 < 0.05 {print $1}' deseq.results.tsv > up-logFC1-p05.txt
awk '$3 != "NA" && $3 < -1 && $7 < 0.05 {print $1}' deseq.results.tsv > down-logFC1-p05.txt

wc -l up-logFC1-p05.txt
wc -l down-logFC1-p05.txt

212 up-logFC1-p05.txt
562 down-logFC1-p05.txt


## Humanmine.org
Create a new list on humanmine.org, then click "Save List of N genes"
* For mouse genes, use mousemine.org
* For Drosophila genes, use flymine.org

## Use the entire ranking to determine the "leading edge"
### GORilla
Sort all genes descending by their logFC or padj; here we use logFC

In [25]:
cat deseq.results.tsv | sort -k 3,3gr | awk '$3 != "NA" {print $1}' | grep -v Gene | head

CD300LG
TSPAN7
MUC5AC
SPRED3
AC090505.6
NLRX1
CTD-2140B24.6
FAM65B
EPN3
TNFRSF10D
grep: write error: Broken pipe
sort: write failed: 'standard output': Broken pipe
sort: write error


In [26]:
cat deseq.results.tsv | sort -k 3,3gr | awk '$3 != "NA" {print $1}' | grep -v Gene > deseq.results.sortFCdesc.txt

Go to http://cbl-gorilla.cs.technion.ac.il/ and enter the text file with ranked gene names, then click on search for enriched GO terms: