# Analysis Notebook - Fisher exact test

This notebook counts up differentially expressed and differentially spliced genes and calculates their overlap.

### 1  Read in all and significant alternative splicing and differential gene expression results

The summary data captured in **all_gene_as.tsv**, **all_genes_dge_data** and significant results captured in **gene_as.tsv** and **gene_dge.tsv** (these filees are generated by **countGenesAndEvents.R**, which must be run before this notebook).

In [None]:
results_dir  <- "../data/"
all_genes_as_data  <- read.table("../data/all_gene_as.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
all_genes_dge_data <- read.table("../assets/all_gene_dge.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
sig_gene_as  <- read.table(file="../data/gene_as.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
sig_gene_dge  <- read.table(file="../data/gene_dge.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)

In [None]:
head(sig_gene_as,2)
head(sig_gene_dge,2)
head(all_genes_as_data,2)
head(all_genes_dge_data,2)
head(all_genes_as_data$GeneSymbol,2)
head(all_genes_dge_data$GeneSymbol,2)
head(sig_gene_as$GeneSymbol,2)
head(sig_gene_dge$GeneSymbol,2)

### 2  Count up genes
About 250 genes were found in the splicing data but not in the gene expression data, presumably related to the different processing pipelines used. We can therefore not make an assessment of whether these genes were differentially expressed or not, and thus we remove the genes prior to further analysis.


In [None]:
# all genes identified in the gene expression data
allExpressionGenes <- unique(sort(all_genes_dge_data$GeneSymbol))
# all genes identified in the splicing data
allSplicingGenes  <- unique(sort(all_genes_as_data$GeneSymbol))
# Genes found in splicing data but not in expression data
orphanSplicingGenes <- setdiff(allSplicingGenes,allExpressionGenes)
message("All expression genes n=", length(allExpressionGenes),"; all splicing genes n=", length(allSplicingGenes), "; splicing genes not represented in expression set n=", length(orphanSplicingGenes))
correctedSplicing <- setdiff(allSplicingGenes, orphanSplicingGenes)
message("Note that we expect to find genes in the expression set that are not in the splicing set")
message("After removing the orphan splicing genes, we are left with  ", length(correctedSplicing), " genes in the splicing dataset")
universe <- allExpressionGenes

# Create the sets of differentially expressed/spliced genes
Note that we also need to correct the set of differentially spliced genes as above

In [None]:
sigDGEGenes <- unique(sort(sig_gene_dge$GeneSymbol))
sigASGenes  <- unique(sort(sig_gene_as$GeneSymbol))
correctedSigASGenes <- setdiff(sigASGenes, orphanSplicingGenes)
message("total AS (uncorrected) n=", length(sigASGenes), "; corrected n=", length(correctedSigASGenes))
total <- length(universe)
n_dge <- length(sigDGEGenes)
n_das <- length(correctedSigASGenes)
message("significant differentially expresssed genes: n=", n_dge, "/", total, ": ", 100*n_dge/total,"%")
message("significant differentially spliced genes: n=", n_das, "/", total, ": ", 100*n_das/total,"%")

In [None]:
dge_but_not_das <- setdiff(sigDGEGenes, correctedSigASGenes)
das_but_not_dge <- setdiff(correctedSigASGenes, sigDGEGenes)
dge_and_das <- intersect(sigDGEGenes, correctedSigASGenes)
neither_dge_nor_das <- setdiff(setdiff(universe,sigDGEGenes), correctedSigASGenes)
n_dge_but_not_das <- length(dge_but_not_das)
n_das_but_not_dge <- length(das_but_not_dge)
n_dge_and_das <- length(dge_and_das)
n_neither_dge_nor_das <- length(neither_dge_nor_das)
message("Differentially expressed but not differentially spliced: n=", n_dge_but_not_das,"/", n_dge, ": ", 100*n_dge_but_not_das/n_dge, "% of all DGE genes")
message("Differentially spliced but not differentially spliced: n=", n_das_but_not_dge,"/", n_das, ": ", 100*n_das_but_not_dge/n_das, "% of all DAS genes")
message("DGE and DAS: ", n_dge_and_das,"; ", 100*n_dge_and_das/total,"% of all genes")
expected_proportion <- (n_dge/total)*(n_das/total)
message("By chance we would expect ", expected_proportion*total,", or ", 100*expected_proportion, "%")
message("Number of genes with neither DGE nor DAS ", n_neither_dge_nor_das)

# 2.1 Check whether the increased proportion is statistically significant

Comparing differentially expressed genes with differentially alternatively spliced:

|  	|  DGE+| DGE-|
|-	|-	|-	|
| DAS+|  a|  b|
| DAS-|  c| d|

In [None]:
a <- n_dge_and_das
b <- n_das_but_not_dge
c <- n_dge_but_not_das
d <- n_neither_dge_nor_das
m <- matrix(c(a,b,c,d), nrow=2,byrow = TRUE)
m

In [None]:
fisher.test(m)