# Analysis Notebook - Fisher exact test

This notebook looks at specific Genes and Gene Subcategories

In [2]:
suppressWarnings({suppressMessages({
options(warn = -1) 
library(dplyr)
library(multtest)
library(R.utils)
})})

### 1  Read in all and significant alternative splicing and differential gene expression results

The summary data captured in the now saved **all_gene_as_gene_names.tsv**, **all_genes_dge_data** and significant results captured in **gene_as.tsv** and **gene_dge.tsv**

In [3]:
results_dir  <- "../data/"
all_genes_as_data  <- read.table("../assets/all_gene_as_gene_names.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
names(all_genes_as_data) <- c("GeneSymbol", "ensg")
all_genes_dge_data <- read.table("../assets/all_gene_dge_gene_names.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
sig_gene_as  <- read.table(file="../data/gene_as.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
sig_gene_dge  <- read.table(file="../data/gene_dge.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
head(sig_gene_as,2)
head(sig_gene_dge,2)
head(all_genes_as_data,2)
head(all_genes_dge_data,2)

Unnamed: 0_level_0,GeneJunction,ASE,ASE_IDX,Tissue,counts,Display,GeneSymbol,GeneID,chr,logFC,AveExpr,t,PValue,AdjPVal,B
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,XIST-2253,A3SS,2253,adipose_subcutaneous,4,Adipose (sc),XIST,ENSG00000229807.11,chrX,-4.408605,3.196317,-36.48897,4.635568e-154,3.893877e-150,310.016
2,XIST-2252,A3SS,2252,adipose_subcutaneous,4,Adipose (sc),XIST,ENSG00000229807.11,chrX,-2.414713,3.64769,-21.92106,1.444102e-78,6.065229000000001e-75,160.0282


Unnamed: 0_level_0,Tissue,ENSG_ver,ENSG_no_ver,GeneSymbol,counts,Display,logFC,AveExpr,t,PValue,AdjPVal,B,chr
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,adipose_subcutaneous,ENSG00000176728.7,ENSG00000176728,TTTY14,765,Adipose (sc),-7.982166,-0.9288129,-139.823,0,0,1107.423,chrY
2,adipose_subcutaneous,ENSG00000231535.5,ENSG00000231535,LINC00278,765,Adipose (sc),-6.09542,-2.7765638,-126.9138,0,0,1050.366,chrY


Unnamed: 0_level_0,GeneSymbol,ensg
Unnamed: 0_level_1,<chr>,<chr>
1,A1BG,ENSG00000121410.11
2,A1CF,ENSG00000148584.15


Unnamed: 0_level_0,ENSG_ver,ENSG_no_ver,GeneSymbol,chr
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,ENSG00000183878.15,ENSG00000183878,UTY,chrY
2,ENSG00000129824.15,ENSG00000129824,RPS4Y1,chrY


### 2  We then do a hypergeometric/Fisher test to look for overrepresentation


### 2.1 DGE vs DAS

Comparing differentially expressed genes with differentially alternatively spliced:

|  	|  DGE+| DGE-|
|-	|-	|-	|
| DAS+|  a|  b|
| DAS-|  c| d|

In [4]:
sigASGenes  <- unique(sort(sig_gene_as$GeneSymbol))
sigDGEGenes <- unique(sort(sig_gene_dge$GeneSymbol))
allASGenes  <- unique(sort(all_genes_as_data$GeneSymbol))
allDGEGenes <- unique(sort(all_genes_dge_data$GeneSymbol))

In [18]:
totalASGEGenes <- union(allASGenes, allDGEGenes)
message("Union of both Alternatively Spliced Genes and Gene Expression genes\n", length(totalASGEGenes))
intersectASDEGenes <- intersect(allASGenes, allDGEGenes)
message("Shared Alternatively Spliced Genes and Gene Expression genes\n", length(intersectASDEGenes))
notASGenes <- setdiff(allASGenes,  allDGEGenes)
notGEGenes <- setdiff(allDGEGenes, allASGenes)
message("Genes alternatively spliced but not found in gene expression (not significantly)\n", length(notASGenes))
message("Genes gene expressed but not found in alternative splicing events\n", length(notGEGenes))
message("Number of genes that are alternatively spliced\n",length(allASGenes))
message("Number of genes that are expressed\n", length(allDGEGenes))

Union of both Alternatively Spliced Genes and Gene Expression genes
41961

Shared Alternatively Spliced Genes and Gene Expression genes
14438

Genes alternatively spliced but not found in gene expression (not significantly)
256

Genes gene expressed but not found in alternative splicing events
27267

Number of genes that are alternatively spliced
14694

Number of genes that are expressed
41705



In [5]:
message("Number of sigASGenes ", length(sigASGenes))
notSigAS <- setdiff(allASGenes,sigASGenes)
message("Number of genes that are NOT sigAS ", length(notSigAS))
message("Number of sigDGEgenes ", length(sigDGEGenes))
notDGE <- setdiff(allDGEGenes,sigDGEGenes)
message("Number of genes that are NOT DGE ", length(notDGE))
a <- intersect(sigASGenes, sigDGEGenes)
b <- intersect(sigASGenes, notDGE)
c <- intersect(notSigAS, allDGEGenes)
d <- intersect(notSigAS, notDGE)
message("a: ", length(a), "; b: ",  length(b), "; c: ",  length(c), "; d: ",  length(d))

Number of sigASGenes 2874

Number of genes that are NOT sigAS 11820

Number of sigDGEgenes 7417

Number of genes that are NOT DGE 34587

a: 1141; b: 1697; c: 11600; d: 10004



In [21]:
m <- matrix(c(length(a),length(b),length(c),length(d)), nrow=2,byrow = TRUE)
fisher.test(m)


	Fisher's Exact Test for Count Data

data:  m
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.5358716 0.6292555
sample estimates:
odds ratio 
 0.5807765 


### 2.2 IGH vs DGE

It was noted that there are a number of immune genes showing up disproportionally in DGE



|  	| IGH+ DGE+| IGH- DGE+|
|-	|-	|-	|
| IGH+ DAS+ |  a|  b|
| IGH- DAS- |  c| d|

In [22]:
IGHDGE      <- sig_gene_dge$GeneSymbol[grepl("IGH",sig_gene_dge$GeneSymbol)]
IGHDGE      <- unique(sort(IGHDGE))
IGHDAS      <- sig_gene_as$GeneSymbol[grepl("IGH",sig_gene_as$GeneSymbol)]
IGHDAS      <- unique(sort(IGHDAS))
sigDGEGenes <- unique(sort(sig_gene_dge$GeneSymbol))
sigDASGenes <- unique(sort(sig_gene_as$GeneSymbol))
length(IGHDGE)
length(IGHDAS)
length(sigDGEGenes)
length(sigASGenes)

In [23]:
message("Number of DGE genes that are IGH ", length(IGHDGE))
notIGHDGE <- setdiff(sigDGEGenes,IGHDGE)
message("Number of DGE genes that are NOT IGH ", length(notIGHDGE))
message("Number of DAS genes that are IGH ", length(IGHDAS))
notIGHDAS <- setdiff(sigASGenes,IGHDAS)
message("Number of DAS genes that are NOT IGH ", length(notIGHDAS))
a <- intersect(IGHDAS, IGHDGE)
b <- intersect(IGHDAS, notIGHDGE)
c <- intersect(notIGHDAS, IGHDGE)
d <- intersect(notIGHDAS, notIGHDGE) 
message("a: ", length(a), "; b: ",  length(b), "; c: ",  length(c), "; d: ",  length(d))

Number of DGE genes that are IGH 134

Number of DGE genes that are NOT IGH 7283

Number of DAS genes that are IGH 2

Number of DAS genes that are NOT IGH 2885

a: 1; b: 0; c: 0; d: 1146



In [24]:
m <- matrix(c(length(a) + 1,
              length(b) + 1,
              length(c) + 1,
              length(d) + 1), 
            nrow=2,byrow = TRUE)
fisher.test(m)


	Fisher's Exact Test for Count Data

data:  m
p-value = 1.359e-05
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 5.07942e+01 4.50360e+15
sample estimates:
odds ratio 
  1538.091 


### 2.2 TLR8 vs DGE

It was noted that there are a number of immune genes showing up disproportionally in DGE



|  	| TLR8+ DGE+| TLR8- DGE+|
|-	|-	|-	|
| TLR8+ DAS+ |  a|  b|
| TLR8- DAS- |  c| d|

In [25]:
TLR8DGE      <- sig_gene_dge$GeneSymbol[grepl("TLR8",sig_gene_dge$GeneSymbol)]
TLR8DGE      <- unique(sort(TLR8DGE))
TLR8DAS      <- sig_gene_as$GeneSymbol[grepl("TLR8",sig_gene_as$GeneSymbol)]
TLR8DAS      <- unique(sort(TLR8DAS))
sigDGEGenes <- unique(sort(sig_gene_dge$GeneSymbol))
sigDASGenes <- unique(sort(sig_gene_as$GeneSymbol))
length(TLR8DGE)
length(TLR8DAS)
length(sigDGEGenes)
length(sigASGenes)

In [26]:
message("Number of DGE genes that are TLR8 ", length(TLR8DGE))
notTLR8DGE <- setdiff(sigDGEGenes,TLR8DGE)
message("Number of DGE genes that are NOT TLR8 ", length(notTLR8DGE))
message("Number of DAS genes that are TLR8 ", length(TLR8DAS))
notTLR8DAS <- setdiff(sigASGenes,TLR8DAS)
message("Number of DAS genes that are NOT TLR8 ", length(notTLR8DAS))
a <- intersect(TLR8DAS, TLR8DGE)
b <- intersect(TLR8DAS, notTLR8DGE)
c <- intersect(notTLR8DAS, TLR8DGE)
d <- intersect(notTLR8DAS, notTLR8DGE) 
message("a: ", length(a), "; b: ",  length(b), "; c: ",  length(c), "; d: ",  length(d))

Number of DGE genes that are TLR8 1

Number of DGE genes that are NOT TLR8 7416

Number of DAS genes that are TLR8 0

Number of DAS genes that are NOT TLR8 2887

a: 0; b: 0; c: 0; d: 1147



In [27]:
m <- matrix(c(length(a) + 1,
              length(b) + 1,
              length(c) + 1,
              length(d) + 1), 
            nrow=2,byrow = TRUE)
fisher.test(m)


	Fisher's Exact Test for Count Data

data:  m
p-value = 0.003474
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 7.336022e+00 4.503600e+15
sample estimates:
odds ratio 
  807.1171 


### 3.0 Extract TLR8 isoforms

In [33]:
TRAFDGE     <- sig_gene_dge[grepl("TRAF",sig_gene_dge$GeneSymbol),]
TRAFDAS     <- sig_gene_as[grepl("TRAF",sig_gene_as$GeneSymbol),]
head(TRAFDGE,2)
head(TRAFDAS,2)
write.table (TRAFDGE,"../data/trafdge.tsv", sep="\t")
write.table (TRAFDAS,"../data/trafdas.tsv", sep="\t")

Unnamed: 0_level_0,Tissue,ENSG_ver,ENSG_no_ver,GeneSymbol,counts,Display,logFC,AveExpr,t,PValue,AdjPVal,B,chr
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
3405,breast_mammary_tissue,ENSG00000009790.14,ENSG00000009790,TRAF3IP3,5433,Breast,0.9137457,0.5676736,11.88221,1.350093e-28,6.968226e-27,53.76519,chr1
4989,breast_mammary_tissue,ENSG00000076604.14,ENSG00000076604,TRAF4,5433,Breast,0.8732736,4.8298281,7.91071,1.923433e-14,2.926761e-13,20.92178,chr17


Unnamed: 0_level_0,GeneJunction,ASE,ASE_IDX,Tissue,counts,Display,GeneSymbol,GeneID,chr,logFC,AveExpr,t,PValue,AdjPVal,B
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
175,TRAF3IP3-2737,A3SS,2737,breast_mammary_tissue,462,Breast,TRAF3IP3,ENSG00000009790.15,chr1,-1.0894857,1.947695,-6.653225,1.058022e-10,1.110581e-08,13.925927
873,TRAF4-2634,A5SS,2634,breast_mammary_tissue,306,Breast,TRAF4,ENSG00000076604.15,chr17,-0.9610208,4.837136,-4.698929,3.724633e-06,0.0001316037,3.837966


### Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### Appendix 1. Checksums with the sha256 algorithm

In [None]:
notebookid   = "FisherExactTests"
notebookid


### Appendix 2. Libraries metadata

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", notebookid, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", notebookid ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]