# Analysis Notebook - Fisher exact test

This notebook looks at specific Genes and Gene Subcategories

In [1]:
defaultW <- getOption("warn")  # suppress warnings for this cell
options(warn = -1) 
library(dplyr)
library(multtest)
library(R.utils)

options(warn = defaultW)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
 

### 1  Read in all and significant alternative splicing and differential gene expression results

The summary data captured in the now saved **all_gene_as_gene_names.tsv**, **all_genes_dge_data** and significant results captured in **gene_as.tsv** and **gene_dge.tsv**

In [2]:
results_dir  <- "../data/"
all_genes_as_data  <- read.table("../assets/all_gene_as_gene_names.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
names(all_genes_as_data) <- c("GeneSymbol", "ensg")
all_genes_dge_data <- read.table("../assets/all_gene_dge_gene_names.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
sig_gene_as  <- read.table(file="../data/gene_as.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
sig_gene_dge  <- read.table(file="../data/gene_dge.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
head(sig_gene_as,2)
head(sig_gene_dge,2)
head(all_genes_as_data,2)
head(all_genes_dge_data,2)

Unnamed: 0_level_0,GeneJunction,ASE,ASE_IDX,Tissue,counts,Display,GeneSymbol,GeneID,chr,logFC,AveExpr,t,PValue,AdjPVal,B
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,XIST-2253,A3SS,2253,adipose_subcutaneous,4,Adipose (sc),XIST,ENSG00000229807.11,chrX,-4.408605,3.196317,-36.48897,4.635568e-154,3.893877e-150,310.016
2,XIST-2252,A3SS,2252,adipose_subcutaneous,4,Adipose (sc),XIST,ENSG00000229807.11,chrX,-2.414713,3.64769,-21.92106,1.444102e-78,6.065229000000001e-75,160.0282


Unnamed: 0_level_0,Tissue,ENSG_ver,ENSG_no_ver,GeneSymbol,counts,Display,logFC,AveExpr,t,PValue,AdjPVal,B,chr
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,adipose_subcutaneous,ENSG00000176728.7,ENSG00000176728,TTTY14,765,Adipose (sc),-7.982166,-0.9288129,-139.823,0,0,1107.423,chrY
2,adipose_subcutaneous,ENSG00000231535.5,ENSG00000231535,LINC00278,765,Adipose (sc),-6.09542,-2.7765638,-126.9138,0,0,1050.366,chrY


Unnamed: 0_level_0,GeneSymbol,ensg
Unnamed: 0_level_1,<chr>,<chr>
1,A1BG,ENSG00000121410.11
2,A1CF,ENSG00000148584.15


Unnamed: 0_level_0,ENSG_ver,ENSG_no_ver,GeneSymbol,chr
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>
1,ENSG00000183878.15,ENSG00000183878,UTY,chrY
2,ENSG00000129824.15,ENSG00000129824,RPS4Y1,chrY


### 2  We then do a hypergeometric/Fisher test to look for overrepresentation


### 2.1 DGE vs DAS

Comparing differentially expressed genes with differentially alternatively spliced:

|  	|  DGE+| DGE-|
|-	|-	|-	|
| DAS+|  a|  b|
| DAS-|  c| d|

In [3]:
sigASGenes  <- unique(sort(sig_gene_as$GeneSymbol))
sigDGEGenes <- unique(sort(sig_gene_dge$GeneSymbol))
allASGenes  <- unique(sort(all_genes_as_data$GeneSymbol))
allDGEGenes <- unique(sort(all_genes_dge_data$GeneSymbol))

In [4]:
message("Number of sigASGenes ", length(sigASGenes))
notSigAS <- setdiff(allASGenes,sigASGenes)
message("Number of genes that are NOT sigAS ", length(notSigAS))
message("Number of sigDGEgenes ", length(sigDGEGenes))
notDGE <- setdiff(allDGEGenes,sigDGEGenes)
message("Number of genes that are NOT DGE ", length(notDGE))
a <- intersect(sigASGenes, sigDGEGenes)
b <- intersect(sigASGenes, notDGE)
c <- intersect(notSigAS, allDGEGenes)
d <- intersect(notSigAS, notDGE)
message("a: ", length(a), "; b: ",  length(b), "; c: ",  length(c), "; d: ",  length(d))

Number of sigASGenes 2887

Number of genes that are NOT sigAS 11807

Number of sigDGEgenes 7417

Number of genes that are NOT DGE 34587

a: 1147; b: 1704; c: 11587; d: 9997



In [5]:
m <- matrix(c(length(a),length(b),length(c),length(d)), nrow=2,byrow = TRUE)
fisher.test(m)


	Fisher's Exact Test for Count Data

data:  m
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.5358716 0.6292555
sample estimates:
odds ratio 
 0.5807765 


### 2.2 IGH vs DGE

It was noted that there are a number of immune genes showing up disproportionally in DGE



|  	| DGE+| DGE-|
|-	|-	|-	|
| IGH+ |  a|  b|
| IGH- |  c| d|

In [6]:
IGH         <- all_genes_dge_data[grepl("IGH",all_genes_dge_data$GeneSymbol ),]
sigDGEGenes <- unique(sort(sig_gene_dge$GeneSymbol))
IGHGenes    <- unique(sort(IGH$GeneSymbol))
AllGenes    <- unique(sort(all_genes_dge_data$GeneSymbol))
length(IGHGenes)
length(AllGenes)
length(sigDGEGenes)

In [7]:
message("Number of IGH genes ", length(IGHGenes))
notIGH <- setdiff(AllGenes,IGHGenes)
message("Number of genes that are NOT IGH ", length(notIGH))
message("Number of sigDGEgenes ", length(sigDGEGenes))
notDGE <- setdiff(AllGenes,sigDGEGenes)
message("Number of genes that are NOT DGE ", length(notDGE))
a <- intersect(IGHGenes, sigDGEGenes)
b <- intersect(IGHGenes, notDGE)
c <- intersect(notIGH,   sigDGEGenes)
d <- intersect(notIGH,   notDGE)
message("a: ", length(a), "; b: ",  length(b), "; c: ",  length(c), "; d: ",  length(d))

Number of IGH genes 152

Number of genes that are NOT IGH 41553

Number of sigDGEgenes 7417

Number of genes that are NOT DGE 34587

a: 133; b: 19; c: 6985; d: 34568



In [8]:
m <- matrix(c(length(a),length(b),length(c),length(d)), nrow=2,byrow = TRUE)
fisher.test(m)


	Fisher's Exact Test for Count Data

data:  m
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 21.30914 59.42851
sample estimates:
odds ratio 
  34.65803 


### Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### Appendix 1. Checksums with the sha256 algorithm

In [9]:
notebookid   = "FisherExactTests"
notebookid


### Appendix 2. Libraries metadata

In [10]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", notebookid, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", notebookid ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]

Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..

Done!


Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..

Done!




 setting  value                       
 version  R version 3.6.2 (2019-12-12)
 os       Ubuntu 18.04.3 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_US.UTF-8                 
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2020-06-22                  

Unnamed: 0_level_0,package,ondiskversion,loadedversion,path,loadedpath,attached,is_base,date,source,md5ok,library
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<lgl>,<fct>
Biobase,Biobase,2.46.0,2.46.0,/opt/conda/lib/R/library/Biobase,/opt/conda/lib/R/library/Biobase,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
BiocGenerics,BiocGenerics,0.32.0,0.32.0,/opt/conda/lib/R/library/BiocGenerics,/opt/conda/lib/R/library/BiocGenerics,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
dplyr,dplyr,0.8.4,0.8.4,/opt/conda/lib/R/library/dplyr,/opt/conda/lib/R/library/dplyr,True,False,2020-01-31,CRAN (R 3.6.2),,/opt/conda/lib/R/library
multtest,multtest,2.42.0,2.42.0,/opt/conda/lib/R/library/multtest,/opt/conda/lib/R/library/multtest,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
R.methodsS3,R.methodsS3,1.8.0,1.8.0,/opt/conda/lib/R/library/R.methodsS3,/opt/conda/lib/R/library/R.methodsS3,True,False,2020-02-14,CRAN (R 3.6.3),,/opt/conda/lib/R/library
R.oo,R.oo,1.23.0,1.23.0,/opt/conda/lib/R/library/R.oo,/opt/conda/lib/R/library/R.oo,True,False,2019-11-03,CRAN (R 3.6.3),,/opt/conda/lib/R/library
R.utils,R.utils,2.9.2,2.9.2,/opt/conda/lib/R/library/R.utils,/opt/conda/lib/R/library/R.utils,True,False,2019-12-08,CRAN (R 3.6.3),,/opt/conda/lib/R/library
