# Notebook to prepare and run ontologizer for DGE results


This notebook uses **gencode.v30.annotation.gtf** to resolve the **ENSG ids** to obtain **geneSymbols** for the **differential gene expression** results that come from the running of the **differentialGeneExpressionAnalysis.ipynb** notebook.   and the associated genome assembly **GRCH30.p12.genome.fa**.   

The output of the **differentialGeneExpressionAnalysis.ipynb** notebook includes:

1. **{tissue}_ensg_map.csv** 
2. **{tissue}_DSG.csv**
3. **{tissue}_refined_DSG.csv**

The **{tissue}_ensg_map.csv** file has the **ENSG ids** comma separated with the **geneSymbols**.   These will serve as the input for the **gene set** requirement of the **ontologizer** for each of the tissues.

Using any **{tissue}_DSG.csv** file, they all have the same set of **ENSG**, the first column of this output contains the **ENSG** ids.  These need to be translated to **geneSymbols** to create the **universe** file also needed by **ontologizer.

Using the **gencode.v30.annotation.txt** file that has been used througout this analysis, we will obtain these required **geneSymbols**.

Finally, for each of the files to be run, the **ontologizer.jar** file will be downloaded from the link and the **ontologizer** will be run for each of the **39** tissues we have analyzed throughout this digital experiment.

## 1. Library Dependencies

In [74]:
suppressWarnings({suppressMessages({
library(Biostrings)
library(rtracklayer)
})})

ERROR: Error in library(system): there is no package called ‘system’


## 1.1 Obtain ontologizer

We will use the ontologizer to obtain the GO enrichments

In [3]:
command <- 'wget http://ontologizer.de/cmdline/Ontologizer.jar'
message("running :", command)
system(command)

running :wget http://ontologizer.de/cmdline/Ontologizer.jar



## 1.2 Obtain the gencode.v30.gtf file

gencode.v30.annotation.gtf file was used for the rMATS 3.2.5 experiment.  

In [4]:
#
# add chr information for summary data later, use the annotation we used for rMATS
#
if (!("gencode.v30.annotation.gtf.gz" %in% list.files("../data/"))) {
    message("downloading gencode v30 annotation\n")
    system("wget -O ../data/gencode.v30.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz")
    message("Done!\n")
    message("Unzipping compressed file gencode.v30.annotation.gtf.gz..")
    system("gunzip ../data/gencode.v30.annotation.gtf.gz", intern = TRUE)
    message("Done! gencode.v30.annotation.gtf can be found in ../data/")
}

### 1.3 Creating the internal datastructure for gencode file

Attempting to use rtracklayer::import rearranges the gtf file causing issues with the using gffread and other applications.   We convert our data to a data.frame for ease of use

In [5]:
gencode <- import("../data/gencode.v30.annotation.gtf")

In [6]:
gtf.df <- as.data.frame (gencode)
chr_genes <- unique(gtf.df[,c("gene_name","gene_id")])
colnames(chr_genes) <- c("GeneSymbol", "ENSG")
head(chr_genes)
write.table(chr_genes, "../data/geneSymbolEnsgUniverse.txt", 
            quote=FALSE,
            col.names=TRUE,
            row.names=FALSE,
            sep="\t")

Unnamed: 0_level_0,GeneSymbol,ENSG
Unnamed: 0_level_1,<chr>,<chr>
1,DDX11L1,ENSG00000223972.5
13,WASH7P,ENSG00000227232.5
26,MIR6859-1,ENSG00000278267.1
29,MIR1302-2HG,ENSG00000243485.5
37,MIR1302-2,ENSG00000284332.1
40,FAM138A,ENSG00000237613.2


## 1.3 Read in the universe as ENSG

In [22]:
universe_as_ensg <- read.table("../data/DGE_universe_ENSG.txt", 
                               header=TRUE, 
                               stringsAsFactors=FALSE)
head(universe_as_ensg,2)

Unnamed: 0_level_0,ENSG
Unnamed: 0_level_1,<chr>
1,ENSG00000000003.14
2,ENSG00000000005.5


In [23]:
for (i in 1:dim(universe_as_ensg)[1]) {
    ensg <- gsub("\\..*","",universe_as_ensg$ENSG[i])
    universe_as_ensg$ENSG[i] <- ensg
}
head(universe_as_ensg,2)

Unnamed: 0_level_0,ENSG
Unnamed: 0_level_1,<chr>
1,ENSG00000000003
2,ENSG00000000005


In [24]:
write.table(universe_as_ensg,"../data/DGE_universe_ENSG_no_ver.txt", 
            row.names=FALSE, 
            quote=FALSE,
            col.names=TRUE)


sorted externally 

In [25]:
sortedGeneSymbolENSG <- read.table("../data/sortedGeneSymbolENSG.txt", 
                                   stringsAsFactors=FALSE,
                                  header=TRUE)
head(sortedGeneSymbolENSG)

Unnamed: 0_level_0,GeneSymbol,ENSG
Unnamed: 0_level_1,<chr>,<chr>
1,TSPAN6,ENSG00000000003.14
2,TNMD,ENSG00000000005.6
3,DPM1,ENSG00000000419.12
4,SCYL3,ENSG00000000457.14
5,C1orf112,ENSG00000000460.17
6,FGR,ENSG00000000938.13


In [26]:
for (i in 1:dim(sortedGeneSymbolENSG)[1]) {
    ensg <- gsub("\\..*","",sortedGeneSymbolENSG$ENSG[i])
    sortedGeneSymbolENSG$ENSG[i] <- ensg
}
head(sortedGeneSymbolENSG,2)
write.table(sortedGeneSymbolENSG,"../data/sortedGeneSymbolENSG_no_ver.txt", 
            row.names=FALSE, 
            quote=FALSE,
            col.names=TRUE)

Unnamed: 0_level_0,GeneSymbol,ENSG
Unnamed: 0_level_1,<chr>,<chr>
1,TSPAN6,ENSG00000000003
2,TNMD,ENSG00000000005


## 1.4 match the gencode v30 ENSGs to the universe

In [27]:
geneSymbol <- rep("NA",dim(universe_as_ensg)[1])

In [36]:
for (i in 1:dim(universe_as_ensg)[1]) {
    match  <- as.character(sortedGeneSymbolENSG$ENSG) %in% as.character((universe_as_ensg$ENSG[i]))
    if (sum(match==TRUE)== 1) {
        geneSymbol[i] <- sortedGeneSymbolENSG[match,]$GeneSymbol
    } else if (sum(match==TRUE)>1) {
        all <- as.vector(as.character(sortedGeneSymbolENSG[match,]$GeneSymbol))
        geneSymbol[i] <- as.character(all[1])
    }
}
head(geneSymbol,2)

In [38]:
write.table(geneSymbol,"../data/DGEuniverse.txt", 
            row.names=FALSE, 
            quote=FALSE,
            col.names=FALSE)

## 1.5 download needed files for ontlogizer.

To run the ontologizer 4 files are needed -- a universe.txt file, a gene-set file -- and 2 reference files.  For each tissue, we have for differential gene expression a ensg_map file that has the required gene symbol.

| | FILE | DESCRIPTION|
|--|:---|:---|
|1|**`universe.txt`** | created by writing all the GeneSymbol entries in the HBA-DEALS results table. For retrieving `GeneSymbol`, the Gene column was splitted by `'_'` into `Geneid` and `GeneSymbol`, eg. `ENSG00000004059.11_ARF5` -> `ENSG00000004059.11`, `ARF5`
|2|**`gene_set.txt`** | created by writing the `GeneSymbol` entries after applying a filtering criterion (for this test,  I used `P` < 0.05 and `ExpLogFc` > 1.2) |
|3|**`goa_human.gaf`** | downloaded from here: http://current.geneontology.org/annotations/goa_human.gaf.gz |
|4|**`go.obo`** | downloaded from here: http://purl.obolibrary.org/obo/go.obo |


In [40]:
command <- 'wget http://current.geneontology.org/annotations/goa_human.gaf.gz'
message("running :", command)
system(command)
command <- 'wget http://purl.obolibrary.org/obo/go.obo'
message("running :", command)
system(command)

running :wget http://current.geneontology.org/annotations/goa_human.gaf.gz

running :wget http://purl.obolibrary.org/obo/go.obo



## 1.6 make the geneset for each of the tissues

In [41]:
results_dir         <- "../data/"
DGE_ensg_pattern <- "_DGE_ensg_map.csv"
DGE_ensg_files   <- list.files(path = results_dir, pattern = DGE_ensg_pattern)
head(DGE_ensg_files,2)

In [61]:
for (i in 1:length(DGE_ensg_files)) {
    file <- paste0("../data/",DGE_ensg_files[i])
    ensg_map <- read.table(file, 
                           stringsAsFactors=FALSE,
                           sep=",",
                           header=TRUE)
    tissue  <- gsub(DGE_ensg_pattern,"", file, fixed = TRUE)
    genesetfile <- paste0(tissue,"_geneset.txt")
    write.table(ensg_map$ensg_genes,
                genesetfile,
                col.names=FALSE,
                quote=FALSE,
                row.names=FALSE)

}

## 2.0 run the ontologizer for each tissue


In [62]:
results_dir         <- "../data/"
geneset_pattern <- "_geneset.txt"
genes_files   <- list.files(path = results_dir, pattern = geneset_pattern)
head(genes_files,2)

In [70]:
i = 1
for (i in 1:length(genes_files)) {
    file <- paste0("../data/",genes_files[i])
    geneset <- read.table(file, 
                           stringsAsFactors=FALSE,
                           sep=",",
                           header=TRUE)
    tissue  <- gsub(geneset_pattern,"", file, fixed = TRUE)
    command <- paste0("mkdir ", tissue)
    message("making directory ", command)
    system(command)
    command <- paste0(paste0(paste0(paste0("java -jar Ontologizer.jar -g ../data/go.obo -a ../data/goa_human.gaf -s ",
                             file), " -o "), tissue), " -p ../data/sortedDGEuniverse.txt -c Term-For-Term -m Benjamini-Hochberg -n")
    message("running ", command)
    system(command)
}

making directory mkdir ../data/adipose_subcutaneous

running java -jar Ontologizer.jar -g ../data/go.obo -a ../data/goa_human.gaf -s ../data/adipose_subcutaneous_geneset.txt -o ../data/adipose_subcutaneous -p ../data/sortedDGEuniverse.txt -c Term-For-Term -m Benjamini-Hochberg -n

making directory mkdir ../data/adipose_visceral_omentum

running java -jar Ontologizer.jar -g ../data/go.obo -a ../data/goa_human.gaf -s ../data/adipose_visceral_omentum_geneset.txt -o ../data/adipose_visceral_omentum -p ../data/sortedDGEuniverse.txt -c Term-For-Term -m Benjamini-Hochberg -n

making directory mkdir ../data/adrenal_gland

running java -jar Ontologizer.jar -g ../data/go.obo -a ../data/goa_human.gaf -s ../data/adrenal_gland_geneset.txt -o ../data/adrenal_gland -p ../data/sortedDGEuniverse.txt -c Term-For-Term -m Benjamini-Hochberg -n

making directory mkdir ../data/artery_aorta

running java -jar Ontologizer.jar -g ../data/go.obo -a ../data/goa_human.gaf -s ../data/artery_aorta_geneset.txt -o ..

## Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `conda list`

### Appendix 

### Libraries metadata

In [76]:
notebookid <- "runOntologizerForDGE.ipynb"
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", notebookid, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", notebookid ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]

Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..

Done!


Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..

Done!




 setting  value                       
 version  R version 3.6.2 (2019-12-12)
 os       Ubuntu 18.04.3 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_US.UTF-8                 
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Etc/UTC                     
 date     2020-06-30                  

Unnamed: 0_level_0,package,ondiskversion,loadedversion,path,loadedpath,attached,is_base,date,source,md5ok,library
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<lgl>,<lgl>,<chr>,<chr>,<lgl>,<fct>
BiocGenerics,BiocGenerics,0.32.0,0.32.0,/opt/conda/lib/R/library/BiocGenerics,/opt/conda/lib/R/library/BiocGenerics,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
Biostrings,Biostrings,2.54.0,2.54.0,/opt/conda/lib/R/library/Biostrings,/opt/conda/lib/R/library/Biostrings,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
GenomeInfoDb,GenomeInfoDb,1.22.0,1.22.0,/opt/conda/lib/R/library/GenomeInfoDb,/opt/conda/lib/R/library/GenomeInfoDb,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
GenomicRanges,GenomicRanges,1.38.0,1.38.0,/opt/conda/lib/R/library/GenomicRanges,/opt/conda/lib/R/library/GenomicRanges,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
IRanges,IRanges,2.20.0,2.20.0,/opt/conda/lib/R/library/IRanges,/opt/conda/lib/R/library/IRanges,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
rtracklayer,rtracklayer,1.46.0,1.46.0,/opt/conda/lib/R/library/rtracklayer,/opt/conda/lib/R/library/rtracklayer,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
S4Vectors,S4Vectors,0.24.0,0.24.0,/opt/conda/lib/R/library/S4Vectors,/opt/conda/lib/R/library/S4Vectors,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
XVector,XVector,0.26.0,0.26.0,/opt/conda/lib/R/library/XVector,/opt/conda/lib/R/library/XVector,True,False,2019-10-29,Bioconductor,,/opt/conda/lib/R/library
