# Analysis Notebook - create all DGE files

Creating and saving two files in this notebook

 **1. chr_genes.tsv:** create a file with chromosome, ENSG (no version number) and GeneSymbols using gencode.v30.annotation.gtf

 **2. all_gene_dge.tsv:** create this file using the chr_genes from gencode.v30.annotation for the specific ENSG ids that are used in the differential gene analysis

In [1]:
library(dplyr)
library(rtracklayer)


Attaching package: ‘dplyr’




The following objects are masked from ‘package:stats’:

    filter, lag




The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




Loading required package: GenomicRanges



Loading required package: stats4



Loading required package: BiocGenerics



Loading required package: parallel




Attaching package: ‘BiocGenerics’




The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB




The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union




The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs




The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min




Loading required package: S4Vectors




Attaching package: ‘S4Vectors’




The following objects are masked from ‘package:dplyr’:

    first, rename




The following objects are masked from ‘package:base’:

    expand.grid, I, unname




Loading required package: IRanges




Attaching package: ‘IRanges’




The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice




Loading required package: GenomeInfoDb



### 1  Add to the all_gene_dge_names.tsv structure

First gather the data and add GeneSymbol, ENSG without version and chromosome

### 1.1 create a file used for statistical analysis of DGE genes
all the tissues used the same listing of genes for the differential gene analysis -- so reading anly of the files will allow these ENSG files to be mapped to GeneSymbols and Chromosomes using the gencode.v30.annotation file

In [2]:
#
# add chr information for summary data later, use the annotation we used for rMATS
#
if (!("gencode.v30.annotation.gtf.gz" %in% list.files("../data/"))) {
    message("downloading gencode v30 annotation\n")
    system("wget -O ../data/gencode.v30.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz")
    message("Done!\n")
    message("Unzipping compressed file gencode.v30.annotation.gtf.gz..")
    system("gunzip ../data/gencode.v30.annotation.gtf.gz", intern = TRUE)
    message("Done! gencode.v30.annotation.gtf can be found in ../data/")
}
gencode <- import("../data/gencode.v30.annotation.gtf")
gtf.df <- as.data.frame (gencode)
chr_genes <- unique(gtf.df[,c("seqnames","gene_name","gene_id")])
colnames(chr_genes) <- c("chr","GeneSymbol", "ENSG")
head(chr_genes)
for (i in 1:dim(chr_genes)[1]) {
    chr_genes$ENSG[i] <- as.character(strsplit(chr_genes$ENSG[i],'\\.\\w+$'))
}
head(chr_genes)

Unnamed: 0_level_0,chr,GeneSymbol,ENSG
Unnamed: 0_level_1,<fct>,<chr>,<chr>
1,chr1,DDX11L1,ENSG00000223972.5
13,chr1,WASH7P,ENSG00000227232.5
26,chr1,MIR6859-1,ENSG00000278267.1
29,chr1,MIR1302-2HG,ENSG00000243485.5
37,chr1,MIR1302-2,ENSG00000284332.1
40,chr1,FAM138A,ENSG00000237613.2


Unnamed: 0_level_0,chr,GeneSymbol,ENSG
Unnamed: 0_level_1,<fct>,<chr>,<chr>
1,chr1,DDX11L1,ENSG00000223972
13,chr1,WASH7P,ENSG00000227232
26,chr1,MIR6859-1,ENSG00000278267
29,chr1,MIR1302-2HG,ENSG00000243485
37,chr1,MIR1302-2,ENSG00000284332
40,chr1,FAM138A,ENSG00000237613


In [3]:
write.table(chr_genes, "../data/chr_genes.tsv", quote=FALSE, sep="\t")

### 1.2 Create the all_genes_dge_names.tsv file for analysis

All of the **DGE.csv** tissue files have the same gene names

In [4]:
results_dir     <- "../data/"
all_dge_pattern <- "*_DGE.csv"
all_dge_files    <- list.files(path = results_dir, pattern = all_dge_pattern)
message ("number of DGE files ", length(all_dge_files))

number of DGE files 39



In [5]:
all_gene_dge = data.frame()

In [6]:
for (file in 1:length(all_dge_files)) {

    lines  <- read.table(file=paste0(results_dir, all_dge_files[file]), 
                                  header = TRUE, sep = ",", quote = "\"'", skipNul = FALSE)
    message("For   ", all_dge_files[file])
    message("we find the number of genes to be ", dim(lines))

    if (dim(lines)[1] > 0) {
        ensg_ver   <- as.vector(as.character(rownames(lines)))
        chr        <- rep("NA",dim(lines)[1])
        ensg_no_ver<- rep("NA",dim(lines)[1])
        ensg_genes <- rep("NA",dim(lines)[1])
       
        for (i in 1:dim(lines)[1]) {
            ensg_no_ver[i] <- as.character(strsplit(ensg_ver[i],'\\.\\w+$'))
   	    match  <- as.character(chr_genes$ENSG) %in% as.character((ensg_no_ver[i]))
	    
   	    if (sum(match==TRUE)== 1) {
   	        chr[i]        <- as.character(chr_genes[match,]$chr)
                ensg_genes[i] <- as.character(chr_genes[match,]$GeneSymbol)
   		  
   	    # if there are multiple matches, just keep the first result
   	    } else if (sum(match==TRUE)>1) {
   	        all_chr <- as.vector(as.character(chr_genes[match,]$chr))
   		chr[i] <- as.character(all_chr[1])
   		all_genes <- as.vector(as.character(chr_genes[match,]$GeneSymbol))
   		ensg_genes[i] <- as.character(all_genes[1])
		
   	    } # end if there is a match
	    
   	    res <- data.frame(ENSG_ver     <- ensg_ver[i],
                          ENSG_no_ver  <- ensg_no_ver[i],
                          GeneSymbol   <- ensg_genes[i],
                          chr          <- chr[i])
        
            all_gene_dge <- rbind(all_gene_dge, res)
	    
        } # for all lines
	
    } #if has events
} #for all files

For   adipose_subcutaneous_DGE.csv



we find the number of genes to be 161226



For   adipose_visceral_omentum_DGE.csv



we find the number of genes to be 163406



For   adrenal_gland_DGE.csv



we find the number of genes to be 160236



For   artery_aorta_DGE.csv



we find the number of genes to be 158156



For   artery_coronary_DGE.csv



we find the number of genes to be 161516



For   artery_tibial_DGE.csv



we find the number of genes to be 153466



For   brain_caudate_basal_ganglia_DGE.csv



we find the number of genes to be 167906



For   brain_cerebellar_hemisphere_DGE.csv



we find the number of genes to be 167826



For   brain_cerebellum_DGE.csv



we find the number of genes to be 170536



For   brain_cortex_DGE.csv



we find the number of genes to be 167496



For   brain_frontal_cortex_ba_9_DGE.csv



we find the number of genes to be 166806



For   brain_hippocampus_DGE.csv



we find the number of genes to be 165536



For   brain_hypothalamus_DGE.csv



we find the number of genes to be 170716



For   brain_nucleus_accumbens_basal_ganglia_DGE.csv



we find the number of genes to be 168296



For   brain_putamen_basal_ganglia_DGE.csv



we find the number of genes to be 163506



For   brain_spinal_cord_cervical_c_1_DGE.csv



we find the number of genes to be 164516



For   breast_mammary_tissue_DGE.csv



we find the number of genes to be 170786



For   cells_cultured_fibroblasts_DGE.csv



we find the number of genes to be 142836



For   cells_ebv_transformed_lymphocytes_DGE.csv



we find the number of genes to be 143106



For   colon_sigmoid_DGE.csv



we find the number of genes to be 162546



For   colon_transverse_DGE.csv



In [None]:
colnames(all_gene_dge) <- c("ENSG_ver","ENSG_no_ver","GeneSymbol","chr")
sorted_all_gene_dge <- all_gene_dge[order(all_gene_dge["ENSG_ver"]),]
unique_all_gene_dge <- unique(sorted_all_gene_dge)

In [None]:
message("The universe of all genes (without ChrY) is ", length(unique_all_gene_dge$GeneSymbol))

n_unique_genes <- length(summary(as.factor(all_gene_dge$GeneSymbol),maxsum=50000))
message("We extracted a total of ",nrow(all_gene_dge)," differential gene events (all_gene_dge)")
message("This includes ", n_unique_genes, " total genes")

In [None]:
table(unique_all_gene_dge$chr)
write.table(unique_all_gene_dge, "../data/all_gene_dge.tsv", quote=FALSE, sep="\t")

### Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### Appendix 1. Checksums with the sha256 algorithm

In [None]:
rm (notebookid)
notebookid   = "createAllgeneDGE"
notebookid

message("Generating sha256 checksums of the file `../data/all_gene_dge_gene_names.tsv` directory .. ")
system(paste0("cd ../data && find . -name all_gene_dge_gene_names.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/chr_genes.tsv` directory .. ")
system(paste0("cd ../data && find . -name chr_genes.tsv -exec sha256sum {} \\;  >  ../

data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

### Appendix 2. Libraries metadata

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../data/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../data/", notebookid, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../data/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../data/", notebookid ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]