# Analysis Notebook - Count Genes and Events

This notebook processes the raw counts as provided by rMATS 3.2.5 with modifications because this rMATS was run at scale with the entirety of GTEx.   To so, please refer to the methods in the paper, but in brief, each sample was run against another.  This method of running rMATS 3.2.5, permitted us leveraging the counting of the reads done by rMATS together with the scrutinization of rMATS against the provided GTF.   In this case, gencode's complete annotated gtf was scanned by rMATS 3.2.5, the specific possible alternative splicing events catalogued and used in the `fromGTF` descriptions which capture the specific genomic locations for each of five alternative splicing types.  Then the short read RNA sequencing files collected by GTEx and annotated by sex and tissue were aligned and counted against the specific junctions as described by these fromGTF files.   It is these counts that were used in the analyses for this paper.  

This notebook, counts the genes and events and performs several descriptive statistical analysis. It  produces outputs needed for downstream visualizations, which are in separate notebooks for added clarity.

## 1 Introduction

## 1.1 Data files needed by this notebook

Input files have been pulled from results, the data were primarily obtained through the execution of the Nextflow workflow, rmats-nf, found in the GitHub repository, https://github.com/lifebit-ai/rmats-nf or the notebook.   How the data were generated are noted for each of the files.

1. **`*`.model_B_sex_as_event.csv** One for each tissue, generated by running the notebook `differentialSplicingJunctionAnalysis.ipynb` as a Nextflow workflow
2. **`*`.model_B_sex_as_event_refined.csv** One for each tissue, these are the statistically significant results FC > 1.5 and p-value < 0.05 generated by running the notebook `differentialSplicingJunctionAnalysis.ipynb` as a Nextflow workflow.
3. **fromGTF.`*`.txt** One for each splicing type, generated within rmats-nf, a Nextflow workflow found here `https://github.com/lifebit-ai/rmats-nf` (fromGTF.A3SS.txt, fromGTF.A5SS.txt, fromGTF.MXE.txt, fromGTF.RI.txt,fromGTF.SE.txt)
4. **`*`.DGE.csv** One for each tissue, generated by `differentialGeneExpressionAnalysis.ipynb`
5. **`*`.DGE_refined.csv** One for each tissue, generated by `differentialGeneExpressionAnalysis.ipynb`
6. **srr_pdata.csv** Sequence Run (SRR) data merged with phenotype data, generated by `differentialGeneExpressionAnalysis.ipynb`
7. **rmats_final.`*`.jc.ijc.txt.gz** One for each splicing type, this is a matrix of all included junction (ijc) counts for each junction and for each sample (SRR) generated by rmats-nf found here `https://github.com/lifebit-ai/rmats-nf`. 
8. **rmats_final.`*`.jc.inclen.txt.gz** One for each splicing type, this is a matrix of inclusion length (inclen) for each junction and for each sample (SRR) generated by rmats-nf found here `https://github.com/lifebit-ai/rmats-nf`.
9. **rmats_final.`*`.jc.inclen.txt.gz** One for each splicing type, this is a matrix of inclen for each junction and for each sample (SRR) generated by rmats-nf found here `https://github.com/lifebit-ai/rmats-nf`.
10. **rmats_final.`*`.jc.inc.txt.gz** One for each splicing type, this is a matrix of percent spliced in (inc) for each junction and for each sample (SRR) generated by rmats-nf found here `https://github.com/lifebit-ai/rmats-nf`.
11. **rmats_final.`*`.jc.sjc.txt.gz** One for each splicing type, this is a matrix of skipped junction counts (sjc) for each junction and for each sample (SRR) generated by rmats-nf found here `https://github.com/lifebit-ai/rmats-nf`.
12. **rmats_final.`*`.jc.skiplen.txt.gz** One for each splicing type, this is a matrix of skipped length (skiplen) for each junction and for each sample (SRR) generated by rmats-nf found here `https://github.com/lifebit-ai/rmats-nf`

Two files needed from the `sbas/assets` directory

13. **tissues.csv** This file curated by each tissue, the number of female and male samples and provides a display name for the graphics.
14. **all_gene_dge.tsv** This file is an intersection of all the differentially expressed genes (not just the significant but all) for all the tissues considered in this study generated by a custom script.


## 1.2 Data files created by this notebook

Output text files are written to the ``sbas/data`` directory (this notebook is in the ``sbas/jupyter`` directory). 

1.  **gene_as.tsv**: Significant alternative splicing events per gene, adjusted p-value <= 0.05, fold change >= 1.5.
2.  **all_gene_as.tsv**: all alternative splicing events
3.  **gene_dge.tsv**: Significant differential gene expression, adjusted p-value <= 0.05, fold change >= 1.5
4.  **all_gene_dge.tsv**: all differential gene event files is NOT produced here but is found in the assets directory ../assets/all_gene_dge.tsv
5.  **genesWithCommonAS.tsv**: genes (as geneSymbol, the number of splicing events, and the number of tissues the event occurs in)
6.  **Total_AS_by_chr.tsv**: Total alternative splicing events per chromosome
7.  **Total_AS_by_geneSymbol.tsv**: Count the number of tissues in which specific genes show significant alternative splicing
8.  **DGE_by_geneSymbol.tsv***: Most highly expressed genes by tissue
9.  **Total_AS_by_tissue.tsv**: Count the number of significant splicing events per tissue
10. **Total_AS_by_splicingtype.tsv**: Count number of significant splicing events for each of the 5' alternative splicing categories
11. **SplicingIndex_chr.tsv**: Splicing index by chr (number of sigificant AS events per 1000 exons)

## 2 Run Analyses

## 2.1 Load Libraries


In [1]:
#conda install r-dplyr bioconductor-biobase r-tibble r-r.utils bioconductor-rtracklayer -y
start_time <- Sys.time()
suppressMessages({
    options(warn = -1) 
    library(readr)
    library(dplyr)
    library(Biobase)
    library(tibble)
    library(R.utils)
    library(rtracklayer)
})

## 2.2 Read in Curated Tissue List

Read in a curated list of tissues, noting sample counts per each sex, male and female and also a curated tissue name for display.


In [2]:
library(readr)

In [3]:
tissue_reduction_filename <- "../assets/tissues.tsv"
#tissue_reduction <- read.table(tissue_reduction_filename, header=TRUE, sep="\t",
#                               skipNul=FALSE, stringsAsFactors = FALSE)
tissue_reduction <- data.table::fread(tissue_reduction_filename)
colnames(tissue_reduction)  <- c("SMTSD","female","male","include","display_name")
tissue_reduction <- tissue_reduction[tissue_reduction$display_name != "n/a",]
tissue_reduction$display_name <- factor(tissue_reduction$display_name)
levels(tissue_reduction$display_name)
message("We extracted ", length(levels(tissue_reduction$display_name))," different tissues with at least 50 samples in both M & f")

We extracted 39 different tissues with at least 50 samples in both M & f



## 2.3 Read in refined differential AS events


In [4]:
significant_results_dir = "../data/"
pattern = "model_B_sex_as_events_refined.csv"
files <- list.files(path = significant_results_dir, pattern = pattern)
as_types <- c("a3ss", "a5ss", "mxe", "ri", "se")
message("We extracted ", length(files), " model_B_sex_as_events_refined.csv files")

We extracted 195 model_B_sex_as_events_refined.csv files



## 2.4 Read in the AS Events annotations

In [5]:
a3ss_annot <- data.table::fread("../data/fromGTF.A3SS.txt")
a5ss_annot <- data.table::fread("../data/fromGTF.A5SS.txt")
mxe_annot  <- data.table::fread("../data/fromGTF.MXE.txt")
ri_annot   <- data.table::fread("../data/fromGTF.RI.txt")
se_annot   <- data.table::fread("../data/fromGTF.SE.txt")

In [6]:
head(se_annot)

ID,GeneID,geneSymbol,chr,strand,exonStart_0base,exonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>
1,ENSG00000034152.18,MAP2K3,chr17,+,21287990,21288091,21284709,21284969,21295674,21295769
2,ENSG00000034152.18,MAP2K3,chr17,+,21303182,21303234,21302142,21302259,21304425,21304553
3,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21287990,21288091,21296085,21296143
4,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21287990,21288091,21298412,21298479
5,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21284710,21284969,21296085,21296143
6,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21284710,21284969,21298412,21298479


## 2.5 create_as_structure 

This function doees an aggregation of the alternative splicing events - good for all events and the significantly expressed events.

In [7]:
create_as_structure <- function ( results_dir, files, all_or_das, pattern, tissue_reduction) {
    gene_as = data.frame()
    counts <- rep(NA, length(files))
    message("\nnumber of files:", paste(length(files)), collapse = "")
    for (i in 1:length(files)) {
       lines  <- read.table(file=paste0(results_dir, files[i]), 
                                     header = TRUE, sep = ",", quote = "\"'", skipNul = FALSE)
       if (dim(lines)[1] > 0) {
           event     <- as.vector(as.character(rownames(lines)))
           tissue1   <- gsub(pattern,"", files[i], fixed = TRUE)
           counts[i] <- dim(lines)[1]
           event_idx <- substring(event, regexpr("[0-9]+$", event))
           res       <- data.frame()
           if (grepl("^a3ss_", files[i])) {
               # remove the first 5 letters of the string 
               tissue2 <- substring(tissue1,6)
               idx <- match(event_idx, a3ss_annot$ID)
               res <- data.frame(GeneJunction <- event,
                              ASE          <- "A3SS", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- a3ss_annot$geneSymbol[idx],
                              GeneID       <- a3ss_annot$GeneID[idx],
                              chr          <- a3ss_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
               colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                                  "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
               gene_as <- rbind(gene_as,res)
            
           } else if (grepl("^a5ss_", files[i])) {
               # remove the first 5 letters of the string 
               tissue2 <- substring(tissue1,6)
               idx <- match(event_idx, a5ss_annot$ID)
               res <- data.frame(GeneJunction <- event,
                              ASE          <- "A5SS", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- a5ss_annot$geneSymbol[idx],
                              GeneID       <- a5ss_annot$GeneID[idx],
                              chr          <- a5ss_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
               colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                               "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
               gene_as <- rbind(gene_as,res)
           } else if (grepl("^mxe_", files[i])) {
               # remove the first 4 letters of the string 
               tissue2 <- substring(tissue1,5)
               idx <- match(event_idx, a3ss_annot$ID)
               res <- data.frame(GeneJunction <- event,
                              ASE          <- "MXE", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- mxe_annot$geneSymbol[idx],
                              GeneID       <- mxe_annot$GeneID[idx],
                              chr          <- mxe_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
               colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                                  "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
               gene_as <- rbind(gene_as,res)
           } else if (grepl("^se_", files[i])) {
               # remove the first 3 letters of the string 
               tissue2 <- substring(tissue1,4)
               idx <- match(event_idx, se_annot$ID)
               res <- data.frame(GeneJunction <- event,
                              ASE          <- "SE", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- se_annot$geneSymbol[idx],
                              GeneID       <- se_annot$GeneID[idx],
                              chr          <- se_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
               colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                                  "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
               gene_as <- rbind(gene_as,res)
           } else if (grepl("^ri_", files[i])){
               # remove the first 3 letters of the string 
               tissue2 <- substring(tissue1,4)
               idx <- match(event_idx, ri_annot$ID)
               res <- data.frame(GeneJunction <- event,
                              ASE          <- "RI", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- ri_annot$geneSymbol[idx],
                              GeneID       <- ri_annot$GeneID[idx],
                              chr          <- ri_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
               colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                                  "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
               gene_as <- rbind(gene_as,res)
           }
        
       } #if has sig. events
    
   } #for all files
   colnames(gene_as) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display","GeneSymbol",
                           "GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
   n_unique_genes <- length(summary(as.factor(gene_as$GeneSymbol),maxsum=50000))
   message("For the run for ", all_or_das, " run")
   message("We extracted a total of ",nrow(gene_as)," alternative splicing events (gene_as)")
   message("This includes ", n_unique_genes, " total genes")
   return (gene_as)
}

## 2.6 create_dge_structure 

This function does an aggregation of the differential gene expression events - good for all events and the significantly expressed events.

In [8]:
create_dge_structure <- function ( results_dir, files, all_or_dge, pattern, map_pattern, tissue_reduction) {
   gene_dge = data.frame()
   counts <- rep(NA, length(files))
   for (i in 1:length(files)) {
      lines  <- read.table(file=paste0(results_dir, files[i]), 
                                     header = TRUE, sep = ",", quote = "\"'", skipNul = FALSE)
      if (dim(lines)[1] > 0) {
         tissue1    <- gsub(pattern,"", files[i], fixed = TRUE)
         map_lines  <- read.table(file=paste0(paste0(results_dir, tissue1),map_pattern),
                                     header = TRUE, sep = ",", quote = "\"'", skipNul = FALSE)
         counts[i]  <- dim(lines)[1]    
         ensg_ver   <- as.vector(as.character(rownames(lines)))
         ensg_no_ver<- as.vector(as.character(map_lines$ensg_names))
         ensg_genes <- as.vector(as.character(map_lines$ensg_genes))
         counts[i]  <- dim(lines)[1]  
         res <- data.frame(Tissue       <- tissue1,
                           ENSG_ver     <- ensg_ver,
                           ENSG_no_ver  <- ensg_no_ver,
                           GeneSymbol   <- ensg_genes,
                           counts       <- counts[i],
                           Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue1, "display_name"],
                           logFC        <- lines$logFC,
                           AveExpr      <- lines$AveExpr,
                           t            <- lines$t,
                           PValue       <- lines$P.Value,
                           AdjPVal      <- lines$adj.P.Val,
                           B            <- lines$B)
         colnames(res) <- c("Tissue","ENSG_ver","ENSG_no_ver","GeneSymbol","counts","Display",
                            "logFC","AveExpr","t","PValue","AdjPVal","B")
         gene_dge <- rbind(gene_dge, res)
       } #if has sig. events
    } #for all files
    colnames(gene_dge) <- c("Tissue","ENSG_ver","ENSG_no_ver","GeneSymbol","counts","Display",
                        "logFC","AveExpr","t","PValue","AdjPVal","B")
    n_unique_genes <- length(summary(as.factor(gene_dge$GeneSymbol),maxsum=50000))
    message("For the run for ", all_or_dge, "run")
    message("We extracted a total of ",nrow(gene_dge)," gene events (gene_dge)")
    message("This includes ", n_unique_genes, " total genes")
    return(gene_dge)
}

## 2.7 Read in the alternative splicing results

We will create an aggregation of  all the results and all the significant results


In [9]:
results_dir         <- "../data/"
significant_pattern <- "_AS_model_B_sex_as_events_refined.csv"
significant_files   <- list.files(path = results_dir, pattern = significant_pattern)
all_pattern         <- "_AS_model_B_sex_as_events.csv"
all_files           <- list.files(path = results_dir, pattern = all_pattern)
as_types            <- c("a3ss", "a5ss", "mxe", "ri", "se")
message("Length of all_files: ", length(all_files))
message("Length of significant_files: ", length(significant_files))

gene_as     <- create_as_structure (results_dir      <- results_dir, 
                                    files            <- significant_files,
                                    all_or_das       <- "differentially significant alternative splicing",
                                    pattern          <- significant_pattern, 
                                    tissue_reduction <- tissue_reduction)
all_gene_as <- create_as_structure (results_dir      <- results_dir, 
                                    files            <- all_files, 
                                    all_or_das       <- "all alternatively spliced",
                                    pattern          <- all_pattern, 
                                    tissue_reduction <- tissue_reduction)
head(gene_as,2)
gene_as$Tissue <- factor(gene_as$Tissue)
write.table(gene_as, "../data/gene_as.tsv", quote=FALSE, sep="\t")
write.table(all_gene_as, "../data/all_gene_as.tsv", quote=FALSE, sep="\t")

Length of all_files: 195



Length of significant_files: 195




number of files:195



For the run for differentially significant alternative splicing run



We extracted a total of 743 alternative splicing events (gene_as)



This includes 581 total genes




number of files:195



For the run for all alternatively spliced run



We extracted a total of 1132598 alternative splicing events (gene_as)



This includes 12681 total genes



Unnamed: 0_level_0,GeneJunction,ASE,ASE_IDX,Tissue,counts,Display,GeneSymbol,GeneID,chr,logFC,AveExpr,t,PValue,AdjPVal,B
Unnamed: 0_level_1,<chr>,<chr>,<int>,<chr>,<int>,<fct>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,SCO1-8452,A3SS,8452,adrenal_gland,1,Adrenal gland,SCO1,ENSG00000133028.12,chr17,0.7715834,5.028635,4.906848,1.533202e-06,0.008170436,0.8561371
2,XIST-2252,A3SS,2252,artery_coronary,1,Coronary artery,XIST,ENSG00000229807.11,chrX,-2.1880965,4.327079,-10.329346,4.631685e-21,2.4640570000000004e-17,16.753512


## 2.8 Create a genes-id file capturing the unique gene-junction locations in a single file

rMATS 3.2.5 unique junction ids by splicing event tied together with gene names and these identifiers useful for downstream analyses and investigations.
only the SE significant AS events here


In [10]:
results_dir         <- "../data/"
significant_pattern <- "^se_*AS_model_B_sex_as_events_refined.csv"
files   <- list.files(path =results_dir, pattern = glob2rx(significant_pattern))
message("The first file from ^se_*AS_model_B_sex_as_events_refined.csv: ", files[1])
pattern="_AS_model_B_sex_as_events_refined.csv"
geneids <- data.frame()
for (i in 1:length(files)) {
    lines  <- read.table(file=paste0(results_dir, files[i]), 
                                     header = TRUE, sep = ",", quote = "\"'", skipNul = FALSE)
    
    if (dim(lines)[1] > 0) {
           event     <- as.vector(as.character(rownames(lines)))
           tissue1   <- gsub(pattern,"", files[i], fixed = TRUE)
           event_idx <- substring(event, regexpr("[0-9]+$", event))
           res       <- data.frame()
           tissue2 <- substring(tissue1,4)
           idx <- match(event_idx, se_annot$ID)
           res <- data.frame(geneIDs      <- event,
                             ID           <- event_idx,
                             GeneSymbol   <- se_annot$geneSymbol[idx],
                             GeneID       <- se_annot$GeneID[idx],
                             chr          <- se_annot$chr[idx])
           outfilename <- paste0(paste0("../data/se_",tissue2),"_geneids.tsv")
           write.table(res, outfilename, quote=FALSE, sep="\t")
           
     }
}
message("Done writing ", length(files), " files.")

The first file from ^se_*AS_model_B_sex_as_events_refined.csv: se_adipose_subcutaneous_AS_model_B_sex_as_events_refined.csv



Done writing 39 files.



## 2.9 Read in the differential gene expression results

Here we create an aggregation of al the significant results differential gene expression events.
Note the all_gene_dge.tsv may be found in the assets directory

In [11]:
results_dir             <- "../data/"
significant_dge_pattern <- "_DGE_refined.csv"
significant_dge_files   <- list.files(path = results_dir, pattern = significant_dge_pattern)
map_pattern             <- "_DGE_ensg_map.csv"
length(significant_files)

gene_dge     <- create_dge_structure (results_dir      <- results_dir, 
                                      files            <- significant_dge_files, 
                                      all_or_dge       <- "differential gene expression",
                                      pattern          <- significant_dge_pattern, 
                                      map_pattern      <- map_pattern,
                                      tissue_reduction <- tissue_reduction)

head(gene_dge,2)
gene_dge$Tissue <- factor(gene_dge$Tissue)
write.table(gene_dge,     "../data/gene_dge.tsv",     quote=FALSE, sep="\t") 

For the run for differential gene expressionrun



We extracted a total of 4417 gene events (gene_dge)



This includes 3221 total genes



Unnamed: 0_level_0,Tissue,ENSG_ver,ENSG_no_ver,GeneSymbol,counts,Display,logFC,AveExpr,t,PValue,AdjPVal,B
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,adipose_subcutaneous,ENSG00000147050.14,ENSG00000147050,KDM6A,272,Adipose (sc),0.6119682,5.030862,35.53259,9.295961000000001e-156,1.498695e-151,344.27452
2,adipose_subcutaneous,ENSG00000224525.2,ENSG00000224525,AL591686.1,272,Adipose (sc),2.2810767,-1.46412,14.44699,2.4507139999999997e-41,3.5918549999999995e-38,81.05687


## 2.10 Read in Gencode (v30) Complete Annotation file.

Load in the gencode.v30.annotation.gtf file for additional annotation.
chr information for summary data later, use the annotation we used for rMATS.

In [12]:
message("downloading gencode v30 annotation\n")
system("wget -O ../data/gencode.v30.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz")
message("Done!\n")
message("Unzipping compressed file gencode.v30.annotation.gtf.gz..")
system("gunzip ../data/gencode.v30.annotation.gtf.gz", intern = TRUE)
message("Done! gencode.v30.annotation.gtf can be found in ../data/")

gencode <- import("../data/gencode.v30.annotation.gtf")
gtf.df <- as.data.frame (gencode)
chr_genes <- unique(gtf.df[,c("seqnames","gene_name","gene_id")])
colnames(chr_genes) <- c("chr","GeneSymbol", "ENSG")
head(chr_genes)

downloading gencode v30 annotation




Done!




Unzipping compressed file gencode.v30.annotation.gtf.gz..



Done! gencode.v30.annotation.gtf can be found in ../data/



Unnamed: 0_level_0,chr,GeneSymbol,ENSG
Unnamed: 0_level_1,<fct>,<chr>,<chr>
1,chr1,DDX11L1,ENSG00000223972.5
13,chr1,WASH7P,ENSG00000227232.5
26,chr1,MIR6859-1,ENSG00000278267.1
29,chr1,MIR1302-2HG,ENSG00000243485.5
37,chr1,MIR1302-2,ENSG00000284332.1
40,chr1,FAM138A,ENSG00000237613.2


In [13]:
for (i in 1:dim(chr_genes)[1]) {
    chr_genes$ENSG[i] <- as.character(strsplit(chr_genes$ENSG[i],'\\.\\w+$'))
}
head(chr_genes)

Unnamed: 0_level_0,chr,GeneSymbol,ENSG
Unnamed: 0_level_1,<fct>,<chr>,<chr>
1,chr1,DDX11L1,ENSG00000223972
13,chr1,WASH7P,ENSG00000227232
26,chr1,MIR6859-1,ENSG00000278267
29,chr1,MIR1302-2HG,ENSG00000243485
37,chr1,MIR1302-2,ENSG00000284332
40,chr1,FAM138A,ENSG00000237613


In [14]:
i = 1
chr <- rep("NA",dim(gene_dge)[1])
gene_dge$chr <- chr
for (i in 1:dim(gene_dge)[1]) {
    match  <- as.character(chr_genes$ENSG) %in% as.character((gene_dge$ENSG_no_ver[i]))
    if (sum(match==TRUE)== 1) {
        chr[i] <- as.character(chr_genes[match,]$chr)
        gene_dge$chr[i] <- chr[i]
    } else if (sum(match==TRUE)>1) {
        all <- as.vector(as.character(chr_genes[match,]$chr))
        gene_dge$chr[i] <- as.character(all[1])
    }
}
head(gene_dge)

write.table(gene_dge, "../data/gene_dge.tsv", quote=FALSE, sep="\t")

Unnamed: 0_level_0,Tissue,ENSG_ver,ENSG_no_ver,GeneSymbol,counts,Display,logFC,AveExpr,t,PValue,AdjPVal,B,chr
Unnamed: 0_level_1,<fct>,<chr>,<chr>,<chr>,<int>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,adipose_subcutaneous,ENSG00000147050.14,ENSG00000147050,KDM6A,272,Adipose (sc),0.6119682,5.0308617,35.53259,9.295961000000001e-156,1.498695e-151,344.27452,chrX
2,adipose_subcutaneous,ENSG00000224525.2,ENSG00000224525,AL591686.1,272,Adipose (sc),2.2810767,-1.4641204,14.44699,2.4507139999999997e-41,3.5918549999999995e-38,81.05687,chr1
3,adipose_subcutaneous,ENSG00000115041.12,ENSG00000115041,KCNIP3,272,Adipose (sc),-0.9052185,3.4532379,-13.7079,7.695914999999999e-38,8.862395e-35,75.10723,chr2
4,adipose_subcutaneous,ENSG00000258484.3,ENSG00000258484,SPESP1,272,Adipose (sc),-0.891813,0.9410518,-13.10456,4.615796e-35,4.650991e-32,68.23064,chr15
5,adipose_subcutaneous,ENSG00000134339.8,ENSG00000134339,SAA2,272,Adipose (sc),3.1082279,2.7856727,12.87586,5.017765e-34,4.257705e-31,66.45132,chr11
6,adipose_subcutaneous,ENSG00000103449.11,ENSG00000103449,SALL1,272,Adipose (sc),1.2692252,-0.242458,12.72143,2.470418e-33,1.9914039999999998e-30,64.14054,chr16


## 2.11 Summary AS 

Capture descriptive statistics of the signficiantly alternatively splicedgene_as and significantly differentially gene expressed, gene_dge regarding events by tissue

In [15]:
XY <- gene_as %>% group_by(Tissue) %>% tally()
XY <- XY[order(XY$n),decreasing=TRUE]
head(XY)
message("Minimum splicing events per tissue ", min(XY$n), " maximum splicing events per tissue ", max(XY$n))
message("Sum of significant splicing events per tissue less than 100 ", sum(XY$n<100))

XY <- gene_dge %>% group_by(Tissue) %>% tally()
XY <- XY[order(XY$n),decreasing=TRUE]
head(XY)
message("Minimum gene expression events per tissue ", min(XY$n), " maximum gene expression events per tissue ", max(XY$n))
# table(gtf.df[,c("gene_type")])

Tissue,n
<fct>,<int>
esophagus_mucosa,1
liver,1
lung,1
whole_blood,1
adrenal_gland,2
artery_coronary,2


Minimum splicing events per tissue 1 maximum splicing events per tissue 521



Sum of significant splicing events per tissue less than 100 35



Tissue,n
<fct>,<int>
brain_spinal_cord_cervical_c_1,5
brain_cortex,7
brain_hippocampus,7
brain_nucleus_accumbens_basal_ganglia,8
small_intestine_terminal_ileum,10
brain_caudate_basal_ganglia,12


Minimum gene expression events per tissue 5 maximum gene expression events per tissue 2296



### 3 Data Structures for Figures

### 3.1 gene_as.tsv

This file contains (description)
Here is a typical line
<pre>
A data.frame: 6 × 15
GeneJunction	ASE	ASE_IDX	Tissue	counts	Display	GeneSymbol	GeneID	chr	logFC	AveExpr	t	PValue	AdjPVal	B
<fct>	<fct>	<int>	<fct>	<int>	<fct>	<fct>	<fct>	<fct>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	XIST-2253	A3SS	2253	adipose_subcutaneous	4	Adipose (sc)	XIST	ENSG00000229807.11	chrX	-4.4086049	3.196317	-36.488970	4.635568e-154	3.893877e-150	310.016049
2	XIST-2252	A3SS	2252	adipose_subcutaneous	4	Adipose (sc)	XIST	ENSG00000229807.11	chrX	-2.4147126	3.647690	-21.921057	1.444102e-78	6.065229e-75	160.028167
3	GREB1L-4933	A3SS	4933	adipose_subcutaneous	4	Adipose (sc)	GREB1L	ENSG00000141449.14	chr18	1.2793173	2.115005	7.123138	3.052112e-12	8.545914e-09	16.692429
4	RHCG-1776	A3SS	1776	adipose_subcutaneous	4	Adipose (sc)	RHCG	ENSG00000140519.14	chr15	-0.6930009	1.636472	-3.922124	9.797866e-05	3.919146e-02	1.142232
5	XIST-2253	A3SS	2253	adipose_visceral_omentum	12	Adipose (v)	XIST	ENSG00000229807.11	chrX	-4.4403352	3.113532	-33.950800	2.654474e-123	2.209585e-119	241.826117
6	XIST-2252	A3SS	2252	adipose_visceral_omentum	12	Adipose (v)	XIST	ENSG00000229807.11	chrX	-2.4506832	3.650617	-18.890779	2.817671e-58	1.172715e-54	114.731682
</pre>
There are 2887 significant events in the file.

In [16]:
glimpse(gene_as)
gene_as$Tissue <- factor(gene_as$Tissue)
length(levels(gene_as$Tissue))
table(is.na(gene_as$Display))
table(gene_as$Display)
colnames(gene_as)
head(gene_as)
tissue_reduction$display_name <- factor(tissue_reduction$display_name)

Rows: 743
Columns: 15
$ GeneJunction [3m[90m<chr>[39m[23m "SCO1-8452", "XIST-2252", "CPT1C-6427", "DEPDC5-1839", "C…
$ ASE          [3m[90m<chr>[39m[23m "A3SS", "A3SS", "A3SS", "A3SS", "A3SS", "A3SS", "A3SS", "…
$ ASE_IDX      [3m[90m<int>[39m[23m 8452, 2252, 6427, 1839, 8117, 453, 3523, 7852, 7853, 1653…
$ Tissue       [3m[90m<fct>[39m[23m adrenal_gland, artery_coronary, brain_cortex, brain_hypot…
$ counts       [3m[90m<int>[39m[23m 1, 1, 1, 2, 2, 2, 2, 104, 104, 104, 104, 104, 104, 104, 1…
$ Display      [3m[90m<fct>[39m[23m Adrenal gland, Coronary artery, Cortex, Hypothalamus, Hyp…
$ GeneSymbol   [3m[90m<chr>[39m[23m "SCO1", "XIST", "CPT1C", "DEPDC5", "CLN5", "ANKHD1-EIF4EB…
$ GeneID       [3m[90m<chr>[39m[23m "ENSG00000133028.12", "ENSG00000229807.11", "ENSG00000169…
$ chr          [3m[90m<chr>[39m[23m "chr17", "chrX", "chr19", "chr22", "chr13", "chr5", "chr2…
$ logFC        [3m[90m<dbl>[39m[23m 0.7715834, -2.1880965, -0.8246151, -1.2651387, 


FALSE 
  743 


         Adipose (sc)           Adipose (v)         Adrenal gland 
                    3                     3                     2 
                Aorta      Atrial appendage                Breast 
                   10                     4                   521 
              Caudate Cerebellar hemisphere            Cerebellum 
                    2                     2                    34 
      Coronary artery                Cortex       EBV-lymphocytes 
                    2                     2                     3 
      Esophagus (gej)         Esophagus (m)        Esophagus (mu) 
                    6                     1                    32 
          Fibroblasts        Frontal cortex           Hippocampus 
                    3                     2                    14 
         Hypothalamus        Left ventricle                 Liver 
                    3                     0                     1 
                 Lung     Nucleus accumbens              Panc

Unnamed: 0_level_0,GeneJunction,ASE,ASE_IDX,Tissue,counts,Display,GeneSymbol,GeneID,chr,logFC,AveExpr,t,PValue,AdjPVal,B
Unnamed: 0_level_1,<chr>,<chr>,<int>,<fct>,<int>,<fct>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,SCO1-8452,A3SS,8452,adrenal_gland,1,Adrenal gland,SCO1,ENSG00000133028.12,chr17,0.7715834,5.028635,4.906848,1.533202e-06,0.008170436,0.8561371
2,XIST-2252,A3SS,2252,artery_coronary,1,Coronary artery,XIST,ENSG00000229807.11,chrX,-2.1880965,4.327079,-10.329346,4.631685e-21,2.4640570000000004e-17,16.753512
3,CPT1C-6427,A3SS,6427,brain_cortex,1,Cortex,CPT1C,ENSG00000169169.14,chr19,-0.8246151,6.908922,-5.042202,9.757441e-07,0.005117778,5.1676834
4,DEPDC5-1839,A3SS,1839,brain_hypothalamus,2,Hypothalamus,DEPDC5,ENSG00000100150.18,chr22,-1.2651387,3.521844,-5.035619,1.09981e-06,0.004663317,4.2503727
5,CLN5-8117,A3SS,8117,brain_hypothalamus,2,Hypothalamus,CLN5,ENSG00000102805.15,chr13,-1.2407543,4.678101,-4.93063,1.777517e-06,0.004663317,4.1488842
6,ANKHD1-EIF4EBP3-453,A3SS,453,brain_spinal_cord_cervical_c_1,2,Spinal cord,ANKHD1-EIF4EBP3,ENSG00000254996.5,chr5,0.610582,6.459352,4.44025,1.825438e-05,0.04818212,2.5615289


In [17]:
x_as_events <- gene_as[gene_as$chr=="chrX",]
message("There were ",nrow(gene_as)," total significant alternative splicing events (gene_as)")
message("There were ",nrow(x_as_events)," total significant alternative splicing events on the X chromosome (gene_as)")
message("i.e., ", (100*nrow(x_as_events)/nrow(gene_as)), "% of all significant AS events were on the X chromosome")

numberOfUniqueTissues <- length(summary(as.factor(gene_as$Display),maxsum=500))
numberOfASEmechanisms <- length(summary(as.factor(gene_as$ASE),maxsum=500))

message("gene_as now has ",numberOfUniqueTissues, " tissues and ", numberOfASEmechanisms, " ASE categories")
message("ASE:")
summary(as.factor(gene_as$ASE),maxsum=500)

There were 743 total significant alternative splicing events (gene_as)



There were 68 total significant alternative splicing events on the X chromosome (gene_as)



i.e., 9.15208613728129% of all significant AS events were on the X chromosome



gene_as now has 39 tissues and 5 ASE categories



ASE:



### 3.2 gene_dge.tsv

This file contains (description)
Here is a typical line
<pre>
Tissue  ENSG_ver        ENSG_no_ver     GeneSymbol      counts  Display logFC   AveExpr t       PValue  AdjPVal B
1       adipose_subcutaneous    ENSG00000176728.7       ENSG00000176728 TTTY14  765     Adipose (sc)    -7.98216577151896     -0.928812923511535       -139.823010017733       0       0       1107.42326360464
2       adipose_subcutaneous    ENSG00000231535.5       ENSG00000231535 LINC00278       765     Adipose (sc)    -6.09542040758638      -2.77656379347601       -126.913818678612       0       0       1050.36559888639
3       adipose_subcutaneous    ENSG00000129824.15      ENSG00000129824 RPS4Y1  765     Adipose (sc)    -9.6641901864726      4.63528767282141 -125.827094717734       0       0       1041.87660796556
4       adipose_subcutaneous    ENSG00000067646.11      ENSG00000067646 ZFY     765     Adipose (sc)    -9.50458982938477     0.672755457406984        -125.037143030325       0       0       1033.61131617113
5       adipose_subcutaneous    ENSG00000229807.10      ENSG00000229807 XIST    765     Adipose (sc)    9.89280986473167      1.23756039627052 121.69689757218 0       0       1030.17757492281
6       adipose_subcutaneous    ENSG00000229236.1       ENSG00000229236 TTTY10  765     Adipose (sc)    -6.20901295440725     -2.74524363170072        -122.540297482165       0       0       1029.41424065532
7       adipose_subcutaneous    ENSG00000233864.7       ENSG00000233864 TTTY15  765     Adipose (sc)    -8.19361688496523     -0.741097276206495       -122.47199454746        0       0       1027.58175958647
8       adipose_subcutaneous    ENSG00000260197.1       ENSG00000260197 AC010889.1      765     Adipose (sc)    -8.52835806068555      -0.686009457030557      -119.790486729538       0       0       1015.85291786821
9       adipose_subcutaneous    ENSG00000183878.15      ENSG00000183878 UTY     765     Adipose (sc)    -9.52139275438866     1.60375445153084 -110.599936868261       0       0       953.069478992754
</pre>
There are 7417 significant events in the file.

In [18]:
x_dge_events <- gene_dge[gene_dge$chr=="chrX",]
message("There were ",nrow(gene_dge)," total significant differential gene expression events (gene_dge)")
message("There were ",nrow(x_dge_events)," total significant differential gene expression events on the X chromosome (gene_as)")
message("i.e., ", (100*nrow(x_dge_events)/nrow(gene_dge)), "% of all significant DGE events were on the X chromosome")

numberOfUniqueTissues <- length(summary(as.factor(gene_dge$Display),maxsum=500))

message("gene_dge now has ",numberOfUniqueTissues, " tissues")

There were 4417 total significant differential gene expression events (gene_dge)



There were 328 total significant differential gene expression events on the X chromosome (gene_as)



i.e., 7.42585465247906% of all significant DGE events were on the X chromosome



gene_dge now has 39 tissues



### 3.3 Count events by chromosome

Count the number of significant alternative splicing events per chromosome and save to the file **Total_AS_by_chr.tsv**.

### 3.3.1 by alternative splicing events

In [19]:
total_as_by_chr <- gene_as          %>% 
                   group_by(chr)    %>% 
                   count(chr)       %>% 
                   arrange(desc(n)) %>% 
                   as.data.frame()
total_as_by_chr$chr <- factor(total_as_by_chr$chr, levels = total_as_by_chr$chr)
length(total_as_by_chr$chr)
total_as_by_chr
glimpse(total_as_by_chr)
write.table(total_as_by_chr, file= "../data/Total_AS_by_chr.tsv", sep="\t", quote = FALSE, row.names=F)

chr,n
<fct>,<int>
chr1,83
chrX,68
chr17,56
chr19,55
chr16,47
chr2,44
chr11,41
chr12,35
chr3,35
chr10,31


Rows: 23
Columns: 2
$ chr [3m[90m<fct>[39m[23m chr1, chrX, chr17, chr19, chr16, chr2, chr11, chr12, chr3, chr10, …
$ n   [3m[90m<int>[39m[23m 83, 68, 56, 55, 47, 44, 41, 35, 35, 31, 29, 26, 25, 24, 22, 22, 19…


### 3.3.2 by gene expression

In [20]:
total_dge_by_chr <- gene_dge          %>% 
                   group_by(chr)    %>% 
                   count(chr)       %>% 
                   arrange(desc(n)) %>% 
                   as.data.frame()
total_dge_by_chr$chr <- factor(total_dge_by_chr$chr, levels = total_dge_by_chr$chr)
length(total_dge_by_chr$chr)
total_dge_by_chr
glimpse(total_dge_by_chr)
write.table(total_dge_by_chr, file= "../data/Total_DGE_by_chr.tsv", sep="\t", quote = FALSE, row.names=F)

chr,n
<fct>,<int>
chr1,431
chr2,346
chrX,328
chr11,250
chr12,234
chr19,224
chr17,216
chr3,212
chr6,199
chr7,196


Rows: 26
Columns: 2
$ chr [3m[90m<fct>[39m[23m chr1, chr2, chrX, chr11, chr12, chr19, chr17, chr3, chr6, chr7, ch…
$ n   [3m[90m<int>[39m[23m 431, 346, 328, 250, 234, 224, 216, 212, 199, 196, 195, 187, 178, 1…


### 3.4 Count events by genes 

### 3.4.1 by alternative splicing

In [21]:
total_as_by_geneSymbol <- gene_as %>% 
                          group_by(GeneSymbol) %>% 
                          count(GeneSymbol)    %>% 
                          arrange(desc(n))     %>% 
                          as.data.frame()
total_as_by_geneSymbol$GeneSymbol <- factor(total_as_by_geneSymbol$GeneSymbol, 
                                            levels = total_as_by_geneSymbol$GeneSymbol)
length(total_as_by_geneSymbol$GeneSymbol)
head(total_as_by_geneSymbol,10)
write.table(total_as_by_geneSymbol, file = "../data/Total_AS_by_geneSymbol.tsv", sep = "\t", quote=FALSE, row.names = F)

Unnamed: 0_level_0,GeneSymbol,n
Unnamed: 0_level_1,<fct>,<int>
1,XIST,25
2,KDM5C,14
3,MUC1,6
4,SORBS2,6
5,DDX3X,5
6,CELSR2,4
7,ABCD4,3
8,BNIP2,3
9,DTNA,3
10,EPN3,3


### 3.5 Count most frequent splicing by tissue

### 3.5.1 by alternative splicing

In [22]:
total_as_by_tissue <- gene_as %>% 
                      group_by(Display) %>% 
                      count(Display)    %>% 
                      arrange(desc(n))  %>% 
                      as.data.frame()
total_as_by_tissue$Display <- factor(total_as_by_tissue$Display, 
                                     levels = total_as_by_tissue$Display)
head(total_as_by_tissue,10)
length(total_as_by_tissue$Display)
write.table(total_as_by_tissue, file = "../data/Total_AS_by_tissue.tsv", sep = "\t", row.names = F)

Unnamed: 0_level_0,Display,n
Unnamed: 0_level_1,<fct>,<int>
1,Breast,521
2,Cerebellum,34
3,Esophagus (mu),32
4,Spleen,28
5,Hippocampus,14
6,Sigmoid colon,14
7,Aorta,10
8,Skeletal muscle,7
9,Tibial artery,7
10,Esophagus (gej),6


### 3.5.2 by gene expression

In [23]:
#glimpse(gene_dge)
gene_dge$GeneSymbol <- factor(gene_dge$GeneSymbol)
total_dge_by_tissue <- gene_dge %>% 
                          select(c(GeneSymbol, Display, logFC)) %>%
                          group_by(Display) %>%
                          arrange(desc(logFC)) %>%
                          tally() %>%
                          arrange(desc(n)) %>%
                          as.data.frame()
head(total_dge_by_tissue,10)
length(total_dge_by_tissue$Display)
write.table(total_dge_by_tissue, file = "../data/Total_DGE_by_tissue.tsv", sep = "\t", quote=FALSE, row.names = F)

Unnamed: 0_level_0,Display,n
Unnamed: 0_level_1,<fct>,<int>
1,Breast,2296
2,Adipose (sc),272
3,Pituitary,256
4,Thyroid,164
5,Adipose (v),156
6,Skin (not exposed),129
7,Skin (exposed),116
8,Skeletal muscle,102
9,Left ventricle,85
10,Fibroblasts,84


###  3.6 Significant Count by splicing type 
We define **significant** to be FC > 1.5 and pVal < 0.05

Our starting values were the significant events, all meeting the criteria FC > 1.5 and pVal < 0.05


In [24]:
total_as_by_splicingtype <- gene_as %>% 
                            group_by(ASE)    %>% 
                            count(ASE)       %>% 
                            arrange(desc(n)) %>%
                            as.data.frame()
total_as_by_splicingtype$ASE <- factor(total_as_by_splicingtype$ASE, levels = total_as_by_splicingtype$ASE)
total_as_by_splicingtype
write.table(total_as_by_splicingtype, file= "../data/Total_AS_by_splicingtype.tsv")

ASE,n
<fct>,<int>
SE,251
RI,178
A3SS,130
A5SS,107
MXE,77


###  3.7 Significant Count by splicing type (significant == FC > 1.5 and pVal < 0.05)

In [25]:
A3SS_keep <- as.character(gene_as$ASE) %in% "A3SS"
table(A3SS_keep)
A3SS.gene_as <- data.frame(gene_as[A3SS_keep == TRUE,])

A5SS_keep <- as.character(gene_as$ASE) %in% "A5SS"
table(A5SS_keep)
A5SS.gene_as <- data.frame(gene_as[A5SS_keep == TRUE,])

MXE_keep  <- as.character(gene_as$ASE) %in% "MXE"
table(MXE_keep)
MXE.gene_as <- data.frame(gene_as[MXE_keep == TRUE,])

SE_keep   <- as.character(gene_as$ASE) %in% "SE"
table(SE_keep)
SE.gene_as <- data.frame(gene_as[SE_keep == TRUE,])

RI_keep   <- as.character(gene_as$ASE) %in% "RI"
table(RI_keep)
RI.gene_as <- data.frame(gene_as[RI_keep == TRUE,])

dim(A3SS.gene_as)
dim(A5SS.gene_as)
dim(MXE.gene_as)
dim(SE.gene_as)
dim(RI.gene_as)


A3SS_keep
FALSE  TRUE 
  613   130 

A5SS_keep
FALSE  TRUE 
  636   107 

MXE_keep
FALSE  TRUE 
  666    77 

SE_keep
FALSE  TRUE 
  492   251 

RI_keep
FALSE  TRUE 
  565   178 

### 3.8 Siginficant spliced by Gene for each splicing factor

In [26]:
A3SS.res <- A3SS.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
A3SS.res$GeneSymbol <- factor(A3SS.res$GeneSymbol, levels = A3SS.res$GeneSymbol)
message("Significant spliced genes for A3SS\n",
        paste(length(A3SS.res$GeneSymbol)), collapse=" ")
#head(A3SS.res)

A5SS.res <- A5SS.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
A5SS.res$GeneSymbol <- factor(A5SS.res$GeneSymbol, levels = A5SS.res$GeneSymbol)
message("Significant spliced genes for A5SS\n",
        paste(length(A5SS.res$GeneSymbol)), collapse=" ")
#head(A5SS.res)

MXE.res <- MXE.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
MXE.res$GeneSymbol <- factor(MXE.res$GeneSymbol, levels = MXE.res$GeneSymbol)
message("Significant spliced genes for MXE\n",
        paste(length(MXE.res$GeneSymbol)), collapse=" ")
#head(MXE.res)

RI.res <- RI.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
RI.res$GeneSymbol <- factor(RI.res$GeneSymbol, levels = RI.res$GeneSymbol)
message("Significant spliced genes for RI\n",
        paste(length(RI.res$GeneSymbol)), collapse=" ")
#head(RI.res)

SE.res <- SE.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
SE.res$GeneSymbol <- factor(SE.res$GeneSymbol, levels = SE.res$GeneSymbol)
message("Significant spliced genes for SE\n",
        paste(length(SE.res$GeneSymbol)), collapse=" ")
#head(SE.res)

Significant spliced genes for A3SS
116 



Significant spliced genes for A5SS
100 



Significant spliced genes for MXE
50 



Significant spliced genes for RI
159 



Significant spliced genes for SE
203 



### 3.9 Count most frequent spliced genes

In [27]:
genesMostFrequentlySpliced <- gene_as %>% 
                              group_by(GeneSymbol) %>% 
                              count(GeneSymbol)    %>% 
                              arrange(desc(n))     %>% 
                              as.data.frame()
genesMostFrequentlySpliced$GeneSymbol <- factor(genesMostFrequentlySpliced$GeneSymbol, 
                                                levels = genesMostFrequentlySpliced$GeneSymbol)
length(genesMostFrequentlySpliced$GeneSymbol)

#Add number of tissues
nTissues <- rep(NA, length(genesMostFrequentlySpliced))
for (i in 1:nrow(genesMostFrequentlySpliced)) {
    df_gene <- gene_as %>% 
               filter(GeneSymbol == genesMostFrequentlySpliced$GeneSymbol[i])
    nTissues[i] <- length(unique(df_gene$Tissue))
}
genesMostFrequentlySpliced$Tissues <- nTissues
head(genesMostFrequentlySpliced)
write.table(genesMostFrequentlySpliced, file = "../data/genesWithCommonAS.tsv", sep = "\t", quote = F, row.names = F)

Unnamed: 0_level_0,GeneSymbol,n,Tissues
Unnamed: 0_level_1,<fct>,<int>,<int>
1,XIST,25,23
2,KDM5C,14,14
3,MUC1,6,1
4,SORBS2,6,1
5,DDX3X,5,4
6,CELSR2,4,1


### 3.10 Count most frequent spliced chromosomes
To get an indication of which chromosome has the most frequent slicing event (regardless of type)
We create an index based upon the number of exons per chromosome.

get the annotation file, at this writing, gencode.v30.annotation.gtf
The information as to the number of exons within the chromosome may be found there

In [28]:
exons <- gencode[ gencode$type == "exon", ]
exons <- as.data.frame(exons)

#Obtain chromosomes we have splicing information for (recall we did not use chr Y in our analysis)
all_chr <- as.character(unique(gene_as$chr))
chr_counts <- rep(0, length(all_chr))


for (i in 1:length(all_chr)) {
  chr_counts[i] <- nrow(exons[exons$seqnames == all_chr[i], ])
}

exon_counts <- data.frame(chr = all_chr, counts = chr_counts)

# Count most frequent spliced chromosomes
res <- gene_as %>% group_by(chr) %>% count(chr) %>% arrange(desc(n)) %>% as.data.frame()
res$chr <- factor(res$chr, levels = res$chr)

idx <- match(res$chr, exon_counts$chr)

res$ExonCounts <- exon_counts$counts[idx]

res$Index <- (res$n / res$ExonCounts) * 1000

res_sorted <- res %>% arrange(desc(Index))
res_sorted$chr <- factor(res_sorted$chr, levels = res_sorted$chr)
glimpse(res_sorted)
write.table(res_sorted, file = "../data/SplicingIndex_chr.tsv", sep = "\t", quote = F, row.names = F)

Rows: 23
Columns: 4
$ chr        [3m[90m<fct>[39m[23m chrX, chr16, chr22, chr19, chr17, chr1, chr20, chr10, chr8,…
$ n          [3m[90m<int>[39m[23m 68, 47, 22, 55, 56, 83, 19, 31, 25, 8, 41, 24, 35, 29, 22, …
$ ExonCounts [3m[90m<dbl>[39m[23m 40029, 61199, 28655, 74466, 78291, 118996, 28506, 47124, 45…
$ Index      [3m[90m<dbl>[39m[23m 1.6987684, 0.7679864, 0.7677543, 0.7385921, 0.7152802, 0.69…


### 3.11 Overlap between Differential Gene Expression and Differential Alternative Splicing

First gather the data

In [29]:
total_AS_Genes <- read.table(file="../data/Total_AS_by_geneSymbol.tsv", header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
sigAsGenes <- sort(total_AS_Genes$GeneSymbol)
dge <- read.table("../data/gene_dge.tsv", sep = "\t", header = FALSE, row.names=1, skip = 1)
#dge <- data.table::fread("../data/gene_dge.tsv")
dge_genes <- sort(dge$V5)
head(dge_genes)
all_genes_data <- read.table("../assets/all_gene_dge.tsv")
all_genes <- sort(all_genes_data$GeneSymbol)
head(all_genes)

In [30]:
head(dge_genes)

In [31]:
head(all_genes)

### 3.12 We then do a hypergeometric/Fisher test to look for overrepresentation
The universe consists of all genes with at least one read (all_genes_data).
So we have

|  	|  DGE+| DGE-|
|-	|-	|-	|
| DAS+|  a|  b|
| DAS-|  c| d|

In [32]:
message("Number of sigAsGenes ", length(sigAsGenes))
notSigAs <- setdiff(all_genes,sigAsGenes)
message("Number of genes that are NOT sigAs ", length(notSigAs))
message("Number of DGE genes ", length(dge_genes))
notDGE <- setdiff(all_genes,dge_genes)
message("Number of genes that are NOT DGE ", length(notDGE))
a <- intersect(sigAsGenes, dge_genes)
b <- intersect(sigAsGenes, notDGE)
c <- intersect(notSigAs, dge_genes)
d <- intersect(notSigAs, notDGE)
message("a: ", length(a), "; b: ",  length(b), "; c: ",  length(c), "; d: ",  length(d))

Number of sigAsGenes 581



Number of genes that are NOT sigAs 23480



Number of DGE genes 4417



Number of genes that are NOT DGE 20896



a: 144; b: 429; c: 3013; d: 20467



In [33]:
m <- matrix(c(length(a),length(b),length(c),length(d)), nrow=2,byrow = TRUE)
fisher.test(m)


	Fisher's Exact Test for Count Data

data:  m
p-value = 4.649e-15
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.867195 2.771118
sample estimates:
odds ratio 
  2.280031 


### Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### Appendix 1. Checksums with the sha256 algorithm

In [34]:
rm (notebookid)
notebookid   = "countGenesAndEvents"
notebookid

message("Generating sha256 checksums of the file `../data/gene_as.tsv` directory .. ")
system(paste0("cd ../data && find . -name gene_as.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/all_gene_as.tsv` directory .. ")
system(paste0("cd ../data && find . -name gene_dge.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/gene_dge.tsv` directory .. ")
system(paste0("cd ../data && find . -name gene_as.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/Total_AS_by_chr.tsv` directory .. ")
system(paste0("cd ../data && find . -name Total_AS_by_chr.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/Total_AS_by_geneSymbol.tsv` directory .. ")
system(paste0("cd ../data && find . -name Total_AS_by_geneSymbol.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/Total_AS_by_tissue.tsv` directory .. ")
system(paste0("cd ../data && find . -name Total_AS_by_tissue.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/Total_AS_by_splicingtype.tsv` directory .. ")
system(paste0("cd ../data && find . -name Total_AS_by_splicingtype.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/genesWithCommonAS.tsv` directory .. ")
system(paste0("cd ../data && find . -name genesWithCommonAS.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/SplicingIndex_chr.tsv` directory .. ")
system(paste0("cd ../data && find . -name SplicingIndex_chr.tsv -exec sha256sum {} \\;  >  ../data/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")


Generating sha256 checksums of the file `../data/gene_as.tsv` directory .. 



Done!




Generating sha256 checksums of the file `../data/all_gene_as.tsv` directory .. 



Done!




Generating sha256 checksums of the file `../data/gene_dge.tsv` directory .. 



Done!




Generating sha256 checksums of the file `../data/Total_AS_by_chr.tsv` directory .. 



Done!




Generating sha256 checksums of the file `../data/Total_AS_by_geneSymbol.tsv` directory .. 



Done!




Generating sha256 checksums of the file `../data/Total_AS_by_tissue.tsv` directory .. 



Done!




Generating sha256 checksums of the file `../data/Total_AS_by_splicingtype.tsv` directory .. 



Done!




Generating sha256 checksums of the file `../data/genesWithCommonAS.tsv` directory .. 



Done!




Generating sha256 checksums of the file `../data/SplicingIndex_chr.tsv` directory .. 



Done!




### Appendix 2. Libraries metadata

In [35]:
end_time <- Sys.time()
end_time - start_time

Time difference of 3.140546 mins