# Analysis Notebook - Count Genes and Events

This notebook processes the raw counts as provided by rMATS and performs some descriptive statistical analysis. It is used to produce the following outputs. 

## Data files created by this notebook
Output text files are written to the ``data/`` directory (at the same level as the ``jupyter`` directory). 

1. **gene_AS.tsv**: Alternative splicing events per gene
2. **genesWithCommonAS.tsv
3. **Total_AS_by_chr.tsv**: Total alternative splicing events per chromosome
4. **Total_AS_by_geneSymbol.tsv**: Count the number of tissues in which specific genes show significant alternative splicing
5. **Total_AS_by_tissue.tsv**: Count the number of significant splicing events per tissue
6. **Total_AS_by_splicingtype.tsv**: Count number of significant splicing events for each of the 5' alternative splicing categories
7. **Significant_AS_events.tsv**: ?? Counts of significant events per slicing type per tissue
8. **SplicingIndex_chr.tsv**: Splicing index by chr (number of sigificant AS events per 1000 exons)

In [1]:
defaultW <- getOption("warn")  # suppress warnings for this cell
options(warn = -1) 

library(dplyr)
library(ggplot2)
library(limma)
library(multtest)
library(Biobase)
library(edgeR)
library(tibble)
library(R.utils)
library(rtracklayer)

options(warn = defaultW)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following object is masked from ‘package:limma’:

    plotMA


The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
   

## 1. Download all the rMATS results

Each of the alternative splicing output files are downloaded here:

### 1.1 get released rMATS GTF annotations

For each splicing type, the junctions are defined, so we have 5 specific annotated splicing specific junction ID annotation files:

1. **fromGTF.A3SS.txt**: annotations for the alternative 3' splice site junctions
2. **fromGTF.A5SS.txt**: annotations for the alternative 5' splice site junctions
3. **fromGTF.MXE.txt**: annotations for the mutually exclusive exon junctions
4. **fromGTF.RI.txt**: annotations for the retained introns junctions
5. **fromGTF.SE.txt**: annotations for the skipped exon junctions

## 1.2 Unpack the data.tar file if necessary
To run this script, we need to import three compressed files and unpack them

|File 
1. DGE_splicing_data.tar.gz
2. data.tar.gz                      b336c423027b71f74732644bc04b9af8df47d403cdf8f7cb08a400f4f9c2b7aa
3. DGE_gene_csv.2020-06-17.tar.gz

In [2]:
dge_splicing_file_dir <- list.files("../../mounted-data", pattern='DGE_splicing_data.tar.gz')
dge_splicing_file <- paste("../../mounted-data", dge_splicing_file_dir, 'robinson-bucket/notebooks/DGE_splicing_data', sep='/')
dge_splicing_file_tar_gz <- paste(dge_splicing_file, '.tar.gz', sep='')
message("In order to unpack the necessary files, execute the following commands on the shell.")
message("1. DGE_splicing_data.tar.gz")
mycommand = paste("tar xvfz ",dge_splicing_file_tar_gz, "-C ../data", sep=" ")
message(mycommand)
message("2. data.tar.gz")
data_file_dir <- list.files("../../mounted-data", pattern='-data.tar.gz')
data_file_tar_gz <- paste("../../mounted-data", data_file_dir, 'robinson-bucket/notebooks/data.tar.gz', sep='/')
mycommand = paste("tar xvfz ",data_file_tar_gz, "-C ../data", sep=" ")
message(mycommand)
message("3. DGE_gene_csv.2020-06-17.tar.gz")
dge_file <- list.files("../../mounted-data", pattern='DGE_gene_csv')
dge_file_tar_gz <- paste("../../mounted-data", dge_file, 'robinson-bucket/results-download/DGE_gene_csv.2020-06-17.tar.gz', sep='/')
mycommand = paste("tar xvfz ",dge_file_tar_gz, "-C ../data", sep=" ")
message(mycommand)

In order to unpack the necessary files, execute the following commands on the shell.

1. DGE_splicing_data.tar.gz

tar xvfz  ../../mounted-data/5ee271c1143fa00113c7d138-DGE_splicing_data.tar.gz-5ee271c1143fa00113c7d138/robinson-bucket/notebooks/DGE_splicing_data.tar.gz -C ../data

2. data.tar.gz

tar xvfz  ../../mounted-data/5ee271e6143fa00113c7d1b2-data.tar.gz-5ee271e6143fa00113c7d1b2/robinson-bucket/notebooks/data.tar.gz -C ../datatar xvfz  ../../mounted-data/5eea35ce143fa00113f04750-data.tar.gz-5eea35ce143fa00113f04750/robinson-bucket/notebooks/data.tar.gz -C ../data

3. DGE_gene_csv.2020-06-17.tar.gz

tar xvfz  ../../mounted-data//robinson-bucket/results-download/DGE_gene_csv.2020-06-17.tar.gz -C ../data



In [3]:
## get the rmats 3.2.5 discovered/annotated junction information in GTF format
message("Decompressing fromGTF.tar.gz into ../data")
system("mkdir -p ../data && tar xvfz ../data/fromGTF.tar.gz -C ../data", intern = TRUE)
system("gunzip ../data/fromGTF.*txt.gz", intern = TRUE)
message("Done!\n")

Decompressing fromGTF.tar.gz into ../data



“running command 'gunzip ../data/fromGTF.*txt.gz' had status 2”


Done!




### 2  Refined results
We define **refined results* as (FC > 1.5 and pVal < 0.05) for the sex\*as_event coefficient result for the linear model

### 2.1 getTissueReduction

In [4]:
tissue_reduction_filename <- "../assets/tissues.tsv"
tissue_reduction <- read.table(tissue_reduction_filename, header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
colnames(tissue_reduction)  <- c("SMTSD","female","male","include","display_name")
tissue_reduction <- tissue_reduction[tissue_reduction$display_name != "n/a",]
tissue_reduction$display_name <- factor(tissue_reduction$display_name)
levels(tissue_reduction$display_name)
message("We extracted ", length(levels(tissue_reduction$display_name))," different tissues with at least 50 samples in both M & f")

We extracted 39 different tissues with at least 50 samples in both M & f



### 2.2 Read in refined results and annotations

In [5]:
significant_results_dir = "../data/"
pattern = "DGE_sex_as_events_refined.csv"
files <- list.files(path = significant_results_dir, pattern = pattern)
as_types <- c("a3ss", "a5ss", "mxe", "ri", "se")
length(files)

In [6]:
a3ss_annot <- read.table(file = "../data/fromGTF.A3SS.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)
a5ss_annot <- read.table(file = "../data/fromGTF.A5SS.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)
mxe_annot <- read.table(file = "../data/fromGTF.MXE.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)
ri_annot <- read.table(file = "../data/fromGTF.RI.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)
se_annot <- read.table(file = "../data/fromGTF.SE.txt", sep = "\t", quote = "\"", header = T, stringsAsFactors = F)

In [7]:
head(se_annot)

Unnamed: 0_level_0,ID,GeneID,geneSymbol,chr,strand,exonStart_0base,exonEnd,upstreamES,upstreamEE,downstreamES,downstreamEE
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>
1,1,ENSG00000034152.18,MAP2K3,chr17,+,21287990,21288091,21284709,21284969,21295674,21295769
2,2,ENSG00000034152.18,MAP2K3,chr17,+,21303182,21303234,21302142,21302259,21304425,21304553
3,3,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21287990,21288091,21296085,21296143
4,4,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21287990,21288091,21298412,21298479
5,5,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21284710,21284969,21296085,21296143
6,6,ENSG00000034152.18,MAP2K3,chr17,+,21295674,21295769,21284710,21284969,21298412,21298479


In [8]:
gene_as = data.frame()
GeneJunction <- rep("NA", length(files))
counts <- rep(NA, length(files))
ASE <- rep("NA", length(files))
Tissue <- rep("NA", length(files))
Display <- rep("NA", length(files))
GeneSymbol <- rep("NA", length(files))
GeneID <- rep("NA", length(files))
chr <- rep("NA", length(files))
logFC <- rep(NA, length(files))
AveExpr <- rep("NA", length(files))
t <- rep("NA", length(files))
PValue <- rep("NA", length(files))
AdjPVal <- rep("NA", length(files))
length(files)

In [9]:
for (i in 1:length(files)) {
    lines  <- read.table(file=paste0(significant_results_dir, files[i]), 
                                     header = TRUE, sep = ",", quote = "\"'", skipNul = FALSE)
#    message(paste(dim(lines)[1] >0),collapse = "")
    if (dim(lines)[1] > 0) {
        event     <- as.vector(as.character(rownames(lines)))
        tissue1   <- gsub("_DGE_sex_as_events_refined.csv","", files[i], fixed = TRUE)
        counts[i] <- dim(lines)[1]
        event_idx <- substring(event, regexpr("[0-9]+$", event))
        res       <- data.frame()
        if (grepl("^a3ss_", files[i])) {
            # remove the first 5 letters of the string 
            tissue2 <- substring(tissue1,6)
            ASE[i] <- "A3SS"
            Tissue[i] <- tissue2
            idx <- match(event_idx, a3ss_annot$ID)
            res <- data.frame(GeneJunction <- event,
                              ASE          <- "A3SS", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- a3ss_annot$geneSymbol[idx],
                              GeneID       <- a3ss_annot$GeneID[idx],
                              chr          <- a3ss_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
            colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                               "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
            gene_as <- rbind(gene_as,res)
            
        } else if (grepl("^a5ss_", files[i])) {
            # remove the first 5 letters of the string 
            tissue2 <- substring(tissue1,6)
            ASE[i] <- "A5SS"
            Tissue[i] <- tissue2
            idx <- match(event_idx, a5ss_annot$ID)
            res <- data.frame(GeneJunction <- event,
                              ASE          <- "A5SS", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- a5ss_annot$geneSymbol[idx],
                              GeneID       <- a5ss_annot$GeneID[idx],
                              chr          <- a5ss_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
            colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                               "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
            gene_as <- rbind(gene_as,res)
        } else if (grepl("^mxe_", files[i])) {
            ASE[i] <- "MXE"
            # remove the first 4 letters of the string 
            tissue2 <- substring(tissue1,5)
            Tissue[i] <- tissue2
            idx <- match(event_idx, a3ss_annot$ID)
            res <- data.frame(GeneJunction <- event,
                              ASE          <- "MXE", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- mxe_annot$geneSymbol[idx],
                              GeneID       <- mxe_annot$GeneID[idx],
                              chr          <- mxe_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
            colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                               "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
            gene_as <- rbind(gene_as,res)
        } else if (grepl("^se_", files[i])) {
            ASE[i] <- "SE"
            # remove the first 3 letters of the string 
            tissue2 <- substring(tissue1,4)
            Tissue[i] <- tissue2
            idx <- match(event_idx, se_annot$ID)
            res <- data.frame(GeneJunction <- event,
                              ASE          <- "SE", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- se_annot$geneSymbol[idx],
                              GeneID       <- se_annot$GeneID[idx],
                              chr          <- se_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
            colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                               "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
            gene_as <- rbind(gene_as,res)
        } else if (grepl("^ri_", files[i])){
            ASE[i] <- "RI"
            # remove the first 3 letters of the string 
            tissue2 <- substring(tissue1,4)
            Tissue[i] <- tissue2
            idx <- match(event_idx, ri_annot$ID)
            res <- data.frame(GeneJunction <- event,
                              ASE          <- "RI", 
                              ASE_IDX      <- idx,
                              Tissue       <- tissue2,
                              counts       <- counts[i],
                              Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue2, "display_name"],
                              GeneSymbol   <- ri_annot$geneSymbol[idx],
                              GeneID       <- ri_annot$GeneID[idx],
                              chr          <- ri_annot$chr[idx],
                              logFC        <- lines$logFC,
                              AveExpr      <- lines$AveExpr,
                              t            <- lines$t,
                              PValue       <- lines$P.Value,
                              AdjPVal      <- lines$adj.P.Val,
                              B            <- lines$B)
            colnames(res) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display",
                               "GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
            gene_as <- rbind(gene_as,res)
        }
        
    } #if has sig. events
    
} #for all files
colnames(gene_as) <- c("GeneJunction","ASE","ASE_IDX","Tissue","counts","Display","GeneSymbol","GeneID","chr","logFC","AveExpr","t","PValue","AdjPVal","B")
n_unique_genes <- length(summary(as.factor(gene_as$GeneSymbol),maxsum=50000))
message("We extracted a total of ",nrow(gene_as)," significant alternative splicing events (gene_as)")
message("This includes ", n_unique_genes, " total genes")

We extracted a total of 4494 significant alternative splicing events (gene_as)

This includes 2433 total genes



In [10]:
head(gene_as)

Unnamed: 0_level_0,GeneJunction,ASE,ASE_IDX,Tissue,counts,Display,GeneSymbol,GeneID,chr,logFC,AveExpr,t,PValue,AdjPVal,B
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<int>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,MDM4-3553,A3SS,3553,adipose_visceral_omentum,3,Adipose (v),MDM4,ENSG00000198625.13,chr1,-0.6698892,2.983493,-4.202351,3.219366e-05,0.041776224,1.900563
2,WDR17-8668,A3SS,8668,adipose_visceral_omentum,3,Adipose (v),WDR17,ENSG00000150627.15,chr4,-0.6285876,1.095294,-4.203459,3.204279e-05,0.041776224,1.896543
3,IL17RC-5032,A3SS,5032,adipose_visceral_omentum,3,Adipose (v),IL17RC,ENSG00000163702.20,chr3,-0.7279304,3.207863,-4.220848,2.975981e-05,0.041776224,1.869097
4,DDX3X-5712,A3SS,5712,adrenal_gland,3,Adrenal gland,DDX3X,ENSG00000215301.10,chrX,-0.710831,4.242619,-4.866784,1.858002e-06,0.007615023,4.466763
5,SCO1-8452,A3SS,8452,adrenal_gland,3,Adrenal gland,SCO1,ENSG00000133028.12,chr17,0.7708987,4.880664,4.918296,1.458709e-06,0.007615023,4.418012
6,MYO7A-1493,A3SS,1493,adrenal_gland,3,Adrenal gland,MYO7A,ENSG00000137474.21,chr11,0.6264866,3.912029,4.513516,9.251297e-06,0.025277627,3.050941


### 3 Data Structures for Figures

### 3.1 gene_as.tsv

This file contains (description)
Here is a typical line
<pre>
        GeneJunction    ASE     ASE_IDX Tissue  GeneSymbol      chr
1       MDM4-3553       SE      3553    adipovisceral_omentum   RNPEP   chr1
2       WDR17-8668      SE      8668    adipovisceral_omentum   ANKMY1  chr2
3       IL17RC-5032     SE      5032    adipovisceral_omentum   SNCAIP  chr5
4       DDX3X-5712      A3SS    5712    adrenal_gland   DDX3X   chrX
</pre>
There are 2848 significant events in the file.

In [11]:
glimpse(gene_as)
gene_as$Tissue <- factor(gene_as$Tissue)
length(levels(gene_as$Tissue))
table(is.na(gene_as$Display))
table(gene_as$Display)
colnames(gene_as)
write.table(gene_as, "../data/gene_as.tsv", quote=FALSE, sep="\t")
head(gene_as)
tissue_reduction$display_name <- factor(tissue_reduction$display_name)

Observations: 4,494
Variables: 15
$ GeneJunction [3m[90m<fct>[39m[23m MDM4-3553, WDR17-8668, IL17RC-5032, DDX3X-5712, SCO1-845…
$ ASE          [3m[90m<fct>[39m[23m A3SS, A3SS, A3SS, A3SS, A3SS, A3SS, A3SS, A3SS, A3SS, A3…
$ ASE_IDX      [3m[90m<int>[39m[23m 3553, 8668, 5032, 5712, 8452, 1493, 5712, 5710, 4891, 45…
$ Tissue       [3m[90m<fct>[39m[23m adipose_visceral_omentum, adipose_visceral_omentum, adip…
$ counts       [3m[90m<int>[39m[23m 3, 3, 3, 3, 3, 3, 8, 8, 8, 8, 8, 8, 8, 8, 1, 1, 3, 3, 3,…
$ Display      [3m[90m<fct>[39m[23m Adipose (v), Adipose (v), Adipose (v), Adrenal gland, Ad…
$ GeneSymbol   [3m[90m<fct>[39m[23m MDM4, WDR17, IL17RC, DDX3X, SCO1, MYO7A, DDX3X, DDX3X, H…
$ GeneID       [3m[90m<fct>[39m[23m ENSG00000198625.13, ENSG00000150627.15, ENSG00000163702.…
$ chr          [3m[90m<fct>[39m[23m chr1, chr4, chr3, chrX, chr17, chr11, chrX, chrX, chr1, …
$ logFC        [3m[90m<dbl>[39m[23m -0.6698892, -0.6285876, -0.7279304, -0.71083


FALSE 
 4494 


         Adipose (sc)           Adipose (v)         Adrenal gland 
                   26                    23                    20 
                Aorta      Atrial appendage                Breast 
                   92                    12                  3434 
              Caudate Cerebellar hemisphere            Cerebellum 
                    3                     1                    22 
      Coronary artery                Cortex       EBV-lymphocytes 
                    4                     6                     7 
      Esophagus (gej)         Esophagus (m)        Esophagus (mu) 
                   14                    17                   241 
          Fibroblasts        Frontal cortex           Hippocampus 
                   37                     0                     9 
         Hypothalamus        Left ventricle                 Liver 
                    2                     7                     6 
                 Lung     Nucleus accumbens              Panc

Unnamed: 0_level_0,GeneJunction,ASE,ASE_IDX,Tissue,counts,Display,GeneSymbol,GeneID,chr,logFC,AveExpr,t,PValue,AdjPVal,B
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<int>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,MDM4-3553,A3SS,3553,adipose_visceral_omentum,3,Adipose (v),MDM4,ENSG00000198625.13,chr1,-0.6698892,2.983493,-4.202351,3.219366e-05,0.041776224,1.900563
2,WDR17-8668,A3SS,8668,adipose_visceral_omentum,3,Adipose (v),WDR17,ENSG00000150627.15,chr4,-0.6285876,1.095294,-4.203459,3.204279e-05,0.041776224,1.896543
3,IL17RC-5032,A3SS,5032,adipose_visceral_omentum,3,Adipose (v),IL17RC,ENSG00000163702.20,chr3,-0.7279304,3.207863,-4.220848,2.975981e-05,0.041776224,1.869097
4,DDX3X-5712,A3SS,5712,adrenal_gland,3,Adrenal gland,DDX3X,ENSG00000215301.10,chrX,-0.710831,4.242619,-4.866784,1.858002e-06,0.007615023,4.466763
5,SCO1-8452,A3SS,8452,adrenal_gland,3,Adrenal gland,SCO1,ENSG00000133028.12,chr17,0.7708987,4.880664,4.918296,1.458709e-06,0.007615023,4.418012
6,MYO7A-1493,A3SS,1493,adrenal_gland,3,Adrenal gland,MYO7A,ENSG00000137474.21,chr11,0.6264866,3.912029,4.513516,9.251297e-06,0.025277627,3.050941


In [12]:
head(gene_as[gene_as$chr=="chrX",])
x_as_events <- gene_as[gene_as$chr=="chrX",]
message("There were ",nrow(gene_as)," total significant alternative splicing events (gene_as)")
message("There were ",nrow(x_as_events)," total significant alternative splicing events on the X chromosome (gene_as)")
message("i.e., ", (100*nrow(x_as_events)/nrow(gene_as)), "% of all significant AS events were on the X chromosome")

Unnamed: 0_level_0,GeneJunction,ASE,ASE_IDX,Tissue,counts,Display,GeneSymbol,GeneID,chr,logFC,AveExpr,t,PValue,AdjPVal,B
Unnamed: 0_level_1,<fct>,<fct>,<int>,<fct>,<int>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
4,DDX3X-5712,A3SS,5712,adrenal_gland,3,Adrenal gland,DDX3X,ENSG00000215301.10,chrX,-0.710831,4.242619,-4.866784,1.858002e-06,0.007615023,4.4667629
7,DDX3X-5712,A3SS,5712,artery_aorta,8,Aorta,DDX3X,ENSG00000215301.10,chrX,-0.6475782,4.316019,-6.97689,1.131772e-11,9.2545e-08,15.6734701
8,DDX3X-5710,A3SS,5710,artery_aorta,8,Aorta,DDX3X,ENSG00000215301.10,chrX,-0.6530134,4.920454,-6.245399,1.007676e-09,4.119884e-06,11.5953025
39,DDX3X-5706,A3SS,5706,breast_mammary_tissue,146,Breast,DDX3X,ENSG00000215301.10,chrX,0.8617529,5.221914,6.1048,2.638024e-09,1.620406e-06,10.8830862
111,SPIN3-6181,A3SS,6181,breast_mammary_tissue,146,Breast,SPIN3,ENSG00000204271.13,chrX,-0.6946358,1.374988,-4.014661,7.230764e-05,0.003918968,1.3973387
146,IGSF1-8876,A3SS,8876,breast_mammary_tissue,146,Breast,IGSF1,ENSG00000147255.19,chrX,-0.7069764,1.90252,-3.406126,0.0007324239,0.01963162,-0.6942546


There were 4494 total significant alternative splicing events (gene_as)

There were 295 total significant alternative splicing events on the X chromosome (gene_as)

i.e., 6.56430796617713% of all significant AS events were on the X chromosome



### 3.2 Tissue specific data frame

In [13]:
data <- data.frame(Tissue=gene_as$Display, ASE=gene_as$ASE, Counts=gene_as$counts)

numberOfUniqueTissues <- length(summary(as.factor(data$Tissue),maxsum=500))
numberOfASEmechanisms <- length(summary(as.factor(data$ASE),maxsum=500))

message("data now has ",numberOfUniqueTissues, " tissues and ", numberOfASEmechanisms, " ASE categories")
message("ASE:")
summary(as.factor(data$ASE),maxsum=500)

data now has 39 tissues and 5 ASE categories

ASE:



### 3.3 Count splicing event by chromosome

Count the number of significant alternative splicing events per chromosome and write to the file **Total_AS_by_chr.tsv**.

In [14]:
res2 <- gene_as          %>% 
       group_by(chr)    %>% 
       count(chr)       %>% 
       arrange(desc(n)) %>% 
       as.data.frame()
res2$chr <- factor(res2$chr, levels = res2$chr)
length(res2$chr)
res2
glimpse(res2)
write.table(res2, file= "../data/Total_AS_by_chr.tsv", sep="\t", quote = FALSE, row.names=F)

chr,n
<fct>,<int>
chr1,449
chr2,299
chr19,298
chrX,295
chr11,291
chr17,272
chr3,267
chr12,256
chr4,227
chr16,218


Observations: 23
Variables: 2
$ chr [3m[90m<fct>[39m[23m chr1, chr2, chr19, chrX, chr11, chr17, chr3, chr12, chr4, chr16, …
$ n   [3m[90m<int>[39m[23m 449, 299, 298, 295, 291, 272, 267, 256, 227, 218, 191, 190, 168, …


### 3.4 Count most frequent spliced genes 

In [15]:
res3 <- gene_as %>% 
       group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
res3$GeneSymbol <- factor(res3$GeneSymbol, levels = res3$GeneSymbol)
length(res3$GeneSymbol)
head(res3)
write.table(res3, file = "../data/Total_AS_by_geneSymbol.tsv", sep = "\t", quote=FALSE, row.names = F)

Unnamed: 0_level_0,GeneSymbol,n
Unnamed: 0_level_1,<fct>,<int>
1,DDX3X,85
2,KDM5C,40
3,ZFX,21
4,SORBS2,18
5,CD44,15
6,CRHR1,14


### 3.5 Count most frequent splicing by tissue

In [16]:
res4 <- gene_as %>% 
       group_by(Display) %>% 
       count(Display) %>% 
       arrange(desc(n)) %>% 
       as.data.frame()
res4$Display <- factor(res4$Display, levels = res4$Display)
length(res4$Display)
res4
write.table(res4, file = "../data/Total_AS_by_tissue.tsv", sep = "\t", row.names = F)

Display,n
<fct>,<int>
Breast,3434
Nucleus accumbens,271
Esophagus (mu),241
Aorta,92
Thyroid,52
Spleen,40
Fibroblasts,37
Skeletal muscle,31
Adipose (sc),26
Adipose (v),23


###  3.6 Significant Count by splicing type 
We define **significant** to be FC > 1.5 and pVal < 0.05

Our starting values were the significant events, all meeting the criteria FC > 1.5 and pVal < 0.05


In [17]:
res5 <- gene_as %>% group_by(ASE) %>% count(ASE) %>% arrange(desc(n)) %>% as.data.frame()
res5$ASE <- factor(res5$ASE, levels = res5$ASE)
head(res5)
write.table(res5, file= "../data/Total_AS_by_splicingtype.tsv")

Unnamed: 0_level_0,ASE,n
Unnamed: 0_level_1,<fct>,<int>
1,SE,3742
2,A3SS,232
3,A5SS,207
4,RI,192
5,MXE,121


###  3.7 Significant Count by splicing type (significant == FC > 1.5 and pVal < 0.05)

In [18]:
A3SS_keep <- as.character(gene_as$ASE) %in% "A3SS"
table(A3SS_keep)
A3SS.gene_as <- data.frame(gene_as[A3SS_keep == TRUE,])

A5SS_keep <- as.character(gene_as$ASE) %in% "A5SS"
table(A5SS_keep)
A5SS.gene_as <- data.frame(gene_as[A5SS_keep == TRUE,])

MXE_keep  <- as.character(gene_as$ASE) %in% "MXE"
table(MXE_keep)
MXE.gene_as <- data.frame(gene_as[MXE_keep == TRUE,])

SE_keep   <- as.character(gene_as$ASE) %in% "SE"
table(SE_keep)
SE.gene_as <- data.frame(gene_as[SE_keep == TRUE,])

RI_keep   <- as.character(gene_as$ASE) %in% "RI"
table(RI_keep)
RI.gene_as <- data.frame(gene_as[RI_keep == TRUE,])

dim(A3SS.gene_as)
dim(A5SS.gene_as)
dim(MXE.gene_as)
dim(SE.gene_as)
dim(RI.gene_as)


A3SS_keep
FALSE  TRUE 
 4262   232 

A5SS_keep
FALSE  TRUE 
 4287   207 

MXE_keep
FALSE  TRUE 
 4373   121 

SE_keep
FALSE  TRUE 
  752  3742 

RI_keep
FALSE  TRUE 
 4302   192 

### 3.8 Siginficant spliced by Gene for each splicing factor

In [19]:
A3SS.res <- A3SS.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
A3SS.res$GeneSymbol <- factor(A3SS.res$GeneSymbol, levels = A3SS.res$GeneSymbol)
message("Significant spliced genes for A3SS\n",
        paste(length(A3SS.res$GeneSymbol)), collapse=" ")
head(A3SS.res)

A5SS.res <- A5SS.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
A5SS.res$GeneSymbol <- factor(A5SS.res$GeneSymbol, levels = A5SS.res$GeneSymbol)
message("Significant spliced genes for A5SS\n",
        paste(length(A5SS.res$GeneSymbol)), collapse=" ")
head(A5SS.res)

MXE.res <- MXE.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
MXE.res$GeneSymbol <- factor(MXE.res$GeneSymbol, levels = MXE.res$GeneSymbol)
message("Significant spliced genes for MXE\n",
        paste(length(MXE.res$GeneSymbol)), collapse=" ")
head(MXE.res)

RI.res <- RI.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
RI.res$GeneSymbol <- factor(RI.res$GeneSymbol, levels = RI.res$GeneSymbol)
message("Significant spliced genes for RI\n",
        paste(length(RI.res$GeneSymbol)), collapse=" ")
head(RI.res)

SE.res <- SE.gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
SE.res$GeneSymbol <- factor(SE.res$GeneSymbol, levels = SE.res$GeneSymbol)
message("Significant spliced genes for SE\n",
        paste(length(SE.res$GeneSymbol)), collapse=" ")
head(SE.res)

Significant spliced genes for A3SS
187 



Unnamed: 0_level_0,GeneSymbol,n
Unnamed: 0_level_1,<fct>,<int>
1,DDX3X,13
2,HAND2-AS1,7
3,WDR17,3
4,ADGRG1,3
5,PPM1J,3
6,FAIM2,2


Significant spliced genes for A5SS
170 



Unnamed: 0_level_0,GeneSymbol,n
Unnamed: 0_level_1,<fct>,<int>
1,DDX3X,23
2,WDR31,4
3,ITGB7,3
4,DEPDC5,2
5,SGCE,2
6,CD38,2


Significant spliced genes for MXE
88 



Unnamed: 0_level_0,GeneSymbol,n
Unnamed: 0_level_1,<fct>,<int>
1,DDX3X,7
2,SORBS2,5
3,ACSL6,4
4,SGCE,3
5,NRG2,3
6,FAM49B,3


Significant spliced genes for RI
160 



Unnamed: 0_level_0,GeneSymbol,n
Unnamed: 0_level_1,<fct>,<int>
1,DDX3X,13
2,HAND2-AS1,3
3,SGCE,3
4,SPG7,3
5,TMEM79,3
6,KDM5B,2


Significant spliced genes for SE
2093 



Unnamed: 0_level_0,GeneSymbol,n
Unnamed: 0_level_1,<fct>,<int>
1,KDM5C,40
2,DDX3X,29
3,ZFX,21
4,CRHR1,14
5,CD44,13
6,DDR1,12


### 3.9 Count most frequent spliced genes

In [20]:
res <- gene_as %>% group_by(GeneSymbol) %>% count(GeneSymbol) %>% arrange(desc(n)) %>% as.data.frame()
res$GeneSymbol <- factor(res$GeneSymbol, levels = res$GeneSymbol)
length(res$GeneSymbol)
res2 <- data %>% group_by(Tissue) %>% 
    summarise(Total = sum(Counts)) %>%
    arrange(desc(Total)) %>%
    as.data.frame()

#Add number of tissues
nTissues <- rep(NA, length(res))
for (i in 1:nrow(res)) {
  df_gene <- gene_as %>% filter(GeneSymbol == res$GeneSymbol[i])
  nTissues[i] <- length(unique(df_gene$Tissue))
}
res$Tissues <- nTissues
head(res)
write.table(res, file = "../data/genesWithCommonAS.tsv", sep = "\t", quote = F, row.names = F)

Unnamed: 0_level_0,GeneSymbol,n,Tissues
Unnamed: 0_level_1,<fct>,<int>,<int>
1,DDX3X,85,19
2,KDM5C,40,29
3,ZFX,21,13
4,SORBS2,18,2
5,CD44,15,1
6,CRHR1,14,2


### 3.10 Count most frequent spliced chromosomes
To get an indication of which chromosome has the most frequent slicing event (regardless of type)
We create an index based upon the number of exons per chromosome.

get the annotation file, at this writing, gencode.v30.annotation.gtf
The information as to the number of exons within the chromosome may be found there

In [21]:
if (!("gencode.v30.annotation.gtf.gz" %in% list.files("../data/"))) {
    message("downloading gencode v30 annotation\n")
    system("wget -O ../data/gencode.v30.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.annotation.gtf.gz")
    message("Done!\n")
    message("Unzipping compressed file gencode.v30.annotation.gtf.gz..")
    system("gunzip ../data/gencode.v30.annotation.gtf.gz", intern = TRUE)
    message("Done! gencode.v30.annotation.gtf can be found in ../data/")
}
gencode <- import("../data/gencode.v30.annotation.gtf")

In [22]:
exons <- gencode[ gencode$type == "exon", ]
exons <- as.data.frame(exons)

#Obtain chromosomes we have splicing information for (recall we did not use chr Y in our analysis)
all_chr <- as.character(unique(gene_as$chr))
chr_counts <- rep(0, length(all_chr))


for (i in 1:length(all_chr)) {
  chr_counts[i] <- nrow(exons[exons$seqnames == all_chr[i], ])
}

exon_counts <- data.frame(chr = all_chr, counts = chr_counts)

# Count most frequent spliced chromosomes
res <- gene_as %>% group_by(chr) %>% count(chr) %>% arrange(desc(n)) %>% as.data.frame()
res$chr <- factor(res$chr, levels = res$chr)

idx <- match(res$chr, exon_counts$chr)

res$ExonCounts <- exon_counts$counts[idx]

res$Index <- (res$n / res$ExonCounts) * 1000

res_sorted <- res %>% arrange(desc(Index))
res_sorted$chr <- factor(res_sorted$chr, levels = res_sorted$chr)
glimpse(res_sorted)

Observations: 23
Variables: 4
$ chr        [3m[90m<fct>[39m[23m chrX, chr22, chr4, chr19, chr11, chr1, chr16, chr15, chr17…
$ n          [3m[90m<int>[39m[23m 295, 133, 227, 298, 291, 449, 218, 165, 272, 256, 190, 74,…
$ ExonCounts [3m[90m<dbl>[39m[23m 40029, 28655, 50420, 74466, 75976, 118996, 61199, 47343, 7…
$ Index      [3m[90m<dbl>[39m[23m 7.369657, 4.641424, 4.502182, 4.001826, 3.830157, 3.773236…


In [23]:
write.table(data,       file = "../data/Significant_AS_events.tsv", sep = "\t", row.names = F, quote = F)
write.table(res_sorted, file = "../data/SplicingIndex_chr.tsv", sep = "\t", quote = F, row.names = F)

# Count gene expression analysis results
The files called (tissue)\_DGE\_refined.csv contain lists of genes found to have statistically significant differential expression.
The mapping files contain the ENSG id to gene symbol maps.

In [24]:
significant_results_dir = "../data/"
pattern = "_DGE_refined.csv"
files <- list.files(path = significant_results_dir, pattern = pattern)
map_pattern <- "_DGE_ensg_map.csv"
map_files <- list.files(path = significant_results_dir, pattern = map_pattern)
message("We got ", length(files), " files with significant DGEs and ", length(map_files), " mapping files")
head(files)
head(map_files)

We got 39 files with significant DGEs and 39 mapping files



In [37]:
gene_dge = data.frame()
GeneJunction <- rep("NA", length(files))
counts <- rep(NA, length(files))
ASE <- rep("NA", length(files))
Tissue <- rep("NA", length(files))
Display <- rep("NA", length(files))
GeneSymbol <- rep("NA", length(files))
GeneID <- rep("NA", length(files))
chr <- rep("NA", length(files))
logFC <- rep(NA, length(files))
AveExpr <- rep("NA", length(files))
t <- rep("NA", length(files))
PValue <- rep("NA", length(files))
AdjPVal <- rep("NA", length(files))

In [38]:
for (i in 1:length(files)) {
    lines  <- read.table(file=paste0(significant_results_dir, files[i]), 
                                     header = TRUE, sep = ",", quote = "\"'", skipNul = FALSE)
#    message(paste(dim(lines)[1] >0),collapse = "")
    if (dim(lines)[1] > 0) {
         ensgid     <- as.vector(as.character(rownames(lines)))
         tissue1 <- gsub("_DGE_refined.csv","", files[i], fixed = TRUE)
         Tissue[i] <- tissue1
         Display[i] <- as.character(tissue_reduction[tissue_reduction$SMTSD == tissue1,"display_name"])
         counts[i] <- dim(lines)[1]    
         res <- data.frame(Tissue       <- tissue1,
                           ENSG         <- ensgid,
                           counts       <- counts[i],
                           Display      <- tissue_reduction[tissue_reduction$SMTSD == tissue1, "display_name"],
                           logFC        <- lines$logFC,
                           AveExpr      <- lines$AveExpr,
                           t            <- lines$t,
                           PValue       <- lines$P.Value,
                           AdjPVal      <- lines$adj.P.Val,
                           B            <- lines$B)
            colnames(res) <- c("Tissue","ENSG","counts","Display",
                               "logFC","AveExpr","t","PValue","AdjPVal","B")
            gene_dge <- rbind(gene_dge,res)
    }
    
} #for all files
colnames(res) <- c("Tissue","counts","Display",
                  "logFC","AveExpr","t","PValue","AdjPVal","B")

n_unique_genes <- length(summary(as.factor(gene_dge$ENSG),maxsum=50000))
message("We extracted a total of ",nrow(gene_dge)," significant alternative splicing events (gene)")
message("This includes ", n_unique_genes, " total genes")

We extracted a total of 12633 significant alternative splicing events (gene)

This includes 7417 total genes



In [39]:
write.table(gene_dge, "../data/gene_dge.tsv", quote=FALSE, sep="\t")

### Appendix - Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### Appendix 1. Checksums with the sha256 algorithm

In [None]:
rm (notebookid)
notebookid   = "countGenesAndEvents"
notebookid

message("Generating sha256 checksums of the file `../data/Total_AS_by_tissue.tsv` directory .. ")
system(paste0("cd ../data && find . -name SplicingIndex_chr.tsv -exec sha256sum {} \\;  >  ../metadata/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/Significant_events.tsv` directory .. ")
system(paste0("cd ../data && find . -name SplicingIndex_chr.tsv -exec sha256sum {} \\;  >  ../metadata/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

message("Generating sha256 checksums of the file `../data/Significant_events.tsv` directory .. ")
system(paste0("cd ../data && find . -name SplicingIndex_chr.tsv -exec sha256sum {} \\;  >  ../metadata/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")


paste0("../metadata/", notebookid, "_sha256sums.txt")

data.table::fread(paste0("../metadata/", notebookid, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

### Appendix 2. Libraries metadata

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", notebookid, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", notebookid ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]