# differenticalSplicingJunctionExpressionAnalysis as a Notebook 

rMATS 3.2.5 was run on controlled access RNASeq files retrieved experiments stored in the Sequence Read Archive with controlled access managed by dbGaP.   This experiment was run with the fastq files from GTEx v8.

The output (read in by section 1.2.1) are matrices which are the result of executing the rmats-nf nextflow workflow on all the samples from GTEx V8 https://github.com/lifebit-ai/rmats-nf.   The workflow begins with the accessions file and continues until a matrix.  Run without statistics, for the purposes of rMATS creating an annotated junction file for each of the five (5) splicing types.  The matrix is possible with this version of rMATS, as the junction ID is unique per annotation GTF.  In this running, we used gencode.v30.annotation.gtf (complete annotation).   The result is 5 matrices per splicing type.

rMATS RNASeq-MATS.py produces 10 different output types which get assembled into as type junction ID by sample ID matrices

## Alternative Splice Site Types are: (se, a3ss, a5ss, mxe, ri)

  * Skipped Exon events (se),
  * Alternative 3' splice site (a3ss),
  * Alternative 5' splice site (a5ss),
  * Mutually exclusive exon (mxe),
  * and retention intron (ri)

## There are two different kinds of junction counts

For our analysis here, we used just the jc count matrices.
  * jc = junction counts - reads that cross the junction
  * jcec = junction counts plus reads on the target (such as included exon)

## And the count type -- there are 5 types

  * inclusion levels (percent spliced in)
  * included junction counts (ijc)
  * skipped junction counts (sjc)
  * inclusion length (inclen)
  * skipped length (skiplen)

# 1.0 Loading dependencies

In [1]:
devtools::install_github("TheJacksonLaboratory/yarn@3ff72c0")
install.packages("gprofiler2")
library(gprofiler2)
library(downloader)
library(readr)
library(edgeR)
library(biomaRt)
library(DBI) # v >= 1.1.0 required for biomaRt
library(devtools)
library(limma)
library(piggyback)
library(multtest)
library(Biobase)
library(yarn)
library(edgeR)
library(tibble)
#install.packages('R.utils')
library(R.utils)
install.packages("snakecase")
library(snakecase)


Skipping install of 'yarn' from a github remote, the SHA1 (3ff72c0b) has not changed since last install.
  Use `force = TRUE` to force installation

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Loading required package: limma

“package ‘DBI’ was built under R version 3.6.2”
Loading required package: usethis


Attaching package: ‘devtools’


The following object is masked from ‘package:downloader’:

    source_url


Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following object is masked from ‘package:limma’:

    plotMA


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘pac

## 1.1 Nextflow execution parameter execution

Using the papermill library, we can parallelize execution of this notebook.  To do this, the loop at the bottom of this notebook should be commented out -- and papermill will run across all tissues fed into it.

In [2]:
# parameters for nextflow execution of notebook
tissue_index = 21

## 1.2 Retrieve Released data

Be sure to set your GITHUB_TOKEN, prior to downloading files

One suggestion is change it to your token and then run it then immediately change it back to this:

Sys.setenv(GITHUB_TOKEN = "your-very-own-github-token")

In [29]:
Sys.setenv(GITHUB_TOKEN = "your-very-own-github-token")

### Did you remember?

Did you remember to delete your private github token?  Now is a good time to do so, before you save your work and checkit in inadvertantly....

### 1.2.1 rmats_final matrices

In [3]:
getReleasedRMATSData <- function ( destDir ) {

  if (!("rmats_final.se.jc.ijc.txt.gz" %in% list.files(destDir))) {    
    # SE
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.se.jc.ijc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.se.jc.sjc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.se.jc.inc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.se.jc.inclen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.se.jc.skiplen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    # RI
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.ri.jc.ijc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.ri.jc.sjc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.ri.jc.inc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.ri.jc.inclen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.ri.jc.skiplen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    # MXE
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.mxe.jc.ijc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.mxe.jc.sjc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.mxe.jc.inc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.mxe.jc.inclen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.mxe.jc.skiplen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    # A3SS
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a3ss.jc.ijc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a3ss.jc.sjc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a3ss.jc.inc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a3ss.jc.inclen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a3ss.jc.skiplen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
     # A5SS
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a5ss.jc.ijc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a5ss.jc.sjc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a5ss.jc.inc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a5ss.jc.inclen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.a5ss.jc.skiplen.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = destDir)
   }
   return (0)
}

### 1.2.2 get released rMATS GTF annotations

For each splicing type, the junctions are defined, so we have 5 specific annotated splicing specific junction ID annotation files:
fromGTF.A3SS.txt <- annotations for the alternative 3' splice site junctions
fromGTF.A5SS.txt <- annotations for the alternative 5' splice site junctions
fromGTF.MXE.txt <- annotations for the mutually exclusive exon junctions
fromGTF.RI.txt <- annotations for the retained introns junctions
fromGTF.SE.txt <- annotations for the skipped exon junctions

In [4]:
getReleasedGTFAnnotations <- function ( destDir ) {

   if (! (file.exists("../data/fromGTF.tar.gz"))) {
       system("mkdir -p ../data", intern = TRUE)
       message("Fetching fromGTF.tar.gz from GitHub ..")
       # Download archive from GitHub release with tag "dge"
       piggyback::pb_download(file = "fromGTF.tar.gz",
                           dest = "../data",
                           repo = "adeslatt/sbas_gtf",
                           tag  = "rMATS.3.2.5.gencode.v30",
                           show_progress = TRUE)
       message("Done!\n")
       message("Decompressing fromGTF.tar.gz into ../data")
       system("mkdir -p ../data && tar xvfz ../data/fromGTF.tar.gz -C ../data", intern = TRUE)
       message("Done!\n")
       message("Decompressing fromGTF.*.txt.gz into ../data")
       system("gunzip  ../data/fromGTF*.txt.gz ", intern = TRUE)
       message("Done!\n")
   }
   return (0)
}


## 1.2.3 Read in SraRunData metadata 

- `Sequence Read Archive (SRA)` Accession Data, `SRR` numbers, this is used to map the SRR accession numbers to the sample information (SAMPID) which will be used to obtain the phenotype information.

In [5]:
getSraRunData <- function ( destDir ) {

  if (!("SraRunTable.txt.gz" %in% list.files(destDir))) {
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "TheJacksonLaboratory/sbas", 
        file = "SraRunTable.txt.gz",
        tag  = "GTExV8.v1.0", 
        dest = destDir)
  }
  message("Loading metadata from SraRunTable.txt.gz ../data/gtex.rds ..\n")
  metadata <- data.table::fread("../data/SraRunTable.txt.gz")
  message("done!\n")
  
  return(metadata)

}

## 1.2.4 Read in GTEx expression object

- `Genome Tissue Expression (GTEx)` Clinical Annotation - this is the expressionSet object that has the phenotype information
In this analysis there will be 3 expressionSet objects.  This one contains the gene Expression Count Data and phenotypes

In [30]:
#getGTExExpressionSet <- function ( destDir ) {

  if (!("gtex.rds" %in% list.files(destDir))) {
    message("Downloading and loading obj with GTEx v8 with 'yarn::downloadGTExV8()'\n")
    es <- yarn::downloadGTExV8(type='genes',file='../data/gtex.rds')
    message("Done!\n")

  } else {
    # Load with readRDS() if gtex.rds available in data/
      message("Loading obj GTEx v8 rds object with readRDS from ../data/gtex.rds ..\n")   
      es <- readRDS(file = "../data/gtex.rds")
      message("Done!\n")
      message("Generating sha256sum for gtex.rds ..\n")    
      message(system("sha256sum ../data/gtex.rds", intern = TRUE))
      message("Done!\n")
  }
#  return (es)

#}

Loading obj GTEx v8 rds object with readRDS from ../data/gtex.rds ..


Done!


Generating sha256sum for gtex.rds ..




ERROR: Error in system("sha256sum ../data/gtex.rds", intern = TRUE): cannot popen 'sha256sum ../data/gtex.rds', probable reason 'Cannot allocate memory'


### 1.2.5 get reduced Tissue Data

Stored in the assets subdirectory, reduced by inspection and selection focusing on those tissues with sufficient samples.

In [7]:
getTissueReduction <- function ( filename ) {

    tissue_reduction <- read.table(filename, header=TRUE, sep="\t",
                               skipNul=FALSE, stringsAsFactors = FALSE)
    colnames(tissue_reduction)  <- c("SMTSD","female","male","include","display_name")

    return(tissue_reduction)
}

### 1.2.6 read in and make Splicing Expression Set Object

Using the expressionSet object, we will make an expression set object for each of the count matrices for each of the alternative splicing matrices.

In [None]:
gtexPhenoDataObj <- obj
srr_metadata <- srr_metadata
filename_gz <- "../data/rmats_final.se.jc.ijc.txt.gz"


In [None]:
    message("\nloading ", paste(filename_gz), collapse=" ")
    counts <- data.table::fread(filename_gz)
    message("done!")
    rownames(counts) <- counts$ID
    counts <- counts[,-1]
    counts <- data.matrix(counts)



In [None]:
dim(counts)
    metadata_for_counts = colnames(counts)%in% srr_metadata$Run
    counts_for_metadata = srr_metadata$Run %in% colnames(counts)
    message("\nmetadata runs matching counts\n",
       paste(table(metadata_for_counts)), collapse = " ")
    message("\ncounts matching metadata\n",
       paste(table(counts_for_metadata)), collapse = " ")


In [None]:
    count_srrs <- colnames(counts)
    srr_metadata_match = as.character(srr_metadata$Run) %in% as.character(count_srrs[1])
    pd     <- pData(gtexPhenoDataObj[gtexPhenoDataObj$SAMPID== srr_metadata[match,"SAMPID"],])


In [None]:
SAMPID <- srr_metadata[match,"SAMPID"]
SAMPID
head(pData(gtexPhenoDataObj))
pdata_match = pData(gtexPhenoDataObj)$SAMPID %in% SAMPID
table(pdata_match)

In [None]:
pdata_sampids <- pData(gtexPhenoDataObj)$SAMPID
srr_metadata_sampids <- srr_metadata$SAMPID
#metadata_in_pdata <- pdata_sampids %in% srr_metadata_sampids
#table(metadata_in_pdata)
#pdata_in_metadata <- srr_metadata_sampids %in% pdata_sampids
#table(pdata_in_metadata)
#colnames(pData(gtexPhenoDataObj))
#table(pData(gtexPhenoDataObj)$SMTSD)
dim(gtexPhenoDataObj)
dim( pData(gtexPhenoDataObj))
keep = pData(gtexPhenoDataObj)$SMTSD=="adipose_subcutaneous"
table(keep)

obj <- gtexPhenoDataObj
obj       <-        obj[          ,pData(obj)$SMTSD=="adipose_subcutaneous"]
pData(obj)<- pData(obj)[pData(obj)$SMTSD=="adipose_subcutaneous",          ]

dim(obj)
dim( pData(obj))

keep = pData(obj)$SMTSD=="adipose_subcutaneous"
table(keep)

fe <- pData(obj)$SMTSD
length(fe)
fe

#pdata_sampids <- pData(gtexPhenoDataObj[keep,])$SAMPID
#length(pdata_sampids)
#metadata_in_pdata <- pdata_sampids %in% srr_metadata_sampids
#table(metadata_in_pdata)
#table(keep)
#table(pData(gtexPhenoDataObj[,keep])$SEX)

In [None]:

    # metadata from the SraRunTable is just used to match SRR accessions to SAMPID
    # GTEx sample phenotype data will be repeated - for samples that are shared
    # reflecting multiple sequencing runs per sample.
    #
    # because of that we have to build the matrix
    # build the matrix from the first one, rbind within the loop
#        ijc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.se.jc.ijc.txt.gz")

gtexPhenoDataObj <- obj
srr_metadata <- srr_metadata
filename_gz <- rmats_final.se.jc.ijc.txt.gz
#makeSplicingExpressionSetObject <- function (gtexPhenoDataObj, srr_metadata, filename_gz) {

    message("\nloading ", paste(filename_gz), collapse=" ")
    counts <- data.table::fread(filename_gz)
    message("done!")
    rownames(counts) <- counts$ID
    counts <- counts[,-1]
    counts <- data.matrix(counts)

    # getting the SRR accession names for counts
    count_srrs <- colnames(counts)
    match = as.character(srr_metadata$Run) %in% as.character(count_srrs[1])
    pd     <- pData(gtexPhenoDataObj[gtexPhenoDataObj$SAMPID== srr_metadata[match,"SAMPID"],])
                    
    pd$SRR <- srr_metadata$Run
    pdfinal <- pd
    for (i in 2:dim(counts)[2]) {
        sample_match = srr_metadata[colnames(counts[,i]) == srr_metadata$Run,]
        pd     <- pData(gtexPhenoDataObj[gtexPhenoDataObj$SAMPID== srr_metadata$SAMPID,])
        pd$SRR <- srr_metadata$Run
        pdfinal <- rbind(pdfinal, pd)
    }

    # because we have multiple SRR per Sample, 
    es  <- ExpressionSet(as.matrix(counts))
    pData(es) <- pdfinal

    return(es)
}

## 1.3 Quality control, preprocessing of gene ExpressionSet Object

### 1.3.1 Inspect and correct ExpressionSet Object

Ensure the expression set object is in sync with its phenotype information (as before in the differentialGeneExpression notebook.

In [28]:
#es <- obj
#rm (obj)
   message("\nBEFORE: dimension of expressionSet\n",
        paste(dim(exprs(es))), collapse=" ")
   message("\nBEFORE: dimension of expressionSet\n",
        paste(dim(es)), collapse=" ")
   sample_names=as.vector(as.character(colnames(exprs(es))))
   pheno_sample_names=as.vector(as.character(rownames(pData(es))))
   
   if (length(pheno_sample_names) > length(sample_names)) {
      superset <- pheno_sample_names
      subset   <- sample_names    
   }
   if (length(pheno_sample_names) < length(sample_names)) {
      superset <- sample_names
      subset   <- pheno_sample_names   
   } 
   non_overlaps <- setdiff( superset, subset)
   message("\nThe non-overlapping IDs between pheno and count data are:\n", 
        paste(length(non_overlaps)), collapse = " ")

   logical_match_names=superset %in% subset
   message("\nLogical diff pheno_sample_names, expression_sample_names\n",
        paste(table(logical_match_names)), collapse = " ")

   es        <- es       [logical_match_names==TRUE,]
   pData(es) <- pData(es)[logical_match_names==TRUE,]

   message("\nAFTER: dimension of expressionSet\n",
        paste(dim(exprs(es))), collapse=" ")
   message("\nAFTER: dimension of expressionSet\n",
        paste(dim(es)), collapse=" ")



BEFORE: dimension of expressionSet
5587817382 


BEFORE: dimension of expressionSet
5587817382 


The non-overlapping IDs between pheno and count data are:
2 


Logical diff pheno_sample_names, expression_sample_names
217382 


AFTER: dimension of expressionSet
5587217382 


AFTER: dimension of expressionSet
5587217382 



In [18]:
inspectAndCorrectExpressionSetObject <- function ( es ) {

   message("\nBEFORE: dimension of expressionSet\n",
        paste(dim(exprs(es))), collapse=" ")
   message("\nBEFORE: dimension of expressionSet\n",
        paste(dim(es)), collapse=" ")
   sample_names=as.vector(as.character(colnames(exprs(es))))
   pheno_sample_names=as.vector(as.character(rownames(pData(es))))
   
   if (length(pheno_sample_names) > length(sample_names)) {
      superset <- pheno_sample_names
      subset   <- sample_names    
   }
   if (length(pheno_sample_names) < length(sample_names)) {
      superset <- sample_names
      subset   <- pheno_sample_names   
   } 
   non_overlaps <- setdiff( superset, subset)
   message("\nThe non-overlapping IDs between pheno and count data are:\n", 
        paste(length(non_overlaps)), collapse = " ")

   logical_match_names=superset %in% subset
   message("\nLogical diff pheno_sample_names, expression_sample_names\n",
        paste(table(logical_match_names)), collapse = " ")

   es        <- es       [logical_match_names==TRUE,]
   pData(es) <- pData(es)[logical_match_names==TRUE,]

   message("\nAFTER: dimension of expressionSet\n",
        paste(dim(exprs(es))), collapse=" ")
   message("\nAFTER: dimension of expressionSet\n",
        paste(dim(es)), collapse=" ")

   return(es)
}


### 1.3.2  Reduce Sample Set 
Read in all requirements so that the stage is properly set -- tissues.tsv contains the subset of files desired for analysis.
It is found in the `assets` subdirectory

In [9]:
reduceSampleSet <- function (tissue_reduction, es) {

   message("\nsize tissue_reduction\n",
        paste(dim(tissue_reduction), collapse=" "))
   message("\nsize es\n",
        paste(dim(es)), collapse=" ")
   message("\nsize pData(es)\n",
        paste(dim(pData(es)), collapse=" "))
   # only include those tissues we wish to continue with
   message("\n number of tissue types to keep\n",
        paste(table(tissue_reduction$include)), collapse = " ")
	
   tissue_reduction <- tissue_reduction[tissue_reduction$include==1,]

   # test to make sure we don't have nonsense
   keep <- (pData(es)$SMTSD== "breast_mammary_tissue")
   message("\nTEST: how many to keep of es in breast_mammary_tissue\n",
        paste(table(keep), collapse = " "))
   tes        = es      [,keep]
   pData(tes) = pData(es[,keep])
   message("\nTEST: size breast_mammary_tissue es:tes\n",
        paste(dim(tes), collapse=" "))
   message("\nTEST: size phenotype esect pData(tes)\n",
        paste(dim(pData(tes)), collapse=" "))
   pData(tes)[1,]
   rm(keep)
   # end test

   # create a matching tissue name to go with the expressionSet phenotype esect
   pData(es)$SMTSD       <- factor(snakecase::to_snake_case(as.character(pData(es)$SMTSD)))
   tissue_reduction$SMTSD <- factor(snakecase::to_snake_case(as.character(tissue_reduction$SMTSD)))

   message("\nlength tissues in phenotype data\n",
        paste(length(levels(pData(es)$SMTSD)), collapse = " "))
   message("\nlength tissues in tissue_reduction data\n",
        paste(length(tissue_reduction$SMTSD), collapse = " "))

   keep <- pData(es)$SMTSD %in% tissue_reduction$SMTSD
   message("\nlength tissue in samples phenotype data\n",
        paste(length(pData(es)$SMTSD), collapse = " "))
   message("\nlength keep es \n",
        paste(length(keep), collapse = " "))
   message("\nhow many to keep in phenotype data\n",
        paste(table(keep), collapse = " "))

   es        <- es       [          ,keep==TRUE]
   pData(es) <- pData(es)[keep==TRUE,          ]
   rm(keep) 
   message("\nsize reduced es\n",
        paste(dim(es)), collapse=" ")
   message("\nsize pData(es)\n",
        paste(dim(pData(es)), collapse=" "))
   message("\nlength tissues in phenotype data\n",
        paste(length(levels(pData(es)$SMTSD)), collapse = " "))

   # test to make sure we don't have nonsense
   keep = pData(es)$SMTSD== "breast_mammary_tissue"
   message("\nTEST: how many to keep in to have only breast_mammary_tissue\n",
        paste(table(keep), collapse = " "))
   tes        = es       [          ,keep==TRUE]
   pData(tes) = pData(es)[keep==TRUE,          ]
   message("\nTEST: size breast_mammary_tissue tes\n",
        paste(dim(tes), collapse=" "))
   message("\nTEST: size phenotype object pData(tes)\n",
        paste(dim(pData(tes)), collapse=" "))
   pData(tes)[1,]
   rm(keep)
   # end test
   return (es)
}

### 1.3.3 Eliminate ChrY fromGTF

We are studying the sex-biased differences, to do this we need to eliminate chromosome Y, this is not shared between the sexes

In [10]:
eliminateChrYfromGTF <- function ( fromGTF ) {

   fromGTF.keepAllButChrY <- (fromGTF$chr != "Y")
   fromGTF           <- fromGTF[fromGTF.keepAllButChrY,]
   rownames(fromGTF) <- fromGTF$ID
   return(fromGTF)
}

### 1.3.4 Eliminate ChrY from expressionSet Object

We are studying the sex-biased differences, to do this we need to eliminate chromosome Y, this is not shared between the sexes
This time from the expressionSet Object

In [11]:
eliminateChrYfromExpressionSet <- function ( fromGTF, es ) {

   fromGTF.keepAllButChrY <- (fromGTF$chr != "Y")
   fromGTF           <- fromGTF[fromGTF.keepAllButChrY,]
   rownames(fromGTF) <- fromGTF$ID

   es_ids <- rownames(es)
   gtf_ids <- fromGTF$ID
   keep <- es_ids %in% gtf_ids

   es <- es[keep==TRUE,]
   return(fromGTF)
}

## Exploratory and Differential analysis as_event:ijc, sjc 

Differential Analysis (DE) was performed using voom (Law et.al., 2014) to transform junction counts (reads that were aligned to junctions when an exon is included - ijc, and reads that were aligned to junctions when the exon is excluded - sjc) with associated precision weights, followed by linear modeling and empirical Bayes procedure using limma.    In each tissue, the following linear regression model was used to detec secually dimorphic alternative splicing event expression: 

           y = B0 + B1 sex + epsilon (error)
           

where y is the included exon junction count expression; sex denotes the reported sex of the subject

## Differential analysis as_event (combined ijc and sjc)

Differential Analysis (DE) was performed using voom (Law et.al., 2014) to transform junction counts (reads that were aligned to junctions when an exon is included - ijc, and reads that were aligned to junctions when the exon is excluded - sjc) with associated precision weights, followed by linear modeling and empirical Bayes procedure using limma.    In each tissue, the following linear regression model was used to detec secually dimorphic alternative splicing event expression: 

           y = B0 + B1 sex + B2 as_event + B3 sex*as_event + epsilon (error)
           

where y is the alternative splicing event expression; sex denotes the reported sex of the subject, as_event represents the specific alternative splicing event - either included exon junction counts or skipped exon junction counts and their interaction terms.   Donor is added to our model as a blocking variable used in both the calculation of duplicate correlation as well as in the linear fit.

### Voom, limma's lmFit and eBayes

Using sample as a blocking variable, we are able to model the effects of the donor on the results, which improves the power.  This topic is discussed in biostars https://www.biostars.org/p/54565/.  And Gordon Smyth answers the question here https://mailman.stat.ethz.ch/pipermail/bioconductor/2014-February/057887.html.  The method of modeling is a random effects approach in which the intra-donor correlation is incorporated into the covariance matrix instead of the linear predictor.   And though as Gordon Smyth states both are good method and the twoway anova approach makes fewer assumptions, the random effects approach is statistically more powerful.  

We have a balanced design in which all donors receive all stimuli (which is really in healthy human donors, life and all of its factors!) Our measurement has so many points -- we are measuring in the skipped exon approach, 42,611 junctions!   It is not possible to encorporate those measurements into the linear predictor.  A two-way ANOVA approach is virtually as powerful as the random effects approach 
and hence is preferable as it makes fewer assumptions.

For an unbalanced design in which each donor receives only a subset of the stimula, the random effects approach is more powerful.

Random effects approach is equivalent to The first method is twoway anova, a generalization of a paired analysis.


In [12]:
print_exploratory_plots <- function (plot, dup, tissue_of_interest, splice_type, fromGTF, tissue_list, ijc, sjc, obj, metadata ) {

    fromGTF           <- fromGTF
    tissue_true       <- tissue_list == tissue_of_interest

    # test to make sure we don't have nonsense
    # like proteins - this process moves in one direction
    # tissue will get us the phenotype selection 
    # from the phenotype SAMPID we get the metadata SAMPID
    # which leads us to counts - it goes in this direction.
    # first phenotype obj subsetting for the tissue_of_interest

    message("\nLimiting phenotype data to tissue of interest\n",
           paste(tissue_of_interest),collapse=" ")    
    keep = pData(obj)$SMTSD== tissue_of_interest
    message("\nkeep\n",
           paste(table(keep)),collapse=" ")    
    tissue_obj<- obj[,keep]
    message("\ntissue_obj now reduced to the tissue of interest\n",
            paste(dim(pData(tissue_obj)), collapse=" "))
    message("\nsample value\n",
            paste(   pData(tissue_obj)[1,], collapse=" "))

    # now we get the runs via the metadata matrix via SAMPID
    # limit the phenotype info to just the samples we care about
    metadata_samples          <- as.character(metadata$SAMPID)
    tissue_phenotype_samples  <- as.character(pData(tissue_obj)$SAMPID)

    rm(keep)
    keep = tissue_phenotype_samples %in% metadata_samples
    message("\nLimit the phenotypes to those we have samples for\n", 
        paste(table(keep), collapse = " ") )
    message("\ntissue_count_obj now has these count specific phenotypes")    
    tissue_count_obj        <- tissue_obj[keep,]
    pData(tissue_count_obj) <- pData(tissue_obj)[keep,]
    message("\nDimensions of tissue_count_obj\n", 
        paste(dim(pData(tissue_count_obj)), collapse = " ") )
    message("\nDimensions of tissue_count_obj \n", 
        paste( dim(tissue_count_obj), collapse = " ") )

    # and vice-versa, limit our samples to the ones we have phenotype for
    rm(keep)
    metadata_samples       <- as.character(metadata$SAMPID)
    phenotype_samples      <- as.character(pData(tissue_count_obj)$SAMPID)
    keep = metadata_samples %in% phenotype_samples
    message("\nLimit the counts now has these count specific phenotypes \n", 
        paste( table(keep), collapse = " ") )
    table(keep)

    ijc_tissue        <- ijc     [          ,keep==TRUE]
    sjc_tissue        <- sjc     [          ,keep==TRUE]
    metadata_tissue   <- metadata[keep==TRUE,          ]

    message("\ndimensions of the ijc_tissue \n", 
        paste(  dim(ijc_tissue), collapse = " ") )
    message("\ndimensions of the ijc_tissue \n", 
        paste(  dim(sjc_tissue), collapse = " ") )
    message("\ndimensions of the metadata_tissue \n", 
        paste(  dim(metadata_tissue), collapse = " ") )
   
    metadata_samples       <- as.character(metadata_tissue$SAMPID)
    phenotype_samples      <- as.character(pData(tissue_count_obj)$SAMPID)

    length(metadata_samples)
    length(phenotype_samples)
    non_overlaps <- setdiff( metadata_samples, phenotype_samples)
    message("\nThe non-overlapping IDs between pheno and count data are:\n", 
        paste(length(non_overlaps), collapse = " ") )

    ijc.dm            <- data.matrix(ijc_tissue)
    sjc.dm            <- data.matrix(sjc_tissue)    

    ## remove features (junctions) that have zero sums
    ijc.rs <- rowSums(ijc.dm) 
    keep.ijc.rs <- (ijc.rs > 10)
    sjc.rs <- rowSums(sjc.dm) 
    keep.sjc.rs <- (sjc.rs > 10)
    keep <- keep.ijc.rs | keep.sjc.rs
    message("\nijc.rs > 10 \n", 
        paste(table(keep.ijc.rs), collapse = " ") )                    
    message("\nsjc.rs > 10 \n", 
        paste(table(keep.sjc.rs), collapse = " ") )
    message("\nkeep combined ijc | sjc \n", 
        paste(table(keep), collapse = " ") )                    
    
    # ensure the matrices do not have too many zeros
    ijc.dm <- ijc.dm[keep,]
    sjc.dm <- sjc.dm[keep,]
    message("\ndim(ijc.dm)\n", 
        paste(dim(ijc.dm), collapse = " ") )                    
    message("\ndim(sjc.dm)\n", 
        paste(dim(sjc.dm), collapse = " ") )
    message("\n")
    
        
    
    
    for (i in (1:length(metadata_samples))) {
        sample = metadata_tissue[i,]$SAMPID
        sample_sex = pData(tissue_count_obj)$SEX[pData(tissue_count_obj)$SAMPID == sample]
        if (i==1) {
            sex = sample_sex
        } else {
            sex = c(sex,sample_sex)
        }
    }
    message("\nsex samples:\n",
        paste0(table(sex), collapse="\n"))
    sex      <- ifelse(sex == 1,"male","female")
    sex      <- factor(sex,levels=c("male","female"))
    message("\nsex samples:\n",
        paste0(table(sex), collapse=" "))

    design    <- model.matrix ( ~ sex)
    message("\ndesign matrix ijc, alone:\n",
        paste0(head(design), collapse="\n"))

    colnames(design) <- c("intercept","sex")

    y_ijc <- DGEList(counts=ijc.dm, group = sex)
    y_ijc <- calcNormFactors(y_ijc, method="RLE")
    y_ijc_voom <- voom (y_ijc, design=design, plot=plot)

    Gender <- substring(sex,1,1)
    pdf_sub_directory = '../pdf/'
    csv_sub_directory = '../data/'
    
    filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                                     snakecase::to_snake_case(tissue_of_interest)),"-ijc-MDSplot-100.pdf")
    pdf (filename)
        plotMDS(y_ijc, labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
    dev.off()
    filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                                     snakecase::to_snake_case(tissue_of_interest)),"-ijc-voom-MDSplot-100.pdf")
    pdf (filename)    
        plotMDS(y_ijc_voom, labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
    dev.off()
 
    fit_ijc <- lmFit(y_ijc_voom, design)
    fit_ijc <- eBayes(fit_ijc)

    ijc_sex_results                    <- topTable(fit_ijc, coef='sex', number=nrow(y_ijc_voom))
    ijc_sex_results_refined            <- ijc_sex_results$adj.P.Val <= 0.05 & abs(ijc_sex_results$logFC) >= abs(log2(1.5))
    ijc_sex_rnResults                  <- rownames(ijc_sex_results)
    ijc_sex_resultsAnnotations         <- fromGTF[ijc_sex_rnResults,]

    ijc_sex_results_refinedAnnotations <- ijc_sex_resultsAnnotations[ijc_sex_results_refined      ==TRUE,]
    message("\ndimensions of the ijc_sex_results_refined_annotations \n", 
        paste(dim (ijc_sex_results_refinedAnnotations), collapse = " ") )

    # geneSymbols are in the annotations 
    ijc_sex_geneSymbols               <- ijc_sex_resultsAnnotations$geneSymbol
    ijc_sex_refined_geneSymbols       <- ijc_sex_results_refinedAnnotations$geneSymbol
    message("\nlength ijc_sex_results_refined_geneSymbols\n", 
        paste(length(ijc_sex_refined_geneSymbols), collapse = " ") )

    # adjust the rownames to be the geneSymbols rather than junction IDs
    ijc_sex_results_rn         <- paste(ijc_sex_geneSymbols,       ijc_sex_rnResults, sep="-")
    rownames(ijc_sex_results)  <- ijc_sex_results_rn    

    ijc_sex_filename               = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_DGE_ijc_sex.csv',sep='')
    ijc_sex_refined_filename       = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_DGE_ijc_sex_refined.csv',sep='')
    ijc_sex_genesFilename          = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_ijc_sex_universe.txt',sep='')
    ijc_sex_refined_genesFilename  = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_ijc_sex_gene_set.txt',sep='')
    write.table(ijc_sex_results, 
                file = ijc_sex_filename        , row.names = T, col.names = T, quote = F, sep = ",")
    write.table(ijc_sex_results [ijc_sex_results_refined      ,], 
                file = ijc_sex_refined_filename, row.names = T, col.names = T, quote = F, sep = ",")
    write.table(ijc_sex_geneSymbols,        
                file = ijc_sex_genesFilename        , row.names = F, col.names = F, quote = F, sep = ",")
    write.table(ijc_sex_refined_geneSymbols,
                file = ijc_sex_refined_genesFilename, row.names = F, col.names = F, quote = F, sep = ",")
   
    message("\nstarting sjc\n")

    y_sjc <- DGEList(counts=sjc.dm, group = sex)
    y_sjc <- calcNormFactors(y_sjc, method="RLE")
    y_sjc_voom <- voom (y_sjc, design=design, plot=plot)

    Gender <- substring(sex,1,1)
    filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                                     snakecase::to_snake_case(tissue_of_interest)),"-sjc-MDSplot-100.pdf")
    pdf (filename)
        plotMDS(y_sjc, labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
    dev.off()
    filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                                     snakecase::to_snake_case(tissue_of_interest)),"-sjc-voom-MDSplot-100.pdf")
    pdf (filename)    
        plotMDS(y_sjc_voom, labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
    dev.off()
 
    fit_sjc <- lmFit(y_sjc_voom, design)
    fit_sjc <- eBayes(fit_sjc)

    sjc_sex_results                    <- topTable(fit_sjc, coef='sex', number=nrow(y_sjc_voom))
    sjc_sex_results_refined            <- sjc_sex_results$adj.P.Val <= 0.05 & abs(sjc_sex_results$logFC) >= abs(log2(1.5))
    sjc_sex_rnResults                  <- rownames(sjc_sex_results)
    sjc_sex_resultsAnnotations         <- fromGTF[sjc_sex_rnResults,]

    sjc_sex_results_refinedAnnotations <- sjc_sex_resultsAnnotations[sjc_sex_results_refined      ==TRUE,]
    message("\ndimensions of the sjc_sex_results_refined_annotations \n", 
        paste(dim (sjc_sex_results_refinedAnnotations), collapse = " ") )

    # geneSymbols are in the annotations 
    sjc_sex_geneSymbols               <- sjc_sex_resultsAnnotations$geneSymbol
    sjc_sex_refined_geneSymbols       <- sjc_sex_results_refinedAnnotations$geneSymbol
    message("\nlength sjc_sex_results_refined_geneSymbols\n", 
        paste(length(sjc_sex_refined_geneSymbols), collapse = " ") )

    # adjust the rownames to be the geneSymbols rather than junction IDs
    sjc_sex_results_rn         <- paste(sjc_sex_geneSymbols, sjc_sex_rnResults, sep="-")
    rownames(sjc_sex_results)  <- sjc_sex_results_rn    

    sjc_sex_filename               = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_DGE_sjc_sex.csv',sep='')
    sjc_sex_refined_filename       = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_DGE_sjc_sex_refined.csv',sep='')
    sjc_sex_genesFilename          = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_sjc_sex_universe.txt',sep='')
    sjc_sex_refined_genesFilename  = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_sjc_sex_gene_set.txt',sep='')
    write.table(sjc_sex_results, 
                file = sjc_sex_filename        , row.names = T, col.names = T, quote = F, sep = ",")
    write.table(sjc_sex_results [sjc_sex_results_refined      ,], 
                file = sjc_sex_refined_filename, row.names = T, col.names = T, quote = F, sep = ",")
    write.table(sjc_sex_geneSymbols,        
                file = sjc_sex_genesFilename        , row.names = F, col.names = F, quote = F, sep = ",")
    write.table(sjc_sex_refined_geneSymbols,
                file = sjc_sex_refined_genesFilename, row.names = F, col.names = F, quote = F, sep = ",")
    
   
    message("starting the y prediction\n")
    sample_names <- as.character(colnames(ijc.dm))
    # we will add donor as a blocking parameter
    # rather than sample name -- we should use donor for real
    sample     <- factor(sample_names)
    
    donor    <- rep(sample, 2)
    message("\ndonor size", 
        paste(length(donor), collapse = " ") )
    
    ijc_names <- as.character(colnames(ijc.dm))
    sjc_names <- as.character(colnames(sjc.dm))
    sjc_names <- paste0(sjc_names,"-sjc")
    ijc_names <- paste0(ijc_names,"-ijc")

    colnames(ijc.dm) <- ijc_names
    colnames(sjc.dm) <- sjc_names

    as_matrix <- cbind(ijc.dm,sjc.dm)
    message("\ndim as_matrix", 
        paste(dim(as_matrix), collapse = " ")) 
            
    for (i in (1:length(metadata_samples))) {
        sample = metadata_tissue[i,]$SAMPID
        sample_sex = pData(tissue_count_obj)$SEX[pData(tissue_count_obj)$SAMPID == sample]
        if (i==1) {
            sex = sample_sex
        } else {
            sex = c(sex,sample_sex)
        }
    }
    message("sex samples:\n",
        paste0(table(sex), collapse="\n"))
    sex      <- ifelse(sex == 1,"male","female")
    sex      <- factor(sex,levels=c("male","female"))
    message("\nsex samples:\n",
        paste0(table(sex), collapse=" "))
    sex2      <- c(rep(sex,2))
    table(sex2)
    message("\nlength sex2\n", 
        paste(length(sex2), collapse = " ") )
    message("table sex2\n", 
        paste(table(sex2), collapse = "\n") )
    as_event  <- c(rep("ijc",dim(ijc.dm)[2]), rep("sjc", dim(sjc.dm)[2]))
    as_event  <- factor(as_event, levels=c("ijc", "sjc"))
    message("\nlength as_event\n", 
        paste(length(as_event), collapse = " ") )

    design    <- model.matrix( ~ sex2 + as_event + sex2*as_event)
    dim(design)
    colnames(design) <- c("intercept","sex", "as_event","sex*as_event")
    message("\ndim design <- model.matrix( ~sex + as_event + sex*as_event)\n", 
        paste(head(design), collapse = "\n") )

    y <- DGEList(counts=as_matrix, group = sex2)
    y <- calcNormFactors(y, method="RLE")
    y_voom <- voom (y, design=design, plot = plot)

    if (dup==TRUE) {
        dup_cor <- duplicateCorrelation(y_voom$E, design=design, ndups=2, block=donor, weights=y$samples$norm.factors)
        dup_cor$consensus.correlation 
        y_dup_voom <- voom (y, design=design, plot = plot, block = donor, correlation = dup_cor$consensus.correlation) 
    }
    
    Gender <- substring(sex[1:dim(ijc.dm)[2]],1,1)
    message("\nGenders new size\n", 
        paste(length(Gender), collapse = " ") )
    message("\nplotting y for ijc portion of design <- model.matrix( ~sex + as_event + sex*as_event\n")
     # print the combined exploratory plot
    filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                              snakecase::to_snake_case(tissue_of_interest)),"-y-ijc-MDSplot-100.pdf")
    pdf (filename)
        plotMDS(y[,c(1:dim(ijc.dm)[2])], labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
            gene.selection="common")
    dev.off()
    message("\nplotting y_voom for ijc portion of design <- model.matrix( ~sex + as_event + sex*as_event\n")
    filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                              snakecase::to_snake_case(tissue_of_interest)),"-y-voom-ijc-MDSplot-100.pdf")
    pdf (filename)
        plotMDS(y_voom[,c(1:dim(ijc.dm)[2])], labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
            gene.selection="common")
    dev.off()
    if (dup == TRUE) {
        filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                           snakecase::to_snake_case(tissue_of_interest)),"-y-dup-voom-ijc-MDSplot-100.pdf")
        pdf (filename)
            plotMDS(y_dup_voom[,c(1:dim(ijc.dm)[2])], labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
        dev.off()
    }
    message("\nplotting y for sjc portion of design <- model.matrix( ~sex + as_event + sex*as_event\n")
    filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                              snakecase::to_snake_case(tissue_of_interest)),"-y-sjc-MDSplot-100.pdf")
    pdf (filename)
        plotMDS(y[,c((dim(ijc.dm)[2]+1)):(dim(ijc.dm)[2]+dim(sjc.dm)[2])], labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
            gene.selection="common")
    dev.off()
    
    if (dup == TRUE) {
        message("\nplotting y_voom for sjc portion of design <- model.matrix( ~sex + as_event + sex*as_event\n")    
        filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                              snakecase::to_snake_case(tissue_of_interest)),"-y-voom-sjc-MDSplot-100.pdf")
        pdf (filename)
            plotMDS(y_voom[,c((dim(ijc.dm)[2]+1)):(dim(ijc.dm)[2]+dim(sjc.dm)[2])], labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
        dev.off()
        filename <- paste0(paste0(paste0(pdf_sub_directory, splice_type),
                              snakecase::to_snake_case(tissue_of_interest)),"-y-dup-voom-sjc-MDSplot-100.pdf")
        pdf (filename)
            plotMDS(y_dup_voom[,c((dim(ijc.dm)[2]+1)):(dim(ijc.dm)[2]+dim(sjc.dm)[2])], labels=Gender, top=100, col=ifelse(Gender=="m","blue","red"), 
                gene.selection="common")
        dev.off()
        
        fit <- lmFit(y_dup_voom, design=design, block=donor, correlation = dup_cor$consensus.correlation)
    } else {
        fit <- lmFit(y_voom, design=design)
    }
        
    fit <- eBayes(fit, robust=TRUE)
    
    sex_as_events_results         <- topTable(fit, coef="sex*as_event", number=nrow(y_voom))
    sex_as_events_results_refined <- sex_as_events_results$adj.P.Val <= 0.05 & abs(sex_as_events_results$logFC) >= abs(log2(1.5))

    sex_results                   <- topTable(fit, coef="sex", number=nrow(y_voom))
    sex_results_refined           <- sex_results$adj.P.Val <= 0.05 & abs(sex_results$logFC) >= abs(log2(1.5))

    sex_as_events_rnResults <- rownames(sex_as_events_results)
    sex_rnResults           <- rownames(sex_results)
    head(sex_as_events_rnResults)
    head(ijc_sex_rnResults)
    head(sex_rnResults)
    head(fromGTF[sex_as_events_rnResults,])

    # use the junctionIDs to get the annotations
    sex_as_events_resultsAnnotations      <- fromGTF[sex_as_events_rnResults,]
    sex_resultsAnnotations                <- fromGTF[sex_rnResults,]
    ijc_sex_resultsAnnotations            <- fromGTF[ijc_sex_rnResults,]
    head(sex_as_events_resultsAnnotations)
    head(sex_resultsAnnotations)
    head(ijc_sex_resultsAnnotations)
    
    sex_as_events_results_refinedAnnotations<- sex_as_events_resultsAnnotations[sex_as_events_results_refined==TRUE,]
    sex_results_refinedAnnotations          <- sex_resultsAnnotations          [sex_results_refined          ==TRUE,]
    sjc_sex_results_refinedAnnotations       <- sjc_sex_resultsAnnotations      [sjc_sex_results_refined      ==TRUE,]
    head(sex_as_events_results_refinedAnnotations)
    head(sex_results_refinedAnnotations)
    head(sjc_sex_results_refinedAnnotations)

    # geneSymbols are in the annotations 
    sex_as_events_geneSymbols         <- sex_as_events_resultsAnnotations$geneSymbol
    sex_as_events_refined_geneSymbols <- sex_as_events_results_refinedAnnotations$geneSymbol
    sex_geneSymbols                   <- sex_resultsAnnotations$geneSymbol
    sex_refined_geneSymbols           <- sex_results_refinedAnnotations$geneSymbol
    ijc_sex_geneSymbols               <- ijc_sex_resultsAnnotations$geneSymbol
    ijc_sex_refined_geneSymbols       <- ijc_sex_results_refinedAnnotations$geneSymbol
    sjc_sex_geneSymbols               <- sjc_sex_resultsAnnotations$geneSymbol
    sjc_sex_refined_geneSymbols       <- sjc_sex_results_refinedAnnotations$geneSymbol


    # adjust the rownames to be the geneSymbols rather than junction IDs
    sex_as_events_results_rn   <- paste(sex_as_events_geneSymbols, sex_as_events_rnResults, sep="-")
    sex_results_rn             <- paste(sex_geneSymbols,           sex_rnResults, sep="-")
    ijc_sex_results_rn         <- paste(ijc_sex_geneSymbols,       ijc_sex_rnResults, sep="-")
    sjc_sex_results_rn         <- paste(sjc_sex_geneSymbols,       sjc_sex_rnResults, sep="-")
    message("\n sex_as_events\n", 
        paste(head(sex_as_events_results_rn), collapse = " ") )
    message("\n ijc_sex_results\n", 
        paste(head(ijc_sex_results_rn), collapse = " ") )
    message("\n sjc_sex_results\n", 
        paste(head(sjc_sex_results_rn), collapse = " ") )
    rownames(sex_as_events_results) <- sex_as_events_results_rn
    rownames(sex_results)           <- sex_results_rn
    rownames(ijc_sex_results)       <- ijc_sex_results_rn
    rownames(sjc_sex_results)       <- sjc_sex_results_rn

    sex_as_events_filename         = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_DGE_sex_as_events.csv')
    sex_as_events_refined_filename = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_DGE_sex_as_events_refined.csv',sep='')
    sex_filename                   = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_DGE_sex.csv',sep='')
    sex_refined_filename           = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_DGE_sex_refined.csv',sep='')
    sex_as_events_genesFilename    = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_sex_as_events_universe.txt',sep='')
    sex_as_events_refined_genesFilename = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_sex_as_events_gene_set.txt',sep='')
    sex_genesFilename              = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_sex_universe.txt',sep='')
    sex_refined_genesFilename      = paste0(paste0(paste0(csv_sub_directory, splice_type),
                                                   snakecase::to_snake_case(tissue_of_interest)),'_sex_gene_set.txt',sep='')

    write.table(sex_as_events_results, file = sex_as_events_filename, 
                row.names = T, col.names = T, quote = F, sep = ",")
    write.table(sex_as_events_results[sex_as_events_results_refined,], 
                file = sex_as_events_refined_filename, row.names = T, col.names = T, quote = F, sep = ",")
    write.table(sex_results,           file = sex_filename          , 
                row.names = T, col.names = T, quote = F, sep = ",")
    write.table(sex_results [sex_results_refined          ,], file = sex_refined_filename, 
                row.names = T, col.names = T, quote = F, sep = ",")
    write.table(sex_as_events_geneSymbols, file = sex_as_events_genesFilename, 
                row.names = F, col.names = F, quote = F, sep = ",")
    write.table(sex_as_events_refined_geneSymbols,file = sex_as_events_refined_genesFilename, 
                row.names = F, col.names = F, quote = F, sep = ",")
    write.table(sex_geneSymbols,           file = sex_genesFilename          , 
                row.names = F, col.names = F, quote = F, sep = ",")
    write.table(sex_refined_geneSymbols,          file = sex_refined_genesFilename          , 
                row.names = F, col.names = F, quote = F, sep = ",")

    return(0)
}

## Execution of All Tissues and All Splicing Variants

Additional values set to enable this notebook to be executed as a nextflow workflow or to run in place with appropriate settings.

### parameters Setting

1. Setting `dup=TRUE` causes lengthy execution times.

2. Setting `plot=TRUE` can overwhelm the saving capacity within a jupyter-lab notebook - 
   this sets to print all the voom plots.
   
3. Adjusting `splice_type` will allow you to play with a variety of results

   a. all splice types desired to be run:
    
    `splice_list       = c("a3ss_","a5ss_","mxe_","ri_","se_")`
    
   b. a subset (leaving out say `splice_type = "se_"` since it is the largest, for example)
    
    `splice_list       = c("a3ss_","a5ss_","mxe_","ri_")`
    


In [13]:
#
# Main routine
#
# 1.2.1 get the released rmats matrices (rows junction ids, columns accessions (SRR)
# 
getReleasedRMATSData (destDir <- "../data/")

In [14]:
#
# 1.2.2 get the rmats 3.2.5 discovered/annoated junction information in GTF format
#
getReleasedGTFAnnotations (destDir <- "../data")

In [15]:
#
# 1.2.3 get SRR Accession Metadata (available through dbGaP)
#
srr_metadata <- getSraRunData (destDir <- "../data/")

Loading metadata from SraRunTable.txt.gz ../data/gtex.rds ..


done!




In [23]:
#
# 1.2.4 get GTEx expression Set Data
#
rm(obj)
obj <- getGTExExpressionSet (destDir <- "../data/")

Loading obj GTEx v8 rds object with readRDS from ../data/gtex.rds ..


Done!


Generating sha256sum for gtex.rds ..


c3c81a2b5b1f17811d2ab828edf1d4c65e8e4a6632964db73555c4b5737fadf0  ../data/gtex.rds

Done!




In [24]:
#
# 1.2.5 get reduced Tissue data
#
tissue_reduction <- getTissueReduction ( "../assets/tissues.tsv" )

In [19]:
#
# 1.3.1 inspect and correct the expressionSet object
#
obj <- inspectAndCorrectExpressionSetObject ( obj )


BEFORE: dimension of expressionSet
5587817382 


BEFORE: dimension of expressionSet
5587817382 


The non-overlapping IDs between pheno and count data are:
2 


Logical diff pheno_sample_names, expression_sample_names
217382 


AFTER: dimension of expressionSet
5587217382 


AFTER: dimension of expressionSet
5587217382 



In [22]:
sum(is.na(pData(obj)))

In [None]:
  #
# 1.3.2 Reduce Sample Set
#
obj <- reduceSampleSet ( tissue_reduction, obj )

In [None]:
#
# Replace `.` with `_` for ease and display. 
# snakecase to ensure comparisons are like to like
#
srr_metadata$SAMPID <- gsub('-','\\.',srr_metadata$'Sample Name')
tissue_list         <- factor(snakecase::to_snake_case(as.character(tissue_reduction$SMTSD)))

In [None]:
#
# Useful for debugging to play with these parameters
# and to reduce to a subset of entire splice_list
# splice_list <- c("a3ss_","a5ss_","mxe_","ri_","se_")
#
plot         <- TRUE
dup          <- FALSE
splice_list  <- c( "se_")


In [None]:
#
# parameters that change with each splice type (3)
# 1. fromGTF
# 2. ijc
# 3. sjc

# Could run this as a loop - or rather, using a package [package name]
# run this notebook as a nextflow workflow
# Requirements are that all required input are in a bucket data.tar.gz
# and assets 
# for (tissue_index in 1:length(tissue_list)) {

In [None]:
    # a3ss
    splice_type = "a3ss_"
    res = splice_list %in% splice_type
    tissue_of_interest  = as.vector(as.character(tissue_list[tissue_index]))
    if (sum(res) == 1) {
        fromGTF  <- read.table("../data/fromGTF.A3SS.txt", header=TRUE)
        fromGTF  <- eliminateChrYfromGTF (fromGTF)
        ijc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.a3ss.jc.ijc.txt.gz")
        ijc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
        sjc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.a3ss.jc.sjc.txt.gz")
        sjc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
        print_exploratory_plots (plot, 
                             dup,
                             tissue_of_interest, 
                             splice_type, 
                             fromGTF, 
                             tissue_list, 
                             ijc, 
                             sjc, 
                             obj, 
                             metadata )
    }

In [None]:
    # a5ss
    splice_type = "a5ss_"
    res = splice_list %in% splice_type
    if (sum(res) == 1) {
        message ("splice_list does contain\n",
             paste(splice_type), " continuing with processing\n")
        fromGTF  <- read.table("../data/fromGTF.A5SS.txt", header=TRUE)
        fromGTF  <- eliminateChrYfromGTF (fromGTF)
        ijc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.a5ss.jc.ijc.txt.gz")
        ijc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
        sjc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.a5ss.jc.sjc.txt.gz")
        sjc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
        print_exploratory_plots (plot, 
                             dup,
                             tissue_of_interest, 
                             splice_type, 
                             fromGTF, 
                             tissue_list, 
                             ijc, 
                             sjc, 
                             obj, 
                             metadata )
    }

In [None]:
    # mxe
    splice_type = "mxe_"
    res = splice_list %in% splice_type
    if (sum(res) == 1) {
        message ("splice_list does contain\n",
             paste(splice_type), " continuing with processing\n")
        fromGTF  <- read.table("../data/fromGTF.MXE.txt", header=TRUE)
        fromGTF  <- eliminateChrYfromGTF (fromGTF)
        ijc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.mxe.jc.ijc.txt.gz")
        ijc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
        sjc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.mxe.jc.sjc.txt.gz")
        sjc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
        print_exploratory_plots (plot, 
                             dup,
                             tissue_of_interest, 
                             splice_type, 
                             fromGTF, 
                             tissue_list, 
                             ijc, 
                             sjc, 
                             obj, 
                             metadata )
    }

In [None]:
    # ri
    splice_type = "ri_"
    res = splice_list %in% splice_type
    if (sum(res) == 1) {
        message ("splice_list does contain\n",
             paste(splice_type), " continuing with processing\n")
        fromGTF  <- read.table("../data/fromGTF.RI.txt", header=TRUE)
        fromGTF  <- eliminateChrYfromGTF (fromGTF)
        ijc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.ri.jc.ijc.txt.gz")
        ijc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
        sjc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.ri.jc.sjc.txt.gz")
        sjc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
        print_exploratory_plots (plot, 
                             dup,
                             tissue_of_interest, 
                             splice_type, 
                             fromGTF, 
                             tissue_list, 
                             ijc, 
                             sjc, 
                             obj, 
                             metadata )
    }

In [None]:
    # se
    splice_type = "se_"
    res = splice_list %in% splice_type
    if (sum(res) == 1) {
        message ("splice_list does contain\n",
             paste(splice_type), " continuing with processing\n")
        fromGTF  <- read.table("../data/fromGTF.SE.txt", header=TRUE)
        fromGTF  <- eliminateChrYfromGTF (fromGTF)
        ijc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.se.jc.ijc.txt.gz")
        ijc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
        sjc      <- makeSplicingExpressionSetObject(obj, srr_metadata, "../data/rmats_final.se.jc.sjc.txt.gz")
        sjc      <- eliminateChrYfromExpressionSet (fromGTF, obj)
#        print_exploratory_plots (plot, 
#                              dup,
#                              tissue_of_interest, 
#                              splice_type, 
#                              fromGTF, 
#                              tissue_list, 
#                              ijc, 
#                              sjc, 
#                              obj, 
#                              metadata )
    }
#}

## Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### 1. Checksums with the sha256 algorithm

In [None]:
rm (notebookid)
notebookid   = "differentialSplicingJunctionExpressionAnalysis"
notebookid

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data && find . -type f -exec sha256sum {} \\;  >  ../metadata/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

paste0("../metadata/", notebookid, "_sha256sums.txt")

data.table::fread(paste0("../metadata/", notebookid, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

### 2. Libraries metadata

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", notebookid, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", notebookid ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]