# AllTissueJunctionAnalysis as a Notebook 

rMATS 3.2.5 was run on controlled access RNASeq files retrieved experiments stored in the Sequence Read Archive with controlled access managed by dbGaP.   The data were generated under the Gene Tissue Expression.

## rMATS RNASeq-MATS.py produces 10 different output types which get assembled into as type junction ID by sample ID matrices

### Alternative Splice Site Types are: (se, a3ss, a5ss, mxe, ri)

 This is input as ARGV1 into variable 'astype'

  * Skipped Exon events (se),
  * Alternative 3' splice site (a3ss),
  * Alternative 5' splice site (a5ss),
  * Mutually exclusive exon (mxe),
  * and retention intron (ri)

### There are two different kinds of junction counts

  * jc = junction counts - reads that cross the junction
  * jcec = junction counts plus reads on the target (such as included exon

### And the count type -- there are 5 types

  * inclusion levels (percent spliced in)
  * included junction counts (ijc)
  * skipped junction counts (sjc)
  * inclusion length (inclen)
  * skipped length (skiplen)

### function: fit_iso_tissue 

fit_iso_tissue expects the following input:

  * the tissue of interest (SMSTD) 
  * an ordered_merged_rmats -- which will be ordered to fit the count matrix
  * count matrix (inc or ijc & sjc merged)
  * splice type (a3ss, a5ss, mxe, ri or se)
  * junction_count type (jc or jcec)
  * count type (inc or the merged ijc,sjc)
  
### reordering to match annotations between count matrix and annotation matrix

Common problem is to match specifically the rows of an annotation matrix with the columns of a count matrix
`match` is the function that gives the re-ordering index required to accomplish this


## **NOTE**:

We assume that you have cloned the analysis repository and have `cd` into the parent directory. Before starting with the analysis make sure you have first completed the dependencies set up by following the instructions described in the **`dependencies/README.md`** document. All paths defined in this Notebook are relative to the parent directory (repository). Please close this Notebook and start again by following the above guidelines if you have not completed the aforementioned steps.

## rMATS-final-merged
the rmats-nf NextFlow was executed and the results released here:

## Loading dependencies

In [1]:
# temporary hack remove me when the dependencies are fixed
#
#install.packages("BiocManager")
#Sys.setenv(TAR = "/bin/tar")
#BiocManager::install(c('limma','edgeR', 'statmod'))
#install.packages(c('doParallel', 'doRNG', 'foreach', 'stringi', 'pheatmap'), repo = 'https://cran.r-project.org')
#devtools::install_github("ropensci/piggyback@87f71e8", upgrade="never")
#install.packages("runjags", repos = "https://cran.r-project.org")
#devtools::install_github("easystats/report")


In [2]:
library(limma)
library(piggyback)
library(multtest)
library(Biobase)
library(edgeR)
library(tibble)
#install.packages('R.utils')
library(R.utils)

Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following object is masked from ‘package:limma’:

    plotMA

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Loading required package: Biobase
Welcome to Bio

## Modeling

This analysis uses edgeR.  From the documentation, it is important to note that normalization takes the form of correction factors that enter into the statistical model. Such correction factors are usually computed internally by edgeR functions, but it is also possible for a user to supply them. The correction factors may take the form of scaling factors for the library sizes, such as computed by calcNormFactors, which are then used to compute the effective library sizes. 

Alternatively, gene-specific correction factors can be entered into the glm functions of edgeR as offsets. In the latter case, the offset matrix will be assumed to account for all normalization issues, including sequencing depth and RNA composition.

Note that normalization in edgeR is model-based, and the original read counts are not themselves transformed. This means that users should not transform the read counts in any way before inputing them to edgeR. For example, users should not enter RPKM or FPKM val- ues to edgeR in place of read counts. Such quantities will prevent edgeR from correctly estimating the mean-variance relationship in the data, which is a crucial to the statistical strategies underlying edgeR. Similarly, users should not add artificial values to the counts before inputing them to edgeR.

edgeR is not designed to work with estimated expression levels, for example as might be output by Cufflinks. 
edgeR can work with expected counts as output by RSEM, but raw counts are still preferred. 

As instructed by the software, we are using the raw counts as provided by rMATS.  The raw counts we are using in the model are `ijc` and `sjc`, the sample specific raw read counts as they align to the junctions of the `included exon (ijc)` and the junctions of the `excluded or skipped exon (sjc)` respectively.


Be sure to set your GITHUB_TOKEN, prior to downloading files

One suggestion is change it to your token and then run it then immediately change it back to this:

Sys.setenv(GITHUB_TOKEN = "your-very-own-github-token")

In [3]:
# devtools::install_github("ropensci/piggyback@87f71e8", upgrade="never")


### Did you remember?
Did you remember to delete your private github token?  Now is a good time to do so, before you save your work and checkit in inadvertantly....

In [28]:
if (!("SraRunTable.noCram.noExome.noWGS.totalRNA.txt.gz" %in% list.files("../data/"))) {
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "TheJacksonLaboratory/sbas", 
        file = "SraRunTable.noCram.noExome.noWGS.totalRNA.txt.gz",
        tag  = "GTExV8.v1.0", 
        dest = "../data/")
    
    message("Loading metadata from SraRunTable.noCram.noExome.noWGS.totalRNA.txt.gz ../data/gtex.rds ..\n")   
    metadata <- data.table::fread("../data/SraRunTable.noCram.noExome.noWGS.totalRNA.txt.gz")
    message("done!")
} else {
    message("Loading metadata from SraRunTable.noCram.noExome.noWGS.totalRNA.txt.gz ../data/gtex.rds ..\n")   
    metadata <- data.table::fread("../data/SraRunTable.noCram.noExome.noWGS.totalRNA.txt.gz")
    message("done!\n")
}

if (!("rmats_final.se.jc.ijc.txt.gz" %in% list.files("../data/"))) {    
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.se.jc.ijc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = "../data/")
    message("Loading ijc counts from rmats_final.se.jc.ijc.txt.gz ../data/gtex.rds ..\n")   
    ijc.iso.counts.mem <- data.table::fread("../data/rmats_final.se.jc.ijc.txt.gz")
    message("done!\n")
} else {
    message("Loading ijc counts from rmats_final.se.jc.ijc.txt.gz ../data/gtex.rds ..\n")   
    ijc.iso.counts.mem <- data.table::fread("../data/rmats_final.se.jc.ijc.txt.gz")
    message("done!\n")    
}

if (!("rmats_final.se.jc.sjc.txt.gz" %in% list.files("../data/"))) {
    message("Downloading rmats_final.se.jc.sjc.txt.gz")
    piggyback::pb_download(
        show_progress = TRUE,
        repo = "adeslatt/sbas_test", 
        file = "rmats_final.se.jc.sjc.txt.gz",
        tag  = "rMATS.3.2.5.GTEx.V8.final_matrices", 
        dest = "../data/")
    message("Loading sjc counts from rmats_final.se.jc.sjc.txt.gz ../data/gtex.rds ..\n")   
    sjc.iso.counts.mem <- data.table::fread("../data/rmats_final.se.jc.sjc.txt.gz")
    message("done!\n")    

} else {
    message("Loading sjc counts from rmats_final.se.jc.sjc.txt.gz ../data/gtex.rds ..\n")   
    sjc.iso.counts.mem <- data.table::fread("../data/rmats_final.se.jc.sjc.txt.gz")
    message("done!\n")        
}

if (!("gtex.rds" %in% list.files("../data/"))) {
    message("Downloading and loading obj with GTEx v8 with 'yarn::downloadGTExV8()'\n")
    obj <- yarn::downloadGTExV8(type='genes',file='../data/gtex.rds')
    message("Done!\n")

} else {
# Load with readRDS() if gtex.rds available in data/
    message("Loading obj GTEx v8 rds object with readRDS from ../data/gtex.rds ..\n")   
    obj <- readRDS(file = "../data/gtex.rds")
    message("Done!\n")
    message("Generating sha256sum for gtex.rds ..\n")    
    message(system("sha256sum ../data/gtex.rds", intern = TRUE))
    message("Done!\n")
} 

Loading metadata from SraRunTable.noCram.noExome.noWGS.totalRNA.txt.gz ../data/gtex.rds ..

done!

Loading ijc counts from rmats_final.se.jc.ijc.txt.gz ../data/gtex.rds ..

done!

Loading sjc counts from rmats_final.se.jc.sjc.txt.gz ../data/gtex.rds ..

done!

Loading obj GTEx v8 rds object with readRDS from ../data/gtex.rds ..

Done!

Generating sha256sum for gtex.rds ..

18e2c7a83c98dcf59ddab53e1281923979d49da6ea3acb68114c5a44057c57bc  ../data/gtex.rds
Done!



In [29]:
metadata$SAMPID   <- gsub('-','\\.',metadata$'Sample Name')
pData(obj)$SAMPID <- gsub('-','\\.',pData(obj)$SAMPID)
tail(metadata$SAMPID,4)
tail(pData(obj)$SAMPID,4)

exprs_sample_names=as.vector(as.character(colnames(exprs(obj))))
length(exprs_sample_names)

pheno_sample_names=as.vector(as.character(rownames(pData(obj))))
length(pheno_sample_names)

if (length(pheno_sample_names) > length(exprs_sample_names)) {
    superset <- pheno_sample_names
    subset   <- exprs_sample_names    
} 

if (length(pheno_sample_names) < length(exprs_sample_names)) {
    superset <- exprs_sample_names
    subset   <- pheno_sample_names   
} 

non_overlaps <- setdiff( superset, subset)

message("The non-overlapping IDs between pheno and exprs data are:\n\n", 
        paste(length(non_overlaps), collapse = "\n") )

logical_match_names=superset %in% subset
length(logical_match_names)

# in this case, this is correcting for an erroneous disconnect between pData(obj) and exprs(obj)
# normally all reduction in dimensionality will occur on the obj itself taking care of the phenotype and count
# data simultaneously
pData(obj) <- (pData(obj)[logical_match_names==TRUE,])
dim(pData(obj))
dim(exprs(obj))
dim(obj)

The non-overlapping IDs between pheno and exprs data are:

2


In [9]:
# now we want to coordinate our metadata with the sequence read runs (SRR)
# with our phenotype data -- this is done through the annotation file obtained from dbGaP
# other than that, the yarn pData obj should be used as it has been corrected
# we will use the SampleName 
run_sample_names=as.vector(as.character(metadata$SAMPID))
pheno_sample_names=as.vector(as.character(pData(obj)$SAMPID))
length(run_sample_names)
tail(run_sample_names,2)
length(pheno_sample_names)
tail(pheno_sample_names,2)

if (length(pheno_sample_names) > length(run_sample_names)) {
    superset <- pheno_sample_names
    subset   <- run_sample_names    
} 

if (length(pheno_sample_names) < length(run_sample_names)) {
    superset <- run_sample_names
    subset   <- pheno_sample_names   
} 

non_overlaps <- setdiff( superset, subset)

message("The non-overlapping IDs between pheno and run data are:\n\n", 
        paste(length(non_overlaps), collapse = "\n") )

logical_match_names=superset %in% subset
length(logical_match_names)
table(logical_match_names)

reduced_obj      <- obj[,logical_match_names==TRUE]
dim(pData(reduced_obj))
dim(exprs(reduced_obj))
dim(reduced_obj)


The non-overlapping IDs between pheno and run data are:

8729


logical_match_names
FALSE  TRUE 
 8729  8653 

In [30]:
head(ijc.iso.counts.mem)
head(sjc.iso.counts.mem)
head(metadata,2)
head(pData(reduced_obj))
#dimensions before we make the changes.
dim(ijc.iso.counts.mem)
dim(sjc.iso.counts.mem)
dim(metadata)
dim(pData(reduced_obj))

ID,SRR1068788,SRR1068808,SRR1068832,SRR1068855,SRR1068880,SRR1068929,SRR1068953,SRR1068977,SRR1068999,⋯,SRR821573,SRR821581,SRR821602,SRR821626,SRR821653,SRR821690,SRR821715,SRR823967,SRR823991,SRR824015
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,0,0,0,0,1,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
2,26,247,103,620,494,145,145,139,697,⋯,151,32,62,48,963,25,196,76,72,61
3,1,0,1,0,0,0,1,1,2,⋯,2,1,0,1,3,0,1,0,0,0
4,0,1,1,2,0,0,1,0,2,⋯,0,0,0,0,1,0,0,0,0,0
5,3,0,2,3,6,1,1,1,5,⋯,3,2,0,1,6,0,2,0,0,0
6,2,1,2,5,6,1,1,0,5,⋯,1,1,0,0,4,0,1,0,0,0


ID,SRR1068788,SRR1068808,SRR1068832,SRR1068855,SRR1068880,SRR1068929,SRR1068953,SRR1068977,SRR1068999,⋯,SRR821573,SRR821581,SRR821602,SRR821626,SRR821653,SRR821690,SRR821715,SRR823967,SRR823991,SRR824015
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,2,0,1,3,6,1,0,0,3,⋯,1,1,0,0,3,0,1,0,0,0
2,0,0,0,1,0,0,1,0,0,⋯,0,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,1,⋯,0,0,0,0,2,0,1,0,0,0
5,0,5,3,8,4,0,3,0,3,⋯,9,3,1,2,3,0,1,0,1,0
6,11,119,36,284,207,60,63,43,295,⋯,52,13,14,9,338,8,63,25,20,18


Run,analyte_type,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,biospecimen_repository,biospecimen_repository_sample_id,body_site,⋯,product_part_number (exp),product_part_number (run),sample_barcode (exp),sample_barcode (run),is_technical_control,target_set (exp),primary_disease (exp),secondary_accessions (run),Alignment_Provider (run),SAMPID
<chr>,<chr>,<chr>,<int>,<int64>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<chr>,<chr>,<dbl>,<dbl>,<chr>,<lgl>,<chr>,<chr>,<lgl>,<chr>
SRR2911715,RNA,RNA-Seq,150,3852895500,PRJNA244100,SAMN04216864,Cloud Testing,HG00103,Lymphoblastoid cell line,⋯,,,,,,,,,,HG00103
SRR2911716,RNA,RNA-Seq,150,4885577400,PRJNA244100,SAMN04216866,Cloud Testing,HG00154,Lymphoblastoid cell line,⋯,,,,,,,,,,HG00154


Unnamed: 0_level_0,SAMPID,SMATSSCR,SMCENTER,SMPTHNTS,SMRIN,SMTS,SMTSD,SMUBRID,SMTSISCH,SMTSPAX,⋯,SME1PCTS,SMRRNART,SME1MPRT,SMNUM5CD,SMDPMPRT,SME2PCTS,SUBJID,SEX,AGE,DTHHRDY
Unnamed: 0_level_1,<chr>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
GTEX-1117F-0226-SM-5GZZ7,GTEX.1117F.0226.SM.5GZZ7,0,B1,"2 pieces, ~15% vessel stroma, rep delineated",6.8,Adipose Tissue,Adipose - Subcutaneous,2190,1214,1125,⋯,50.0354,0.00310538,0.99474,,0,50.1944,GTEX-1117F,2,60-69,4
GTEX-1117F-0426-SM-5EGHI,GTEX.1117F.0426.SM.5EGHI,0,B1,"2 pieces, !5% fibrous connective tissue, delineated (rep)",7.1,Muscle,Muscle - Skeletal,11907,1220,1119,⋯,50.2809,0.00699464,0.995041,,0,49.9455,GTEX-1117F,2,60-69,4
GTEX-1117F-0526-SM-5EGHJ,GTEX.1117F.0526.SM.5EGHJ,0,B1,"2 pieces, clean, Monckebeg medial sclerosis, rep delineated",8.0,Blood Vessel,Artery - Tibial,7610,1221,1120,⋯,49.9535,0.00286826,0.994001,,0,50.2667,GTEX-1117F,2,60-69,4
GTEX-1117F-0626-SM-5N9CS,GTEX.1117F.0626.SM.5N9CS,1,B1,"2 pieces, up to 4mm aderent fat/nerve/vessel, delineated",6.9,Blood Vessel,Artery - Coronary,1621,1243,1098,⋯,50.2096,0.00533653,0.992257,,0,50.0865,GTEX-1117F,2,60-69,4
GTEX-1117F-0726-SM-5GIEN,GTEX.1117F.0726.SM.5GIEN,1,B1,"2 pieces, no abnormalities",6.3,Heart,Heart - Atrial Appendage,6631,1244,1097,⋯,50.2367,0.0305841,0.995711,,0,49.9563,GTEX-1117F,2,60-69,4
GTEX-1117F-1326-SM-5EGHH,GTEX.1117F.1326.SM.5EGHH,1,B1,"2 pieces, diffuse mesothelial hyperplasia; ~10% vessel/fibrous tissue (delineated)",5.9,Adipose Tissue,Adipose - Visceral (Omentum),10414,1277,1066,⋯,50.0547,0.010331,0.990378,,0,50.1311,GTEX-1117F,2,60-69,4


## Synchronize metadata samples with ijc sjc samples

Keep only the runs that are in the ijc count list (assuming ijc and sjc are the same).  As well, name the rows with the junction id column and then make the matrix just about the counts.

In [81]:
# preserve junction id as rowname
rownames(ijc.iso.counts.mem) <- ijc.iso.counts.mem$ID
rownames(sjc.iso.counts.mem) <- sjc.iso.counts.mem$ID

# and remove the id to have a data matrix
ijc.iso.counts.mem  <- ijc.iso.counts.mem[,-1]
sjc.iso.counts.mem  <- sjc.iso.counts.mem[,-1]

dim(ijc.iso.counts.mem)
dim(sjc.iso.counts.mem)

In [82]:
# the sample names are in the columns of both the ijc and the sjc matrices, these matrices have the identical column order)
metadata <- data.table::fread("../data/SraRunTable.txt.gz")
ijc_run_names <- as.vector(as.character(colnames(ijc.iso.counts.mem)))
run_names     <- as.vector(as.character(metadata$Run))

if (length(run_names) > length(ijc_run_names)) {
    superset <- run_names
    subset   <- ijc_run_names    
} 

if (length(run_names) < length(ijc_run_names)) {
    superset <- ijc_run_names
    subset   <- run_names   
} 

length(superset)
length(subset)
tail(superset)
tail(subset)
non_overlaps <- setdiff( superset, subset)

message("The non-overlapping IDs between pheno and count data are:\n\n", 
        paste(length(non_overlaps), collapse = "\n") )

logical_match_names=superset %in% subset

length(logical_match_names)
table(logical_match_names)

reduced_metadata <- metadata[logical_match_names==TRUE,]
dim(reduced_metadata)

dim(ijc.iso.counts.mem)

The non-overlapping IDs between pheno and count data are:

15995


logical_match_names
FALSE  TRUE 
15995  8672 

### keep only those values for which we have phenotype data 
Samples were resequenced, which is shown since we have the previous step saw no reduction in information - but we are using the yarn function to correct for errors in the GTEx data set.

To deal with this, we make the metadata unique per sample, rather than unique per run -- there are a little over 100 runs that are more than one sequencing run for a sample.

There are in fact 64 samples which have more than 1 run, 17 samples that have 2 sequencing runs and 47 samples that have 3 sequencing runs.


In [83]:
length(unique(reduced_metadata$SAMPID))
length(reduced_metadata$SAMPID)
length(unique(reduced_metadata$Run))
length(reduced_metadata$Run)

# 64 samples which have more than 1 run, 
# 17 samples that have 2 sequencing runs and 
# 47 samples that have 3 sequencing runs.
# as the following three commands illustrate
# we use t (which is the number of runs per sample) as the index
# to reduce the table
unique_index <- unique(reduced_metadata$SAMPID)
t <- table(reduced_metadata$SAMPID)
table((t[t>1]))
names_gt_1 <- names(table(t[t>1]))
length(unique_index)


< table of extent 0 >

### Now we need to reduce the metadata to the unique samples
We will use the first occurance of the sample, got the idea from here https://stackoverflow.com/questions/19944334/extract-rows-for-the-first-occurrence-of-a-variable-in-a-data-frame

In [74]:
reduced_metadata_first <- reduced_metadata[match(unique(reduced_metadata$SAMPID), reduced_metadata$SAMPID),]
dim(reduced_metadata_first)

### Now adjust the count matrices
Now that we see that we have multiple runs per sample (and an improvement here would be to use the sample with the best results or RIN number which isn't in the annotation data....

In [79]:
run_names <- as.vector(as.character(reduced_metadata_first$Run))
ijc_run_names <- as.vector(as.character(colnames(ijc.iso.counts.mem)))

if (length(run_names) > length(ijc_run_names)) {
    superset <- run_names
    subset   <- ijc_run_names    
} 

if (length(run_names) < length(ijc_run_names)) {
    superset <- ijc_run_names
    subset   <- run_names   
} 

length(superset)
length(subset)
tail(superset)
tail(subset)
non_overlaps <- setdiff( superset, subset)

message("The non-overlapping IDs between pheno and count data are:\n\n", 
        paste(length(non_overlaps), collapse = "\n") )

logical_match_names= ijc_run_names %in% run_names

length(logical_match_names)
table(logical_match_names)

ijc.iso.counts.mem2 <- ijc.iso.counts.mem[,logical_match_names==TRUE]
sjc.iso.counts.mem2 <- sjc.iso.counts.mem[,logical_match_names==TRUE]
dim(ijc.iso.counts.mem)
dim(sjc.iso.counts.mem)
dim(ijc.iso.counts.mem2)
dim(sjc.iso.counts.mem2)

The non-overlapping IDs between pheno and count data are:

112


logical_match_names
FALSE  TRUE 
  112  8562 

NULL

NULL

In [37]:
reduced_metadata$SAMPID   <- gsub('-','\\.',reduced_metadata$'Sample Name')
pData(obj)$SAMPID <- gsub('-','\\.',pData(obj)$SAMPID)

sample_names          = as.vector(as.character(pData(obj)$SAMPID))
metadata_sample_names = as.vector(as.character(reduced_metadata$SAMPID))

tail(sample_names)
tail(metadata_sample_names)

if (length(metadata_sample_names) > length(sample_names)) {
    superset <- metadata_sample_names
    subset   <- sample_names    
} 

if (length(metadata_sample_names) < length(sample_names)) {
    superset <- sample_names
    subset   <- metadata_sample_names   
} 

non_overlaps <- setdiff( superset, subset)

message("The number of non-overlapping IDs between pheno and count data are:\n\n", 
        paste(length(non_overlaps), collapse = "\n") )

logical_match_names=superset %in% subset
length(logical_match_names)
table(logical_match_names)

reduced_obj2 <- obj[,logical_match_names==TRUE]

# now we have the reduced_obj2 which will contain all the yarn phenotypes that match our samples 
# there is data loss here (unfortunately) because the annotation information does not 100% match
# our samples which were retrieved from the SRA dbGaP GTEx data
# 
dim(reduced_obj2)
dim(pData(reduced_obj2))
dim(exprs(reduced_obj2))

The number of non-overlapping IDs between pheno and count data are:

9615


logical_match_names
FALSE  TRUE 
 9615  7767 

In [59]:
length(unique(as.character(pData(reduced_obj)$SAMPID)))

In [55]:
# Now unfortunately, without adequate annotations we will reduce our matrix size
# 
# So now lets do the match in reverse with the yarn annoted reduced_obj
#
# ijc colnames matched to SraRunTable run names 100%
# metadata$SAMPID names matched to 7767 pData(obj) data <- which we then created reduced_obj
#
# Now we need to update the ijc colnames and the metadata names
# 
is.unique(as.character(pData(reduced_obj)$SAMPID)))
sample_names          = as.vector(as.character(pData(reduced_obj)$SAMPID))
metadata_sample_names = as.vector(as.character(metadata$SAMPID))

length(sample_names)
length(metadata_sample_names)

if (length(metadata_sample_names) > length(sample_names)) {
    superset <- metadata_sample_names
    subset   <- sample_names    
} 

if (length(metadata_sample_names) < length(sample_names)) {
    superset <- sample_names
    subset   <- metadata_sample_names   
} 

non_overlaps <- setdiff( superset, subset)

message("The number of non-overlapping IDs between pheno and count data are:\n\n", 
        paste(length(non_overlaps), collapse = "\n") )

#logical_match_names=superset %in% subset
#length(logical_match_names)
table(logical_match_names)

#reduced_metadata2 <- reduced_metadata[logical_match_names==TRUE,]
#dim(reduced_metadata2)

The number of non-overlapping IDs between pheno and count data are:

5571


logical_match_names
FALSE  TRUE 
  801  7872 

In [53]:
table(pData(reduced_obj2)$AGE)
table(pData(reduced_obj2)$DTHHRDY)
table(pData(reduced_obj2)$SMCENTER)


20-29 30-39 40-49 50-59 60-69 70-79 
  627   560  1363  2626  2468   123 


   0    1    2    3    4 
4172  397 1942  411  762 


        B1     B1, A1         C1     C1, A1 C1, B1, A1         D1     D1, A1 
      4672        369       2117        492          0         60         57 

## Order ijc and sjc columns in the same order as the metadata Run order

Using tibble library, we can rearrange the columns as the column name.  

In [None]:
metadata_runnames    <- as.character(reduced_metadata$Run)
pheno_data           <- as.tibble()
ijc.iso.counts.mem2  <- as_tibble(ijc.iso.counts.mem)
sjc.iso.counts.mem2  <- as_tibble(sjc.iso.counts.mem)

ijc.iso.counts.mem2  <- ijc.iso.counts.mem2[,c(metadata_runnames)]
sjc.iso.counts.mem2  <- sjc.iso.counts.mem2[,c(metadata_runnames)]

dim(ijc.iso.counts.mem2)
dim(sjc.iso.counts.mem2)
dim(reduced_metadata)

Remove samples that match '11IL0' from the ijc, sjc and metadata files using the logical grep, grepl

In [None]:
keep_metadata <- (!grepl('11ILO',reduced_metadata$SAMPID))
table(keep_metadata)
ijc.iso.counts.mem2 <-ijc.iso.counts.mem2 [                    ,keep_metadata==TRUE]
sjc.iso.counts.mem2 <-sjc.iso.counts.mem2 [                    ,keep_metadata==TRUE]

reduced_metadata   <-reduced_metadata   [keep_metadata==TRUE,                    ]
dim(ijc.iso.counts.mem2)
dim(sjc.iso.counts.mem2)

### and now for all tissues


### exploration of the details

For each sample, we have ijc and sjc count data and demo
For exon skipping events (SE), we have 42,611 non-zero junction IDs the (first dimension of the ijc and sjc cout table) for the skipped exon event for breast-Mammary Tissue, 191 individuals.  These are healthy individuals, and we are studying the impact of sex on the occurrence or non-occurance of specific alternative splicing events.   We explore the information we ahve about these junctions and create a construct, as_event, which accounts for the junction under exploration.

#### Exploring the ijc and sjc Count data 

We have two counts that are in many ways two sides of the same coin.  Both our the observational output and we wish to see how robust each are in their ability to separate out the samples to provide for us differentially expressed isoform events as measured by their counts.   Each junction is in a manner a specific marker to specific isoform events that may or may not be shared between the genders.   If there is significant results, then this is indicative of the separation achieved by isoform specific differentiation.   In our model we will use these in combination, it is important to see if they will yield the results we are looking for.

## Preparing the data further

### Keeping only tissues shared male female

We need to remove the tissues that are not shared by males and females, we do this by finding the intersection of the tissue lists.

In [None]:
# SEX is coded 1 == Male
#              2 == Female
sex = factor(pData(reduced_obj)$SEX)
sex <- ifelse(sex == 1,'male','female')

tissue_groups      <- factor(pData(reduced_obj)$SMTSD)
tissue_male_female <- tissue_groups_male %in% tissue_groups_female
table(tissue_male_female)

tissue_shared_male_female <- factor(tissue_groups_male[tissue_male_female])
table(tissue_shared_male_female)

male_tissues_true   <- sex == "male"
female_tissues_true <- sex == "female"

sum(table(male_tissues_true))
sum(table(female_tissues_true))

male_tissue_list   <- factor(tissue_list[male_tissues_true   == TRUE])
female_tissue_list <- factor(tissue_list[female_tissues_true == TRUE])

male_female_tissue_list <- intersect(levels(male_tissue_list),levels(female_tissue_list))

keep = tissue_list %in% male_female_tissue_list

table(keep)

ijc_m_f         = ijc.iso.counts.mem2[          ,keep==TRUE]
sjc_m_f         = sjc.iso.counts.mem2[          ,keep==TRUE]
metadata_m_f    = reduced_metadata   [keep==TRUE,          ]
tissue_list_m_f = tissue_list        [keep==TRUE]

dim(ijc_m_f)
dim(sjc_m_f)
dim(metadata_m_f)
length(tissue_list_m_f)
tissue_list_m_f <- factor(tissue_list_m_f)
levels(tissue_list_m_f)
length(levels(tissue_list_m_f))

### Keeping only chromosomes shared male female

The Y chromosome spans more than 59 million base pairs of DNA and represents almost 2 percent of the total DNA in cells. Each person normally has one pair of sex chromosomes in each cell. The Y chromosome is present in males, who have one X and one Y chromosome, while females have two X chromosomes. Since our analysis is on the comparative differences, we must eliminate chrY from our analyses.

To do so, we grab the annotation from the GTF file and remove those junctions that correspond to the genes on this chromosome

In [None]:
if (! (file.exists("../data/fromGTF.tar.gz"))) {
        system("mkdir -p ../data", intern = TRUE)
        message("Fetching fromGTF.tar.gz from GitHub ..")
        # Download archive from GitHub release with tag "dge"
        piggyback::pb_download(file = "fromGTF.tar.gz",
                           dest = "../data",
                           repo = "adeslatt/sbas_gtf",
                           tag  = "rMATS.3.2.5.gencode.v30",
                           show_progress = TRUE)
        message("Done!\n")
        message("Decompressing fromGTF.tar.gz into ../data")
        system("mkdir -p ../data && tar xvfz ../data/fromGTF.tar.gz -C ../data", intern = TRUE)
        message("Done!\n")
        message("Decompressing fromGTF.*.txt.gz into ../data")
        system("gunzip  ../data/fromGTF*.txt.gz ", intern = TRUE)
        message("Done!\n")
}
fromGTF.SE <- read.table("../data/fromGTF.SE.txt", header=TRUE)
head(fromGTF.SE)
genes <- factor(fromGTF.SE$geneSymbol)
length(levels(genes))    

table(fromGTF.SE$chr)

keepAllJunctionsButChrY <- (fromGTF.SE$chr != "chrY")

table(keepAllJunctionsButChrY)
sum(table(keepAllJunctionsButChrY))

fromGTF_no_chrY <- fromGTF.SE[keepAllJunctionsButChrY,]
ijc_m_f_no_chrY <- ijc_m_f   [keepAllJunctionsButChrY,]
sjc_m_f_no_chrY <- sjc_m_f   [keepAllJunctionsButChrY,]

dim(ijc_m_f_no_chrY)
dim(sjc_m_f_no_chrY)
dim(fromGTF_no_chrY)


ijc_m_f_no_chrY <- data.matrix(ijc_m_f_no_chrY)
sjc_m_f_no_chrY <- data.matrix(sjc_m_f_no_chrY)

rownames(ijc_m_f_no_chrY) <- rownames(fromGTF_no_chrY)
rownames(sjc_m_f_no_chrY) <- rownames(fromGTF_no_chrY)

head(ijc_m_f_no_chrY)
head(fromGTF_no_chrY)
head(sjc_m_f_no_chrY)

## Exploratory and Differential analysis as_event:ijc, sjc 

Differential Analysis (DE) was performed using voom (Law et.al., 2014) to transform junction counts (reads that were aligned to junctions when an exon is included - ijc, and reads that were aligned to junctions when the exon is excluded - sjc) with associated precision weights, followed by linear modeling and empirical Bayes procedure using limma.    In each tissue, the following linear regression model was used to detec secually dimorphic alternative splicing event expression: 

           y = B0 + B1 sex + epsilon (error)
           

where y is the included exon junction count expression; sex denotes the reported sex of the subject

## Differential analysis as_event (combined ijc and sjc)

Differential Analysis (DE) was performed using voom (Law et.al., 2014) to transform junction counts (reads that were aligned to junctions when an exon is included - ijc, and reads that were aligned to junctions when the exon is excluded - sjc) with associated precision weights, followed by linear modeling and empirical Bayes procedure using limma.    In each tissue, the following linear regression model was used to detec secually dimorphic alternative splicing event expression: 

           y = B0 + B1 sex + B2 as_event + B3 sex*as_event + epsilon (error)
           

where y is the alternative splicing event expression; sex denotes the reported sex of the subject, as_event represents the specific alternative splicing event - either included exon junction counts or skipped exon junction counts and their interaction terms.   Donor is added to our model as a blocking variable used in both the calculation of duplicate correlation as well as in the linear fit.

### Voom, limma's lmFit and eBayes

Using sample as a blocking variable, we are able to model the effects of the donor on the results, which improves the power.  This topic is discussed in biostars https://www.biostars.org/p/54565/.  And Gordon Smyth answers the question here https://mailman.stat.ethz.ch/pipermail/bioconductor/2014-February/057887.html.  The method of modeling is a random effects approach in which the intra-donor correlation is incorporated into the covariance matrix instead of the linear predictor.   And though as Gordon Smyth states both are good method and the twoway anova approach makes fewer assumptions, the random effects approach is statistically more powerful.  

We have a balanced design in which all donors receive all stimuli (which is really in healthy human donors, life and all of its factors!) Our measurement has so many points -- we are measuring in the skipped exon approach, 42,611 junctions!   It is not possible to encorporate those measurements into the linear predictor.  A two-way ANOVA approach is virtually as powerful as the random effects approach 
and hence is preferable as it makes fewer assumptions.

For an unbalanced design in which each donor receives only a subset of the stimula, the random effects approach is more powerful.

Random effects approach is equivalent to The first method is twoway anova, a generalization of a paired analysis.


In [None]:

cat(levels(tissue_list_m_f),sep="\n")

actual_tissue_list_m_f = levels(tissue_list_m_f)
tissue_of_interest = actual_tissue_list_m_f[21]
tissue_of_interest
length(actual_tissue_list_m_f)
length(tissue_list_m_f)

dim(ijc_m_f_no_chrY)
dim(sjc_m_f_no_chrY)
dim(metadata_m_f)
dim(fromGTF_no_chrY)

tissue_of_interest = tissue_of_interest
fromGTF = fromGTF_no_chrY
tissue_list = tissue_list_m_f
ijc = ijc_m_f_no_chrY
sjc = sjc_m_f_no_chrY
metadata =  metadata_m_f


In [None]:

cat(levels(tissue_list_m_f),sep="\n")

actual_tissue_list_m_f = levels(tissue_list_m_f)
tissue_of_interest = actual_tissue_list_m_f[21]
tissue_of_interest
length(actual_tissue_list_m_f)
length(tissue_list_m_f)

dim(ijc_m_f_no_chrY)
dim(sjc_m_f_no_chrY)
dim(metadata_m_f)
dim(fromGTF_no_chrY)

print_exploratory_plots (tissue_of_interest, 
                         fromGTF_no_chrY, 
                         tissue_list_m_f, 
                         ijc_m_f_no_chrY, 
                         sjc_m_f_no_chrY, 
                         metadata_m_f )


## Metadata

For replicability and reproducibility purposes, we also print the following metadata:

1. Checksums of **'artefacts'**, files generated during the analysis and stored in the folder directory **`data`**
2. List of environment metadata, dependencies, versions of libraries using `utils::sessionInfo()` and [`devtools::session_info()`](https://devtools.r-lib.org/reference/session_info.html)

### 1. Checksums with the sha256 algorithm

In [None]:
rm (notebookid)
notebookid   = "AllTissueJunctionAnalysis"
notebookid

message("Generating sha256 checksums of the artefacts in the `..data/` directory .. ")
system(paste0("cd ../data && find . -type f -exec sha256sum {} \\;  >  ../metadata/", notebookid, "_sha256sums.txt"), intern = TRUE)
message("Done!\n")

paste0("../metadata/", notebookid, "_sha256sums.txt")

data.table::fread(paste0("../metadata/", notebookid, "_sha256sums.txt"), header = FALSE, col.names = c("sha256sum", "file"))

### 2. Libraries metadata

In [None]:
dev_session_info   <- devtools::session_info()
utils_session_info <- utils::sessionInfo()

message("Saving `devtools::session_info()` objects in ../metadata/devtools_session_info.rds  ..")
saveRDS(dev_session_info, file = paste0("../metadata/", notebookid, "_devtools_session_info.rds"))
message("Done!\n")

message("Saving `utils::sessionInfo()` objects in ../metadata/utils_session_info.rds  ..")
saveRDS(utils_session_info, file = paste0("../metadata/", notebookid ,"_utils_info.rds"))
message("Done!\n")

dev_session_info$platform
dev_session_info$packages[dev_session_info$packages$attached==TRUE, ]