# Figure 1C - YARN Normalization Version

Heatplot representing similarity in the fold-changes between male and female samples, with the values in the heatmap being the correlation between the vectors of fold changes of the tissues</b>

In [34]:
rm(list = ls())

We downloaded the GTEx version 8.0 RNA-seq and genotype data (phs000424.v8.v2), released 2019-08-26.
We used YARN (https://bioconductor.org/packages/release/bioc/html/yarn.html), uploading the downloadGTEx function
to download this release, and used it to perform quality control, gene filtering and normalization pre-processing on the
GTEx RNA-seq data, as described in (Paulson et al, 2017).   This pipelines tested for sample sex-misidentification, 
merged related sub-tissues, performed tissue-aware normalization using qsmooth (Hicks et al, 2017).

In [35]:
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
#BiocManager::install("yarn")

In [20]:
#BiocManager::install("downloader")

Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
Installing package(s) 'downloader'
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'backports', 'BH', 'bit', 'blob', 'broom', 'callr', 'caret',
  'cli', 'curl', 'data.table', 'DBI', 'devtools', 'digest', 'dplyr',
  'ellipsis', 'fansi', 'foreach', 'forecast', 'fracdiff', 'ggplot2', 'gh',
  'haven', 'hexbin', 'hms', 'htmltools', 'htmlwidgets', 'httpuv', 'IRkernel',
  'jsonlite', 'KernSmooth', 'knitr', 'later', 'lava', 'MASS', 'Matrix', 'mgcv',
  'mime', 'ModelMetrics', 'nlme', 'nycflights13', 'pillar', 'pkgbuild',
  'pkgconfig', 'plyr', 'prettyunits', 'processx', 'prodlim', 'promises', 'ps',
  'purrr', 'quadprog', 'R6', 'Rcpp', 'RcppArmadillo', 'RCurl', 'recipes',
  'repr', 'rlang', 'rmarkdown', 'roxygen2', 'RSQLite', 'rstudioapi', 'rvest',
  'scales', 'selectr', 'shiny', 'sparklyr', 'SQUAREM', 'stringi', 'survival',
  'sys', 'testthat', 'tidyr', 'tidyselect', 'tidyverse', 't

In [21]:
#BiocManager::install("readr")

Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
Installing package(s) 'readr'
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'backports', 'BH', 'bit', 'blob', 'broom', 'callr', 'caret',
  'cli', 'curl', 'data.table', 'DBI', 'devtools', 'digest', 'dplyr',
  'ellipsis', 'fansi', 'foreach', 'forecast', 'fracdiff', 'ggplot2', 'gh',
  'haven', 'hexbin', 'hms', 'htmltools', 'htmlwidgets', 'httpuv', 'IRkernel',
  'jsonlite', 'KernSmooth', 'knitr', 'later', 'lava', 'MASS', 'Matrix', 'mgcv',
  'mime', 'ModelMetrics', 'nlme', 'nycflights13', 'pillar', 'pkgbuild',
  'pkgconfig', 'plyr', 'prettyunits', 'processx', 'prodlim', 'promises', 'ps',
  'purrr', 'quadprog', 'R6', 'Rcpp', 'RcppArmadillo', 'RCurl', 'recipes',
  'repr', 'rlang', 'rmarkdown', 'roxygen2', 'RSQLite', 'rstudioapi', 'rvest',
  'scales', 'selectr', 'shiny', 'sparklyr', 'SQUAREM', 'stringi', 'survival',
  'sys', 'testthat', 'tidyr', 'tidyselect', 'tidyverse', 'tinyte

In [22]:
#BiocManager::install("biomaRt")

Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
Installing package(s) 'biomaRt'
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'backports', 'BH', 'bit', 'blob', 'broom', 'callr', 'caret',
  'cli', 'curl', 'data.table', 'DBI', 'devtools', 'digest', 'dplyr',
  'ellipsis', 'fansi', 'foreach', 'forecast', 'fracdiff', 'ggplot2', 'gh',
  'haven', 'hexbin', 'hms', 'htmltools', 'htmlwidgets', 'httpuv', 'IRkernel',
  'jsonlite', 'KernSmooth', 'knitr', 'later', 'lava', 'MASS', 'Matrix', 'mgcv',
  'mime', 'ModelMetrics', 'nlme', 'nycflights13', 'pillar', 'pkgbuild',
  'pkgconfig', 'plyr', 'prettyunits', 'processx', 'prodlim', 'promises', 'ps',
  'purrr', 'quadprog', 'R6', 'Rcpp', 'RcppArmadillo', 'RCurl', 'recipes',
  'repr', 'rlang', 'rmarkdown', 'roxygen2', 'RSQLite', 'rstudioapi', 'rvest',
  'scales', 'selectr', 'shiny', 'sparklyr', 'SQUAREM', 'stringi', 'survival',
  'sys', 'testthat', 'tidyr', 'tidyselect', 'tidyverse', 'tiny

Define a V8 of the function from YARN - wrote the author to make this perhaps version - or I guess I could update the package itself.
THere were 3 lines to change for each of the source files

In [23]:
downloadGTExV8=function (type = "genes", file = NULL, ...) 
{
    phenoFile <- "https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt"
    pheno2File <- "https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt"
    geneFile <- "https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz"
    message("Downloading and reading files")
    pdFile <- tempfile("phenodat", fileext = ".txt")
    download(phenoFile, destfile = pdFile)
    pd <- read_tsv(pdFile)
    pd <- as.matrix(pd)
    rownames(pd) <- pd[, "SAMPID"]
    ids <- sapply(strsplit(pd[, "SAMPID"], "-"), function(i) paste(i[1:2], 
        collapse = "-"))
    pd2File <- tempfile("phenodat2", fileext = ".txt")
    download(pheno2File, destfile = pd2File)
    pd2 <- read_tsv(pd2File)
    pd2 <- as.matrix(pd2)
    rownames(pd2) <- pd2[, "SUBJID"]
    pd2 <- pd2[which(rownames(pd2) %in% unique(ids)), ]
    pd2 <- pd2[match(ids, rownames(pd2)), ]
    rownames(pd2) <- colnames(counts)
    pdfinal <- AnnotatedDataFrame(data.frame(cbind(pd, pd2)))
    if (type == "genes") {
        countsFile <- tempfile("counts", fileext = ".gz")
        download(geneFile, destfile = countsFile)
        cnts <- suppressWarnings(read_tsv(geneFile, skip = 2))
        genes <- unlist(cnts[, 1])
        geneNames <- unlist(cnts[, 2])
        counts <- cnts[, -c(1:2)]
        counts <- as.matrix(counts)
        rownames(counts) <- genes
        for (i in 1:nrow(problems(cnts))) {
            counts[problems(cnts)$row[i], problems(cnts)$col[i]] <- 1e+05
        }
        throwAway <- which(rowSums(counts) == 0)
        counts <- counts[-throwAway, ]
        genes <- sub("\\..*", "", rownames(counts))
        host <- "www.ensembl.org"
        biomart <- "ENSEMBL_MART_ENSEMBL"
        dataset <- "hsapiens_gene_ensembl"
        attributes <- c("ensembl_gene_id", "hgnc_symbol", "chromosome_name", 
            "start_position", "end_position", "gene_biotype")
    }
    message("Creating ExpressionSet")
    pdfinal <- pdfinal[match(colnames(counts), rownames(pdfinal)), 
        ]
    es <- ExpressionSet(as.matrix(counts))
    phenoData(es) <- pdfinal
    pData(es)["GTEX-YF7O-2326-101833-SM-5CVN9", "SMTS"] <- "Skin"
    pData(es)["GTEX-YEC3-1426-101806-SM-5PNXX", "SMTS"] <- "Stomach"
    message("Annotating from biomaRt")
    es <- annotateFromBiomart(obj = es, genes = genes, host = host, 
        biomart = biomart, dataset = dataset, attributes = attributes)
    message("Cleaning up files")
    unlink(pdFile)
    unlink(pd2File)
    unlink(countsFile)
    if (!is.null(file)) 
        saveRDS(es, file = file)
    return(es)
}


Begin here if you have already run this and created the data/gtex.rds file

In [36]:
library(downloader)
library(readr)
library(biomaRt)
#library(yarn)

In [25]:
getwd()

In [26]:
setwd('/mnt/shared/ec2-user/session_data/lifebitCloudOSDRE')
getwd()

In [29]:
BiocManager("yarn")

ERROR: Error in BiocManager("yarn"): could not find function "BiocManager"


You may need to adjust your working directory -- the data subdirectory is relative to the lifebitCloudOSDRE working directory

In [27]:
#obj <- downloadGTExV8(type='genes',file='data/gtex.rds')

Downloading and reading files
Parsed with column specification:
cols(
  .default = col_double(),
  SAMPID = [31mcol_character()[39m,
  SMCENTER = [31mcol_character()[39m,
  SMPTHNTS = [31mcol_character()[39m,
  SMTS = [31mcol_character()[39m,
  SMTSD = [31mcol_character()[39m,
  SMUBRID = [31mcol_character()[39m,
  SMNABTCH = [31mcol_character()[39m,
  SMNABTCHT = [31mcol_character()[39m,
  SMNABTCHD = [31mcol_character()[39m,
  SMGEBTCH = [31mcol_character()[39m,
  SMGEBTCHD = [31mcol_character()[39m,
  SMGEBTCHT = [31mcol_character()[39m,
  SMAFRZE = [31mcol_character()[39m,
  SMGTC = [33mcol_logical()[39m,
  SMNUMGPS = [33mcol_logical()[39m,
  SM550NRM = [33mcol_logical()[39m,
  SM350NRM = [33mcol_logical()[39m,
  SMMNCPB = [33mcol_logical()[39m,
  SMMNCV = [33mcol_logical()[39m,
  SMCGLGTH = [33mcol_logical()[39m
  # ... with 2 more columns
)
See spec(...) for full column specifications.
“379 parsing failures.
  row   col           expected  

ERROR: Error in colnames(counts): object 'counts' not found


In [28]:
obj

ERROR: Error in eval(expr, envir, enclos): object 'obj' not found


In [32]:
getwd()

In [33]:
obj<-readRDS('data/gtex.rds')

“cannot open compressed file 'data/gtex.rds', probable reason 'No such file or directory'”

ERROR: Error in gzfile(file, "rb"): cannot open the connection


In [None]:
obj

In [None]:
tissues <- pData(obj)$SMTS

In [None]:
dim(pData(obj))

In [None]:
dim(obj)

In [None]:
sample_names=as.vector(as.character(colnames(exprs(obj))))
head(sample_names)
length(sample_names)

In [None]:
pheno_sample_names=as.vector(as.character(rownames(pData(obj))))
head(pheno_sample_names)
length(pheno_sample_names)

Okay - for some reason our phenotype data is larger than our expression data - I've written Joe Paulson about that.
In the meantime, make sure that the two sets are aligned.

In [None]:
logical_match_names=pheno_sample_names %in% sample_names
length(logical_match_names)

In [None]:
table(logical_match_names)


In [None]:
pData(obj) <- (pData(obj)[logical_match_names==TRUE,])

Now we want to replace all *dashes* with _underscores_

In [None]:
newSampID <- gsub('-','\\.',pData(obj)$SAMPID)

In [None]:
head (newSampID)

In [None]:
pData(obj)$SAMPID <- newSampID

In [None]:
tissueFactors <- factor(tissues)

In [None]:
table(tissueFactors)

In [None]:
# SEX is coded 1 == Male
#              2 == Female
sex <- pData(obj)$SEX
age <- pData(obj)$AGE
#cod <- cause of death
cod <- pData(obj)$DTHHRDY
    

In [None]:
table(sex)
table(age)
table(cod)

Now let us do the differential analysis - using EdgeR

In [15]:
BiocManager::install("edgeR")
library(edgeR)

Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
Installing package(s) 'edgeR'
also installing the dependency ‘locfit’

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Old packages: 'backports', 'BH', 'bit', 'blob', 'broom', 'callr', 'caret',
  'cli', 'curl', 'data.table', 'DBI', 'devtools', 'digest', 'dplyr',
  'ellipsis', 'fansi', 'foreach', 'forecast', 'fracdiff', 'ggplot2', 'gh',
  'haven', 'hexbin', 'hms', 'htmltools', 'htmlwidgets', 'httpuv', 'IRkernel',
  'jsonlite', 'KernSmooth', 'knitr', 'later', 'lava', 'MASS', 'Matrix', 'mgcv',
  'mime', 'ModelMetrics', 'nlme', 'nycflights13', 'pillar', 'pkgbuild',
  'pkgconfig', 'plyr', 'prettyunits', 'processx', 'prodlim', 'promises',
  'purrr', 'quadprog', 'R6', 'Rcpp', 'RcppArmadillo', 'RCurl', 'recipes',
  'repr', 'rlang', 'rmarkdown', 'roxygen2', 'RSQLite', 'rstudioapi', 'rvest',
  'scales', 'selectr', 'shiny', 'sparklyr', 'SQUAREM', 'stringi', 'survival',
  'sys', 'testthat', 'tidyr',

In [None]:
x <- exprs(obj)

In [None]:
dim(x)

To use the DGEList function from EdgeR, we need to transpose our x so that the length of group is equal
to the number of columns in our counts (x).

You will get an error in DGEList (counts = x, group = group) if the length of group is not equal to the number of columns in counts

In [None]:
group <- factor(pData(obj)$SEX)

In [None]:
y <- DGEList(counts=x, group=group)

I keep running out of memory on this step - so on my laptop after calculating the DGEList
I saved it and now I uploaded it to this larger memory machine

In [10]:
setwd("../../mounted-data/lifebit-user-data-e0354335-813e-4085-9693-9457f8507ff1/dataset/5e35bf40e3474100f467262b/")

In [11]:
getwd()

In [12]:
y <- readRDS("DGEy.rds")

In [13]:
attributes(y)

In [16]:
y <- calcNormFactors(y)

In [17]:
saveRDS(y, file = "DGENormFactorsy.rds")

We only want to keep those events that are greater than the first quartile (25%),
this is done using all non-zero (>1) events >= 0.25 min(table(pData(obj)$SEX))

Recall SEX is coded 1 for male, 2 for female

In [None]:
groups <- pData(obj)$SEX
keep.events <- rep(TRUE, nrow(y))
for (group in c(1,2)) {
    keep.events <- keep.events & 
                   rowSums(cpm(y[,groups %in% group]) > 1) >= 0.25*min(table(groups))
}
