# Figure 1C - YARN Normalization Version

Heatplot representing similarity in the fold-changes between male and female samples, with the values in the heatmap being the correlation between the vectors of fold changes of the tissues</b>

We downloaded the GTEx version 8.0 RNA-seq and genotype data (phs000424.v8.v2), released 2019-08-26.
We used YARN (https://bioconductor.org/packages/release/bioc/html/yarn.html), uploading the downloadGTEx function
to download this release, and used it to perform quality control, gene filtering and normalization pre-processing on the
GTEx RNA-seq data, as described in (Paulson et al, 2017).   This pipelines tested for sample sex-misidentification, 
merged related sub-tissues, performed tissue-aware normalization using qsmooth (Hicks et al, 2017).

In [5]:
library(downloader)
library(readr)
library(biomaRt)
library(DBI) # v >= 1.1.0 required for biomaRt
library(devtools)
library(yarn)

In [2]:
getwd()

In [None]:
setwd("../..")
getwd()

Begin here if you have already run this and created the data/gtex.rds file

You may need to adjust your working directory -- the data subdirectory is relative to the lifebitCloudOSDRE working directory

In [4]:
obj <- downloadGTExV8(type='genes',file='data/gtex.rds')

ERROR: Error in downloadGTExV8(type = "genes", file = "data/gtex.rds"): could not find function "downloadGTExV8"


In [None]:
setwd('/mnt/shared/ec2-user/session_data')

In [None]:
getwd()

This uploaded object is available by long names -- which might be nice to simplify ....

In [None]:
obj<-readRDS('/mnt/shared/ec2-user/session_data/mounted-data/lifebit-user-data-84f09c4a-623d-47db-b432-2880ba594b3b/deploit/teams/5db1ce1274081500dffba7b1/users/5c4cced085814700a85f8c7f/dataset/5e3ab72ee3474100f4708b3b/gtex.rds')

In [None]:
obj

In [None]:
dim(phenoData(obj))

In [None]:
dim(obj)

In [None]:
sample_names=as.vector(as.character(colnames(exprs(obj))))
head(sample_names)
length(sample_names)

In [None]:
pheno_sample_names=as.vector(as.character(rownames(pData(obj))))
head(pheno_sample_names)
length(pheno_sample_names)

Okay - for some reason our phenotype data is larger than our expression data - I've written Joe Paulson about that.
In the meantime, make sure that the two sets are aligned.

In [None]:
logical_match_names=pheno_sample_names %in% sample_names
length(logical_match_names)

In [None]:
table(logical_match_names)


In [None]:
pData(obj) <- (pData(obj)[logical_match_names==TRUE,])

Now we want to replace all *dashes* with _underscores_

In [None]:
newSampID <- gsub('-','\\.',pData(obj)$SAMPID)

In [None]:
head (newSampID)

In [None]:
pData(obj)$SAMPID <- newSampID

In [None]:
tissueFactors <- factor(tissues)

In [None]:
table(tissueFactors)

In [None]:
# SEX is coded 1 == Male
#              2 == Female
sex <- pData(obj)$SEX
age <- pData(obj)$AGE
#cod <- cause of death
cod <- pData(obj)$DTHHRDY
    

In [None]:
table(sex)
table(age)
table(cod)

Now let us do the differential analysis - using EdgeR

In [None]:
x <- exprs(obj)

In [None]:
dim(x)

To use the DGEList function from EdgeR, we need to transpose our x so that the length of group is equal
to the number of columns in our counts (x).

You will get an error in DGEList (counts = x, group = group) if the length of group is not equal to the number of columns in counts

In [None]:
group <- factor(pData(obj)$SEX)

In [None]:
y <- DGEList(counts=x, group=group)

I keep running out of memory on this step - so on my laptop after calculating the DGEList
I saved it and now I uploaded it to this larger memory machine

In [None]:
setwd("../../mounted-data/lifebit-user-data-e0354335-813e-4085-9693-9457f8507ff1/dataset/5e35bf40e3474100f467262b/")

In [None]:
getwd()

In [None]:
y <- readRDS("DGEy.rds")

In [None]:
attributes(y)

In [None]:
y <- calcNormFactors(y)

In [None]:
saveRDS(y, file = "DGENormFactorsy.rds")

We only want to keep those events that are greater than the first quartile (25%),
this is done using all non-zero (>1) events >= 0.25 min(table(pData(obj)$SEX))

Recall SEX is coded 1 for male, 2 for female

In [None]:
groups <- pData(obj)$SEX
keep.events <- rep(TRUE, nrow(y))
for (group in c(1,2)) {
    keep.events <- keep.events & 
                   rowSums(cpm(y[,groups %in% group]) > 1) >= 0.25*min(table(groups))
}
