# Figure 1C - YARN Normalization Version

A `heatplot` representing similarity in the fold-changes between male and female samples, with the values in the heatmap being the correlation between the vectors of fold changes of the tissues. </b>

We downloaded the GTEx version 8.0 RNA-seq and genotype data (phs000424.v8.v2), released 2019-08-26.
We used YARN (https://bioconductor.org/packages/release/bioc/html/yarn.html), uploading the downloadGTEx function
to download this release, and used it to perform quality control, gene filtering and normalization pre-processing on the
GTEx RNA-seq data, as described in (Paulson et al, 2017).   This pipelines tested for sample sex-misidentification, 
merged related sub-tissues, performed tissue-aware normalization using qsmooth (Hicks et al, 2017).

## Loading dependencies

In [1]:
library(downloader)
library(readr)
library(edgeR)
library(biomaRt)
library(DBI) # v >= 1.1.0 required for biomaRt
library(devtools)
library(yarn)
Sys.setenv(TAR = "/bin/tar") # for gzfile

“package ‘DBI’ was built under R version 3.6.2”Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Welcome to Bioconductor

    Vign

Begin here if you have already run this and created the `data/gtex.rds` file

Please `git clone` the repository and start working after changing to this as your working directory (`cd lifebitCloudOSDRE`). The `data` subdirectory, along with all other paths used in this Notebook are relative to the `lifebitCloudOSDRE` repository.

In [5]:
# CAUTION! It requires some minutes to complete, also memory and enough storage
obj <- yarn::downloadGTExV8(type='genes',file='../data/gtex.rds')

Downloading and reading files
Parsed with column specification:
cols(
  .default = col_double(),
  SAMPID = [31mcol_character()[39m,
  SMCENTER = [31mcol_character()[39m,
  SMPTHNTS = [31mcol_character()[39m,
  SMTS = [31mcol_character()[39m,
  SMTSD = [31mcol_character()[39m,
  SMUBRID = [31mcol_character()[39m,
  SMNABTCH = [31mcol_character()[39m,
  SMNABTCHT = [31mcol_character()[39m,
  SMNABTCHD = [31mcol_character()[39m,
  SMGEBTCH = [31mcol_character()[39m,
  SMGEBTCHD = [31mcol_character()[39m,
  SMGEBTCHT = [31mcol_character()[39m,
  SMAFRZE = [31mcol_character()[39m,
  SMGTC = [33mcol_logical()[39m,
  SMNUMGPS = [33mcol_logical()[39m,
  SM550NRM = [33mcol_logical()[39m,
  SM350NRM = [33mcol_logical()[39m,
  SMMNCPB = [33mcol_logical()[39m,
  SMMNCV = [33mcol_logical()[39m,
  SMCGLGTH = [33mcol_logical()[39m
  # ... with 2 more columns
)
See spec(...) for full column specifications.
“379 parsing failures.
  row   col           expected  

This uploaded object is available by long names -- which might be nice to simplify ....

In [6]:
class(obj)

In [7]:
dim(phenoData(obj))

In [8]:
dim(obj)

In [9]:
sample_names=as.vector(as.character(colnames(exprs(obj))))
head(sample_names)
length(sample_names)

In [10]:
pheno_sample_names=as.vector(as.character(rownames(pData(obj))))
head(pheno_sample_names)
length(pheno_sample_names)

Okay - for some reason our phenotype data is larger than our expression data - I've written Joe Paulson about that.
In the meantime, make sure that the two sets are aligned.

In [11]:
logical_match_names=pheno_sample_names %in% sample_names
length(logical_match_names)

In [12]:
table(logical_match_names)


logical_match_names
FALSE  TRUE 
    2 17382 

In [13]:
pData(obj) <- (pData(obj)[logical_match_names==TRUE,])

Now we want to replace all *dashes* with _underscores_

In [14]:
newSampID <- gsub('-','\\.',pData(obj)$SAMPID)

In [15]:
head (newSampID)

In [16]:
pData(obj)$SAMPID <- newSampID

In [18]:
colnames(pData(obj))

In [19]:
tissueFactors <- factor(pData(obj)$SMTS)

In [20]:
table(tissueFactors)

tissueFactors
 Adipose Tissue   Adrenal Gland         Bladder           Blood    Blood Vessel 
           1204             258              21             929            1335 
          Brain          Breast    Cervix Uteri           Colon       Esophagus 
           2642             459              19             779            1445 
 Fallopian Tube           Heart          Kidney           Liver            Lung 
              9             861              89             226             578 
         Muscle           Nerve           Ovary        Pancreas       Pituitary 
            803             619             180             328             283 
       Prostate  Salivary Gland            Skin Small Intestine          Spleen 
            245             162            1809             187             241 
        Stomach          Testis         Thyroid          Uterus          Vagina 
            359             361             653             142             156 

In [21]:
# SEX is coded 1 == Male
#              2 == Female
sex <- pData(obj)$SEX
age <- pData(obj)$AGE
#cod <- cause of death
cod <- pData(obj)$DTHHRDY
    

In [22]:
table(sex)
table(age)
table(cod)

sex
    1     2 
11584  5798 

age
20-29 30-39 40-49 50-59 60-69 70-79 
 1320  1323  2702  5615  5821   601 

cod
   0    1    2    3    4 
8814  711 4839  868 2039 

Now let us do the differential analysis - using EdgeR

In [23]:
x <- exprs(obj)

In [24]:
dim(x)

To use the DGEList function from EdgeR, we need to transpose our x so that the length of group is equal
to the number of columns in our counts (x).

You will get an error in DGEList (counts = x, group = group) if the length of group is not equal to the number of columns in counts

In [25]:
group <- factor(pData(obj)$SEX)

In [28]:
y <- DGEList(counts=x, group=group)

I keep running out of memory on this step - so on my laptop after calculating the DGEList
I saved it and now I uploaded it to this larger memory machine

In [30]:
attributes(y)

In [None]:
#caution this step takes a lot of memory and time

In [31]:
y <- calcNormFactors(y)

In [32]:
saveRDS(y, file = "../data/DGENormFactorsy.rds")

We only want to keep those events that are greater than the first quartile (25%),
this is done using all non-zero (>1) events >= 0.25 min(table(pData(obj)$SEX))

Recall SEX is coded 1 for male, 2 for female

In [33]:
groups <- pData(obj)$SEX
keep.events <- rep(TRUE, nrow(y))
for (group in c(1,2)) {
    keep.events <- keep.events & 
                   rowSums(cpm(y[,groups %in% group]) > 1) >= 0.25*min(table(groups))
}


In [37]:
reduced_y<- y[keep.events,]

In [42]:
dim (reduced_y)

In [43]:
dim(y)

In [44]:
saveRDS(reduced_y, file = "../data/DGENormFactorsReducedy.rds")

In [45]:
# make the design based upon sex 
design <- model.matrix(~factor(pData(obj)$SEX))

In [None]:
# From voom function description, we see ‘voom’ is an acronym for mean-variance modelling at the
#     observational level. The idea is to estimate the mean-variance
#     relationship in the data, then use this to compute an appropriate
#     precision weight for each observation. 
#     ‘voom’ performs the following specific calculations. First, the
#     counts are converted to logCPM values, adding 0.5 to all the
#     counts to avoid taking the logarithm of zero. 

In [47]:
v <- voom (reduced_y, design)

In [None]:
# make a linear fit model based upon the model.matrix

In [49]:
fit <- lmFit(v, design)

In [53]:
install.packages("statmod")
library(statmod)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [54]:
fit <- eBayes(fit, robust=TRUE)

In [56]:
# extract a table of the top-ranked genes from a linear fit model.
results = topTable(fit, coef='factor(pData(obj)$SEX)2',number=nrow(reduced_y))


In [57]:
head(results)

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000005889.15,0.6947531,4.8455923,83.36478,0,0,2911.746
ENSG00000147050.14,0.7026541,4.7663662,74.97091,0,0,2424.186
ENSG00000126012.11,0.5221968,6.7604268,64.26,0,0,1842.496
ENSG00000225470.7,0.5191323,0.1480988,54.72147,0,0,1366.746
ENSG00000169249.12,0.4663712,4.0206209,53.85897,0,0,1331.656
ENSG00000215301.9,0.5117949,7.9151232,52.34426,0,0,1262.748


In [58]:
write.table(results,'../data/all_FC_results.txt',sep='\t',quote = F)