# Figure 1C - YARN Normalization Version

A `heatplot` representing similarity in the fold-changes between male and female samples, with the values in the heatmap being the correlation between the vectors of fold changes of the tissues. </b>

We downloaded the GTEx version 8.0 RNA-seq and genotype data (phs000424.v8.v2), released 2019-08-26.
We used YARN (https://bioconductor.org/packages/release/bioc/html/yarn.html), uploading the downloadGTEx function
to download this release, and used it to perform quality control, gene filtering and normalization pre-processing on the
GTEx RNA-seq data, as described in (Paulson et al, 2017).   This pipelines tested for sample sex-misidentification, 
merged related sub-tissues, performed tissue-aware normalization using qsmooth (Hicks et al, 2017).

## Loading dependencies

In [1]:
library(downloader)
library(readr)
library(edgeR)
library(biomaRt)
library(DBI) # v >= 1.1.0 required for biomaRt
library(devtools)
library(yarn)
Sys.setenv(TAR = "/bin/tar") # for gzfile

“package ‘DBI’ was built under R version 3.6.2”Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which, which.max, which.min

Welcome to Bioconductor

    Vign

Begin here if you have already run this and created the `data/gtex.rds` file

Please `git clone` the repository and start working after changing to this as your working directory (`cd lifebitCloudOSDRE`). The `data` subdirectory, along with all other paths used in this Notebook are relative to the `lifebitCloudOSDRE` repository.

In [5]:
# CAUTION! It requires some minutes to complete, also memory and enough storage
obj <- yarn::downloadGTExV8(type='genes',file='../data/gtex.rds')

Downloading and reading files
Parsed with column specification:
cols(
  .default = col_double(),
  SAMPID = [31mcol_character()[39m,
  SMCENTER = [31mcol_character()[39m,
  SMPTHNTS = [31mcol_character()[39m,
  SMTS = [31mcol_character()[39m,
  SMTSD = [31mcol_character()[39m,
  SMUBRID = [31mcol_character()[39m,
  SMNABTCH = [31mcol_character()[39m,
  SMNABTCHT = [31mcol_character()[39m,
  SMNABTCHD = [31mcol_character()[39m,
  SMGEBTCH = [31mcol_character()[39m,
  SMGEBTCHD = [31mcol_character()[39m,
  SMGEBTCHT = [31mcol_character()[39m,
  SMAFRZE = [31mcol_character()[39m,
  SMGTC = [33mcol_logical()[39m,
  SMNUMGPS = [33mcol_logical()[39m,
  SM550NRM = [33mcol_logical()[39m,
  SM350NRM = [33mcol_logical()[39m,
  SMMNCPB = [33mcol_logical()[39m,
  SMMNCV = [33mcol_logical()[39m,
  SMCGLGTH = [33mcol_logical()[39m
  # ... with 2 more columns
)
See spec(...) for full column specifications.
“379 parsing failures.
  row   col           expected  

This uploaded object is available by long names -- which might be nice to simplify ....

In [6]:
class(obj)

In [None]:
saveRDS(obj, file = "../data/ExpressionSetobj.rds")

In [7]:
dim(phenoData(obj))

In [61]:
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")

In [62]:
attributes = listAttributes(ensembl)

0,1
getBM {biomaRt},R Documentation

0,1
attributes,Attributes you want to retrieve. A possible list of attributes can be retrieved using the function listAttributes.
filters,Filters (one or more) that should be used in the query. A possible list of filters can be retrieved using the function listFilters.
values,"Values of the filter, e.g. vector of affy IDs. If multiple filters are specified then the argument should be a list of vectors of which the position of each vector corresponds to the position of the filters in the filters argument."
mart,"object of class Mart, created with the useMart function."
curl,"An optional 'CURLHandle' object, that can be used to speed up getBM when used in a loop."
checkFilters,"Sometimes attributes where a value needs to be specified, for example upstream\_flank with value 20 for obtaining upstream sequence flank regions of length 20bp, are treated as filters in BioMarts. To enable such a query to work, one must specify the attribute as a filter and set checkFilters = FALSE for the query to work."
verbose,"When using biomaRt in webservice mode and setting verbose to TRUE, the XML query to the webservice will be printed."
uniqueRows,"If the result of a query contains multiple identical rows, setting this argument to TRUE (default) will result in deleting the duplicated rows in the query result at the server side."
bmHeader,"Boolean to indicate if the result retrieved from the BioMart server should include the data headers or not, defaults to FALSE. This should only be switched on if the default behavior results in errors, setting to on might still be able to retrieve your data in that case"
quote,"Sometimes parsing of the results fails due to errors in the Ensembl data fields such as containing a quote, in such cases you can try to change the value of quote to try to still parse the results."


In [8]:
dim(obj)

In [9]:
sample_names=as.vector(as.character(colnames(exprs(obj))))
head(sample_names)
length(sample_names)

In [10]:
pheno_sample_names=as.vector(as.character(rownames(pData(obj))))
head(pheno_sample_names)
length(pheno_sample_names)

Okay - for some reason our phenotype data is larger than our expression data - I've written Joe Paulson about that.
In the meantime, make sure that the two sets are aligned.

In [11]:
logical_match_names=pheno_sample_names %in% sample_names
length(logical_match_names)

In [12]:
table(logical_match_names)


logical_match_names
FALSE  TRUE 
    2 17382 

In [13]:
pData(obj) <- (pData(obj)[logical_match_names==TRUE,])

Now we want to replace all *dashes* with _underscores_

In [14]:
newSampID <- gsub('-','\\.',pData(obj)$SAMPID)

In [15]:
head (newSampID)

In [16]:
pData(obj)$SAMPID <- newSampID

In [18]:
colnames(pData(obj))

In [60]:
head(exprs(obj))

Unnamed: 0,GTEX-1117F-0226-SM-5GZZ7,GTEX-1117F-0426-SM-5EGHI,GTEX-1117F-0526-SM-5EGHJ,GTEX-1117F-0626-SM-5N9CS,GTEX-1117F-0726-SM-5GIEN,GTEX-1117F-1326-SM-5EGHH,GTEX-1117F-2426-SM-5EGGH,GTEX-1117F-2526-SM-5GZY6,GTEX-1117F-2826-SM-5GZXL,GTEX-1117F-2926-SM-5GZYI,⋯,GTEX-ZZPU-1126-SM-5N9CW,GTEX-ZZPU-1226-SM-5N9CK,GTEX-ZZPU-1326-SM-5GZWS,GTEX-ZZPU-1426-SM-5GZZ6,GTEX-ZZPU-1826-SM-5E43L,GTEX-ZZPU-2126-SM-5EGIU,GTEX-ZZPU-2226-SM-5EGIV,GTEX-ZZPU-2426-SM-5E44I,GTEX-ZZPU-2626-SM-5E45Y,GTEX-ZZPU-2726-SM-5NQ8O
ENSG00000223972.5,0,0,0,0,0,0,0,0,0,0,⋯,0,2,0,0,0,0,0,0,1,1
ENSG00000227232.5,187,109,143,251,113,139,199,473,286,306,⋯,72,96,136,79,89,86,49,84,34,66
ENSG00000278267.1,0,0,1,0,0,0,0,0,0,0,⋯,0,0,0,0,0,0,0,0,0,0
ENSG00000243485.5,1,0,0,1,0,0,0,0,0,1,⋯,0,0,1,0,2,2,0,1,0,0
ENSG00000237613.2,0,0,0,0,0,0,0,1,0,0,⋯,1,0,0,0,0,0,0,0,0,0
ENSG00000268020.3,0,1,0,0,0,1,1,0,1,2,⋯,1,0,0,1,2,0,0,0,1,1


In [19]:
tissueFactors <- factor(pData(obj)$SMTS)

In [20]:
table(tissueFactors)

tissueFactors
 Adipose Tissue   Adrenal Gland         Bladder           Blood    Blood Vessel 
           1204             258              21             929            1335 
          Brain          Breast    Cervix Uteri           Colon       Esophagus 
           2642             459              19             779            1445 
 Fallopian Tube           Heart          Kidney           Liver            Lung 
              9             861              89             226             578 
         Muscle           Nerve           Ovary        Pancreas       Pituitary 
            803             619             180             328             283 
       Prostate  Salivary Gland            Skin Small Intestine          Spleen 
            245             162            1809             187             241 
        Stomach          Testis         Thyroid          Uterus          Vagina 
            359             361             653             142             156 

In [21]:
# SEX is coded 1 == Male
#              2 == Female
sex <- pData(obj)$SEX
age <- pData(obj)$AGE
#cod <- cause of death
cod <- pData(obj)$DTHHRDY


In [22]:
table(sex)
table(age)
table(cod)

sex
    1     2 
11584  5798 

age
20-29 30-39 40-49 50-59 60-69 70-79 
 1320  1323  2702  5615  5821   601 

cod
   0    1    2    3    4 
8814  711 4839  868 2039 

Now let us do the differential analysis - using EdgeR

In [23]:
x <- exprs(obj)

In [24]:
dim(x)

To use the DGEList function from EdgeR, we need to transpose our x so that the length of group is equal
to the number of columns in our counts (x).

You will get an error in DGEList (counts = x, group = group) if the length of group is not equal to the number of columns in counts

In [25]:
group <- factor(pData(obj)$SEX)

In [28]:
y <- DGEList(counts=x, group=group)

I keep running out of memory on this step - so on my laptop after calculating the DGEList
I saved it and now I uploaded it to this larger memory machine

In [30]:
attributes(y)

In [None]:
#caution this step takes a lot of memory and time

In [31]:
y <- calcNormFactors(y)

In [32]:
saveRDS(y, file = "../data/DGENormFactorsy.rds")

In [None]:
# For Guy -- does this do what you are expecting -- I am confused because what you get when you
#        ask for the min (table(groups)) is the smaller sized group -- which in this case is 
#        female -- it will help the reader to know what you are doing here with the statement.
#        one can read what it is doing but not understand your objective.

In [33]:
groups <- pData(obj)$SEX
keep.events <- rep(TRUE, nrow(y))
for (group in c(1,2)) {
    keep.events <- keep.events & 
                   rowSums(cpm(y[,groups %in% group]) > 1) >= 0.25*min(table(groups))
}


In [37]:
reduced_y<- y[keep.events,]

In [100]:
reduced_obj <- obj[keep.events==TRUE,]

In [None]:
saveRDS(reduced_obj, file = "../data/ExpressionObjectReducedObj.rds")

In [101]:
dim(reduced_obj)

In [44]:
saveRDS(reduced_y, file = "../data/DGENormFactorsReducedy.rds")
saveRDS(reduced_obj, file = "../data/reduced_obj.rds")

In [45]:
# make the design based upon sex 
design <- model.matrix(~factor(pData(obj)$SEX))

In [None]:
# From voom function description, we see ‘voom’ is an acronym for mean-variance modelling at the
#     observational level. The idea is to estimate the mean-variance
#     relationship in the data, then use this to compute an appropriate
#     precision weight for each observation. 
#     ‘voom’ performs the following specific calculations. First, the
#     counts are converted to logCPM values, adding 0.5 to all the
#     counts to avoid taking the logarithm of zero. 

In [47]:
v <- voom (reduced_y, design)

In [None]:
# make a linear fit model based upon the model.matrix

In [49]:
fit <- lmFit(v, design)

In [53]:
install.packages("statmod")
library(statmod)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [54]:
fit <- eBayes(fit, robust=TRUE)

In [56]:
# extract a table of the top-ranked genes from a linear fit model.
results = topTable(fit, coef='factor(pData(obj)$SEX)2',number=nrow(reduced_y))


In [58]:
write.table(results,'../data/all_FC_results.txt',sep='\t',quote = F)

In [105]:
# separate the analysis by male and by female
# then do the differential analysis regarding tissue
# then do the differential analysis male tissue vs female tissue

In [106]:
reduced_male <- pData(reduced_obj)$SEX==1

In [118]:
reduced_female <- pData(reduced_obj)$SEX==2

In [111]:
sum(reduced_male==TRUE)

In [114]:
length(reduced_male)

In [113]:
dim(reduced_obj)

In [115]:
reduced_obj_male <- reduced_obj[,reduced_male==TRUE]

In [116]:
dim(reduced_obj_male)

In [119]:
reduced_obj_female <- reduced_obj[,reduced_female==TRUE]

In [120]:
dim(reduced_obj_female)

In [121]:
tissue_groups_male <- factor(pData(reduced_obj_male)$SMTS)

In [122]:
tissue_groups_female <- factor(pData(reduced_obj_female)$SMTS)

In [129]:
# good sanity check, the male set does not have any vaginas or uterus

In [130]:
table (tissue_groups_male)

tissue_groups_male
 Adipose Tissue   Adrenal Gland         Bladder           Blood    Blood Vessel 
            816             157              14             613             879 
          Brain          Breast           Colon       Esophagus           Heart 
           1914             291             499             952             587 
         Kidney           Liver            Lung          Muscle           Nerve 
             69             161             395             543             419 
       Pancreas       Pituitary        Prostate  Salivary Gland            Skin 
            207             204             245             115            1208 
Small Intestine          Spleen         Stomach          Testis         Thyroid 
            120             154             227             361             434 

In [131]:
# and the females have no prostate or testis

In [132]:
table(tissue_groups_female)

tissue_groups_female
 Adipose Tissue   Adrenal Gland         Bladder           Blood    Blood Vessel 
            388             101               7             316             456 
          Brain          Breast    Cervix Uteri           Colon       Esophagus 
            728             168              19             280             493 
 Fallopian Tube           Heart          Kidney           Liver            Lung 
              9             274              20              65             183 
         Muscle           Nerve           Ovary        Pancreas       Pituitary 
            260             200             180             121              79 
 Salivary Gland            Skin Small Intestine          Spleen         Stomach 
             47             601              67              87             132 
        Thyroid          Uterus          Vagina 
            219             142             156 

In [133]:
y_tissue_male <- DGEList(counts=exprs(reduced_obj_male), group=tissue_groups_male)

In [134]:
y_tissue_female <- DGEList(counts=exprs(reduced_obj_female), group=tissue_groups_female)

In [136]:
# make the design based upon sex specific tissues 
design_male <- model.matrix(~tissue_groups_male)
design_female <- model.matrix(~tissue_groups_female)

In [137]:
y_tissue_male <- calcNormFactors(y_tissue_male)
y_tissue_female <- calcNormFactors(y_tissue_female)

In [138]:
y_voom_male <- voom (y_tissue_male, design_male)
y_voom_female <- voom (y_tissue_female, design_female)

In [139]:
y_fit_male <- lmFit(y_voom_male, design_male)
y_fit_female <- lmFit(y_voom_female, design_female)

In [140]:
y_fit_male <- eBayes(y_fit_male, robust=TRUE)
y_fit_female <- eBayes(y_fit_female, robust=TRUE)

In [147]:
y_fit_male_tissues = colnames(y_fit_male)
y_fit_male_tissues = y_fit_male_tissues[-1]
y_fit_male_tissues

In [148]:
y_fit_female_tissues = colnames(y_fit_female)
y_fit_female_tissues = y_fit_female_tissues[-1]
y_fit_female_tissues

In [151]:
for (tissue_group in y_fit_male_tissues) {
    results = topTable(y_fit_male, coef=tissue_group,number=nrow(y_tissue_male))
    assign(paste("results",tissue_group, sep="_"),results)
    filename = paste("../data", tissue_group,sep="/")
    write.table(results,filename,sep='\t',quote = F)
}

In [152]:
for (tissue_group in y_fit_female_tissues) {
    
    results = topTable(y_fit_female, coef=tissue_group,number=nrow(y_tissue_female))
    assign(paste("results",tissue_group, sep="_"),results)
    filename = paste("../data", tissue_group,sep="/")
    write.table(results,filename,sep='\t',quote = F)
}

In [None]:
# Reproducing Guys results using the yarn expression object
# loop through the tissues and for those tissues that are shared between the two sexes
# perform a differential gene analysis on a per tissue basis


In [155]:
tissue_groups <- factor(pData(reduced_obj)$SMTS)

In [163]:
tissue_male_female <- tissue_groups_male %in% tissue_groups_female
table(tissue_male_female)

tissue_male_female
FALSE  TRUE 
  606 10978 

In [165]:
tissue_shared_male_female <- factor(tissue_groups_male[tissue_male_female])
table(tissue_shared_male_female)

tissue_shared_male_female
 Adipose Tissue   Adrenal Gland         Bladder           Blood    Blood Vessel 
            816             157              14             613             879 
          Brain          Breast           Colon       Esophagus           Heart 
           1914             291             499             952             587 
         Kidney           Liver            Lung          Muscle           Nerve 
             69             161             395             543             419 
       Pancreas       Pituitary  Salivary Gland            Skin Small Intestine 
            207             204             115            1208             120 
         Spleen         Stomach         Thyroid 
            154             227             434 

In [1]:
sex = factor(pData(reduced_obj)$SEX)

ERROR: Error in pData(reduced_obj): could not find function "pData"


In [None]:
#tissue_shared_male_female
for (tissue in tissue_shared_male_female) {
    tissue_true   <- pData(reduced_obj)$SMTS==tissue
    tissue_obj    <- reduced_obj[,tissue_true==TRUE]
    tissue_sex    <- factor(pData(tissue_obj)$SEX)
    tissue_design <- model.matrix(~tissue_sex)
    dim(tissue_obj)
    head(tissue_design)
    y_tissue      <- DGEList(counts=exprs(tissue_obj), group=tissue_sex)
    y_tissue      <- calcNormFactors(y_tissue)
    y_tissue_voom <- voom (y_tissue, tissue_design)
    fit_tissue    <- lmFit(y_tissue_voom, tissue_design)
    fit_tissue    <- eBayes(fit_tissue, robust=TRUE)
    results_tissue<- topTable(fit_tissue, coef='tissue_sex2', number=nrow(y_tissue))
    assign(paste("results",tissue, sep="_"),results_tissue)
    filename = paste(paste("../data", tissue,sep="/"),"DGE.txt",sep="_")
    write.table(results_tissue,filename,sep='\t',quote = F)
}

Unnamed: 0,(Intercept),tissue_sex2
1,1,0
2,1,0
3,1,1
4,1,0
5,1,1
6,1,0


In [178]:
head(results_tissue)

Unnamed: 0_level_0,logFC,AveExpr,t,P.Value,adj.P.Val,B
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000147050.14,0.7512061,5.409792,23.74446,3.1478529999999997e-65,5.286505e-61,135.21196
ENSG00000005889.15,0.6461228,5.41013,20.96543,2.15225e-56,1.807244e-52,115.79538
ENSG00000173674.10,0.6971714,5.42208,20.54005,5.222193e-55,2.192538e-51,112.73974
ENSG00000126012.11,0.5099286,7.36082,20.13642,1.094636e-53,3.676663e-50,109.54874
ENSG00000184368.15,2.2040217,-1.606367,20.89722,3.584495e-56,2.0066000000000002e-52,95.32512
ENSG00000130021.13,0.5945241,3.840276,15.92866,1.3797360000000001e-39,3.8618799999999996e-36,77.91256
