### Study: "Building an optimal predictive model for imputing tissue-specific gene expression by combining genotype and whole blood transcriptome data"
### Method: MERGED_fixed_100
### Author/Maintainer: Sunwoo Jung < s17171717s@gmail.com >

### This code is for: 
### instructions on how to access and explore the imputaiton models generated by a method named MERGED_fixed_100.

In [1]:
# You can SKIP this code if you located the Rdata file (sample_models_m.Rdata) and this jupyter notebook in the same directory.
# Make sure to set your working directory to the folder that contains the Rdata file.
# Below is the example used by the writer.
setwd('/Users/sunwoosimac/downloads/models')

In [2]:
# Load the data.
load('sample_models_m.Rdata')

In [3]:
# Data structure for the imputation models is List.
# List named 'models' will be loaded.
ls()

In [4]:
print(class(models))

[1] "list"


In [5]:
# The length of the List 'models': The number of genes included in a target tissue (Here: sample data containing only 2000 genes randomly selected from 'GTEx v7 Adipose Subcutaneous' tissue).
N_gene = length(models) 
print(paste0('The imputatioin weights for < ', N_gene, ' genes > are available.')) 

[1] "The imputatioin weights for < 2000 genes > are available."


In [6]:
# Each element of the List 'models' contains the imputation model for each gene of a target tissue.
# Each element of the List 'models' is named after the corresponding gene ID.
# For example, the first element of the List contains the imputation model(weights) for the gene ENSG00000147642.
print(names(models)[1])

[1] "ENSG00000147642"


In [7]:
# Here are some other examples.
print(names(models)[c(1:10,1991:2000)])

 [1] "ENSG00000147642" "ENSG00000124225" "ENSG00000157470" "ENSG00000149294"
 [5] "ENSG00000228290" "ENSG00000240747" "ENSG00000196712" "ENSG00000184182"
 [9] "ENSG00000237298" "ENSG00000171824" "ENSG00000104133" "ENSG00000116649"
[13] "ENSG00000102931" "ENSG00000196367" "ENSG00000186088" "ENSG00000059377"
[17] "ENSG00000011600" "ENSG00000204632" "ENSG00000112796" "ENSG00000131408"


In [8]:
# Each item (imputation model) in the List is named after the corresponding method, in this case, 'MERGED_fixed'.
print(names(models[["ENSG00000147642"]]))

[1] "MERGED_fixed"


In [9]:
# The stored item is the matrix that contains the weights needed for
#     imputing the expression level of the corresponding gene. 
# This matrix of the imputation weights is our imputation model. 
print(class(models[["ENSG00000147642"]]$'MERGED_fixed'))

[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"


### Each item in the List 'models' is the matrix of the weights needed for imputing the expression level of the corresponding gene.
### The imputation model for the gene ENSG00000147642 can be displayed as below.

In [15]:
imputation_model <- models[["ENSG00000147642"]]$'MERGED_fixed'

In [16]:
head(imputation_model, 10)
tail(imputation_model, 10)

10 x 1 sparse Matrix of class "dgCMatrix"
                                 s1
(Intercept)            6.555347e-16
SEX                   -1.766743e-02
AGE                   -2.763743e-02
PC1                   -9.612259e-02
PC2                   -5.006442e-02
PC3                   -6.251501e-02
PC4                    2.266458e-02
PC5                   -1.103225e-01
8_110086771_G_A_b37_A  6.677403e-05
8_110087871_T_A_b37_A  6.811716e-05

10 x 1 sparse Matrix of class "dgCMatrix"
                           s1
ENSG00000128159 -0.0004739657
ENSG00000100429 -0.0028413186
ENSG00000188130 -0.0008489317
ENSG00000185386 -0.0023584904
ENSG00000196576 -0.0025749849
ENSG00000100258  0.0006312531
ENSG00000025770  0.0005358800
ENSG00000177989 -0.0038872774
ENSG00000130487 -0.0018919609
ENSG00000079974  0.0001536264

In [17]:
N_predictors = length(imputation_model)
print(paste0("The number of predictors: ", N_predictors)) 
# For the gene ENSG00000147642, the matrix stores the imputation weights of 5087 predictors.
# Predictors are either the covariates (sex, age, PCs), the genotypes (SNPs), or the whole blood expression.

[1] "The number of predictors: 5087"


In [18]:
# The rownames of the matrix are the names of the corresponding predictors.
print(rownames(models[["ENSG00000147642"]]$'MERGED_fixed')[c(1:20,2000:2010)])

 [1] "(Intercept)"           "SEX"                   "AGE"                  
 [4] "PC1"                   "PC2"                   "PC3"                  
 [7] "PC4"                   "PC5"                   "8_110086771_G_A_b37_A"
[10] "8_110087871_T_A_b37_A" "8_110088888_A_G_b37_G" "8_110089913_T_A_b37_A"
[13] "8_110090736_G_T_b37_T" "8_110093634_T_C_b37_C" "8_110093871_A_G_b37_G"
[16] "8_110099209_C_T_b37_T" "8_110102250_A_T_b37_T" "8_110103774_T_C_b37_C"
[19] "8_110105081_A_G_b37_G" "8_110105147_G_A_b37_A" "ENSG00000238160"      
[22] "ENSG00000131437"       "ENSG00000164402"       "ENSG00000213585"      
[25] "ENSG00000081059"       "ENSG00000113575"       "ENSG00000119048"      
[28] "ENSG00000152700"       "ENSG00000113648"       "ENSG00000146021"      
[31] "ENSG00000177733"      


### Note: Anyone can freely use the imputation weights stored in the file for any research purpose.