### Study: "Building an optimal predictive model for imputing tissue-specific gene expression by combining genotype and whole blood transcriptome data"
### Method: MERGED_flexible
### Writer/Maintainer: Sunwoo Jung < s17171717s@gmail.com >

### This code is for: 
### Access to the imputaiton models generated by MERGED_flexible model.

In [1]:
# You can SKIP this code if you located the Rdata file (imputation models) and this jupyter notebook in the same directory.
# Make sure to set your working directory to the folder that contains the Rdata file.
# Below is the example used by the writer.
setwd('/Users/sunwoosimac/downloads/models')

In [2]:
# Load the data.
load('Adipose_Subcutaneous_m.Rdata')

In [3]:
# Data structure for the imputation models is List.
# List named 'models' will be loaded.
ls()

In [4]:
print(class(models))

[1] "list"


In [7]:
# The length of the List 'models' = The number of genes included in a target tissue (Here: Adipose Subcutaneous).
N_gene = length(models) 
print(paste0('The imputatioin weights for < ', N_gene, ' genes > are available.')) 
# The imputation models for 15216 genes are available for Adipose Subcutaneous.

[1] "The imputatioin weights for < 15216 genes > are available."


In [8]:
# Each element of the List contains the imputation model for each gene of a target tissue (Here: Adipose Subcutaneous).
# Each element of the List is named as the corresponding gene ID.
# For example, the first element of the List contains the imputation model(weights) for the gene ENSG00000237683.
print(names(models)[1])

[1] "ENSG00000237683"


In [10]:
# Here are some other examples.
print(names(models)[c(1:10,15207:15216)])

 [1] "ENSG00000237683" "ENSG00000241860" "ENSG00000228463" "ENSG00000237094"
 [5] "ENSG00000235373" "ENSG00000228327" "ENSG00000237491" "ENSG00000177757"
 [9] "ENSG00000225880" "ENSG00000228794" "ENSG00000272821" "ENSG00000025708"
[13] "ENSG00000177989" "ENSG00000205560" "ENSG00000100288" "ENSG00000008735"
[17] "ENSG00000100299" "ENSG00000251322" "ENSG00000100312" "ENSG00000079974"


In [12]:
# In each element of the List, the item (imputation model) is named as the value 
#     of the regularization ratio parameter (phi) selected for the correspondign gene in MERGED_flexible model.
# For detailed explanation, please refer to our study.

# For example, 
#     MERGED_flexible model selected '1' for the value of phi 
#     for imputing the expression level of the gene "ENSG00000237683". 

print(names(models[["ENSG00000237683"]]))

[1] "1"


In [14]:
# Here are some other examples.
print(names(models[["ENSG00000114993"]]))
print(names(models[["ENSG00000134250"]]))
print(names(models[["ENSG00000162426"]]))
print(names(models[["ENSG00000237438"]]))

[1] "10"
[1] "100"
[1] "100"
[1] "0.1"


In [15]:
# In each element of the List, the stored item is the matrix that contains the weights needed for
#     imputing the expression level of the corresponding gene. 
# This matrix of the imputation weights is our imputation model. 
print(class(models[["ENSG00000100299"]]$'1'))

[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"


### Each item of the List 'models' is the matrix of the weights needed for imputing the expression level of the corresponding gene.
### The imputation model for the gene ENSG00000100299 is displayed below.

In [17]:
imputation_model <- models[["ENSG00000100299"]]$'1'

In [18]:
head(imputation_model)
tail(imputation_model)

필요한 패키지를 로딩중입니다: Matrix



6 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept)  2.766291e-18
SEX          6.683749e-02
AGE          9.659364e-02
PC1         -8.409612e-02
PC2         -8.724955e-03
PC3          1.802840e-02

6 x 1 sparse Matrix of class "dgCMatrix"
                           s1
ENSG00000100304 -0.0011983941
ENSG00000100364 -0.0004302000
ENSG00000197182 -0.0034459036
ENSG00000205643 -0.0013794376
ENSG00000260708 -0.0041834980
ENSG00000188511  0.0009948689

In [19]:
N_predictors = length(imputation_model)
print(paste0("The number of predictors: ", N_predictors)) 
# For the gene ENSG00000100299, the impuation weights of 3518 predictors are stored in the matrix.
# Predictors are either the genotype (SNPs) or the whole blood expression, along with the covariates (sex, age, PCs).

[1] "The number of predictors: 3518"


In [20]:
# The rownames of the matrix are the names of the corresponding predictors.
print(rownames(models[["ENSG00000100299"]]$'1')[c(1:20,2000:2010)])

 [1] "(Intercept)"             "SEX"                    
 [3] "AGE"                     "PC1"                    
 [5] "PC2"                     "PC3"                    
 [7] "PC4"                     "PC5"                    
 [9] "22_50561364_G_A_b37_A"   "22_50561417_A_G_b37_G"  
[11] "22_50561590_G_A_b37_A"   "22_50562311_G_C_b37_C"  
[13] "22_50564145_T_A_b37_A"   "22_50564238_G_A_b37_A"  
[15] "22_50564380_C_T_b37_T"   "22_50564419_C_T_b37_T"  
[17] "22_50564442_C_T_b37_T"   "22_50564510_T_C_b37_C"  
[19] "22_50564618_A_AT_b37_AT" "22_50565583_A_G_b37_G"  
[21] "ENSG00000130347"         "ENSG00000130349"        
[23] "ENSG00000081087"         "ENSG00000112335"        
[25] "ENSG00000135535"         "ENSG00000112367"        
[27] "ENSG00000112290"         "ENSG00000168438"        
[29] "ENSG00000004809"         "ENSG00000155115"        
[31] "ENSG00000196591"        


### Anyone can freely use the imputation weights stored in the file for any research purpose.