# Prediction based on CaDRReS-Sc pre-trained model
This notebook show an example of how load a pre-trained CaDRReS-SC model and predict drug response based on new data.

In [1]:
import sys, os, pickle
from collections import Counter
import importlib
from ipywidgets import widgets
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

scriptpath = '..'
sys.path.append(os.path.abspath(scriptpath))

from cadrres_sc import pp, model, evaluation, utility
importlib.reload(model)
importlib.reload(evaluation)

<module 'cadrres_sc.evaluation' from 'c:\\Users\\carey\\Desktop\\CaDRReS-Sc\\cadrres_sc\\evaluation.py'>

# Read pre-trained model

In [2]:
model_dir = '../my_models/'

In [3]:
obj_function = widgets.Dropdown(options=['cadrres-wo-sample-bias', 'cadrres-wo-sample-bias-weight'], description='Objetice function')

In [4]:
#choose which model you have trained previously
display(obj_function)

Dropdown(description='Objetice function', options=('cadrres-wo-sample-bias', 'cadrres-wo-sample-bias-weight'),…

## Load the pre-trained model based on your selection


In [5]:
model_spec_name = obj_function.value
model_file = model_dir + '{}_param_dict.pickle'.format(model_spec_name)

cadrres_model = model.load_model(model_file)

# Read test data
Again, for this example we load GDSC dataset.
@TODO: GDSC dataset using only essential gene list?

Note: GDSC_exp.tsv can be downloaded from https://www.dropbox.com/s/3v576mspw5yewbm/GDSC_exp.tsv?dl=0

## Notes for other test data

You can apply the model to other gene expression dataset. The input gene expression matrix should have been normalized, i.e. **for each sample, expression values are comparable across genes**. 

In this example the gene expression matrix provided by GDSC is already normalized using RMA.

For RNA-seq data, read count should be normalized by gene length, using normalization methods such as TPM.

In [6]:
gene_exp_df = pd.read_csv('../data/GDSC/GDSC_exp.tsv', sep='\t', index_col=0)
gene_exp_df = gene_exp_df.groupby(gene_exp_df.index).mean()
print("Dataframe shape:", gene_exp_df.shape, "\n")
gene_exp_df.head(2)

Dataframe shape: (17419, 1018) 



Unnamed: 0_level_0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,6.208447,5.02581,5.506955,4.208349,3.399366,4.917872,3.828088,5.146903,3.107543,5.062066,...,4.272172,3.435025,4.930052,2.900213,4.523712,5.074951,2.957153,3.089628,4.047364,5.329524
A1CF,2.981775,2.947547,2.872071,3.075478,2.853231,3.221491,2.996355,2.893977,2.755668,2.98565,...,2.941659,3.155536,2.983619,3.118312,2.975409,2.905804,2.944488,2.780003,2.870819,2.926353


## Calculate fold-change
We normalized baseline gene expression values for each gene by computing fold-changes compared to the median value across cell-lines

In [7]:
cell_line_log2_mean_fc_exp_df, cell_line_mean_exp_df = pp.gexp.normalize_log2_mean_fc(gene_exp_df)

## Read essential genes list

Or in case you want your training using one specific set of genes.

In [8]:
ess_gene_list = utility.get_gene_list('../data/essential_genes.txt')

## Calculate kernel feature 

Based on all given cell line samples with gene expression profiles and a list of genes (e.g. essential gene list). This step might take a bit more time than usual.

In [9]:
test_kernel_df = pp.gexp.calculate_kernel_feature(cell_line_log2_mean_fc_exp_df, cell_line_log2_mean_fc_exp_df, ess_gene_list)

Calculating kernel features based on 1610 common genes
(17419, 1018) (17419, 1018)
100 of 1018 (78.83)s
200 of 1018 (84.83)s
300 of 1018 (84.38)s
400 of 1018 (83.44)s
500 of 1018 (82.86)s
600 of 1018 (81.79)s
700 of 1018 (84.61)s
800 of 1018 (82.88)s
900 of 1018 (82.35)s
1000 of 1018 (89.66)s


In [10]:
print("Dataframe shape:", test_kernel_df.shape, "\n")
test_kernel_df.head(2)

Dataframe shape: (1018, 1018) 



Unnamed: 0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
906826,1.0,0.054507,0.026621,0.000195,0.181043,-0.010206,-0.091207,0.255585,0.256516,-0.043044,...,0.178078,-0.033405,-0.128262,-0.02086,0.226647,0.225082,0.146886,0.041669,-0.099332,0.044356
687983,0.054507,1.0,0.1515,-0.017105,0.047332,0.061474,-0.11547,0.040432,-0.113185,-0.073907,...,-0.024037,0.027242,0.12131,-0.018611,0.009571,0.044496,0.087031,-0.149296,0.118897,-0.056471


# Drug response prediction

In [11]:
print('Predicting drug response using CaDRReS: {}'.format(model_spec_name))
pred_df, P_test_df= model.predict_from_model(cadrres_model, test_kernel_df, model_spec_name)
print('done!')

Predicting drug response using CaDRReS: cadrres-wo-sample-bias
done!


Inspecting the model predictions and save the predictions

In [12]:
print(pred_df.shape)
pred_df.head()

(1018, 226)


Drug ID,1,1001,1003,1004,1005,1006,1007,1008,1009,1010,...,64,71,83,86,87,88,89,9,91,94
906826,3.598188,10.837875,-5.694187,-5.174909,4.153878,0.592987,-6.464035,4.468215,8.026369,5.213609,...,0.804889,3.696694,-2.686636,-1.33279,-2.85316,0.61674,2.130489,-0.962104,4.65308,2.927343
687983,6.998276,11.254932,-4.622645,-4.770331,4.374687,2.828551,-4.713568,3.099338,7.995354,6.175023,...,3.416636,6.156876,-1.37646,-0.309761,-1.344785,1.29353,5.168526,0.613289,7.704615,6.694684
910927,1.578585,10.599277,-7.540745,-7.54908,2.389969,-2.206786,-10.304592,1.71332,6.298277,2.117459,...,1.514638,2.597214,-2.154976,-1.219061,-3.885245,1.169545,2.313805,-0.848955,4.100259,3.549469
1240138,3.24297,11.115442,-4.70432,-4.986453,5.027159,1.255595,-5.958886,4.157858,7.879065,3.316346,...,2.273305,5.629552,-0.818494,-0.354655,-1.280859,1.826223,3.781634,0.368827,5.651509,3.97687
1240139,1.982039,10.493722,-8.400546,-6.967617,2.335411,-2.284234,-8.768306,1.281026,6.405093,3.079599,...,0.398804,2.073469,-2.509752,-1.144668,-3.551622,0.36579,1.13797,-1.088353,3.734904,3.214346


In [13]:
# P_test_df is the drugs and their latent vectors
print(P_test_df.shape)
P_test_df.head()

(1018, 10)


Unnamed: 0,1,2,3,4,5,6,7,8,9,10
906826,-0.838617,-2.606379,0.255473,0.20004,-0.503588,1.316628,-0.549998,0.472825,-0.919502,0.514386
687983,-0.236683,0.583207,0.171062,-0.118878,1.31035,0.203863,-0.562031,0.797582,-1.332473,1.680585
910927,0.66608,-0.738941,1.475736,0.620604,-1.432304,1.331909,0.762452,-0.550815,-0.612409,-1.398217
1240138,-0.46492,-0.560788,0.298326,-1.31237,-0.303701,1.44901,-0.41871,0.833223,0.618528,-0.615271
1240139,-0.050252,-1.94906,0.428729,1.240224,-0.976912,0.602875,0.493952,0.583569,-1.577828,-1.254461


In [14]:
print('Saving ' + model_dir + '{}_test_pred.csv'.format(model_spec_name))
pred_df.to_csv(model_dir + '{}_test_pred.csv'.format(model_spec_name))

Saving ../my_models/cadrres-wo-sample-bias_test_pred.csv


---

**Authors:** [Chayaporn Suphavilai](mailto:@.com), [Rafael Peres da Silva](), Genome Institute of Singapore, Nagarajan Lab, November, 2020

---

Reproducibility tips from https://github.com/jupyter-guide/ten-rules-jupyter

# Prediction Evaluation
Using Spearman correlation and NDCG to evaluate performance

In [15]:
cell_line_obs_df = pd.read_csv('../data/GDSC/gdsc_all_abs_ic50_bayesian_sigmoid_only9dosages.csv', index_col=0)
cell_line_sample_list = cell_line_obs_df.index.astype(str)
cell_line_sample_list = np.array([s for s in cell_line_sample_list if s in gene_exp_df.columns])
# convert indices to string for consistency in filtering
cell_line_sample_list = cell_line_sample_list.astype(str)
cell_line_obs_df.index = cell_line_obs_df.index.astype(str)

cell_line_obs_df = cell_line_obs_df.loc[cell_line_sample_list, cadrres_model['drug_list']]

pred_df = pred_df.loc[cell_line_sample_list, cadrres_model['drug_list']]

per_sample_df, per_drug_df = evaluation.calculate_spearman(cell_line_obs_df, pred_df, cell_line_sample_list, cadrres_model['drug_list'])
print(f"Average of sample spearman correlation: {np.nanmean(per_sample_df.values)}")
print(f"Average of drug spearman correlation: {np.nanmean(per_drug_df.values)}")

ndcg = evaluation.calculate_ndcg(cell_line_obs_df, pred_df)
print(f"Average of samples NDCG value: {np.nanmean(ndcg.values)}")

Average of sample spearman correlation: 0.3880890774382629
Average of drug spearman correlation: nan
Average of samples NDCG value: 0.6557105814908604


# Latent Vector Similarity Evaluation


In [None]:
dataset_drug_df = pd.read_csv('../preprocessed_data/GDSC/drug_stat.csv', index_col=0) # hn_drug_stat | drug_stat
list(dataset_drug_df.iloc[:,6])