# Prediction based on CaDRReS-Sc pre-trained model
This notebook show an example of how load a pre-trained CaDRReS-SC model and predict drug response based on new data.

In [1]:
import sys, os, pickle
from collections import Counter
import importlib
from ipywidgets import widgets
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

scriptpath = '..'
sys.path.append(os.path.abspath(scriptpath))

from cadrres_sc import pp, model, evaluation, utility

# Read pre-trained model

In [2]:
model_dir = '../example_result/'

In [3]:
obj_function = widgets.Dropdown(options=['cadrres-wo-sample-bias', 'cadrres-wo-sample-bias-weight', 'cadrres'],description='Objetice function')

In [4]:
#choose which model you have trained previously
display(obj_function)

Dropdown(description='Objetice function', options=('cadrres-wo-sample-bias', 'cadrres-wo-sample-bias-weight', …

## Load the pre-trained model based on your selection


In [5]:
model_spec_name = obj_function.value
model_file = model_dir + '{}_param_dict.pickle'.format(model_spec_name)

cadrres_model = model.load_model(model_file)

# Read test data
Again, for this example we load GDSC dataset.
@TODO: GDSC dataset using only essential gene list?

Note: GDSC_exp.tsv can be downloaded from https://www.dropbox.com/s/3v576mspw5yewbm/GDSC_exp.tsv?dl=0

## Notes for other test data

You can apply the model to other gene expression dataset. The input gene expression matrix should have been normalized, i.e. **for each sample, expression values are comparable across genes**. 

In this example the gene expression matrix provided by GDSC is already normalized using RMA.

For RNA-seq data, read count should be normalized by gene length, using normalization methods such as TPM.

In [6]:
gene_exp_df = pd.read_csv('../data/GDSC/GDSC_exp.tsv', sep='\t', index_col=0)
gene_exp_df = gene_exp_df.groupby(gene_exp_df.index).mean()
print("Dataframe shape:", gene_exp_df.shape, "\n")
gene_exp_df.head(2)

Dataframe shape: (17419, 1018) 



Unnamed: 0_level_0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,6.208447,5.02581,5.506955,4.208349,3.399366,4.917872,3.828088,5.146903,3.107543,5.062066,...,4.272172,3.435025,4.930052,2.900213,4.523712,5.074951,2.957153,3.089628,4.047364,5.329524
A1CF,2.981775,2.947547,2.872071,3.075478,2.853231,3.221491,2.996355,2.893977,2.755668,2.98565,...,2.941659,3.155536,2.983619,3.118312,2.975409,2.905804,2.944488,2.780003,2.870819,2.926353


## Calculate fold-change
We normalized baseline gene expression values for each gene by computing fold-changes compared to the median value across cell-lines

In [7]:
cell_line_log2_mean_fc_exp_df, cell_line_mean_exp_df = pp.gexp.normalize_log2_mean_fc(gene_exp_df)

## Read essential genes list

Or in case you want your training using one specific set of genes.

In [8]:
ess_gene_list = utility.get_gene_list('../data/essential_genes.txt')

## Calculate kernel feature 

Based on all given cell line samples with gene expression profiles and a list of genes (e.g. essential gene list). This step might take a bit more time than usual.

In [9]:
test_kernel_df = pp.gexp.calculate_kernel_feature(cell_line_log2_mean_fc_exp_df, cell_line_log2_mean_fc_exp_df, ess_gene_list)

Calculating kernel features based on 1610 common genes
(17419, 1018) (17419, 1018)
100 of 1018 (16.12)s
200 of 1018 (16.22)s
300 of 1018 (16.10)s
400 of 1018 (16.22)s
500 of 1018 (16.33)s
600 of 1018 (16.28)s
700 of 1018 (15.88)s
800 of 1018 (10.46)s
900 of 1018 (10.49)s
1000 of 1018 (10.34)s


In [10]:
print("Dataframe shape:", test_kernel_df.shape, "\n")
test_kernel_df.head(2)

Dataframe shape: (1018, 1018) 



Unnamed: 0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
906826,1.0,0.054507,0.026621,0.000195,0.181043,-0.010206,-0.091207,0.255585,0.256516,-0.043044,...,0.178078,-0.033405,-0.128262,-0.02086,0.226647,0.225082,0.146886,0.041669,-0.099332,0.044356
687983,0.054507,1.0,0.1515,-0.017105,0.047332,0.061474,-0.11547,0.040432,-0.113185,-0.073907,...,-0.024037,0.027242,0.12131,-0.018611,0.009571,0.044496,0.087031,-0.149296,0.118897,-0.056471


# Drug response prediction

In [11]:
print('Predicting drug response using CaDRReS: {}'.format(model_spec_name))
pred_df, P_test_df= model.predict_from_model(cadrres_model, test_kernel_df, model_spec_name)
print('done!')

Predicting drug response using CaDRReS: cadrres-wo-sample-bias
done!


Inspecting the model predictions and save the predictions

In [12]:
pred_df.head()

Drug ID,1,1001,1003,1004,1005,1006,1007,1008,1009,1010,...,64,71,83,86,87,88,89,9,91,94
906826,3.856865,11.115356,-5.714982,-5.561752,4.129691,0.238772,-6.829151,4.397136,8.115744,5.342511,...,1.401109,3.573971,-2.102823,-0.961198,-2.844079,0.929385,2.908931,-0.799364,4.880714,3.46348
687983,6.997003,11.521834,-4.12401,-4.482363,4.611786,1.867274,-4.146659,3.453969,8.049282,6.977233,...,3.112575,6.961551,-0.660826,-0.461482,-1.56092,2.261902,5.131415,0.747301,7.120768,6.596741
910927,1.744297,10.845635,-7.329617,-6.772134,2.870419,-2.584782,-9.059966,2.719736,6.778657,3.453523,...,1.004838,3.189792,-2.231047,-1.457975,-4.227088,1.318033,2.129338,-0.941594,3.979322,3.618429
1240138,3.554745,11.422337,-4.78966,-4.818663,4.886816,0.998196,-6.06109,4.280636,7.975548,3.718647,...,2.550148,5.412402,-1.177267,-0.423846,-1.763688,1.567855,3.344768,0.003893,5.520441,3.834303
1240139,2.768286,10.70298,-7.889551,-7.422914,2.489097,-2.378537,-9.622192,1.767213,6.404831,3.030787,...,1.142023,2.761482,-2.37242,-0.878104,-3.828454,0.801083,2.171508,-1.591526,4.433969,3.871057


In [15]:
P_test_df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
906826,0.381871,-1.392874,-1.258435,-0.149721,-0.367136,-1.354308,1.090426,0.041576,-0.778599,0.374818
687983,0.277867,-0.675402,0.467139,1.312239,0.606508,0.93387,-0.090949,-0.768077,-2.20368,-0.41665
910927,-0.51872,0.468541,0.124029,-0.096551,-1.556721,-2.986613,1.15128,-0.06244,-0.584276,-1.211875
1240138,0.717129,0.51249,-1.164903,-0.343803,1.564474,-1.212826,1.060513,-0.592996,0.081074,-0.533514
1240139,-0.080744,-0.45495,0.451206,-0.185165,-0.754197,-2.929592,0.502527,0.665691,-0.111447,0.267623


In [13]:
print('Saving ' + model_dir + '{}_test_pred.csv'.format(model_spec_name))
pred_df.to_csv(model_dir + '{}_test_pred.csv'.format(model_spec_name))

Saving ../example_result/cadrres-wo-sample-bias_test_pred.csv


---

**Authors:** [Chayaporn Suphavilai](mailto:@.com), [Rafael Peres da Silva](), Genome Institute of Singapore, Nagarajan Lab, January 14, 2020

---

Reproducibility tips from https://github.com/jupyter-guide/ten-rules-jupyter