# CaDRReS-SC pre-trained model prediction
This notebook show an example of how load a pre-trained CaDRReS-SC model and predict drug response based on new data.

In [2]:
import sys, os, pickle
from collections import Counter
import importlib
from ipywidgets import widgets
import pandas as pd
import numpy as np

scriptpath = '..'
sys.path.append(os.path.abspath(scriptpath))

from cadrres import pp, model, evaluation, utility

# Read pre-trained model

In [3]:
model_dir = '../example_result/'

In [4]:
obj_function = widgets.Dropdown(options=['cadrres-wo-sample-bias', 'cadrres-wo-sample-bias-weight', 'cadrres'],description='Objetice function')

In [5]:
#choose which model you have trained previously
display(obj_function)

Dropdown(description='Objetice function', options=('cadrres-wo-sample-bias', 'cadrres-wo-sample-bias-weight', …

## Load the pre-trained model based on your selection


In [9]:
model_spec_name = obj_function.value
model_file = model_dir + '{}_param_dict.pickle'.format(model_spec_name)

cadrres_model = model.load_model(model_file)

# Read test data
Again, for this example we load GDSC dataset.
@TODO: GDSC dataset using only essential gene list?

Note: GDSC_exp.tsv can be downloaded from https://www.dropbox.com/s/3v576mspw5yewbm/GDSC_exp.tsv?dl=0

## Notes for other test data

You can apply the model to other gene expression dataset. The input gene expression matrix should have been normalized, i.e. **for each sample, expression values are comparable across genes**. 

In this example the gene expression matrix provided by GDSC is already normalized using RMA.

For RNA-seq data, read count should be normalized by gene length, using normalization methods such as TPM.

In [6]:
gene_exp_df = pd.read_csv('../data/GDSC/GDSC_exp.tsv', sep='\t', index_col=0)
gene_exp_df = gene_exp_df.groupby(gene_exp_df.index).mean()
print("Dataframe shape:", gene_exp_df.shape, "\n")
gene_exp_df.head(2)

Dataframe shape: (17419, 1018) 



Unnamed: 0_level_0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,6.208447,5.02581,5.506955,4.208349,3.399366,4.917872,3.828088,5.146903,3.107543,5.062066,...,4.272172,3.435025,4.930052,2.900213,4.523712,5.074951,2.957153,3.089628,4.047364,5.329524
A1CF,2.981775,2.947547,2.872071,3.075478,2.853231,3.221491,2.996355,2.893977,2.755668,2.98565,...,2.941659,3.155536,2.983619,3.118312,2.975409,2.905804,2.944488,2.780003,2.870819,2.926353


## Calculate fold-change
We normalized baseline gene expression values for each gene by computing fold-changes compared to the median value across cell-lines

In [11]:
cell_line_log2_mean_fc_exp_df, cell_line_mean_exp_df = pp.gexp.normalize_log2_mean_fc(gene_exp_df)

## Read essential genes list

Or in case you want your training using one specific set of genes.

In [12]:
ess_gene_list = utility.get_gene_list('../data/essential_genes.txt')

## Calculate kernel feature 

Based on all given cell line samples with gene expression profiles and a list of genes (e.g. essential gene list). This step might take a bit more time than usual.

In [13]:
test_kernel_df = pp.gexp.calculate_kernel_feature(cell_line_log2_mean_fc_exp_df, cell_line_log2_mean_fc_exp_df, ess_gene_list)

Calculating kernel features based on 1610 common genes
(17419, 1018) (17419, 1018)
100 of 1018 (19.83)s
200 of 1018 (13.19)s
300 of 1018 (12.65)s
400 of 1018 (12.63)s
500 of 1018 (12.59)s
600 of 1018 (12.59)s
700 of 1018 (12.62)s
800 of 1018 (12.65)s
900 of 1018 (12.71)s
1000 of 1018 (12.76)s


In [15]:
print("Dataframe shape:", test_kernel_df.shape, "\n")
test_kernel_df.head(2)

Dataframe shape: (1018, 1018) 



Unnamed: 0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
906826,1.0,0.054507,0.026621,0.000195,0.181043,-0.010206,-0.091207,0.255585,0.256516,-0.043044,...,0.178078,-0.033405,-0.128262,-0.02086,0.226647,0.225082,0.146886,0.041669,-0.099332,0.044356
687983,0.054507,1.0,0.1515,-0.017105,0.047332,0.061474,-0.11547,0.040432,-0.113185,-0.073907,...,-0.024037,0.027242,0.12131,-0.018611,0.009571,0.044496,0.087031,-0.149296,0.118897,-0.056471


# Drug response prediction

In [16]:
print('Predicting drug response using CaDRReS: {}'.format(model_spec_name))
pred_df, P_test_df= model.predict_from_model(cadrres_model, test_kernel_df, model_spec_name)
print('done!')

Predicting drug response using CaDRReS: cadrres-wo-sample-bias
done!


Inspecting the model predictions and save the predictions

In [17]:
pred_df.head()

Drug ID,1001,1003,1004,1006,1007,1010,1012,1014,1015,1016,...,299,301,302,303,305,306,308,328,331,346
906826,11.852327,-5.518843,-5.046308,0.971064,-6.078666,6.661717,2.089982,2.835175,3.907548,-1.134554,...,2.771623,5.188969,2.100502,5.034929,5.840061,3.961555,0.734314,1.298839,2.475582,-0.634193
687983,11.837371,-3.967708,-4.643764,3.160419,-4.802473,7.113175,1.385263,7.396879,6.441652,-1.73027,...,1.214771,5.116991,0.533735,4.241065,5.423465,3.911427,0.979906,1.581112,1.779684,-2.049694
910927,11.380081,-7.310671,-7.277131,-2.439548,-10.244858,4.523432,1.694288,1.301639,3.086393,-2.511762,...,4.656056,5.991129,4.405546,5.615788,5.671027,3.535789,0.084465,0.522739,3.335967,0.117861
1240138,11.289334,-4.691136,-4.508313,0.95603,-5.665845,4.058,1.520491,2.352573,4.211185,-0.248094,...,2.017487,4.552627,1.927344,4.195925,4.615114,2.85985,-0.490752,0.211791,2.2339,-1.636342
1240139,11.200646,-8.781634,-6.432948,-2.268403,-7.488023,3.225305,0.793092,0.736975,1.903656,-1.915484,...,3.58445,5.493712,3.706355,4.977149,4.695365,4.152595,0.154278,0.784608,3.305857,-0.518832


In [18]:
print('Saving ' + model_dir + '{}_test_pred.csv'.format(model_spec_name))
pred_df.to_csv(model_dir + '{}_test_pred.csv'.format(model_spec_name))

Saving ../example_result/cadrres-wo-sample-bias_test_pred.csv


## Next Step
You can stop here if you want. But CaDRReS is also suitable for drug combination prediction! Check this out 
here [notebook_03_drug_combination.ipynb](./notebook_02_drug_combination.ipynb).

---

**Authors:** [Chayaporn Suphavilai](mailto:@.com), [Rafael Peres da Silva](), Genome Institute of Singapore, Nagarajan Lab, January 14, 2020

---

Reproducibility tips from https://github.com/jupyter-guide/ten-rules-jupyter