# CaDRReS-Sc for predicting combinatorial drug response
This notebook explains how load a pre-trained CaDRReS-Sc model and predict drug response combination for your own data. In this example, we used scRNA-seq data of patient-derived cell lines obtained from head and neck patients. Cell clustering result is based on Scanpy package. For the detail of data preprocessing, cell clustering, and drug response prediction, please refer to our manuscript.

In [1]:
import sys, os, pickle
from collections import Counter
import importlib
from ipywidgets import widgets
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

scriptpath = '..'
sys.path.append(os.path.abspath(scriptpath))
pd.set_option('precision', 2)

from cadrres_sc import pp, model, evaluation, utility

# Load pre-trained model
We will load a model and make a prediction based on HNSC pretrained model. Alternatively, a model trained previously in [notebook_02_model_prediction.ipynb](./notebook_02_model_prediction.ipynb) can also be used.

### Load the pre-trained model based on your selection


In [2]:
model_dir = '../data/pretrained_model/'
model_name = 'hn_drug_cw_dw10_100000'
model_file = model_dir + '{}_param_dict.pickle'.format(model_name)

cadrres_model = model.load_model(model_file)

# Read test data
Again, for this example we load GDSC dataset.
Note: GDSC_exp.tsv can be downloaded from https://www.dropbox.com/s/3v576mspw5yewbm/GDSC_exp.tsv?dl=0

## Notes for other test data

You can apply the model to other gene expression dataset. The input gene expression matrix should have been normalized, i.e. **for each sample, expression values are comparable across genes**. 

In this example the gene expression matrix provided by GDSC is already normalized using RMA.

For RNA-seq data, read count should be normalized by gene length, using normalization methods such as TPM.

In [3]:
gene_exp_df = pd.read_csv('../data/GDSC/GDSC_exp.tsv', sep='\t', index_col=0)
gene_exp_df = gene_exp_df.groupby(gene_exp_df.index).mean()
print("Dataframe shape:", gene_exp_df.shape, "\n")
gene_exp_df.head(2)

Dataframe shape: (17419, 1018) 



Unnamed: 0_level_0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,6.21,5.03,5.51,4.21,3.4,4.92,3.83,5.15,3.11,5.06,...,4.27,3.44,4.93,2.9,4.52,5.07,2.96,3.09,4.05,5.33
A1CF,2.98,2.95,2.87,3.08,2.85,3.22,3.0,2.89,2.76,2.99,...,2.94,3.16,2.98,3.12,2.98,2.91,2.94,2.78,2.87,2.93


## Calculate fold-change
We normalized baseline gene expression values for each gene by computing fold-changes compared to the median value across cell-lines

In [4]:
cell_line_log2_mean_fc_exp_df, cell_line_mean_exp_df = pp.gexp.normalize_log2_mean_fc(gene_exp_df)

## Load cluster-specific gene expression profile
Dimensions: genes x clusters.

In [5]:
cluster_norm_exp_fname = '../data/patient/log2_fc_cluster_tpm.csv'
output_dir = '../example_result/'

In [6]:
cluster_norm_exp_df = pd.read_csv(cluster_norm_exp_fname, index_col=0).T

In [7]:
cluster_norm_exp_df.head(2)

cluster,A1,A2,B1,B2,C1,C2,D1,D2,E1,E2,...,G1,G2,H1,I1,I2,J1,J2,K1,L,M
AAAS,0.34,0.73,-0.1,0.04,-0.75,0.44,0.97,0.65,0.24,-0.11,...,0.02,0.14,0.93,-0.5,-0.56,-0.23,-2.15,-0.67,-1.38,0.46
AAMP,0.47,0.81,-0.54,-0.89,-1.26,0.21,0.2,-0.08,0.64,-0.55,...,-0.3,-0.57,0.71,-0.94,-0.96,-0.85,-1.17,-1.75,-1.93,0.72


## Read essential genes list

Or in case you want your training using one specific set of genes.

In [8]:
ess_gene_list = utility.get_gene_list('../data/essential_genes.txt')

In [9]:
selected_gene_list = [g for g in ess_gene_list if g in cluster_norm_exp_df.index]
len(selected_gene_list)

1724

## Calculate kernel feature 

Now we will compute the kernel features between your loaded dataset (e.g, clustered patient data) and cell lines (or any other model you trained previously, e.g. CaDRReS-SC trained on PDX data)

In [10]:
test_kernel_df = pp.gexp.calculate_kernel_feature(cluster_norm_exp_df, cell_line_log2_mean_fc_exp_df, selected_gene_list)

Calculating kernel features based on 1543 common genes
(1724, 24) (17419, 1018)


In [11]:
print("Dataframe shape:", test_kernel_df.shape, "\n")
test_kernel_df.head(2)

Dataframe shape: (24, 1018) 



Unnamed: 0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
A1,-0.09,0.00608,0.01,0.05,-0.02,-0.03,-0.02,-0.03,-0.01,0.01,...,0.05,0.04,0.09,-0.06,0.0344,0.03,0.06,0.05,-0.1,0.02
A2,-0.09,0.0255,0.02,0.02,-0.03,-0.02,-0.11,-0.04,-0.06,-0.06,...,0.07,0.09,0.11,-0.15,0.00579,0.06,0.09,0.1,-0.09,-0.03


# Drug response prediction
Given the pre-trained model we will predict now for your clustered data

In [12]:
print('Predicting drug response using CaDRReS: {}'.format(model_name))
pred_df, P_test_df= model.predict_from_model(cadrres_model, test_kernel_df)
print('done!')

Predicting drug response using CaDRReS: hn_drug_cw_dw10_100000
done!


Inspecting the model predictions and save the predictions

In [13]:
#cluster vs drugs
pred_df.head(2)

Drug ID,1001,1003,1004,1006,1007,1010,1012,1014,1015,1016,...,299,301,302,303,305,306,308,328,331,346
A1,10.4,-4.98,-6.04,0.98,-6.94,-0.01,0.738,2.65,3.53,-1.85,...,0.57,2.48,0.21,3.19,3.69,2.49,-0.7,-1.43,1.61,-3.96
A2,9.96,-6.81,-7.07,-0.29,-7.48,0.02,0.00301,2.13,2.97,-2.29,...,0.53,2.16,0.41,2.87,3.22,2.27,-1.16,-2.39,1.4,-4.56


In [14]:
print('Saving ' + model_dir + '{}_test_pred.csv'.format(model_name))
pred_df.to_csv(output_dir + '{}_test_pred.csv'.format(model_name))

Saving ../data/pretrained_model/hn_drug_cw_dw10_100000_test_pred.csv


# Predicting overall drug response and cell death percentage

In [15]:
# for each patient, if cell cluster is less than 5%, then we don't consider that cluster 
freq_cutoff = 0.05
# estimate cell death percentage based on log2 of the median IC50 observed in HNSC cell lines (GDSC)
ref_type = 'log2_median_ic50_hn'

## Read drug statistics

In [16]:
drug_info_df = pd.read_csv('../preprocessed_data/GDSC/hn_drug_stat.csv', index_col=0)
drug_info_df.index = drug_info_df.index.astype(str)

drug_id_name_dict = dict(zip(drug_info_df.index, drug_info_df['Drug Name']))
print (drug_info_df.shape)

(81, 27)


In [17]:
drug_info_df.head(2)

Unnamed: 0_level_0,Drug Name,Synonyms,Target,Target Pathway,Selleckchem Cat#,CAS number,PubCHEM,Others,entropy,max_conc,...,median_ic50_9f,log2_median_ic50_9f,log2_median_ic50_hn,median_ic50_hn,median_ic50_3f_hn,log2_median_ic50_3f_hn,median_ic50_9f_hn,log2_median_ic50_9f_hn,num_sensitive,num_sensitive_hn
Drug ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001,AICA Ribonucleotide,"AICAR, N1-(b-D-Ribofuranosyl)-5-aminoimidazole...",AMPK agonist,Metabolism,S1802,2627-69-2,65110,,6.03,2000.0,...,207.0,7.69,9.94,982.0,327.0,8.35,109.0,6.77,476,27
1003,Camptothecin,"7-Ethyl-10-Hydroxy-Camptothecin, SN-38, Irinot...",TOP1,DNA replication,S1288,7689-03-4,104842,"(SN-38, S4908, 86639-52-3) (Irinotecan, S1198,...",4.61,0.1,...,0.002,-8.96,-7.59,0.0052,0.00173,-9.17,0.000578,-10.76,688,30


## Load cluster-specific drug response prediction

In [18]:
cadrres_cluster_df = pd.read_csv(output_dir + '{}_test_pred.csv'.format(model_name), index_col=0)

In [19]:
#load prediction for a certain set of drugs
drug_list = drug_info_df.index
cluster_list = cadrres_cluster_df.index
print(len(drug_list), len(cluster_list))

drug_info_df = drug_info_df.loc[drug_list]
cadrres_cluster_df = cadrres_cluster_df[drug_list]

81 24


## Load cluster proportion information

In [20]:
freq_df = pd.read_excel('../data/patient/percent_patient_tpm_cluster.xlsx', index_col=[0, 1]).reset_index()
freq_df = freq_df.pivot(index='patient_id', columns='cluster', values='percent').fillna(0) / 100

patient_list = freq_df.index

freq_df.head(2)
#show patient by cluster percentage

cluster,A1,A2,B1,B2,C1,C2,D1,D2,E1,E2,...,F3,G1,G2,H1,I1,I2,J1,J2,K1,L
patient_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
HN120,0.01,0.00549,0.0,0.0,0.0,0.0,0.313,0.18,0.0,0.0,...,0.0,0.34,0.12,0.0,0.0,0.0,0.0,0.0,0.0,0.03
HN137,0.0,0.0,0.0,0.0,0.0,0.0,0.00568,0.0,0.34,0.09,...,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01


## Predict cell death percentage at the `ref_type` dosage

In [21]:
pred_delta_df = pd.DataFrame(cadrres_cluster_df.values - drug_info_df[ref_type].values, columns=drug_list, index=cluster_list)
pred_cv_df = 100 / (1 + (np.power(2, -pred_delta_df)))
pred_kill_df = 100 - pred_cv_df

In [22]:
rows = []
print('List of cluster in each patient')
for p in patient_list:
    c_list = freq_df.loc[p][freq_df.loc[p] >= freq_cutoff].index.values
    freqs = freq_df.loc[p][freq_df.loc[p] >= freq_cutoff].values

    print(p, c_list, freqs)

    p_pred_delta_weighted = np.matmul(pred_delta_df.loc[c_list].values.T, freqs)
    p_pred_delta_mat = pred_delta_df.loc[c_list].values
    
    p_pred_kill_weighted = np.matmul(pred_kill_df.loc[c_list].values.T, freqs)
    p_pred_kill_mat = pred_kill_df.loc[c_list].values

    for d_i, d_id in enumerate(drug_list):
        rows += [[p, d_id] + ['|'.join(c_list)] + ['|'.join(["{:.14}".format(f) for f in freqs])] + 
                 ['|'.join(["{:.14}".format(f) for f in p_pred_delta_mat[:, d_i]])] + 
                 ["{:.14}".format(p_pred_delta_weighted[d_i])] +
                 ['|'.join(["{:.14}".format(f) for f in p_pred_kill_mat[:, d_i]])] + 
                 ["{:.14}".format(p_pred_kill_weighted[d_i])]
                ]

List of cluster in each patient
HN120 ['D1' 'D2' 'G1' 'G2'] [0.31318681 0.17582418 0.34065934 0.12087912]
HN137 ['E1' 'E2' 'E3' 'F1' 'F2' 'F3'] [0.34090909 0.08522727 0.07386364 0.26704545 0.11931818 0.09659091]
HN148 ['C1' 'C2' 'H1'] [0.31351351 0.20540541 0.45945946]
HN159 ['I1' 'I2' 'K1'] [0.31736527 0.18562874 0.48502994]
HN160 ['B1' 'B2' 'L'] [0.42222222 0.41481481 0.16296296]
HN182 ['J1' 'J2' 'L'] [0.71910112 0.20224719 0.07865169]


In [23]:
single_drug_pred_df = pd.DataFrame(rows, columns=['patient', 'drug_id', 'cluster', 'cluster_p', 'cluster_delta', 'delta', 'cluster_cell_death', 'cell_death'])
single_drug_pred_df[['patient', 'drug_id', 'cluster', 'cluster_cell_death', 'cell_death']].head()

Unnamed: 0,patient,drug_id,cluster,cluster_cell_death,cell_death
0,HN120,1001,D1|D2|G1|G2,34.908073458363|36.560577887311|37.43884047451...,34.593677608186
1,HN120,1003,D1|D2|G1|G2,33.535917537247|32.254984202963|25.96632734701...,28.505381464623
2,HN120,1004,D1|D2|G1|G2,29.917363245775|33.263999510979|25.55127940412...,27.086112378265
3,HN120,1006,D1|D2|G1|G2,36.683397262759|34.738211834014|21.31910722766...,28.084570239709
4,HN120,1007,D1|D2|G1|G2,10.768034058802|13.470080993016|5.818301286009...,8.5209921135827


In [24]:
single_drug_pred_df.to_csv(output_dir + 'pred_drug_kill_{}.csv'.format(model_name), index=False)

## Predict patient respose to combinatorial drugs

In [25]:
tested_drug_list = [1007, 133, 201, 1010, 182, 301, 302, 1012]
[drug_id_name_dict[str(d)] for d in tested_drug_list]

['Docetaxel',
 'Doxorubicin',
 'Epothilone B',
 'Gefitinib',
 'Obatoclax Mesylate',
 'PHA-793887',
 'PI-103',
 'Vorinostat']

In [26]:
single_drug_pred_df.loc[:, 'drug_id'] = single_drug_pred_df.loc[:, 'drug_id'].values.astype(str)
single_drug_pred_df.loc[:, 'drug_name'] = [drug_id_name_dict[d] for d in single_drug_pred_df.loc[:, 'drug_id'].values]
patient_list = sorted(list(set(single_drug_pred_df['patient'])))

single_drug_pred_df.head()

Unnamed: 0,patient,drug_id,cluster,cluster_p,cluster_delta,delta,cluster_cell_death,cell_death,drug_name
0,HN120,1001,D1|D2|G1|G2,0.31318681318681|0.17582417582418|0.3406593406...,0.89891787646367|0.7950907549512|0.74073149856...,0.76608828181776,34.908073458363|36.560577887311|37.43884047451...,34.593677608186,AICA Ribonucleotide
1,HN120,1003,D1|D2|G1|G2,0.31318681318681|0.17582417582418|0.3406593406...,0.98686783677045|1.0705926967102|1.51153962438...,1.1697813179586,33.535917537247|32.254984202963|25.96632734701...,28.505381464623,Camptothecin
2,HN120,1004,D1|D2|G1|G2,0.31318681318681|0.17582417582418|0.3406593406...,1.2280740256921|1.0045035858118|1.542851524626...,1.2676836647428,29.917363245775|33.263999510979|25.55127940412...,27.086112378265,Vinblastine
3,HN120,1006,D1|D2|G1|G2,0.31318681318681|0.17582417582418|0.3406593406...,0.78745659587898|0.90971502511471|1.8838663034...,1.2245948422257,36.683397262759|34.738211834014|21.31910722766...,28.084570239709,Cytarabine
4,HN120,1007,D1|D2|G1|G2,0.31318681318681|0.17582417582418|0.3406593406...,3.0508057471845|2.68344052681|4.0167768343325|...,3.2576613093829,10.768034058802|13.470080993016|5.818301286009...,8.5209921135827,Docetaxel


### Setup all drug combinations by patient

In [27]:
drug_combi_list = []
n_drugs = len(tested_drug_list)

for p in patient_list:
    for x in range(0, n_drugs-1):
        for y in range(x+1, n_drugs):
            drug_x = str(tested_drug_list[x])
            drug_y = str(tested_drug_list[y])

            drug_combi_list += [[p, drug_x, drug_y]]

drug_combi_df = pd.DataFrame(drug_combi_list, columns=['patient', 'A', 'B'])

print (drug_combi_df.shape)
drug_combi_df.head()

(168, 3)


Unnamed: 0,patient,A,B
0,HN120,1007,133
1,HN120,1007,201
2,HN120,1007,1010
3,HN120,1007,182
4,HN120,1007,301


In [28]:
merge_df = pd.merge(drug_combi_df, single_drug_pred_df, how='left', left_on=['patient', 'A'], right_on=['patient', 'drug_id'])
drug_combi_pred_df = pd.merge(merge_df, single_drug_pred_df[['patient', 'drug_id', 'drug_name', 'cluster_delta', 'delta', 'cluster_cell_death', 'cell_death']], how='left', left_on=['patient', 'B'], right_on=['patient', 'drug_id'], suffixes=['_A', '_B'])

In [29]:
rows = []
for _, data in drug_combi_pred_df.iterrows():
    
    cluster_p = np.array([float(p) for p in data['cluster_p'].split('|')])
    
    cluster_kill_A = np.array([float(k) for k in data['cluster_cell_death_A'].split('|')])
    cluster_kill_B = np.array([float(k) for k in data['cluster_cell_death_B'].split('|')])
    
    kill_A = float(data['cell_death_A'])
    kill_B = float(data['cell_death_B'])
    
    cluster_kill_C = cluster_kill_A + cluster_kill_B - np.multiply(cluster_kill_A/100, cluster_kill_B/100)*100
    kill_C = np.sum(cluster_p * cluster_kill_C)
    
    best_kill = np.max([kill_A, kill_B])
    improve = kill_C - best_kill
    improve_p = (kill_C - best_kill) / best_kill
    
    ##### specificity (entropy) #####
    
    temp_A = np.sum(cluster_p[cluster_kill_A > cluster_kill_B])
    temp_B = np.sum(cluster_p[cluster_kill_A <= cluster_kill_B])
    if temp_A == 0 or temp_B == 0:
        entropy = 0
    else:
        entropy = -(temp_A * np.log2(temp_A) + temp_B * np.log2(temp_B))
    
    sum_kill_dif = np.sum(np.abs(cluster_kill_A - cluster_kill_B))
    
    ##### save output #####
    
    rows += [['|'.join(["{:.14}".format(k) for k in cluster_kill_C])] + [kill_C, improve, improve_p, entropy, sum_kill_dif]]

In [30]:
drug_combi_pred_df = pd.concat([drug_combi_pred_df, pd.DataFrame(rows, columns=['cluster_cell_death_combi', 'cell_death_combi', 'improve', 'improve_p', 'kill_entropy', 'sum_kill_dif'])], axis=1)
drug_combi_pred_df.shape

(168, 23)

### Final drug combination predictions for patients

In [35]:
drug_combi_pred_df = drug_combi_pred_df[['patient', 'drug_id_A', 'drug_name_A', 'drug_id_B', 'drug_name_B', 'cluster', 'cluster_p', 'cluster_cell_death_A', 'cluster_cell_death_B', 'cluster_cell_death_combi', 'cell_death_A', 'cell_death_B', 'cell_death_combi', 'improve', 'improve_p', 'kill_entropy', 'sum_kill_dif']]

drug_combi_pred_df[['patient', 'drug_id_A', 'drug_id_B', 'cluster', 'cell_death_A', 'cell_death_B', 'cell_death_combi', 'improve']].head()

Unnamed: 0,patient,drug_id_A,drug_id_B,cluster,cell_death_A,cell_death_B,cell_death_combi,improve
0,HN120,1007,133,D1|D2|G1|G2,8.5209921135827,75.63044860423,77.19,1.56
1,HN120,1007,201,D1|D2|G1|G2,8.5209921135827,60.390313965552,63.04,2.65
2,HN120,1007,1010,D1|D2|G1|G2,8.5209921135827,15.312538173024,22.39,7.08
3,HN120,1007,182,D1|D2|G1|G2,8.5209921135827,64.658542403814,67.22,2.56
4,HN120,1007,301,D1|D2|G1|G2,8.5209921135827,63.393584136265,66.39,3.0


In [32]:
drug_combi_pred_df.to_csv(output_dir + 'pred_combi_kill_{}.csv'.format(ref_type, model_name), index=False)

---

**Authors:** [Chayaporn Suphavilai](mailto:@.com), [Rafael Peres da Silva](), Genome Institute of Singapore, Nagarajan Lab, November 2020

---

Reproducibility tips from https://github.com/jupyter-guide/ten-rules-jupyter