# CaDRReS-Sc Training
This notebook show an example of how to fit CaDRReS-Sc model on a training set (e.g GDSC or any pharmacogenomic experiment you might have) and save its final version to make predictions on new data.

In [1]:
import sys, os, pickle
import pandas as pd
import numpy as np
np.set_printoptions(precision=2)
from collections import Counter
import importlib
from ipywidgets import widgets

scriptpath = '..'
sys.path.append(os.path.abspath(scriptpath))

from cadrres import pp, model, evaluation, utility

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


# Read data set
In this step we expect a dataset holding gene expression and screened drug responses. In the following example, we load this information from the GDSC dataset.  

## Read cell line info

- Cell line tissue info
- Observed drug response IC50 to be used as ground truth

In [2]:
tissue_sample_df = pd.read_csv('../data/GDSC/GDSC_tissue_info.csv', index_col=0)
tissue_sample_df.index = tissue_sample_df.index.astype(str)

cell_line_obs_df = pd.read_csv('../data/GDSC/gdsc_all_abs_ic50_bayesian_sigmoid_only9dosages.csv', index_col=0)
cell_line_obs_df.index = cell_line_obs_df.index.astype(str)

# cell lines list which your model will be trained on
cell_line_sample_list = cell_line_obs_df.index.astype(str)

In case you want sample weight based on cancer type. In our example we picked head and neck cell lines.

In [3]:
cell_line_hn_sample_list = tissue_sample_df[tissue_sample_df['TCGA_CLASS']=='HNSC'].index
cell_line_hn_obs_df = cell_line_obs_df.loc[cell_line_hn_sample_list]

## Read drug info

In this example, we focus on 81 drugs that sensitive in head and neck cell lines.

In [5]:
dataset_drug_df = pd.read_csv('../preprocessed_data/GDSC/drug_stat.csv', index_col=0) # hn_drug_stat | drug_stat
dataset_drug_df.index = dataset_drug_df.index.astype(str)

dataset_drug_list = dataset_drug_df.index
dataset_drug_df.shape
print("Dataframe shape:", dataset_drug_df.shape, "\n")
dataset_drug_df.head(2)

Dataframe shape: (226, 27) 



Unnamed: 0_level_0,Drug Name,Synonyms,Target,Target Pathway,Selleckchem Cat#,CAS number,PubCHEM,Others,entropy,max_conc,...,median_ic50_9f,log2_median_ic50_9f,log2_median_ic50_hn,median_ic50_hn,median_ic50_3f_hn,log2_median_ic50_3f_hn,median_ic50_9f_hn,log2_median_ic50_9f_hn,num_sensitive,num_sensitive_hn
Drug ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Erlotinib,"Tarceva, RG-1415, CP-358774, OSI-774, Ro-50823...",EGFR,EGFR signaling,S7786,183321-74-6,176870,"(S1023, 183319-69-9, HCl)",7.045609,2.0,...,8.44889,3.078762,7.76464,217.465095,72.488365,6.179678,24.162788,4.594715,17,1
1001,AICA Ribonucleotide,"AICAR, N1-(b-D-Ribofuranosyl)-5-aminoimidazole...",AMPK agonist,Metabolism,S1802,2627-69-2,65110,,6.034272,2000.0,...,206.74838,7.691732,9.939784,982.139588,327.379863,8.354822,109.126621,6.769859,476,27


## Read gene expression

The file can be download from https://www.dropbox.com/s/3v576mspw5yewbm/GDSC_exp.tsv?dl=0

@TODO: maybe load gene expression using only essential gene list? Smaller dataset in the end.

In [6]:
gene_exp_df = pd.read_csv('../data/GDSC/GDSC_exp.tsv', sep='\t', index_col=0)
print("Dataframe shape:", gene_exp_df.shape, "\n")
gene_exp_df.head(2)

Dataframe shape: (17737, 1018) 



Unnamed: 0_level_0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TSPAN6,7.632023,7.548671,8.712338,7.797142,7.729268,7.074533,3.285198,6.961606,5.943046,3.455951,...,7.105637,3.236503,3.038892,8.373223,6.932178,8.441628,8.422922,8.089255,3.112333,7.153127
TNMD,2.964585,2.777716,2.643508,2.817923,2.957739,2.889677,2.828203,2.874751,2.686874,3.290184,...,2.798847,2.745137,2.976406,2.852552,2.62263,2.639276,2.87989,2.521169,2.870468,2.834285


If there is any gene with mutiple probes, calculate the mean.

In [7]:
gene_exp_df = gene_exp_df.groupby(gene_exp_df.index).mean()
print("Dataframe shape:", gene_exp_df.shape, "\n")
gene_exp_df.head(2)

Dataframe shape: (17419, 1018) 



Unnamed: 0_level_0,906826,687983,910927,1240138,1240139,906792,910688,1240135,1290812,907045,...,753584,907044,998184,908145,1659787,1298157,1480372,1298533,930299,905954.1
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,6.208447,5.02581,5.506955,4.208349,3.399366,4.917872,3.828088,5.146903,3.107543,5.062066,...,4.272172,3.435025,4.930052,2.900213,4.523712,5.074951,2.957153,3.089628,4.047364,5.329524
A1CF,2.981775,2.947547,2.872071,3.075478,2.853231,3.221491,2.996355,2.893977,2.755668,2.98565,...,2.941659,3.155536,2.983619,3.118312,2.975409,2.905804,2.944488,2.780003,2.870819,2.926353


### Normalize gene expression
We normalized baseline gene expression values for each gene by computing fold-changes compared to the average value across cell-lines

In [8]:
cell_line_log2_mean_fc_exp_df, cell_line_mean_exp_df = pp.gexp.normalize_log2_mean_fc(gene_exp_df)

### Read essential genes list

Or in case you want your training using one specific set of genes.

In [9]:
ess_gene_list = utility.get_gene_list('../data/essential_genes.txt')

### Sample with both expression and response data

In [10]:
cell_line_sample_list = np.array([s for s in cell_line_sample_list if s in gene_exp_df.columns])
len(cell_line_sample_list)

985

Arrange gene expression and drug response matrix

In [11]:
cell_line_log2_mean_fc_exp_df = cell_line_log2_mean_fc_exp_df[cell_line_sample_list]
cell_line_obs_df = cell_line_obs_df.loc[cell_line_sample_list, dataset_drug_list]
dataset_drug_df = dataset_drug_df.loc[dataset_drug_list]

cell_line_log2_mean_fc_exp_df.shape, cell_line_obs_df.shape, dataset_drug_df.shape

((17419, 985), (985, 226), (226, 27))

# Calculate kernel feature 

Based on all given cell line samples with gene expression profiles and a list of genes (e.g. essential gene list). This step might take a bit more time than the usual.

In [12]:
kernel_feature_df = pp.gexp.calculate_kernel_feature(cell_line_log2_mean_fc_exp_df, cell_line_log2_mean_fc_exp_df, ess_gene_list).loc[cell_line_sample_list]

Calculating kernel features based on 1610 common genes
(17419, 985) (17419, 985)
100 of 985 (20.27)s
200 of 985 (20.46)s
300 of 985 (15.84)s
400 of 985 (12.80)s
500 of 985 (12.70)s
600 of 985 (12.78)s
700 of 985 (12.77)s
800 of 985 (12.95)s
900 of 985 (13.11)s


# Model training

In [13]:
# kernel feature based only on training samples
X_train = kernel_feature_df.loc[cell_line_sample_list, cell_line_sample_list]
# observed drug response
Y_train = cell_line_obs_df.loc[cell_line_sample_list]

In [14]:
print("Dataframe shape:", X_train.shape, "\n")
X_train.head(2)

Dataframe shape: (985, 985) 



Unnamed: 0,1240121,1240122,1240123,1240124,1240125,1240127,1240128,1240129,1240130,1240131,...,949175,949176,949177,949178,949179,971773,971774,971777,998184,998189
1240121,1.0,0.200762,-0.097257,0.079455,-0.080807,-0.107964,-0.058302,0.079915,0.063199,0.035671,...,-0.129215,-0.179337,-0.0953,-0.112817,-0.186527,-0.088457,-0.143004,-0.189747,-0.25959,-0.054617
1240122,0.200762,1.0,0.193214,-0.049567,-0.180749,0.187601,0.042315,0.17116,-0.049354,-0.061332,...,-0.008915,0.042224,0.080204,0.052032,-0.091817,0.007112,0.046598,0.099549,-0.010853,-0.037156


In [15]:
print("Dataframe shape:", Y_train.shape, "\n")
Y_train.head(2)

Dataframe shape: (985, 226) 



Drug ID,1,1001,1003,1004,1005,1006,1007,1008,1009,1010,...,64,71,83,86,87,88,89,9,91,94
1240121,,8.92234,-9.065271,-8.301324,1.998429,-2.455685,-11.688212,4.329096,9.033478,-0.035476,...,,,,,,,,,,
1240122,,9.594524,-8.343748,-7.554691,6.033703,-2.145146,-10.332062,4.108168,7.775115,-4.102308,...,,,,,,,,,,


## Select CaDRReS training for different objective functions

1. `cadrres-wo-sample-bias`: CaDRReS + no bp (bp = sample bias)
2. `cadrres-wo-sample-bias-weight`: CaDRReS + no bp + ciu + du (ciu = drug-sample weight w.r.t. maximum dosage, du = indication-specific weight). This is **CaDRReS-Sc** model.

In [16]:
obj_function = widgets.Dropdown(options=['cadrres-wo-sample-bias', 'cadrres-wo-sample-bias-weight'], description='Objetice function')

In [17]:
display(obj_function)

Dropdown(description='Objetice function', options=('cadrres-wo-sample-bias', 'cadrres-wo-sample-bias-weight'),…

In [18]:
model_spec_name = obj_function.value # cadrres-wo-sample-bias | cadrres-wo-sample-bias-weight

if model_spec_name in ['cadrres-wo-sample-bias']:
    indication_specific_degree = 1 # multiply by 1 = disabled
else:
    indication_specific_degree = 10

indication_specific_degree

1

## Specify output directory

In [19]:
output_dir = '../example_result/'

In [20]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print ('Results will be saved in ', output_dir)

Results will be saved in  ../example_result/


## Train CaDRReS Model 

Prepare x0 for calculating logistic sample weigh (o_i) based on maximum drug dosage

In [21]:
sample_weights_logistic_x0_df = model.get_sample_weights_logistic_x0(dataset_drug_df, 'log2_max_conc', X_train.index)

Prepare indication weight (skip for this analysis = set all to 1)

In [22]:
indication_weight_df = pd.DataFrame(np.ones(Y_train.shape), index=Y_train.index, columns=Y_train.columns)
cv_cell_line_hn_sample_list = [cl for cl in cell_line_hn_sample_list if cl in X_train.index]
indication_weight_df.loc[cv_cell_line_hn_sample_list, :] = indication_weight_df.loc[cv_cell_line_hn_sample_list, :] * indication_specific_degree

Start model training

In [33]:
if model_spec_name in ['cadrres', 'cadrres-wo-sample-bias']:
    cadrres_model_dict, cadrres_output_dict = model.train_model(Y_train, X_train, Y_train, X_train, 10, 0.0, 100000, 0.01, model_spec_name=model_spec_name, save_interval=5000, output_dir=output_dir)
elif model_spec_name in ['cadrres-wo-sample-bias-weight']:
    cadrres_model_dict, cadrres_output_dict = model.train_model_logistic_weight(Y_train, X_train, Y_train, X_train, sample_weights_logistic_x0_df, indication_weight_df, 10, 0.0, 100000, 0.01, model_spec_name=model_spec_name, save_interval=5000, output_dir=output_dir)

Initializing the model ...



Train: 71554 out of 79785
Starting model training ...




MSE train at step 0: 12.647 (0.00m)
MSE train at step 5000: 6.649 (0.70m)
MSE train at step 10000: 5.037 (1.38m)
MSE train at step 15000: 4.499 (2.09m)
MSE train at step 20000: 4.286 (2.78m)
MSE train at step 25000: 4.178 (3.48m)
MSE train at step 30000: 4.108 (4.17m)
MSE train at step 35000: 4.054 (4.85m)
MSE train at step 40000: 4.008 (5.53m)
MSE train at step 45000: 3.968 (6.21m)
MSE train at step 50000: 3.930 (6.89m)
MSE train at step 55000: 3.893 (7.58m)
MSE train at step 60000: 3.858 (8.28m)
MSE train at step 65000: 3.824 (8.96m)
MSE train at step 70000: 3.791 (9.64m)
MSE train at step 75000: 3.759 (10.34m)
MSE train at step 80000: 3.727 (11.04m)
MSE train at step 85000: 3.697 (11.74m)
MSE train at step 90000: 3.668 (12.48m)
MSE train at step 95000: 3.639 (13.21m)
Saving model parameters and predictions ...
DONE


## Save the final CaDRReS Model trained on your dataset

- cadrres_model_dict contains model hyperparameters and trained parameters
- cadrres_output_dict contains data and prediction on training dataset

In [35]:
print('Saving ' + output_dir + '{}_param_dict.pickle'.format(model_spec_name))
pickle.dump(cadrres_model_dict, open(output_dir + '{}_param_dict.pickle'.format(model_spec_name), 'wb'))
print('Saving ' + output_dir + '{}_output_dict.pickle'.format(model_spec_name))
pickle.dump(cadrres_output_dict, open(output_dir + '{}_output_dict.pickle'.format(model_spec_name), 'wb'))

Saving ../example_result/cadrres-wo-sample-bias_param_dict.pickle
Saving ../example_result/cadrres-wo-sample-bias_output_dict.pickle


## Next Step
After you saved the CaDRReS model here, run the next step in the workflow [notebook_02_prediction.ipynb](./notebook_02_prediction.ipynb).

---

**Authors:** [Chayaporn Suphavilai](mailto:@.com), [Rafael Peres da Silva](), Genome Institute of Singapore, Nagarajan Lab, April 30, 2020

---

Reproducibility tips from https://github.com/jupyter-guide/ten-rules-jupyter