<font size=5>**Download and Preprocessing of D-PRISM**</font>

<font size=4>Monotherapy respnse dataset</font>
-NCI60

<font size=4>Combinationtherapy response dataset</font>
-NCI-ALMANAC

<font size=4>Cell line expression data</font>
-CCLE

<font size=4>Drug information</font>
-NCI Chemical data

<font size=4>Additional Drug information</font>
-PubChem

<font size=4>Pathway information</font>
-MSigDB


**The link for NCI60 response dataset**

NCI60[ https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-60+Growth+Inhibition+Data ] 

-file: DOSERESP.csv (CONCENTRATION/RESPONSE DATA)[ https://wiki.nci.nih.gov/download/attachments/147193864/DOSERESP.zip?version=10&modificationDate=1704733010000&api=v2 ]

**The link for NCI-ALMANAC response dataset**

NCI-ALMANAC[ https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-ALMANAC ] 

-file: ComboDrugGrowth_Nov2017.csv[ https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-ALMANAC?preview=/338237347/357699398/ComboDrugGrowth_Nov2017.zip ]

**The link for CCLE expression data**

DepMap[ https://depmap.org/portal/download/all/ ]

-file: OmicsExpressionProteinCodingGenesTPMLogp1.csv [ https://depmap.org/portal/download/all/?releasename=DepMap+Public+23Q4&filename=OmicsExpressionProteinCodingGenesTPMLogp1.csv ]

**The link for Cell line annotation for converting DepMap Model ID into Cell line name**

-file: DepMap-2018q3-celllines.csv [ https://depmap.org/portal/download/all/?release=DepMap+Public+18Q3&file=DepMap-2018q3-celllines.csv ]

**The link for Drug information**

-file: Chem2D_Jun2016.sdf [ https://wiki.nci.nih.gov/display/NCIDTPdata/Chemical+Data?preview=/155844992/339380766/Chem2D_Jun2016.zip ]

**The link for Pathway information**

MSigDB[ http://www.gsea-msigdb.org/gsea/index.jsp ]

-file: c2.cp.kegg_legacy.v2023.2.Hs.symbols.gmt (KEGG_LEGACY) [ https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/2023.2.Hs/c2.cp.kegg_legacy.v2023.2.Hs.symbols.gmt ]

In [None]:
base_directory='Base directory that DD-PRiSM located'

# Loading Packages

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

%config Completer.use_jedi=False
import os

import rdkit
import rdkit.Chem as Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Fingerprints import FingerprintMols

import pickle
import random
import urllib.request
import re

from IPython.display import clear_output

import random

import tqdm

from scipy.stats import zscore

import difflib

In [None]:
from itertools import chain

def flatten_list(list_of_list):
    return list(chain.from_iterable(list_of_list))

def concat_str(str_list):
    str_tmp=""
    for str_element in str_list:
        str_tmp+=str_element
    return str_tmp

def find_common(list1,list2):
    return [x for x in list1 if x in list2]

<font size=6>Single-Drug response data (NCI60)</font>

# Preprocessing of Drugs for NCI60

In [None]:
suppl = Chem.SDMolSupplier(base_directory+'Raw/Chem2D_Jun2016.sdf')
mols=[x for x in suppl]
#The file 'NCI60_compound.sdf' is the sdf file with NCI compound informations

mols_without_None=[x for x in mols if x!=None ]
#Filtering 'None' molecules (molecule that were not processible with rdkit library)

Mol_df=pd.DataFrame({'Mol':mols_without_None,'NSC':[x.GetProp('_Name') for x in mols_without_None]})
#Dataframe with Molecules

Mol_df['fingerprint']=[AllChem.GetMorganFingerprintAsBitVect(mol,2,nBits=512) for mol in Mol_df.Mol]
#Get Morgan Fingerprint, that will be used as drug feature

compound_morgan512=pd.DataFrame([list(x) for x in Mol_df.fingerprint.values])
#Dataframe that consists of Morgan Fingerprints

compound_morgan512.index=Mol_df['NSC']
compound_morgan512.index=compound_morgan512.index.astype(int)
#Index mapping

NSC_list=compound_morgan512.index
#NSC_list=Drug list

In [None]:
compound_morgan512.to_csv(base_directory+'Input/Fingerprint_Morgan512.csv')

# Preprocessing Responses (NCI60)

In [None]:
compound_morgan512=pd.read_csv(base_directory+'Input/Fingerprint_Morgan512.csv',index_col=0)

In [None]:
nci60_raw=pd.read_csv(base_directory+'Raw/DOSERESP.csv')
nci60_raw=nci60_raw[nci60_raw.CONCENTRATION_UNIT=='M'] #Others are not convertable into M (Molar)

nci60_compact=nci60_raw[['NSC','CONCENTRATION','CELL_NAME','AVERAGE_GIPRCNT']]
nci60_compact['CONCENTRATION']+=6 #Convert concentration in M (Molar) to uM (MicroMolar)

cellline_annotation=pd.read_csv(base_directory+'Raw/DepMap-2018q3-celllines.csv')

cellline_annotation.CCLE_Name=[x.split('_')[0] for x in cellline_annotation.CCLE_Name]
nci60_cellline=nci60_compact.drop_duplicates(subset='CELL_NAME')[['CELL_NAME']]
cellline_annotation_ccle=cellline_annotation[['Broad_ID','CCLE_Name']]
cellline_annotation_ccle.columns=['Broad_ID','CELL_NAME']
cellline_annotation_aliases=cellline_annotation[['Broad_ID','Aliases']]
cellline_annotation_aliases.columns=['Broad_ID','CELL_NAME']

nci60_cellline=pd.merge(nci60_cellline,cellline_annotation_ccle,how='left',on='CELL_NAME')
nci60_cellline=pd.merge(nci60_cellline,cellline_annotation_aliases,how='left',on='CELL_NAME')

nci60_cellline_valid=nci60_cellline[(~nci60_cellline.Broad_ID_x.isna())|(~nci60_cellline.Broad_ID_y.isna())]
nci60_cellline_nan=nci60_cellline[(nci60_cellline.Broad_ID_x.isna())&(nci60_cellline.Broad_ID_y.isna())]

nci60_cellline_valid['Broad_ID'] = np.where(~nci60_cellline_valid['Broad_ID_x'].isnull(),nci60_cellline_valid['Broad_ID_x'],nci60_cellline_valid['Broad_ID_y'])
nci60_cellline_valid.index = nci60_cellline_valid.CELL_NAME
nci60_cellline_valid = nci60_cellline_valid[['Broad_ID']]

nci60_cellline_nan=nci60_cellline_nan[['CELL_NAME']]
matched_cellline_name=[]
for idx,x in nci60_cellline_nan.iterrows():
    try:
        matched_cellline_name.append(difflib.get_close_matches(x.CELL_NAME,cellline_annotation.CCLE_Name)[0])
    except:
        matched_cellline_name.append(None)
nci60_cellline_nan=nci60_cellline_nan.reset_index(drop=True)
nci60_cellline_nan['CELL_NAME_matched']=matched_cellline_name

#Manually filtering based on the most similar cell line name (Using Cellosaurus->https://www.cellosaurus.org/)
cellline_nan_dict={'CAKI-1':'CAKI1','RXF 393':'RXF393','786-0':'786O','A549/ATCC':'A549'
                   ,'SF-268':'SF268','HCT-116':'HCT116','OVCAR-5':'OVCAR5','UO-31':'UO31','HOP-62':'HOP62'
                    ,'MALME-3M':'MALME3M','UACC-257':'UACC257','SF-539':'SF539','TK-10':'TK10','NCI-H322M':'NCIH322M'
                    ,'MDA-MB-231/ATCC':'MDAMB231','HCC-2998':'HCC2998','RPMI-8226':'RPMI8226','SNB-75':'SNB75','HS 578T':'HS578T'
                   ,'U251':'U251MG','SW-620':'SW620','SK-MEL-2':'SKMEL2','769-P':'769P','SW-156':'SW156'
                    ,'SW-1573':'SW1573','SW 1088':'SW1088','RPMI-7951':'RPMI7951','SF-767':'SF767'
                   ,'MCF7/ATCC':'MCF7','CALU-1':'CALU1','CACO-2':'CACO2'}
nci60_cellline_nan=pd.DataFrame.from_dict(cellline_nan_dict,orient='index').reset_index()
nci60_cellline_nan.columns=['CELL_NAME','CELL_NAME_matched']

cellline_annotation_ccle.index=cellline_annotation_ccle.CELL_NAME
cellline_annotation_ccle=cellline_annotation_ccle[['Broad_ID']]
nci60_cellline_nan['Broad_ID']=cellline_annotation_ccle.loc[nci60_cellline_nan.CELL_NAME_matched].values
nci60_cellline_nan.index=nci60_cellline_nan.CELL_NAME
nci60_cellline_nan=nci60_cellline_nan[['Broad_ID']]

nci60_cellline_df=pd.concat([nci60_cellline_valid,nci60_cellline_nan],axis=0)


nci60_compact.AVERAGE_GIPRCNT+=100
nci60_compact.AVERAGE_GIPRCNT/=200

nci60_compact=nci60_compact[nci60_compact.NSC.isin(compound_morgan512.index)]
nci60_compact=nci60_compact[nci60_compact.CELL_NAME.isin(nci60_cellline_df.index)]

nci60_compact['depmap_id'] = nci60_cellline_df.loc[nci60_compact.CELL_NAME].values
cellline_expression_valid=list(expression_df_zscore.index)
cellline_nci60=list(set(nci60_compact.depmap_id))
cellline_common=find_common(cellline_expression_valid,cellline_nci60)
nci60_compact=nci60_compact[nci60_compact.depmap_id.isin(cellline_common)]

nci60_compact.columns=['NSC','CONCENTRATION','CELLNAME','VIABILITY','depmap_id']
nci60_compact.to_csv(base_directory+'Processed/NCI60_matched.csv')

expression_df=pd.read_csv(base_directory+'Raw/OmicsExpressionProteinCodingGenesTPMLogp1.csv',index_col=0)
expression_df.columns=[gene.split(' (')[0] for gene in expression_df.columns]

expression_df_zscore=expression_df.apply(zscore,axis=1)
expression_df_zscore.to_csv(base_directory+'Processed/Expression_ZNormalized.csv')

In [None]:
ccle2depmap_df=nci60_compact[['CELLNAME','depmap_id']].drop_duplicates(subset=['depmap_id'])
ccle2depmap_df.to_csv(base_directory+'Processed/CCLE2DepMap.csv')

In [None]:
nci60_compact=nci60_compact[(nci60_compact.VIABILITY<1.5)].groupby(by=['depmap_id','NSC','CONCENTRATION']).median().reset_index()
nci60_compact=pd.merge(nci60_compact,ccle2depmap_df,how='left',on='depmap_id')
nci60_compact=nci60_compact[['CELLNAME','NSC','CONCENTRATION','VIABILITY','depmap_id']]
nci60_compact.to_csv(base_directory+'Processed/NCI60_semifiltered.csv')

<font size=6>Gene expression grouping (NCI60)</font>

# Grouping gene expression values by Pathway gene set

In [None]:
expression_df_zscore=pd.read_csv(base_directory+'NCI60/Processed/Expression_ZNormalized.csv',index_col=0)

In [None]:
ccle2depmap=pd.read_csv(base_directory+'NCI60/Processed/CCLE2DepMap.csv',index_col=0)

In [None]:
ccle2depmap.index=ccle2depmap.depmap_id
ccle2depmap=ccle2depmap[['CELLNAME']]

In [None]:
valid_gene_list=expression_df_zscore.columns

#Loading Gene Set
#Gene Set File (gmt) is from MSigDB (http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp)
KEGG_legacy_file='c2.cp.kegg_legacy.v2023.2.Hs.symbols.gmt' #186 gene sets

GeneSet_List=[]
GeneSetFile=base_directory+'Raw/'+KEGG_legacy_file
with open(GeneSetFile) as f:
    reader = csv.reader(f)
    data = list(list(rec) for rec in csv.reader(f, delimiter='\t')) #reads csv into a list of lists
    for row in data:
        GeneSet_List.append(row)

GeneSet_Dic={}
for GeneSet in GeneSet_List:
    GeneSet_Dic[GeneSet[0]]=GeneSet[2:]

GeneSet_Dic_valid={}
for GeneSet in GeneSet_Dic:
    GeneSet_tmp=pd.Series(GeneSet_Dic[GeneSet])
    GeneSet_tmp=GeneSet_tmp[GeneSet_tmp.isin(valid_gene_list)]
    GeneSet_Dic_valid[GeneSet]=GeneSet_tmp

In [None]:
def CelllineFeatureExtract(ExpressionMatrix, CellLine):
    X_Feature=[]
    for GeneSet in GeneSet_Dic_valid.keys():
        Gene_in_GeneSet=[]
        for Gene in GeneSet_Dic_valid[GeneSet]:
            Gene_in_GeneSet.append(Gene)
        X_Feature.append(ExpressionMatrix[Gene_in_GeneSet].loc[[CellLine]])
    return X_Feature

#The function for the cell line feature extraction (Formatting the expression into gene set forms)

In [None]:
cellline_input=[]
for i in range(len(GeneSet_Dic_valid)):
    cellline_input.append(pd.DataFrame())
for cellline in tqdm.tqdm(expression_df_zscore.index):
    x=CelllineFeatureExtract(expression_df_zscore,cellline)
    for j in range(len(GeneSet_Dic_valid)):
        cellline_input[j]=pd.concat([cellline_input[j],x[j]],axis=0)


In [None]:
for idx,key in enumerate(GeneSet_Dic_valid.keys()):
    cellline_input[idx].to_csv(base_directory+'CellLine_Overall/'+key+'.csv')

In [None]:
nci60_cellline_expression_list=[]
for df in tqdm.tqdm(cellline_input):
    df=df.loc[ccle2depmap.index]
    df.index=ccle2depmap.CELLNAME
    nci60_cellline_expression_list.append(df)

In [None]:
for idx,key in enumerate(GeneSet_Dic_valid.keys()):
    nci60_cellline_expression_list[idx].to_csv(base_directory+'Input/'+key+'.csv')

# Filtering pairs by Parameter&Concentration&Standard deviation

In [None]:
nci60_compact=pd.read_csv(base_directory+'Processed/NCI60_semifiltered.csv',index_col=0)

In [None]:
nci60_sorted=nci60_compact.sort_values(by=['CELLNAME','NSC','CONCENTRATION'])
nci60_sorted['CONCENTRATION']=[np.around(conc,5) for conc in nci60_sorted.CONCENTRATION]
nci60_sorted['conc_delta']=nci60_sorted.CONCENTRATION.shift(1)-nci60_sorted.CONCENTRATION
nci60_sorted['identity']=(nci60_sorted.NSC.shift(1)==nci60_sorted.NSC)&(nci60_sorted.CELLNAME.shift(1)==nci60_sorted.CELLNAME)
nci60_sorted['conc_delta_around']=np.around(nci60_sorted.conc_delta,5)

In [None]:
#Filtering pairs whose dilusion is not 10 (log10=1) -> Aggregation of multiple experiments with batch effects
filtered_pair_df=nci60_sorted[(nci60_sorted.identity==True)&(nci60_sorted.conc_delta_around!=-1.0)].drop_duplicates(subset=['CELLNAME','NSC'])[['CELLNAME','NSC']]
filtered_combination_df=pd.merge(nci60_sorted.reset_index(),filtered_pair_df,how='right',on=['CELLNAME','NSC'])
column_list=['idx']
column_list.extend(filtered_combination_df.columns[1:])
filtered_combination_df.columns=column_list
nci60_filtered=nci60_sorted[~nci60_sorted.index.isin(filtered_combination_df.idx)]
nci60_filtered=nci60_filtered[['CELLNAME','NSC','CONCENTRATION','VIABILITY','depmap_id']]

#Filtering pairs with zero std (no change of viability among all concentrations)
nci60_std=nci60_filtered.groupby(['CELLNAME','NSC']).std()[['VIABILITY']].reset_index()
filtered_pair_df=nci60_std[nci60_std.VIABILITY==0]
filtered_combination_df=pd.merge(nci60_sorted.reset_index(),filtered_pair_df,how='right',on=['CELLNAME','NSC'])
column_list=['idx']
column_list.extend(filtered_combination_df.columns[1:])
filtered_combination_df.columns=column_list
nci60_filtered=nci60_filtered[~nci60_filtered.index.isin(filtered_combination_df.idx)]

In [None]:
nci60_filtered.to_csv(base_directory+'Processed/NCI60_filtered.csv')

# Split NCI60

In [None]:
nci60=pd.read_csv(base_directory+'Processed/NCI60_filtered.csv',index_col=0)

In [None]:
cellline_list=list(set(nci60.CELLNAME))
drug_list=list(set(nci60.NSC))

In [None]:
ratio_train_val=0.9
ratio_test=1-ratio_train_val
factor_train_val=(1-ratio_test/2)**0.5
factor_test=1-factor_train_val


#About the cell line
num_cellline=len(cellline_list)
num_seen_cellline_for_cycle=int(num_cellline/(factor_test*num_cellline))
cellline_test_idx=np.arange(0,num_cellline,num_seen_cellline_for_cycle)
cellline_training_idx=[x for x in np.arange(0,num_cellline,1) if x not in cellline_test_idx]
cellline_count_df=nci60.groupby(by='CELLNAME').VIABILITY.count().sort_values(ascending=False)
unseen_cellline_list=cellline_count_df.iloc[cellline_test_idx].index.values
seen_cellline_list=cellline_count_df.iloc[cellline_training_idx].index.values

#About the drug
num_drug=len(drug_list)
num_seen_drug_for_cycle=int(num_drug/(factor_test*num_drug))
drug_test_idx=np.arange(0,num_drug,num_seen_drug_for_cycle)
drug_training_idx=[x for x in np.arange(0,num_drug,1) if x not in drug_test_idx]
drug_count_df=nci60.groupby(by='NSC').VIABILITY.count().sort_values(ascending=False)
unseen_drug_list=drug_count_df.iloc[drug_test_idx].index.values
seen_drug_list=drug_count_df.iloc[drug_training_idx].index.values

In [None]:
nci60_unseen_both_df=nci60[(nci60.NSC.isin(unseen_drug_list))&(nci60.CELLNAME.isin(unseen_cellline_list))]
nci60_unseen_drug_df=nci60[(nci60.NSC.isin(unseen_drug_list))&(~nci60.CELLNAME.isin(unseen_cellline_list))]
nci60_unseen_cellline_df=nci60[(~nci60.NSC.isin(unseen_drug_list))&(nci60.CELLNAME.isin(unseen_cellline_list))]
nci60_seen_both_df=nci60[(~nci60.NSC.isin(unseen_drug_list))&(~nci60.CELLNAME.isin(unseen_cellline_list))]

#Currently around 90% of whole data are in nci60_seen_both, so 1/18 of nci60_seen_both will be 5% of whole data, like unseen cell line or unseen drug
nci60_pair_df=nci60_seen_both_df[['CELLNAME','NSC']].drop_duplicates()
nci60_unseen_pair_df=nci60_pair_df.sample(frac=1/18)
nci60_seen_pair_df=nci60_pair_df[~nci60_pair_df.index.isin(nci60_unseen_pair_df.index)]
nci60_unseen_pair_df=pd.merge(nci60_seen_both_df,nci60_unseen_pair_df,how='inner',on=['CELLNAME','NSC'])
nci60_seen_pair_df=pd.merge(nci60_seen_both_df,nci60_seen_pair_df,how='inner',on=['CELLNAME','NSC'])

print('Total: '+str(len(nci60)))
print('Training&Validation: '+str(len(nci60_seen_pair_df))+'('+str(np.around(len(nci60_seen_pair_df)/len(nci60)*100,2))+'%)')
print('Unseen Pair: '+str(len(nci60_unseen_pair_df))+'('+str(np.around(len(nci60_unseen_pair_df)/len(nci60)*100,2))+'%)')
print('Unseen CellLine: '+str(len(nci60_unseen_cellline_df))+'('+str(np.around(len(nci60_unseen_cellline_df)/len(nci60)*100,2))+'%)')
print('Unseen Drug: '+str(len(nci60_unseen_drug_df))+'('+str(np.around(len(nci60_unseen_drug_df)/len(nci60)*100,2))+'%)')
print('Both unseen: '+str(len(nci60_unseen_both_df))+'('+str(np.around(len(nci60_unseen_both_df)/len(nci60)*100,2))+'%)')

In [None]:
nci60_seen_pair_df.to_csv(base_directory+'Training/TrainVal.csv')
nci60_unseen_pair_df.to_csv(base_directory+'Training/UnseenPair.csv')
nci60_unseen_cellline_df.to_csv(base_directory+'Training/UnseenCellLine.csv')
nci60_unseen_drug_df.to_csv(base_directory+'Training/UnseenDrug.csv')
nci60_unseen_both_df.to_csv(base_directory+'Training/UnseenBoth.csv')

# NCI-ALMANAC

In [None]:
nci60=pd.read_csv(base_directory+'NCI60/Processed/NCI60_filtered.csv',index_col=0)

In [None]:
celline_mapping=nci60[['CELLNAME','depmap_id']].drop_duplicates()

In [None]:
NCI_ALMANAC=pd.read_csv(base_directory+'Raw/ComboDrugGrowth_Nov2017.csv',index_col=0)
NCI_ALMANAC=NCI_ALMANAC[['NSC1','CONC1','NSC2','CONC2','CELLNAME','PERCENTGROWTH']]

In [None]:
NCI_ALMANAC.PERCENTGROWTH=NCI_ALMANAC.PERCENTGROWTH+100
NCI_ALMANAC.PERCENTGROWTH=NCI_ALMANAC.PERCENTGROWTH/200

In [None]:
NCI_ALMANAC=NCI_ALMANAC[NCI_ALMANAC.PERCENTGROWTH<1.5]

In [None]:
NCI_ALMANAC_mono=NCI_ALMANAC[NCI_ALMANAC.NSC2.isna()]
NCI_ALMANAC_mono=NCI_ALMANAC_mono[['NSC1','CONC1','CELLNAME','PERCENTGROWTH']]
NCI_ALMANAC_mono.CONC1=[np.log10(x) for x in NCI_ALMANAC_mono.CONC1]
NCI_ALMANAC_mono.CONC1=NCI_ALMANAC_mono.CONC1+6
NCI_ALMANAC_mono.columns=['NSC','CONC','CELLNAME','PERCENTGROWTH']
NCI_ALMANAC_mono_median=NCI_ALMANAC_mono.groupby(by=['NSC','CONC','CELLNAME']).median().reset_index()

compound_morgan512=pd.read_csv(base_directory+'Input/Fingerprint_Morgan512.csv',index_col=0)
NSC_list=compound_morgan512.index

expression_df_zscore=pd.read_csv(base_directory+'Processed/Expression_ZNormalized.csv',index_col=0)
cellline_NCI60_valid=list(expression_df_zscore.index)

In [None]:
NCI_ALMANAC_mono_valid=NCI_ALMANAC_mono_median[(NCI_ALMANAC_mono_median.CELLNAME.isin(celline_mapping.CELLNAME))&(NCI_ALMANAC_mono_median.NSC.isin(NSC_list))]

In [None]:
NCI_ALMANAC_mono_valid.columns=['NSC','CONCENTRATION','CELLNAME','VIABILITY']

In [None]:
NCI_ALMANAC_mono_valid.to_csv(base_directory+'NCI_ALMANAC_mono/Processed/NCI_ALMANAC_mono.csv')

In [None]:
NCI_ALMANAC_comb=NCI_ALMANAC[~NCI_ALMANAC.NSC2.isna()]
NCI_ALMANAC_comb=NCI_ALMANAC_comb[['NSC1','CONC1','NSC2','CONC2','CELLNAME','PERCENTGROWTH']]
NCI_ALMANAC_comb.CONC1=[np.log10(x) for x in NCI_ALMANAC_comb.CONC1]
NCI_ALMANAC_comb.CONC1=NCI_ALMANAC_comb.CONC1+6
NCI_ALMANAC_comb.CONC2=[np.log10(x) for x in NCI_ALMANAC_comb.CONC2]
NCI_ALMANAC_comb.CONC2=NCI_ALMANAC_comb.CONC2+6
NCI_ALMANAC_comb_median=NCI_ALMANAC_comb.groupby(by=['NSC1','CONC1','NSC2','CONC2','CELLNAME']).median().reset_index()

NCI_ALMANAC_comb_valid=NCI_ALMANAC_comb_median[(NCI_ALMANAC_comb_median.CELLNAME.isin(celline_mapping.CELLNAME))&(NCI_ALMANAC_comb_median.NSC1.isin(NSC_list))&(NCI_ALMANAC_comb_median.NSC2.isin(NSC_list))]
NCI_ALMANAC_comb_valid.NSC2=NCI_ALMANAC_comb_valid.NSC2.astype(int)

In [None]:
NCI_ALMANAC_comb_valid.columns=['NSC1','CONCENTRATION1','NSC2','CONCENTRATION2','CELLNAME','VIABILITY']

In [None]:
NCI_ALMANAC_comb_valid.to_csv(base_directory+'NCI_ALMANAC(combination)/Processed/NCI_ALMANAC_combination.csv')

# Split NCI-ALMANAC (mono)

In [None]:
nci_almanac_mono=pd.read_csv(base_directory+'NCI_ALMANAC_mono/Processed/NCI_ALMANAC_mono.csv',index_col=0)

In [None]:
cellline_list=list(set(nci_almanac_mono.CELLNAME))
drug_list=list(set(nci_almanac_mono.NSC))

In [None]:
ratio_train_val=0.9
ratio_test=1-ratio_train_val
factor_train_val=(1-ratio_test/2)**0.5
factor_test=1-factor_train_val


#About the cell line
num_cellline=len(cellline_list)
num_seen_cellline_for_cycle=int(num_cellline/(factor_test*num_cellline))
cellline_test_idx=np.arange(0,num_cellline,num_seen_cellline_for_cycle)
cellline_training_idx=[x for x in np.arange(0,num_cellline,1) if x not in cellline_test_idx]
cellline_count_df=nci_almanac_mono.groupby(by='CELLNAME').VIABILITY.count().sort_values(ascending=False)
unseen_cellline_list=cellline_count_df.iloc[cellline_test_idx].index.values
seen_cellline_list=cellline_count_df.iloc[cellline_training_idx].index.values

#About the drug
num_drug=len(drug_list)
num_seen_drug_for_cycle=int(num_drug/(factor_test*num_drug))
drug_test_idx=np.arange(0,num_drug,num_seen_drug_for_cycle)
drug_training_idx=[x for x in np.arange(0,num_drug,1) if x not in drug_test_idx]
drug_count_df=nci_almanac_mono.groupby(by='NSC').VIABILITY.count().sort_values(ascending=False)
unseen_drug_list=drug_count_df.iloc[drug_test_idx].index.values
seen_drug_list=drug_count_df.iloc[drug_training_idx].index.values

In [None]:
almanac_unseen_both_df=nci_almanac_mono[(nci_almanac_mono.NSC.isin(unseen_drug_list))&(nci_almanac_mono.CELLNAME.isin(unseen_cellline_list))]
almanac_unseen_drug_df=nci_almanac_mono[(nci_almanac_mono.NSC.isin(unseen_drug_list))&(~nci_almanac_mono.CELLNAME.isin(unseen_cellline_list))]
almanac_unseen_cellline_df=nci_almanac_mono[(~nci_almanac_mono.NSC.isin(unseen_drug_list))&(nci_almanac_mono.CELLNAME.isin(unseen_cellline_list))]
almanac_seen_both_df=nci_almanac_mono[(~nci_almanac_mono.NSC.isin(unseen_drug_list))&(~nci_almanac_mono.CELLNAME.isin(unseen_cellline_list))]

#Currently around 90% of whole data are in nci60_seen_both, so 1/18 of nci60_seen_both will be 5% of whole data, like unseen cell line or unseen drug
almanac_pair_df=almanac_seen_both_df[['CELLNAME','NSC']].drop_duplicates()
almanac_unseen_pair_df=almanac_pair_df.sample(frac=1/18)
almanac_seen_pair_df=almanac_pair_df[~almanac_pair_df.index.isin(almanac_unseen_pair_df.index)]
almanac_unseen_pair_df=pd.merge(almanac_seen_both_df,almanac_unseen_pair_df,how='inner',on=['CELLNAME','NSC'])
almanac_seen_pair_df=pd.merge(almanac_seen_both_df,almanac_seen_pair_df,how='inner',on=['CELLNAME','NSC'])

print('Total: '+str(len(nci_almanac_mono)))
print('Training&Validation: '+str(len(almanac_seen_pair_df))+'('+str(np.around(len(almanac_seen_pair_df)/len(nci_almanac_mono)*100,2))+'%)')
print('Unseen Pair: '+str(len(almanac_unseen_pair_df))+'('+str(np.around(len(almanac_unseen_pair_df)/len(nci_almanac_mono)*100,2))+'%)')
print('Unseen CellLine: '+str(len(almanac_unseen_cellline_df))+'('+str(np.around(len(almanac_unseen_cellline_df)/len(nci_almanac_mono)*100,2))+'%)')
print('Unseen Drug: '+str(len(almanac_unseen_drug_df))+'('+str(np.around(len(almanac_unseen_drug_df)/len(nci_almanac_mono)*100,2))+'%)')
print('Both unseen: '+str(len(almanac_unseen_both_df))+'('+str(np.around(len(almanac_unseen_both_df)/len(nci_almanac_mono)*100,2))+'%)')

In [None]:
almanac_seen_pair_df.to_csv(base_directory+'NCI_ALMANAC_mono/Training/TrainVal.csv')
almanac_unseen_pair_df.to_csv(base_directory+'NCI_ALMANAC_mono/Training/UnseenPair.csv')
almanac_unseen_cellline_df.to_csv(base_directory+'NCI_ALMANAC_mono/Training/UnseenCellLine.csv')
almanac_unseen_drug_df.to_csv(base_directory+'NCI_ALMANAC_mono/Training/UnseenDrug.csv')
almanac_unseen_both_df.to_csv(base_directory+'NCI_ALMANAC_mono/Training/UnseenBoth.csv')

# Split NCI-ALMANAC (combination) into TrainVal&Test (9:1) For unseen setting

In [None]:
nci_almanac_combination=pd.read_csv(base_directory+'NCI_ALMANAC_combination/Processed/NCI_ALMANAC_combination.csv',index_col=0)

In [None]:
monotherapy_unseen_cellline_df=pd.read_csv(base_directory+'NCI_ALMANAC_mono/Training/UnseenCellLine.csv',index_col=0)
monotherapy_unseen_drug_df=pd.read_csv(base_directory+'NCI_ALMANAC_mono/Training/UnseenDrug.csv',index_col=0)
unseen_cellline_list=list(set(monotherapy_unseen_cellline_df.CELLNAME))
unseen_drug_list=list(set(monotherapy_unseen_drug_df.NSC))

In [None]:
nci_almanac_unseen_cellline_df=nci_almanac_combination[(~nci_almanac_combination.NSC1.isin(unseen_drug_list))&(~nci_almanac_combination.NSC2.isin(unseen_drug_list))&(nci_almanac_combination.CELLNAME.isin(unseen_cellline_list))]
#About the unseen drug1, cell line could be both seen or unseen
nci_almanac_unseen_drug1=nci_almanac_combination[((nci_almanac_combination.NSC1.isin(unseen_drug_list))&(~nci_almanac_combination.NSC2.isin(unseen_drug_list)))|((~nci_almanac_combination.NSC1.isin(unseen_drug_list))&(nci_almanac_combination.NSC2.isin(unseen_drug_list)))]
nci_almanac_unseen_drug2=nci_almanac_combination[(nci_almanac_combination.NSC1.isin(unseen_drug_list))&(nci_almanac_combination.NSC2.isin(unseen_drug_list))&(~nci_almanac_combination.CELLNAME.isin(unseen_cellline_list))]
nci_almanac_unseen_both=nci_almanac_combination[(nci_almanac_combination.NSC1.isin(unseen_drug_list))&(nci_almanac_combination.NSC2.isin(unseen_drug_list))&(nci_almanac_combination.CELLNAME.isin(unseen_cellline_list))]
nci_almanac_seen=nci_almanac_combination[(~nci_almanac_combination.NSC1.isin(unseen_drug_list))&(~nci_almanac_combination.NSC2.isin(unseen_drug_list))&(~nci_almanac_combination.CELLNAME.isin(unseen_cellline_list))]
#Here, around 10% of NCI-ALMANAC combination dataset are test set, and 90% are training&validation

In [None]:
#We need Unseen pair (Cell line-Drug1-Drug2 pair) also, so divide 10% of total dataset as a unseen pair test set
nci_almanac_seen['min_NSC']=np.minimum(nci_almanac_seen.NSC1,nci_almanac_seen.NSC2)
nci_almanac_seen['max_NSC']=np.maximum(nci_almanac_seen.NSC1,nci_almanac_seen.NSC2)

nci_almanac_seen_pair=nci_almanac_seen[['CELLNAME','min_NSC','max_NSC']].drop_duplicates()

ratio_test=1/9 #10% of total dataset=1/9 of train&val dataset
nci_almanac_unseen_pair_df=nci_almanac_seen_pair.sample(frac=ratio_test)
nci_almanac_seen_pair_df=nci_almanac_seen_pair[~nci_almanac_seen_pair.index.isin(nci_almanac_unseen_pair_df.index)]

nci_almanac_unseen_pair_df=pd.merge(nci_almanac_seen,nci_almanac_unseen_pair_df,how='inner',on=['CELLNAME','min_NSC','max_NSC'])
nci_almanac_seen_pair_df=pd.merge(nci_almanac_seen,nci_almanac_seen_pair_df,how='inner',on=['CELLNAME','min_NSC','max_NSC'])

nci_almanac_seen=nci_almanac_seen_pair_df.sample(frac=1)
nci_almanac_unseen_pair=nci_almanac_unseen_pair_df.sample(frac=1)

In [None]:
nci_almanac_seen.to_csv(base_directory+'NCI_ALMANAC_combination/Training/UnseenSetting/TrainVal.csv')
nci_almanac_unseen_pair.to_csv(base_directory+'NCI_ALMANAC_combination/Training/UnseenSetting/UnseenPair.csv')
nci_almanac_unseen_cellline_df.to_csv(base_directory+'NCI_ALMANAC_combination/Training/UnseenSetting/UnseenCellLine.csv')
nci_almanac_unseen_drug1.to_csv(base_directory+'NCI_ALMANAC_combination/Training/UnseenSetting/UnseenDrug1.csv')
nci_almanac_unseen_drug2.to_csv(base_directory+'NCI_ALMANAC_combination/Training/UnseenSetting/UnseenDrug2.csv')
nci_almanac_unseen_both.to_csv(base_directory+'NCI_ALMANAC_combination/Training/UnseenSetting/UnseenBoth.csv')

In [None]:
print('Total: '+str(len(nci_almanac_combination)))
print('Training&Validation: '+str(len(nci_almanac_seen))+'('+str(np.around(len(nci_almanac_seen)/len(nci_almanac_combination)*100,2))+'%)')
print('Unseen Pair: '+str(len(nci_almanac_unseen_pair))+'('+str(np.around(len(nci_almanac_unseen_pair)/len(nci_almanac_combination)*100,2))+'%)')
print('Unseen CellLine: '+str(len(nci_almanac_unseen_cellline_df))+'('+str(np.around(len(nci_almanac_unseen_cellline_df)/len(nci_almanac_combination)*100,2))+'%)')
print('Unseen Drug 1: '+str(len(nci_almanac_unseen_drug1))+'('+str(np.around(len(nci_almanac_unseen_drug1)/len(nci_almanac_combination)*100,2))+'%)')
print('Unseen Drug 2: '+str(len(nci_almanac_unseen_drug2))+'('+str(np.around(len(nci_almanac_unseen_drug2)/len(nci_almanac_combination)*100,2))+'%)')
print('Both unseen: '+str(len(nci_almanac_unseen_both))+'('+str(np.around(len(nci_almanac_unseen_both)/len(nci_almanac_combination)*100,2))+'%)')