# Explore NCI data

In [12]:
import pandas as pd

## Drug growth

In [52]:
drug_growth = pd.read_csv("../Data/NCI/ComboDrugGrowth_Nov2017.csv", low_memory=False)

In [53]:
drug_growth

Unnamed: 0,COMBODRUGSEQ,SCREENER,STUDY,TESTDATE,PLATE,PANELNBR,CELLNBR,PREFIX1,NSC1,SAMPLE1,...,PERCENTGROWTH,PERCENTGROWTHNOTZ,TESTVALUE,CONTROLVALUE,TZVALUE,EXPECTEDGROWTH,SCORE,VALID,PANEL,CELLNAME
0,260496,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,85.979,88.159,332168.000,376781.548,58592.854,90.342,4.0,Y,Renal Cancer,786-0
1,260497,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,100.903,100.763,379656.000,376781.548,58592.854,87.130,-14.0,Y,Renal Cancer,786-0
2,260498,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,14.147,27.498,103608.000,376781.548,58592.854,12.739,-1.0,Y,Renal Cancer,786-0
3,260499,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,71.268,75.736,285360.000,376781.548,58592.854,76.397,5.0,Y,Renal Cancer,786-0
4,260500,FG,PZUT00156_86_01_T72,06/23/2011,786-0_1_T72,9,18,S,752,37,...,89.278,90.945,342664.000,376781.548,58592.854,73.681,-16.0,Y,Renal Cancer,786-0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3686470,5732421,1A,0903CB75,03/02/2009,322,12,16,S,681239,1,...,-51.889,,0.242,1.636,0.503,,,Y,CNS Cancer,SF-539
3686471,5732422,1A,0903CB75,03/02/2009,322,12,16,S,681239,1,...,-61.133,,0.196,1.636,0.503,,,Y,CNS Cancer,SF-539
3686472,5732423,1A,0903CB75,03/02/2009,322,12,16,S,681239,1,...,-62.724,,0.188,1.636,0.503,,,Y,CNS Cancer,SF-539
3686473,5732424,1A,0903CB75,03/02/2009,322,12,16,S,681239,1,...,-66.799,,0.167,1.636,0.503,,,Y,CNS Cancer,SF-539


The NSC1 and NSC2 columns contain the IDs of the drugs used for experiments. Infos about the drugs (name and smiles) can be retrieved in the ```ComboCompoundNames_all_with_smiles.csv``` file. The target should be the ```PercentGrowth```column.

Important information such as drug concentrations (conc1 and conc2) should be used as input features (in addition to drug fingerprints). 

For more information about the columns, visit: https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-ALMANAC?preview=/338237347/360284891/ALMANAC%20Data%20Fields.docx

In [54]:
drug_growth[['NSC1', 'NSC2']]

Unnamed: 0,NSC1,NSC2
0,752,3088.0
1,752,3088.0
2,752,3088.0
3,752,3088.0
4,752,3088.0
...,...,...
3686470,681239,
3686471,681239,
3686472,681239,
3686473,681239,


In [57]:
(drug_growth[['NSC1', 'NSC2']].duplicated() == False).sum()

5460

We have **5460 unique examples of drug pairs**. For some pairs, the experiment has been performed several times

In [60]:
drug_growth['NSC1'].isna().sum()

0

In [61]:
drug_growth['NSC2'].isna().sum()

812961

## Compound infos

In [24]:
compound_names = pd.read_csv("../Data/NCI/ComboCompoundNames_all_with_smiles.csv", index_col=["0"])

In [25]:
compound_names

Unnamed: 0_level_0,1,smiles
0,Unnamed: 1_level_1,Unnamed: 2_level_1
740,Methotrexate,CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H...
740,Trexall,[Na+].CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)...
740,Abitrexate,[Na+].[Na+].CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3...
740,Mexate,CN(Cc1cnc2nc(N)nc(N)c2n1)c3ccc(cc3)C(=O)N[C@@H...
740,Folex,CCCCSP(SCCCC)SCCCC
...,...,...
761432,XRP6258,-1
707389,Eribulin mesylate,-1
737754,Pazopanib hydrochloride,-1
753082,vemurafenib,-1


In [23]:
len(compound_names.index.unique())

105

In [22]:
len(compound_names['smiles'].unique())

200

In [44]:
cpt = 0
for idx in compound_names.index.unique():
    has_no_smiles = compound_names.loc[idx]['smiles'] == '-1'
    if type(has_no_smiles) == type(True):
        cpt += 1 - int(has_no_smiles)
    else:
        cpt += 1 - int(has_no_smiles.all())
print(cpt)

93


Out of 105 drugs, 93 of them have a smiles representation. Many of them have more than one smiles representation. We should figure out what to do in such a case.