# Reading data sets
- IC50
- Genomic Featuers
- Drug Decoder

In [1]:
%pylab inline
matplotlib.rcParams['figure.figsize'] = (10,6)

Populating the interactive namespace from numpy and matplotlib


In [1]:
# let us import some functions
from gdsctools import IC50, DrugDecoder, GenomicFeatures
# and data sets
from gdsctools import ic50_test, genomic_features
from gdsctools.datasets import testing

## IC50

The first type of data set to be used in the anlaysis is the matrix of IC50. There is a test file called **ic50_test** that gives the location of such a file


In [2]:
ic50 = IC50(ic50_test)

In [3]:
print(ic50)

Number of drugs: 11
Number of cell lines: 988
Percentage of NA 0.206569746043



In [4]:
data = ic50.plot_ic50_count(marker='o')
title("Count of valid IC50 values per drug")

NameError: name 'title' is not defined

In [None]:
data = ic50.hist()

In [5]:
drug_to_drop  = ['Drug_999_IC50', 'Drug_1047_IC50', 'Drug_1049_IC50',
                'Drug_1050_IC50', 'Drug_1052_IC50', 'Drug_1053_IC50']
dummy = ic50.drop_drugs(drug_to_drop)
data = ic50.hist()

## Genomic Features

In [6]:
f = GenomicFeatures() # default from the package

This is equivalent to 

In [7]:
f = GenomicFeatures(genomic_features)

In [8]:
print(f)

Genomic features distribution
Number of unique tissues 27
Here are first 10 tissues: lung_NSCLC, prostate, stomach, nervous_system, skin, Bladder, leukemia, kidney, thyroid, soft_tissue

There are 677 unique features distributed as
- Mutation: 270
- CNA (gain): 116
- CNA (loss): 291


Note that this GenomicFeatures matrix must have 3 special columns
to provide the sample name, Tissue Factor Value and MSI factor value. 
Then all features.

In [9]:
f.df.iloc[0:3]

Unnamed: 0_level_0,Sample Name,Tissue Factor Value,MS-instability Factor Value,ABCB1_mut,ABL2_mut,ACACA_mut,ACVR1B_mut,ACVR2A_mut,AFF4_mut,AHCTF1_mut,...,"loss_cnaPANCAN415_(B2M,BUB1B,MGA,TP53BP1)",loss_cnaPANCAN416,loss_cnaPANCAN417,loss_cnaPANCAN418,loss_cnaPANCAN419,loss_cnaPANCAN420,loss_cnaPANCAN421,"loss_cnaPANCAN422, loss_cnaPANCAN423",loss_cnaPANCAN424,loss_cnaPANCAN425
COSMIC ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1287381,201T,lung_NSCLC,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
924100,22RV1,prostate,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
910924,23132-87,stomach,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
df = f.plot()

In [11]:
groups = f.df.groupby('Tissue Factor Value').groups
to_remove = []
for tissue in groups.keys():
    if len(groups[tissue])<40:
        to_remove.append(tissue)


In [12]:
info = f.drop_tissue_in(to_remove)
f.plot()

aero dig tract      79
breast              52
large intestine     50
leukemia            82
lung NSCLC         111
lung SCLC           64
lymphoma            69
nervous system      56
ovary               43
skin                58
dtype: float64

## Drug Decoder

GDSCTools provides an IC50 test file (ic50_test). The drug identifiers are
usually encoded  with a unique identifier that have no meaning. A decoder
file may be provided. for example, we provide the drug_test data set

In [13]:
print(testing.drug_test_csv)

location: /home/cokelaer/Work/github/gdsctools/share/data/test_drug_decode.csv
description: drug_decode in CSV format
authors: GDSC consortium



In [14]:
dd = DrugDecoder(testing.drug_test_csv)
print(dd)

Number of drugs: 11



It can be used to retrive the name and target of the drug

In [15]:
dd.get_name('Drug_1047_IC50')

In [16]:
dd.get_target('Drug_1047_IC50')

<hr>
<ht>
**Author: Thomas Cokelaer, Nov 2015**