- Date: 20 June 2019
- Author: Hosein Fooladi
- Topic: Data Exploration

## Data Exploration

In this notebook, I am going to explore the toy dataset that has been used during running the decagon code. I am going to understand it's structure in order to be able to replace it with actual dataset.

In [5]:
import os
from itertools import combinations
from collections import Counter
from itertools import chain

import numpy as np
import networkx as nx
import scipy.sparse as sp
import pandas as pd

from decagon.utility import rank_metrics, preprocessing
from polypharmacy.utility import load_combo_se, load_ppi, load_mono_se, load_targets, load_categories, load_se_combo

## Loading Data

In [6]:
combo2stitch, combo2se, se2name = load_combo_se(os.path.join('data', 'bio-decagon-combo.csv'))
net, node2idx = load_ppi(os.path.join('data', 'bio-decagon-ppi.csv'))
stitch2se, se2name_mono = load_mono_se(os.path.join('data', 'bio-decagon-mono.csv'))
stitch2proteins = load_targets(os.path.join('data', 'bio-decagon-targets-all.csv'))
se2class, se2name_class = load_categories(os.path.join('data', 'bio-decagon-effectcategories.csv'))
se2combo = load_se_combo(os.path.join('data', 'bio-decagon-combo.csv'))

Reading: data\bio-decagon-combo.csv
Drug combinations: 63473 Side effects: 1317
Drug-drug interactions: 4649441
Reading: data\bio-decagon-ppi.csv
Edges: 715612
Nodes: 19081
Reading: data\bio-decagon-mono.csv
Reading: data\bio-decagon-targets-all.csv
Reading: data\bio-decagon-effectcategories.csv
Reading: data\bio-decagon-combo.csv


#### Bulding dictionary for stitches

I am going to build a dictionary from stitch to index and vice versa. I am going to use this dictionary for creating sparse matrix.

In [14]:
# Creating dictionary from drug to index.

stitchs = set([drug for drug_comb in combo2stitch.values() for drug in drug_comb])

stitch2idx = {node: i for i, node in enumerate(stitchs)}
idx2stitch = {i: node for i, node in enumerate(stitchs)}

In [15]:
# Creating dictionary from side effects to index.

se2idx = {node: i for i, node in enumerate(list(se2name.keys()))}
idx2se = {i: node for i, node in enumerate(list(se2name.keys()))}

#### Creating sparse matrix for drug gene interaction

In [35]:
drug_gene_adj = preprocessing.get_sparse_mat(stitch2proteins, stitch2idx, node2idx)
drug_gene_adj

<645x19081 sparse matrix of type '<class 'numpy.float64'>'
	with 18596 stored elements in COOrdinate format>

### Protein–protein and drug–protein interactions

First, Let's consider the PPI network. There is a mismatch with these numbers and numbers that have been written in the paper. e.g., number of nodes in the PPI dataset is 19081, while this number is 19085 in the paper.

In the Paper, it has been written that "The network is unweighted and undirected with 19 085 proteins and 719 402 physical interactions". But, I have found something a little different here by exploring PPI data.

- Number of proteins: 19801

- Number of edges (physical interaction): 715612


Also:
- Drug-drug interaction 4 649 441 which is different from 4 651 131

### Drug–drug interaction and side effect data

#### Number of Unique Drugs in drug combination side effect dataset

Here I am going to explore a little more drug combination side effect dataset. In particular, I want to know how many unique drugs are avalable in this dataset. Fortunately, I have come up with the **same number** that mentioned in the paper.

In [56]:
unique_drugs = set(list(chain.from_iterable([combo2stitch[stitch] for stitch in combo2stitch])))

print("Number of unique drugs in drug combination side effect dataset is %d" % (len(unique_drugs)))

Number of unique drugs in drug combination side effect dataset is 645


But The number of side eefetc is different from the mentioned number in the paper. I have found 1317 side effect which is different from 964.

There is this quote in the paper: "In this study, we focus on predicting the 964 commonly occurring types of polypharmacy side effects that each occurred in at least 500 drug combinations."

So, I am going to explore side effects that occure in at least 500 drug combinations.

In [50]:
side_effects = Counter()
count = 0

for stitch in combo2se:
    side_effects.update(combo2se[stitch])
    
for i in side_effects:
    if side_effects[i] >= 500:
        count+=1

print("Number of side effects that occure in at least 500 drug combinations:", count)

Number of side effects that occure in at least 500 drug combinations: 963


So, I found out 963 side effects that occure in at least 500 drug combinations.

### Number of unique drugs and proteins in drug-target dataset

In [68]:
print('There are %d unique drugs in the drug-target datbase' %len(stitch2proteins))

def get_gene_counter(gene_map):
    genes = []
    for drug in gene_map:
        genes += list(set(gene_map[drug]))
    return Counter(genes)

combo_counter = get_gene_counter(stitch2proteins)
print('There are %d unique genes in the drug-target datbase' %len(combo_counter))

Reading: data\bio-decagon-targets-all.csv
There are 1774 unique drugs in the drug-target datbase
There are 7795 unique genes in the drug-target datbase


In [5]:
## Bulding a dictionary from drugs to id and from proteins to id

stitch2idx = {node: i for i, node in enumerate(list(stitch2proteins.keys()))}
print(len(stitch2idx))
print(len(node2idx))

1774
19081


In [93]:
list(list(stitch2proteins.values())[0])
np.zeros(19801*1774)

array([0., 0., 0., ..., 0., 0., 0.])

#### PPI Network

First, I am going to construct PPi network for given dataset. It is very easy task. We can construct this network using networkx library. We shuld just provide the list of nodes and the list of edges and using this library to constrcut the PPI network. PPI has 19801 nodes.

Note: The adjacency matrix type is sparse matrix. 

In [54]:
gene_net = net
gene_adj = nx.adjacency_matrix(gene_net)
gene_degrees = np.array(gene_adj.sum(axis=0)).squeeze()

#gene_adj

In [10]:
val_test_size = 0.05
n_genes = len(node2idx)
n_drugs = 400
n_drugdrug_rel_types = 3

In [53]:
gene_net = net
gene_adj = nx.adjacency_matrix(gene_net)
gene_degrees = np.array(gene_adj.sum(axis=0)).squeeze()

gene_adj

<19081x19081 sparse matrix of type '<class 'numpy.int32'>'
	with 1431224 stored elements in Compressed Sparse Row format>

In [26]:
gene_degrees[0]

18

In [36]:
gene_drug_adj = sp.csr_matrix((10 * np.random.randn(n_genes, n_drugs) > 15).astype(int))
drug_gene_adj = gene_drug_adj.transpose(copy=True)

gene_drug_adj[0]

<1x400 sparse matrix of type '<class 'numpy.int32'>'
	with 28 stored elements in Compressed Sparse Row format>

In [37]:
drug_drug_adj_list = []
tmp = np.dot(drug_gene_adj, gene_drug_adj)
for i in range(n_drugdrug_rel_types):
    mat = np.zeros((n_drugs, n_drugs))
    for d1, d2 in combinations(list(range(n_drugs)), 2):
        if tmp[d1, d2] == i + 4:
            mat[d1, d2] = mat[d2, d1] = 1.
    drug_drug_adj_list.append(sp.csr_matrix(mat))
drug_degrees_list = [np.array(drug_adj.sum(axis=0)).squeeze() for drug_adj in drug_drug_adj_list]

drug_drug_adj_list

[<400x400 sparse matrix of type '<class 'numpy.float64'>'
 	with 17724 stored elements in Compressed Sparse Row format>,
 <400x400 sparse matrix of type '<class 'numpy.float64'>'
 	with 7916 stored elements in Compressed Sparse Row format>,
 <400x400 sparse matrix of type '<class 'numpy.float64'>'
 	with 3018 stored elements in Compressed Sparse Row format>]

In [8]:


# data representation
adj_mats_orig = {
    (0, 0): [gene_adj, gene_adj.transpose(copy=True)],
    (0, 1): [gene_drug_adj],
    (1, 0): [drug_gene_adj],
    (1, 1): drug_drug_adj_list + [x.transpose(copy=True) for x in drug_drug_adj_list],
}
degrees = {
    0: [gene_degrees, gene_degrees],
    1: drug_degrees_list + drug_degrees_list,
}

# featureless (genes)
gene_feat = sp.identity(n_genes)
gene_nonzero_feat, gene_num_feat = gene_feat.shape
gene_feat = preprocessing.sparse_to_tuple(gene_feat.tocoo())

# features (drugs)
drug_feat = sp.identity(n_drugs)
drug_nonzero_feat, drug_num_feat = drug_feat.shape
drug_feat = preprocessing.sparse_to_tuple(drug_feat.tocoo())

# data representation
num_feat = {
    0: gene_num_feat,
    1: drug_num_feat,
}
nonzero_feat = {
    0: gene_nonzero_feat,
    1: drug_nonzero_feat,
}
feat = {
    0: gene_feat,
    1: drug_feat,
}

edge_type2dim = {k: [adj.shape for adj in adjs] for k, adjs in adj_mats_orig.items()}
edge_type2decoder = {
    (0, 0): 'bilinear',
    (0, 1): 'bilinear',
    (1, 0): 'bilinear',
    (1, 1): 'dedicom',
}

edge_types = {k: len(v) for k, v in adj_mats_orig.items()}
num_edge_types = sum(edge_types.values())
print("Edge types:", "%d" % num_edge_types)

Edge types: 10


In [7]:
from collections import defaultdict

a = defaultdict(int)
a

defaultdict(int, {})

In [61]:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sp.csc_matrix((data, indices, indptr), shape=(3, 3)).toarray()
'''array([[1, 0, 4],
       [0, 0, 5],
       [2, 3, 6]])'''

'array([[1, 0, 4],\n       [0, 0, 5],\n       [2, 3, 6]])'

In [47]:
df = pd.read_csv(os.path.join('data', 'bio-decagon-targets-all.csv'))
df
df['STITCH'].map(stitch2idx)
df['Gene'].astype('str').map(node2idx)


0          1329.0
1          4654.0
2          3589.0
3          8283.0
4          8212.0
5         17125.0
6         15143.0
7          8214.0
8          9137.0
9         14864.0
10            NaN
11        13343.0
12         8831.0
13         8833.0
14         8218.0
15        15315.0
16         8851.0
17        18078.0
18         8342.0
19         8229.0
20         8363.0
21         6949.0
22        17516.0
23         8213.0
24        15144.0
25         1761.0
26         8241.0
27         8242.0
28         8411.0
29        14152.0
           ...   
131004    15899.0
131005     8355.0
131006    12132.0
131007    14126.0
131008    14166.0
131009     3016.0
131010    15141.0
131011     3242.0
131012    10728.0
131013     8165.0
131014    18491.0
131015    14280.0
131016     7408.0
131017    18816.0
131018    18386.0
131019    12508.0
131020    18786.0
131021    14558.0
131022     9837.0
131023    12215.0
131024     6903.0
131025     5061.0
131026    18775.0
131027    10369.0
131028    

In [52]:
df2 = pd.read_csv(os.path.join('data', 'bio-decagon-combo.csv'))


str

In [63]:
df2['STITCH 1'].map(stitch2idx).isna().sum()
#stitch2idx[CID000003461]

2132055

In [64]:
df2['STITCH 1']

0          CID000002173
1          CID000002173
2          CID000002173
3          CID000002173
4          CID000002173
5          CID000002173
6          CID000002173
7          CID000002173
8          CID000002173
9          CID000002173
10         CID000002173
11         CID000002173
12         CID000002173
13         CID000002173
14         CID000002173
15         CID000002173
16         CID000002173
17         CID000002173
18         CID000002173
19         CID000002173
20         CID000002173
21         CID000002173
22         CID000002173
23         CID000002173
24         CID000002173
25         CID000002173
26         CID000002173
27         CID000002173
28         CID000002173
29         CID000002173
               ...     
4649411    CID000003461
4649412    CID000003461
4649413    CID000003461
4649414    CID000003461
4649415    CID000003461
4649416    CID000003461
4649417    CID000003461
4649418    CID000003461
4649419    CID000003461
4649420    CID000003461
4649421    CID00