- Date: 20 June 2019
- Author: Hosein Fooladi

### Data Exploration

In this notebook, I am going to explore the toy dataset that has been used during running the decagon code. I am going to understand it's structure in order to be able to replace it with actual dataset.

In [45]:
from itertools import combinations
from collections import Counter
from itertools import chain

import numpy as np
import networkx as nx
import scipy.sparse as sp

from decagon.utility import rank_metrics, preprocessing

In [40]:
combo2stitch, combo2se, se2name = load_combo_se(os.path.join('data', 'bio-decagon-combo.csv'))
net, node2idx = load_ppi(os.path.join('data', 'bio-decagon-ppi.csv'))
stitch2se, se2name_mono = load_mono_se(os.path.join('data', 'bio-decagon-mono.csv'))
stitch2proteins = load_targets(os.path.join('data', 'bio-decagon-targets-all.csv'))
se2class, se2name_class = load_categories(os.path.join('data', 'bio-decagon-effectcategories.csv'))

Reading: data\bio-decagon-combo.csv
Drug combinations: 63473 Side effects: 1317
Drug-drug interactions: 4649441
Reading: data\bio-decagon-ppi.csv
Edges: 715612
Nodes: 19081
Reading: data\bio-decagon-mono.csv
Reading: data\bio-decagon-targets-all.csv
Reading: data\bio-decagon-effectcategories.csv


There is a mismatch with these numbers and numbers that have been written in the paper. e.g., number of nodes in the PPI dataset is 19081, while this number is 19085 in the paper.

Also:
- Drug-drug interaction 4 649 441 which is different from 4 651 131

### Number of Unique Drugs in drug combination side effect dataset

Here I am going to explore a little more drug combination side effect dataset. In particular, I want to know how many unique drugs are avalable in this dataset. Fortunately, I have come up with the same number that mentioned in the paper.

In [56]:
unique_drugs = set(list(chain.from_iterable([combo2stitch[stitch] for stitch in combo2stitch])))

print("Number of unique drugs in drug combination side effect dataset is %d" % (len(unique_drugs)))

Number of unique drugs in drug combination side effect dataset is 645


In [38]:
print(len(stitch2proteins.keys()))

def get_gene_counter(gene_map):
    genes = []
    for drug in gene_map:
        genes += list(set(gene_map[drug]))
    return Counter(genes)

combo_counter = get_gene_counter(stitch2proteins)
print(len(combo_counter))

1774
7795


In [10]:
val_test_size = 0.05
n_genes = len(node2idx)
n_drugs = 400
n_drugdrug_rel_types = 3

In [20]:
gene_net = net

gene_adj = nx.adjacency_matrix(gene_net)
gene_degrees = np.array(gene_adj.sum(axis=0)).squeeze()

In [26]:
gene_degrees[0]

18

In [36]:
gene_drug_adj = sp.csr_matrix((10 * np.random.randn(n_genes, n_drugs) > 15).astype(int))
drug_gene_adj = gene_drug_adj.transpose(copy=True)

gene_drug_adj[0]

<1x400 sparse matrix of type '<class 'numpy.int32'>'
	with 28 stored elements in Compressed Sparse Row format>

In [37]:
drug_drug_adj_list = []
tmp = np.dot(drug_gene_adj, gene_drug_adj)
for i in range(n_drugdrug_rel_types):
    mat = np.zeros((n_drugs, n_drugs))
    for d1, d2 in combinations(list(range(n_drugs)), 2):
        if tmp[d1, d2] == i + 4:
            mat[d1, d2] = mat[d2, d1] = 1.
    drug_drug_adj_list.append(sp.csr_matrix(mat))
drug_degrees_list = [np.array(drug_adj.sum(axis=0)).squeeze() for drug_adj in drug_drug_adj_list]

drug_drug_adj_list

[<400x400 sparse matrix of type '<class 'numpy.float64'>'
 	with 17724 stored elements in Compressed Sparse Row format>,
 <400x400 sparse matrix of type '<class 'numpy.float64'>'
 	with 7916 stored elements in Compressed Sparse Row format>,
 <400x400 sparse matrix of type '<class 'numpy.float64'>'
 	with 3018 stored elements in Compressed Sparse Row format>]

In [8]:


# data representation
adj_mats_orig = {
    (0, 0): [gene_adj, gene_adj.transpose(copy=True)],
    (0, 1): [gene_drug_adj],
    (1, 0): [drug_gene_adj],
    (1, 1): drug_drug_adj_list + [x.transpose(copy=True) for x in drug_drug_adj_list],
}
degrees = {
    0: [gene_degrees, gene_degrees],
    1: drug_degrees_list + drug_degrees_list,
}

# featureless (genes)
gene_feat = sp.identity(n_genes)
gene_nonzero_feat, gene_num_feat = gene_feat.shape
gene_feat = preprocessing.sparse_to_tuple(gene_feat.tocoo())

# features (drugs)
drug_feat = sp.identity(n_drugs)
drug_nonzero_feat, drug_num_feat = drug_feat.shape
drug_feat = preprocessing.sparse_to_tuple(drug_feat.tocoo())

# data representation
num_feat = {
    0: gene_num_feat,
    1: drug_num_feat,
}
nonzero_feat = {
    0: gene_nonzero_feat,
    1: drug_nonzero_feat,
}
feat = {
    0: gene_feat,
    1: drug_feat,
}

edge_type2dim = {k: [adj.shape for adj in adjs] for k, adjs in adj_mats_orig.items()}
edge_type2decoder = {
    (0, 0): 'bilinear',
    (0, 1): 'bilinear',
    (1, 0): 'bilinear',
    (1, 1): 'dedicom',
}

edge_types = {k: len(v) for k, v in adj_mats_orig.items()}
num_edge_types = sum(edge_types.values())
print("Edge types:", "%d" % num_edge_types)

Edge types: 10


In [14]:
from polypharmacy.utility import load_combo_se, load_ppi, load_mono_se, load_targets, load_categories 

In [17]:
import os
combo2stitch, combo2se, se2name = load_combo_se(os.path.join('data', 'bio-decagon-combo.csv'))

Reading: data\bio-decagon-combo.csv
Drug combinations: 63473 Side effects: 1317
Drug-drug interactions: 4649441
