# Exploratory Data Analysis

This notebook is to:
* Download [DRKG](https://github.com/gnn4dr/DRKG)
* Exploratory data analysis
* Visualization subgraphs


## Download `DRKG`

In [1]:
from grapharm._utils import download_and_extract
import os

os.makedirs("../outputs", exist_ok=True)



In [2]:
# What are in the data
!tree ../data

[01;34m../data[00m
├── [01;34mdrkg[00m
│   ├── drkg.tsv
│   ├── [01;34membed[00m
│   │   ├── DRKG_TransE_l2_entity.npy
│   │   ├── DRKG_TransE_l2_relation.npy
│   │   ├── entities.tsv
│   │   ├── mol_contextpred.npy
│   │   ├── mol_edgepred.npy
│   │   ├── mol_infomax.npy
│   │   ├── mol_masking.npy
│   │   ├── Readme.md
│   │   └── relations.tsv
│   ├── entity2src.tsv
│   ├── [01;34mmap[00m
│   │   └── [01;34mcompounds[00m
│   │       ├── BindingDB.tsv
│   │       ├── Brenda.tsv
│   │       ├── ChEBI.tsv
│   │       ├── DrugBank.tsv
│   │       ├── DrugCentral.tsv
│   │       ├── head
│   │       ├── HMDB.tsv
│   │       ├── MolPort.tsv
│   │       └── ZINC.tsv
│   └── relation_glossary.tsv
└── [01;34mhetionet[00m
    ├── hetionet_test.txt
    ├── hetionet_train.txt
    ├── hetionet-v1.0-edges.sif
    ├── hetionet-v1.0-nodes.tsv
    └── hetionet_valid.txt

5 directories, 26 files


* `drkg.tsv`: original `drkg` in the format of $(h, r, t)$ triplets ($h$ and $t$: entity, $r$: relation)
* `entity2src.tsv`: mapping entities in `drkg` to their original sources
* `relation_glossary.tsv`: glossary of relation in `drkg`, and other associated information with sources (if available).
* `embed/entities.csv`: names and corresponding ids of entities
* `embed/relations.tsv`: names and corresponding ids of relations

## Data observation

![drkg-structure](https://raw.githubusercontent.com/gnn4dr/DRKG/master/connectivity.png)
> Image credit: https://github.com/gnn4dr/DRKG

### Check the number in downloaded data with the reported statistics:

In [3]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx 
import pyvis

drkg = pd.read_csv(f"{datadir}/drkg.tsv", sep="\t", header=None, names=["h", "r", "t"])
re_gloss = pd.read_csv(f"{datadir}/relation_glossary.tsv", sep="\t")
entities = pd.read_csv(f"{datadir}/embed/entities.tsv", sep="\t", header=None, names=["entity_name", "entity_id"])
relations = pd.read_csv(f"{datadir}/embed/relations.tsv", sep="\t", header=None, names=["relation_name", "relation_id"])

In [4]:
print(f"Number of relations in DRKG: {len(drkg)}")
print("Number of unique entities: {}".format(len(set(drkg.h.tolist() + drkg.t.tolist()))))
print("Number of relation types: {}".format(len(drkg.r.unique())))
entity_stats = pd.concat([
    drkg["h"].value_counts().reset_index().rename(columns={"h": "entity"}),
    drkg["t"].value_counts().reset_index().rename(columns={"t": "entity"})
    ], ignore_index=True).groupby("entity").agg("sum").reset_index()

entity_stats["entity_type"] = entity_stats["entity"].str.split("::", expand=True)[0]
entity_stats = entity_stats.groupby("entity_type").agg("sum").reset_index()[
    ["entity_type", "count"]
    ]
print("Number of entity types: {}".format(len(entity_stats)))
entity_stats

Number of relations in DRKG: 5874261
Number of unique entities: 97238
Number of relation types: 107
Number of entity types: 13


Unnamed: 0,entity_type,count
0,Anatomy,730097
1,Atc,15750
2,Biological Process,559504
3,Cellular Component,73566
4,Compound,3221926
5,Disease,215777
6,Gene,6592315
7,Molecular Function,97222
8,Pathway,84372
9,Pharmacologic Class,1029


Some terms:
* `ATC`: Anatomical Therapeutic Chemical ([ATC](https://www.who.int/tools/atc-ddd-toolkit/atc-classification#:~:text=In%20the%20Anatomical%20Therapeutic%20Chemical,therapeutic%2C%20pharmacological%20and%20chemical%20properties.))
* `Tax`: Taxonomy (a scheme of classification, especially a hierarchical classification)

In [5]:
print(f"Number of embeded entities: {len(entities)}")
entities.head()

Number of embeded entities: 97238


Unnamed: 0,entity_name,entity_id
0,Gene::2157,0
1,Gene::5264,1
2,Gene::2158,2
3,Gene::3309,3
4,Gene::28912,4


In [6]:
print(f"Number of embeded relation types: {len(relations)}")
relations.head()

Number of embeded relation types: 107


Unnamed: 0,relation_name,relation_id
0,bioarx::HumGenHumGen:Gene:Gene,0
1,bioarx::VirGenHumGen:Gene:Gene,1
2,bioarx::DrugVirGen:Compound:Gene,2
3,bioarx::DrugHumGen:Compound:Gene,3
4,bioarx::Covid2_acc_host_gene::Disease:Gene,4


### Link types

In [7]:
print("There are {} connected entity-types".format(len(re_gloss["Connected entity-types"].unique())))
print("-"*50)
for x in re_gloss["Connected entity-types"].unique():
    print(x)

There are 17 connected entity-types
--------------------------------------------------
Compound:Gene
Compound:Compound
ATC:Compound
Compound:Disease
Gene:Gene
Disease:Gene
Gene:Tax
Anatomy:Gene
Compound:Side Effect
Anatomy:Disease
Disease:Symptom
Diisease:Disease
Biological Process:Gene
Cellular Component:Gene
Gene:Molecular Function
Gene:Pathway
Compound:Pharmacologic Class


In [8]:
print("There are {} interaction types:".format(len(re_gloss["Interaction-type"].unique())))
print("-"*50)
for x in re_gloss["Interaction-type"].unique():
    print(x)

There are 80 interaction types:
--------------------------------------------------
activation
agonism
allosteric modulation
antagonism
antibody
binding
blocking
channel blocking
inhibition
modulation
other
partial agonism
positive allosteric modulation
carrier
drug-drug interaction
enzyme
target
Compound belongs to Anatomical Therapeutic Chemical (ATC) code.
Compound treats the disease
agonism, activation
antagonism, blocking
binding, ligand (esp. receptors)
inhibits cell growth (esp. cancers)
drug targets
increases expression/production
decreases expression/production
affects expression/production (neutral)
promotes progression
same protein or complex
signaling pathway
role in disease pathogenesis
role in pathogenesis
metabolism, pharmacokinetics
improper regulation linked to disease
biomarkers (diagnostic)
biomarkers (of disease progression)
inhibits
transport, channels
alleviates, reduces
prevents, suppresses
production by cell population
regulation
side effect/adverse event
treatme

### Load `drkg` to Networkx

In [9]:
from grapharm._utils import tsv2networkx, print_graph_stats
from grapharm.viz import  draw_connected_components
import networkx as nx

G = tsv2networkx(drkg, re_gloss)
print_graph_stats(G, drkg)
connected_components = list((G.subgraph(c).copy() for c in nx.connected_components(G)))
draw_connected_components(G, connected_components, "../outputs")

Number of nodes: 97238
Number of node types: 97238
Number of edges: 4400766
Number of edge types: 107
Average edges per node: 45.257677039840395
Number of subgraphs: 303
The largest subgraph has 96420 nodes (13 types) and 4400229 edges.
The second largest subgraph has 28 nodes and 29 edges.
Number of subgraphs with less than 10 nodes: 298
Number of nodes to display: 74
../outputs/subgraph.html


## Visualize subgraphs

In [10]:
from grapharm.viz import  (select_entities_for_display,
                           draw_subgraph_widget)

print("This step may take more than 1 minute to run! Be patient!")
selected_entities = select_entities_for_display(drkg)

This step may take more than 1 minute to run! Be patient!
There are 4500 entities with [15, 25] connections, belonging to 13 entity types
--------------------------------------------------
Entity type statistics:
---
entity_type
Gene                   1287
Biological Process     1028
Compound                645
Disease                 388
Side Effect             320
Pathway                 275
Molecular Function      252
Cellular Component      132
Atc                      68
Symptom                  48
Anatomy                  45
Pharmacologic Class       7
Tax                       5
Name: count, dtype: int64
--------------------------------------------------
For every entity type, choose 5 entities
---


In [11]:
drkg[drkg["h"].str.contains("DB01590")]

Unnamed: 0,h,r,t
83823,Compound::DB01590,bioarx::DrugHumGen:Compound:Gene,Gene::2475
83824,Compound::DB01590,bioarx::DrugHumGen:Compound:Gene,Gene::1576
1087636,Compound::DB01590,DRUGBANK::x-atc::Compound:Atc,Atc::L01XE10
1087637,Compound::DB01590,DRUGBANK::ddi-interactor-in::Compound:Compound,Compound::DB06643
1087638,Compound::DB01590,DRUGBANK::ddi-interactor-in::Compound:Compound,Compound::DB01656
...,...,...,...
3593463,Compound::DB01590,Hetionet::CcSE::Compound:Side Effect,Side Effect::C0549463
3593640,Compound::DB01590,Hetionet::CcSE::Compound:Side Effect,Side Effect::C0020555
3594208,Compound::DB01590,Hetionet::CcSE::Compound:Side Effect,Side Effect::C0035455
3594461,Compound::DB01590,Hetionet::CcSE::Compound:Side Effect,Side Effect::C0017675


In [12]:
draw_subgraph_widget(G, 
                     selected_entities=selected_entities, 
                     drkg=drkg,
                     outdir="../outputs",
                     logo_path="../assets/GraPharm-icon.png")

GridspecLayout(children=(Dropdown(description='Entity type', layout=Layout(grid_area='widget001', width='auto'…

interactive(children=(Dropdown(description='Entity name: ', layout=Layout(width='auto'), options=('--Select',)…