<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Missing-values---FILTER" data-toc-modified-id="Missing-values---FILTER-1">Missing values - FILTER</a></span></li><li><span><a href="#interaction_types" data-toc-modified-id="interaction_types-2">interaction_types</a></span><ul class="toc-item"><li><span><a href="#Multiple-values" data-toc-modified-id="Multiple-values-2.1">Multiple values</a></span></li></ul></li><li><span><a href="#interaction_source_db_name" data-toc-modified-id="interaction_source_db_name-3">interaction_source_db_name</a></span></li><li><span><a href="#Namespaces---FILTER" data-toc-modified-id="Namespaces---FILTER-4">Namespaces - FILTER</a></span></li><li><span><a href="#Merging-by-gene-drug-pairs" data-toc-modified-id="Merging-by-gene-drug-pairs-5">Merging by gene-drug pairs</a></span><ul class="toc-item"><li><span><a href="#interaction_types" data-toc-modified-id="interaction_types-5.1">interaction_types</a></span></li><li><span><a href="#interaction_source_db_name" data-toc-modified-id="interaction_source_db_name-5.2">interaction_source_db_name</a></span></li></ul></li></ul></div>

# DGIdb notebook: EDA

In [1]:
## for notebook only 

## allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## for printing
from pprint import pprint

## for loading locally-stored files
import pathlib

In [2]:
import pandas as pd

## NOT for parser: for viewing df only
pd.options.display.max_columns = None
pd.set_option('display.max_colwidth', None)

<div class="alert alert-block alert-danger">

This notebook was originally written using the **2024-Dec** interactions.tsv from https://dgidb.org/downloads. Its "last modified" date is Fri, **06 Dec 2024** 15:20:44 GMT, according to the headers returned from a HEAD request to [its download link](https://dgidb.org/data/2024-Dec/interactions.tsv).
    
<br>
    
I didn't use the "latest" interactions.tsv because its "last modified" date is Mon, **10 Jun 2024** 16:04:52 GMT according to the headers returned from a HEAD request to [its download link](https://dgidb.org/data/latest/interactions.tsv)

The 2024-Dec file has two header lines showing the DGIdb semantic version **5.0.7** and "date" version. 

```
# Data version: Dec-2024
# DGIdb version: v.5.0.7
```

In [3]:
## path to raw resource file
interactions_path = pathlib.Path.home().joinpath("Desktop", 
                                                 "DGIdb_files",
                                                 "2024-Dec-interactions.tsv")

In [4]:
## load file in pandas directly

## should automatically treat first line after comments as header
df = pd.read_table(interactions_path, comment="#")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98920 entries, 0 to 98919
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gene_claim_name                98915 non-null  object 
 1   gene_concept_id                90442 non-null  object 
 2   gene_name                      90442 non-null  object 
 3   drug_claim_name                98918 non-null  object 
 4   drug_concept_id                88396 non-null  object 
 5   drug_name                      88396 non-null  object 
 6   drug_is_approved               88395 non-null  object 
 7   drug_is_immunotherapy          88395 non-null  object 
 8   drug_is_antineoplastic         88395 non-null  object 
 9   interaction_source_db_name     98917 non-null  object 
 10  interaction_source_db_version  98917 non-null  object 
 11  interaction_types              35632 non-null  object 
 12  interaction_score              81740 non-null 

## Missing values - FILTER

Review:
* **All of the columns have some missing values**
* the entity ID columns `gene_concept_id` and `drug_concept_id` are missing thousands of values - even though DGIdb did an [entity-resolving/common-ID-assignment step](https://dgidb.org/about/overview/grouping). 
* `interaction_source_db_name` is missing a few values, which is unexpected
* BUT it is expected that `interaction_types` is missing values, since not all sources will assign specific relationship types (ex: inhibitor). 

In [6]:
## EDA - going through columns looking at missing values

# df[df["gene_concept_id"].isna()]

# df[df["drug_concept_id"].isna()]

## these rows look weird - unexpected drug namespace or no drug ID, lots of NAs
df[df["interaction_source_db_name"].isna()]

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
8301,NCBIGENE:362,hgnc:638,AQP5,IUPHAR.LIGAND:2129,iuphar.ligand:2129,N&,,,,,,,,,,
12204,NCBIGENE:2721,ncbigene:2721,GLC,,,,,,,,,,,,,
12205,NCBIGENE:2721,ncbigene:2721,GLC,,,,,,,,,,,,,


<div class="alert alert-block alert-success">

DECISION: drop rows with NA in `gene_concept_id` OR `drug_concept_id` OR `interaction_source_db_name`. 

In [7]:
## logs

## number of rows left after dropping NAs
have_values = df.dropna(subset=["gene_concept_id", 
                                "drug_concept_id",
                                "interaction_source_db_name"]).shape[0]
print(f"{have_values} rows have both entity IDs and an underlying source" + 
      f": {have_values / df.shape[0]:.1%}\n")


## gene IDs
have_gene_id = df["gene_concept_id"].notna().sum()
print(f"{have_gene_id} rows have gene IDs: {have_gene_id / df.shape[0]:.1%}")

## drug IDs
have_drug_id = df["drug_concept_id"].notna().sum()
print(f"{have_drug_id} rows have drug IDs: {have_drug_id / df.shape[0]:.1%}")

## sources
have_source = df["interaction_source_db_name"].notna().sum()
print(f"{have_source} rows have an underlying source: {have_source / df.shape[0]:.3%}")

81740 rows have both entity IDs and an underlying source: 82.6%

90442 rows have gene IDs: 91.4%
88396 rows have drug IDs: 89.4%
98917 rows have an underlying source: 99.997%


In [8]:
## drop rows, check

## default is remove if any column has NA for the row
df_filtered = df.dropna(subset=["gene_concept_id", 
                                "drug_concept_id",
                                "interaction_source_db_name"], ignore_index=True).copy()

df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81740 entries, 0 to 81739
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gene_claim_name                81735 non-null  object 
 1   gene_concept_id                81740 non-null  object 
 2   gene_name                      81740 non-null  object 
 3   drug_claim_name                81740 non-null  object 
 4   drug_concept_id                81740 non-null  object 
 5   drug_name                      81740 non-null  object 
 6   drug_is_approved               81740 non-null  object 
 7   drug_is_immunotherapy          81740 non-null  object 
 8   drug_is_antineoplastic         81740 non-null  object 
 9   interaction_source_db_name     81740 non-null  object 
 10  interaction_source_db_version  81740 non-null  object 
 11  interaction_types              30405 non-null  object 
 12  interaction_score              81740 non-null 

## interaction_types

[2025-11-05 with 2024-Dec data]

The data before filtering has the same unique interaction_types values. 

In [9]:
df_filtered["interaction_types"].nunique()

df_filtered["interaction_types"].value_counts(dropna=False).sort_index()

30

interaction_types
activator                      584
activator|blocker                4
activator|inhibitor              2
agonist                       5882
agonist|inhibitor               22
agonist|modulator                2
antibody                       298
antibody|immunotherapy           4
antisense oligonucleotide        4
binder                         258
blocker                       1807
blocker|activator                2
blocker|inhibitor                2
cleavage                        83
immunotherapy                    3
immunotherapy|antibody           4
inhibitor                    18692
inhibitor|activator              2
inhibitor|agonist               14
inhibitor|blocker                3
inhibitor|modulator              5
inverse agonist                 36
modulator                     1241
modulator|agonist                1
modulator|inhibitor              3
negative modulator             133
other/unknown                  219
positive modulator            1013
po

### Multiple values

"|"-delimited (special value, needs escaping).

Only a small proportion of the dataset

In [10]:
df_multi_types = df_filtered[df_filtered["interaction_types"].str.contains("\\|", na=False)].copy()

print(f"{df_multi_types.shape[0]} rows with |-delimited interaction_types" + 
      f": {df_multi_types.shape[0] / df_filtered.shape[0]:.3%}")

df_multi_types.head()

70 rows with |-delimited interaction_types: 0.086%


Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
649,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10038,iuphar.ligand:10038,COMPOUND 5 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,inhibitor|blocker,1.864421,3.627763,0.513931,1.0
654,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10039,iuphar.ligand:10039,COMPOUND 6 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,inhibitor|blocker,1.864421,3.627763,0.513931,1.0
657,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:1713,rxcui:318,ADENOSINE TRIPHOSPHATE,True,False,False,GuideToPharmacology,2024.3,inhibitor|blocker,0.109672,0.213398,0.513931,1.0
6029,NCBIGENE:21,hgnc:33,ABCA3,IUPHAR.LIGAND:459,iuphar.ligand:459,MRE 3008F20,False,False,False,GuideToPharmacology,2024.3,agonist|inhibitor,0.134546,0.906941,0.148351,1.0
6205,NCBIGENE:277,hgnc:475,AMY1B,IUPHAR.LIGAND:9494,iuphar.ligand:9494,CYM-5541,False,False,False,GuideToPharmacology,2024.3,modulator|agonist,3.070812,3.627763,0.846475,1.0


In [11]:
df_multi_types["interaction_types"].nunique()

df_multi_types["interaction_types"].value_counts().sort_index()

14

interaction_types
activator|blocker          4
activator|inhibitor        2
agonist|inhibitor         22
agonist|modulator          2
antibody|immunotherapy     4
blocker|activator          2
blocker|inhibitor          2
immunotherapy|antibody     4
inhibitor|activator        2
inhibitor|agonist         14
inhibitor|blocker          3
inhibitor|modulator        5
modulator|agonist          1
modulator|inhibitor        3
Name: count, dtype: int64

**Idea**: split into two separate edges (split into list -> explode). Then merge those with same mappings later.

**Opposing:**
* activator|blocker
* blocker|activator
* activator|inhibitor
* inhibitor|activator
* agonist|inhibitor
* inhibitor|agonist

**Close:**
* blocker|inhibitor
* inhibitor|blocker

**One is kinda a subclass of the other?**
* agonist|modulator
* modulator|agonist
* inhibitor|modulator
* modulator|inhibitor

**Identical?**
* antibody|immunotherapy
* immunotherapy|antibody

In [12]:
## are the "flipped order" types the same data? esp when same row count?
## NO - based on the few pairs I reviewed

df_multi_types[df_multi_types["interaction_types"] == "activator|inhibitor"]
df_multi_types[df_multi_types["interaction_types"] == "inhibitor|activator"]

# df_multi_types[df_multi_types["interaction_types"] == "agonist|modulator"]
# df_multi_types[df_multi_types["interaction_types"] == "modulator|agonist"]

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
9727,NCBIGENE:749,hgnc:10485,RYR3,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,activator|inhibitor,0.163137,0.090694,1.79876,1.0
79392,NCBIGENE:748,hgnc:10484,RYR2,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,activator|inhibitor,0.118645,0.090694,1.308189,1.0


Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
24189,NCBIGENE:747,hgnc:1165,DAGLA,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,inhibitor|activator,0.130509,0.090694,1.439008,1.0
70567,BRAF,hgnc:1097,BRAF,VEMURAFENIB,rxcui:1147220,VEMURAFENIB,True,False,True,MyCancerGenomeClinicalTrial,30-Feburary-2014,inhibitor|activator,1.587278,0.098048,0.07195,225.0


In [13]:
## from only 4 resources, mostly GuideToPharmacology

df_multi_types["interaction_source_db_name"].value_counts()

interaction_source_db_name
GuideToPharmacology            60
MyCancerGenome                  8
MyCancerGenomeClinicalTrial     1
ChEMBL                          1
Name: count, dtype: int64

## interaction_source_db_name

aka "underlying sources". 
These are all single-values. 

Almost all counts in data match [website](https://dgidb.org/browse/sources)'s "interaction claims in groups" (small discrepancy with GuideToPharmacology, likely from filtering out interactions above). (And website includes DrugBank, but that doesn't actually have any interaction claims.)

Includes all sources from website **plus NCI. It's not clear to me what NCI is.**

In [14]:
df_filtered["interaction_source_db_name"].nunique()

df_filtered["interaction_source_db_name"].value_counts().sort_index()

21

interaction_source_db_name
CGI                                  345
CIViC                               1013
CKB-CORE                            1777
COSMIC                                34
CancerCommons                        106
ChEMBL                             12291
ClearityFoundationBiomarkers         160
ClearityFoundationClinicalTrial      240
DTC                                23876
DoCM                                  72
FDA                                  402
GuideToPharmacology                16362
MyCancerGenome                       803
MyCancerGenomeClinicalTrial          314
NCI                                 6076
OncoKB                               146
PharmGKB                            5248
TALC                                 564
TEND                                2242
TTD                                 5110
TdgClinicalTrial                    4559
Name: count, dtype: int64

In [15]:
df_filtered[df_filtered["interaction_source_db_name"] == "NCI"]

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
8,ICAM3,hgnc:5346,ICAM3,GRANULOCYTE MACROPHAGE COLONY-STIMULATING FACTOR,ncit:C1288,RECOMBINANT GRANULOCYTE-MACROPHAGE COLONY-STIMULATING FACTOR,False,False,True,NCI,14-September-2017,,13.050950,3.627763,1.798760,2.0
9,ICAM3,hgnc:5346,ICAM3,PMA,ncit:C866,TETRADECANOYLPHORBOL ACETATE,False,False,True,NCI,14-September-2017,,0.283716,0.078864,1.798760,2.0
10,ICAM3,hgnc:5346,ICAM3,GM-CSF,iuphar.ligand:4942,GM-CSF,False,False,False,NCI,14-September-2017,,0.815684,0.226735,1.798760,2.0
11,ICAM3,hgnc:5346,ICAM3,VITAMIN D,rxcui:11253,VITAMIN D,True,False,False,NCI,14-September-2017,,1.003919,0.279059,1.798760,2.0
12,ICAM3,hgnc:5346,ICAM3,INTERFERONS,ncit:C584,RECOMBINANT INTERFERON,False,False,True,NCI,14-September-2017,,0.334640,0.093020,1.798760,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81735,PTPRC,hgnc:9666,PTPRC,PROTEIN KINASE INHIBITOR,ncit:C1404,PROTEIN KINASE INHIBITOR,False,False,False,NCI,14-September-2017,,1.160084,0.725553,0.799449,2.0
81736,PTPRC,hgnc:9666,PTPRC,OESTRADIOL,rxcui:24395,ESTRADIOL VALERATE,True,False,True,NCI,14-September-2017,,0.123413,0.077186,0.799449,2.0
81737,PTPRC,hgnc:9666,PTPRC,PREDNISONE,rxcui:8640,PREDNISONE,True,False,True,NCI,14-September-2017,,0.128898,0.080617,0.799449,2.0
81738,PTPRC,hgnc:9666,PTPRC,HEPARAN SULFATE,rxcui:2603494,HEPARAN SULFATE,False,False,True,NCI,14-September-2017,,0.483369,0.302314,0.799449,2.0


## Namespaces - FILTER

In [16]:
## genes

df_filtered["gene_prefix"] = [i.split(":")[0] for i in df_filtered["gene_concept_id"]]

df_filtered["gene_prefix"].value_counts()

df_filtered[df_filtered["gene_prefix"] == "ensembl"]

gene_prefix
hgnc        80643
ncbigene     1096
ensembl         1
Name: count, dtype: int64

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score,gene_prefix
30322,TARP,ensembl:ENSG00000289746,TARP,TESTOSTERONE,rxcui:10379,TESTOSTERONE,True,False,True,NCI,14-September-2017,,4.015677,0.09302,14.390078,3.0,ensembl


In [17]:
## drugs

df_filtered["drug_prefix"] = [i.split(":")[0] for i in df_filtered["drug_concept_id"]]

df_filtered["drug_prefix"].value_counts()

df_filtered[df_filtered["drug_prefix"] == "chemidplus"]

drug_prefix
rxcui             34765
ncit              15100
iuphar.ligand     15010
chembl            12985
drugbank           3723
wikidata            112
hemonc               30
drugsatfda.nda       12
chemidplus            3
Name: count, dtype: int64

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score,gene_prefix,drug_prefix
5550,CYP7B1,hgnc:2652,CYP7B1,DDD,chemidplus:72-54-8,TDE,False,False,False,NCI,14-September-2017,,6.960506,3.627763,0.959339,2.0,hgnc,chemidplus
31642,NOREPINEPHRINE TRANSPORTER,hgnc:11048,SLC6A2,Hypericum,chemidplus:68917-49-7,ST. JOHN'S WORT,False,False,False,TTD,2020.06.01,,0.483369,3.627763,0.133241,1.0,hgnc,chemidplus
71843,ESTROGEN-RELATED RECEPTOR-ALPHA,hgnc:3471,ESRRA,Dexamethasone palmitate,chemidplus:14899-36-6,DEXAMETHASONE PALMITATE,False,False,False,TTD,2020.06.01,,8.700633,3.627763,2.398346,1.0,hgnc,chemidplus


**NodeNorm can't handle:**
* iuphar.ligand
* wikidata
* hemonc
* drugsatfda.nda
* chemidplus

(assuming chembl = CHEMBL.COMPOUND. based on 2 spot-checks, it seems to work some of the time?)

Noticed some names are also not NameRes-able, ex: `COMPOUND 5 [PMID: 29579323]`

In [18]:
prefixes_cant_nodenorm = [
    ## . probably will be treated as "all match"...unless escaped
    "iuphar\\.ligand",
    "wikidata",
    "hemonc",
    "drugsatfda\\.nda",
    "chemidplus",
]

In [19]:
## set case=False so it isn't case-sensitive on matches!

n_before = df_filtered.shape[0]
df_filtered = df_filtered[~df_filtered.drug_prefix.str.contains('|'.join(prefixes_cant_nodenorm), case=False)].copy()

## -> log
print(f"{df_filtered.shape[0]} rows ({df_filtered.shape[0] / n_before:.1%}) after filtering out drug prefixes that can't be NodeNormed")

66573 rows (81.4%) after filtering out drug prefixes that can't be NodeNormed


In [20]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 66573 entries, 0 to 81739
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gene_claim_name                66568 non-null  object 
 1   gene_concept_id                66573 non-null  object 
 2   gene_name                      66573 non-null  object 
 3   drug_claim_name                66573 non-null  object 
 4   drug_concept_id                66573 non-null  object 
 5   drug_name                      66573 non-null  object 
 6   drug_is_approved               66573 non-null  object 
 7   drug_is_immunotherapy          66573 non-null  object 
 8   drug_is_antineoplastic         66573 non-null  object 
 9   interaction_source_db_name     66573 non-null  object 
 10  interaction_source_db_version  66573 non-null  object 
 11  interaction_types              18903 non-null  object 
 12  interaction_score              66573 non-null  floa

## Merging by gene-drug pairs

In [21]:
## first drop some columns - not needed OR values won't make sense after merge
## makes merge faster

cols_not_needed = [
    "gene_claim_name", 
    "gene_name",  ## is unique to each gene_concept_id
    "drug_claim_name",
    "drug_name",  ## is unique to each drug_concept_id
    "interaction_source_db_version", 
    "gene_prefix",
    "drug_prefix",
]

df_merge = df_filtered.drop(columns=cols_not_needed).copy()

## do so can look at combos later without type issues
df_merge["interaction_types"] = df_merge["interaction_types"].fillna(value="zNULL")

In [22]:
## merge: takes ~10s to run

cols_define_edge = ["gene_concept_id", "drug_concept_id"]

df_merge = df_merge.groupby(by=cols_define_edge).agg(set).reset_index().copy()

In [23]:
df_merge.shape[0]

df_merge

55077

Unnamed: 0,gene_concept_id,drug_concept_id,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
0,ensembl:ENSG00000289746,rxcui:10379,{True},{False},{True},{NCI},{zNULL},{4.015676812221963},{0.09301957340359},{14.390078221490326},{3.0}
1,hgnc:100,rxcui:1148138,{True},{False},{True},{GuideToPharmacology},{inhibitor},{0.6868920863011254},{1.8138816813700047},{0.3786862689865875},{1.0}
2,hgnc:100,rxcui:644,{True},{False},{True},{TTD},{zNULL},{0.0808108336824853},{0.2133978448670594},{0.3786862689865875},{1.0}
3,hgnc:10000,chembl:CHEMBL107131,{False},{False},{False},{DTC},{zNULL},{0.1706006488852468},{1.20925445424667},{0.1410791982499051},{1.0}
4,hgnc:10000,chembl:CHEMBL1255835,{False},{False},{False},{DTC},{zNULL},{0.0853003244426234},{0.604627227123335},{0.1410791982499051},{1.0}
...,...,...,...,...,...,...,...,...,...,...,...
55072,ncbigene:7,rxcui:8332,{True},{False},{True},{GuideToPharmacology},{inhibitor|agonist},{0.0610570743378778},{0.4030848180822233},{0.151474507594635},{1.0}
55073,ncbigene:7,rxcui:8353,{False},{False},{True},{GuideToPharmacology},{inhibitor},{0.0457928057534083},{0.3023136135616675},{0.151474507594635},{1.0}
55074,ncbigene:849,ncit:C152542,{False},{False},{True},{GuideToPharmacology},{inhibitor},{5.800422062098392},{1.20925445424667},{4.796692740496775},{1.0}
55075,ncbigene:849,rxcui:1294569,{True},{False},{True},{GuideToPharmacology},{inhibitor},{2.900211031049196},{0.604627227123335},{4.796692740496775},{1.0}


In [24]:
## all single values - so same for a gene-drug pair

## tied to drug, basically a node attribute/annotation
df_merge[df_merge["drug_is_approved"].map(len) > 1].shape[0]
df_merge[df_merge["drug_is_immunotherapy"].map(len) > 1].shape[0]
df_merge[df_merge["drug_is_antineoplastic"].map(len) > 1].shape[0]

0

0

0

In [25]:
## all single values - so same for a gene-drug pair

## scores for gene-drug pair
df_merge[df_merge["interaction_score"].map(len) > 1].shape[0]
df_merge[df_merge["evidence_score"].map(len) > 1].shape[0]

## scores for only drug or only gene
df_merge[df_merge["drug_specificity_score"].map(len) > 1].shape[0]
df_merge[df_merge["gene_specificity_score"].map(len) > 1].shape[0]

0

0

0

0

In [26]:
## end up with multiple values 

df_merge[df_merge["interaction_types"].map(len) > 1].shape[0]

df_merge[df_merge["interaction_source_db_name"].map(len) > 1].shape[0]

3674

5982

### interaction_types

~6.6% of the merged dataset have multiple interaction_types from the merge. 

Are there sets that don't make sense?

In [30]:
df_multi_types_2 = df_merge[df_merge["interaction_types"].map(len) > 1].copy()

print(f"{df_multi_types_2.shape[0] / df_merge.shape[0]:.2%}")

df_multi_types_2

6.67%


Unnamed: 0,gene_concept_id,drug_concept_id,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
87,hgnc:10009,ncit:C152241,{False},{False},{True},"{ChEMBL, TTD}","{inhibitor, zNULL}",{26.10189927944276},{3.6277633627400094},{3.5975195553725814},{2.0}
140,hgnc:1014,rxcui:1364347,{True},{False},{True},"{FDA, PharmGKB, ChEMBL, TALC}","{inhibitor, zNULL}",{0.3650615283838149},{0.139529360105385},{0.6540944646131966},{4.0}
141,hgnc:1014,rxcui:1546019,{True},{False},{True},"{FDA, PharmGKB, ChEMBL, TALC}","{inhibitor, zNULL}",{0.1248894702365682},{0.0477337284571054},{0.6540944646131966},{4.0}
149,hgnc:1014,rxcui:282388,{True},{False},{True},"{ChEMBL, DTC, PharmGKB, TALC, FDA}","{inhibitor, zNULL}",{0.2306986047425497},{0.0503856022602779},{0.6540944646131966},{7.0}
189,hgnc:10251,ncit:C152866,{False},{False},{True},"{ChEMBL, TdgClinicalTrial}","{inhibitor, zNULL}",{0.7677029199836107},{0.9069408406850026},{0.4232375947497154},{2.0}
...,...,...,...,...,...,...,...,...,...,...,...
54642,hgnc:9967,rxcui:1603296,{True},{False},{True},"{CKB-CORE, CIViC, TALC}","{inhibitor, zNULL}",{0.1340277241563171},{0.1909349138284216},{0.1754887587986625},{4.0}
54646,hgnc:9967,rxcui:2370147,{True},{False},{True},"{CIViC, ChEMBL, PharmGKB, OncoKB, CKB-CORE, TTD}","{inhibitor, zNULL}",{5.411369362811306},{0.9069408406850026},{0.1754887587986625},{34.0}
54647,hgnc:9967,rxcui:2394936,{True},{False},{True},"{CIViC, ChEMBL, PharmGKB, OncoKB, CKB-CORE, TTD}","{inhibitor, zNULL}",{2.000842453476449},{0.5182519089628587},{0.1754887587986625},{22.0}
54654,hgnc:9967,rxcui:357977,{True},{False},{True},"{CGI, CIViC, ChEMBL, MyCancerGenome, TdgClinicalTrial, NCI, DoCM, TALC, CKB-CORE, TEND}","{inhibitor, zNULL}",{0.3978948060890666},{0.0755784033904168},{0.1754887587986625},{30.0}


In [31]:
# SORT values first - so the sets are ideally unique beforehand

df_multi_types_2["interaction_types"] = [",".join(sorted(i)) for i in df_multi_types_2["interaction_types"]]

In [32]:
df_multi_types_2["interaction_types"].nunique()

47

In [33]:
df_multi_types_2["interaction_types"].value_counts(normalize=True)

interaction_types
inhibitor,zNULL                                            0.606968
agonist,zNULL                                              0.148340
positive modulator,zNULL                                   0.090093
blocker,zNULL                                              0.029396
binder,zNULL                                               0.023408
inhibitor,other/unknown,zNULL                              0.022863
modulator,zNULL                                            0.019869
antibody,inhibitor,zNULL                                   0.008982
other/unknown,zNULL                                        0.007349
activator,zNULL                                            0.006532
negative modulator,zNULL                                   0.003811
blocker,inhibitor                                          0.002994
potentiator,zNULL                                          0.002722
cleavage,zNULL                                             0.002450
vaccine,zNULL                 

Top are (cover ~83%):
* `inhibitor,zNULL` (60%)
* `agonist,zNULL` (14%)
* `positive modulator,zNULL` (9%)

In [34]:
df_multi_types_2["interaction_types"].value_counts().sort_index()

interaction_types
activator,positive modulator,zNULL                            1
activator,potentiator,zNULL                                   3
activator,zNULL                                              24
agonist,antibody                                              4
agonist,antibody,zNULL                                        4
agonist,immunotherapy,zNULL                                   1
agonist,modulator                                             2
agonist,zNULL                                               545
agonist|inhibitor,zNULL                                       1
antibody,antibody|immunotherapy,inhibitor,zNULL               1
antibody,binder,zNULL                                         3
antibody,blocker,immunotherapy|antibody,inhibitor,zNULL       1
antibody,immunotherapy|antibody,inhibitor,zNULL               2
antibody,inhibitor                                            7
antibody,inhibitor,zNULL                                     33
antibody,modulator,zNU

**MERGED DATA ISSUES**

(fully reviewed)

**Opposing**
* **agonist,inhibitor**
* inhibitor,inhibitor|activator,zNULL

**Makes sense, but tricky to merge? (1 example of each kind)**
* activator,potentiator,zNULL 
* agonist,antibody
* antibody,binder,zNULL
* antibody,blocker,immunotherapy|antibody,inhibitor,zNULL
* antibody,inhibitor
* blocker,inhibitor
* **cleavage,inhibitor,zNULL**
* inhibitor,inverse agonist,zNULL

### interaction_source_db_name

~10.8% of data has multiple sources after merge

EDA only

In [35]:
df_multi_sources = df_merge[df_merge["interaction_source_db_name"].map(len) > 1].copy()

print(f"{df_multi_sources.shape[0] / df_merge.shape[0]:.2%}")

df_multi_sources

10.86%


Unnamed: 0,gene_concept_id,drug_concept_id,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
87,hgnc:10009,ncit:C152241,{False},{False},{True},"{ChEMBL, TTD}","{inhibitor, zNULL}",{26.10189927944276},{3.6277633627400094},{3.5975195553725814},{2.0}
137,hgnc:1014,rxcui:11202,{True},{False},{True},"{FDA, PharmGKB}",{zNULL},{0.0585901218393777},{0.0447872020091359},{0.6540944646131966},{2.0}
138,hgnc:1014,rxcui:1307619,{True},{False},{True},"{FDA, PharmGKB}",{zNULL},{0.206339124738678},{0.1577288418582613},{0.6540944646131966},{2.0}
140,hgnc:1014,rxcui:1364347,{True},{False},{True},"{FDA, PharmGKB, ChEMBL, TALC}","{inhibitor, zNULL}",{0.3650615283838149},{0.139529360105385},{0.6540944646131966},{4.0}
141,hgnc:1014,rxcui:1546019,{True},{False},{True},"{FDA, PharmGKB, ChEMBL, TALC}","{inhibitor, zNULL}",{0.1248894702365682},{0.0477337284571054},{0.6540944646131966},{4.0}
...,...,...,...,...,...,...,...,...,...,...,...
54646,hgnc:9967,rxcui:2370147,{True},{False},{True},"{CIViC, ChEMBL, PharmGKB, OncoKB, CKB-CORE, TTD}","{inhibitor, zNULL}",{5.411369362811306},{0.9069408406850026},{0.1754887587986625},{34.0}
54647,hgnc:9967,rxcui:2394936,{True},{False},{True},"{CIViC, ChEMBL, PharmGKB, OncoKB, CKB-CORE, TTD}","{inhibitor, zNULL}",{2.000842453476449},{0.5182519089628587},{0.1754887587986625},{22.0}
54651,hgnc:9967,rxcui:282388,{True},{False},{True},"{NCI, TEND, TdgClinicalTrial}",{zNULL},{0.035368427207917},{0.0503856022602779},{0.1754887587986625},{4.0}
54654,hgnc:9967,rxcui:357977,{True},{False},{True},"{CGI, CIViC, ChEMBL, MyCancerGenome, TdgClinicalTrial, NCI, DoCM, TALC, CKB-CORE, TEND}","{inhibitor, zNULL}",{0.3978948060890666},{0.0755784033904168},{0.1754887587986625},{30.0}


In [36]:
# SORT values first - so the sets are ideally unique beforehand

df_multi_sources["interaction_source_db_name"] = [",".join(sorted(i)) for i in df_multi_sources["interaction_source_db_name"]]

In [37]:
df_multi_sources["interaction_source_db_name"].nunique()

640

In [38]:
df_multi_sources["interaction_source_db_name"].value_counts()[0:10]

interaction_source_db_name
ChEMBL,TTD                          782
TEND,TdgClinicalTrial               767
ChEMBL,TEND,TdgClinicalTrial        515
ChEMBL,TdgClinicalTrial             402
ChEMBL,TEND,TTD,TdgClinicalTrial    249
ChEMBL,TTD,TdgClinicalTrial         220
TEND,TTD,TdgClinicalTrial           180
FDA,PharmGKB                        180
ChEMBL,DTC                          155
TTD,TdgClinicalTrial                155
Name: count, dtype: int64