<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#FILTER-Missing-values" data-toc-modified-id="FILTER-Missing-values-1">FILTER Missing values</a></span></li><li><span><a href="#interaction_types" data-toc-modified-id="interaction_types-2">interaction_types</a></span><ul class="toc-item"><li><span><a href="#EDA-&quot;|-delimited&quot;-values" data-toc-modified-id="EDA-&quot;|-delimited&quot;-values-2.1">EDA "|-delimited" values</a></span></li><li><span><a href="#Split-&quot;|-delimited&quot;" data-toc-modified-id="Split-&quot;|-delimited&quot;-2.2">Split "|-delimited"</a></span></li></ul></li><li><span><a href="#EDA-interaction_source_db_name" data-toc-modified-id="EDA-interaction_source_db_name-3">EDA interaction_source_db_name</a></span></li><li><span><a href="#FILTER-Namespaces" data-toc-modified-id="FILTER-Namespaces-4">FILTER Namespaces</a></span></li><li><span><a href="#EDA-drug_is_immunotherapy" data-toc-modified-id="EDA-drug_is_immunotherapy-5">EDA drug_is_immunotherapy</a></span></li><li><span><a href="#EDA-Merging-by-gene-drug-pairs" data-toc-modified-id="EDA-Merging-by-gene-drug-pairs-6">EDA Merging by gene-drug pairs</a></span><ul class="toc-item"><li><span><a href="#interaction_types" data-toc-modified-id="interaction_types-6.1">interaction_types</a></span></li><li><span><a href="#interaction_source_db_name" data-toc-modified-id="interaction_source_db_name-6.2">interaction_source_db_name</a></span></li></ul></li><li><span><a href="#EDA-Merging-by-gene-drug-interaction_type-sets" data-toc-modified-id="EDA-Merging-by-gene-drug-interaction_type-sets-7">EDA Merging by gene-drug-interaction_type sets</a></span></li><li><span><a href="#Experimental" data-toc-modified-id="Experimental-8">Experimental</a></span><ul class="toc-item"><li><span><a href="#sources-logic" data-toc-modified-id="sources-logic-8.1">sources logic</a></span></li><li><span><a href="#scores-logic" data-toc-modified-id="scores-logic-8.2">scores logic</a></span></li></ul></li><li><span><a href="#For-comparing-to-pipeline-output" data-toc-modified-id="For-comparing-to-pipeline-output-9">For comparing to pipeline output</a></span></li></ul></div>

# DGIdb notebook: EDA + code dev

In [1]:
## for notebook only 

## allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## for printing
from pprint import pprint

## for loading locally-stored files
import pathlib

In [2]:
import pandas as pd

## NOT for parser: for viewing df only
pd.options.display.max_columns = None
pd.set_option('display.max_colwidth', None)

<div class="alert alert-block alert-danger">

This notebook was originally written using the **2024-Dec** interactions.tsv from https://dgidb.org/downloads. Its "last modified" date is Fri, **06 Dec 2024** 15:20:44 GMT, according to the headers returned from a HEAD request to [its download link](https://dgidb.org/data/2024-Dec/interactions.tsv).
    
<br>
    
I didn't use the "latest" interactions.tsv because its "last modified" date is Mon, **10 Jun 2024** 16:04:52 GMT according to the headers returned from a HEAD request to [its download link](https://dgidb.org/data/latest/interactions.tsv)

The 2024-Dec file has two header lines showing the DGIdb semantic version **5.0.7** and "data" version (month-year date). 

```
# Data version: Dec-2024
# DGIdb version: v.5.0.7
```

In [3]:
## path to raw resource file
interactions_path = pathlib.Path.home().joinpath("Desktop", 
                                                 "DGIdb_files",
                                                 "2024-Dec-interactions.tsv")

In [4]:
## load file in pandas directly

## skip first two lines (comments)
## setting parameter comment="#" causes a bug
##   because some lines have # in the names. param causes rest of line to be NA
df = pd.read_table(interactions_path, header=2)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98920 entries, 0 to 98919
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gene_claim_name                98915 non-null  object 
 1   gene_concept_id                90442 non-null  object 
 2   gene_name                      90442 non-null  object 
 3   drug_claim_name                98920 non-null  object 
 4   drug_concept_id                88398 non-null  object 
 5   drug_name                      88398 non-null  object 
 6   drug_is_approved               88398 non-null  object 
 7   drug_is_immunotherapy          88398 non-null  object 
 8   drug_is_antineoplastic         88398 non-null  object 
 9   interaction_source_db_name     98920 non-null  object 
 10  interaction_source_db_version  98920 non-null  object 
 11  interaction_types              35635 non-null  object 
 12  interaction_score              81743 non-null 

## FILTER Missing values

Review:
* **All of the columns have some missing values**
* the entity ID columns `gene_concept_id` and `drug_concept_id` are missing thousands of values - even though DGIdb did an [entity-resolving/common-ID-assignment step](https://dgidb.org/about/overview/grouping). 
* BUT it is expected that `interaction_types` is missing values, since not all sources will assign specific relationship types (ex: inhibitor). 

In [6]:
## EDA - going through columns looking at missing values

# df[df["gene_concept_id"].isna()]

# df[df["drug_concept_id"].isna()]

df[df["interaction_source_db_name"].isna()]

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score


<div class="alert alert-block alert-success">

**DECISION**: drop rows with NA in `gene_concept_id` OR `drug_concept_id`. 

In [7]:
## logs

## number of rows left after dropping NAs
## default is remove if any column has NA for the row
have_values = df.dropna(subset=["gene_concept_id", 
                                "drug_concept_id"]).shape[0]
print(f"{have_values} rows ({have_values / df.shape[0]:.1%}) kept (have both entity IDs)")
print("\n")

## gene IDs
have_gene_id = df["gene_concept_id"].notna().sum()
print(f"{have_gene_id} rows have gene IDs: {have_gene_id / df.shape[0]:.1%}")

## drug IDs
have_drug_id = df["drug_concept_id"].notna().sum()
print(f"{have_drug_id} rows have drug IDs: {have_drug_id / df.shape[0]:.1%}")

81743 rows (82.6%) kept (have both entity IDs)


90442 rows have gene IDs: 91.4%
88398 rows have drug IDs: 89.4%


In [8]:
## save set of interaction_types before filtering, to compare to after

starting_interact_types = set(df["interaction_types"].unique())
len(starting_interact_types)

31

In [9]:
## drop rows, check
## default is remove if any column has NA for the row
# df = df.dropna(subset=["gene_concept_id", 
#                        "drug_concept_id",
#                        "interaction_source_db_name"], ignore_index=True).copy()
df.dropna(subset=["gene_concept_id", 
                  "drug_concept_id",
                  "interaction_source_db_name"], 
          ignore_index=True, 
          inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81743 entries, 0 to 81742
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gene_claim_name                81738 non-null  object 
 1   gene_concept_id                81743 non-null  object 
 2   gene_name                      81743 non-null  object 
 3   drug_claim_name                81743 non-null  object 
 4   drug_concept_id                81743 non-null  object 
 5   drug_name                      81743 non-null  object 
 6   drug_is_approved               81743 non-null  object 
 7   drug_is_immunotherapy          81743 non-null  object 
 8   drug_is_antineoplastic         81743 non-null  object 
 9   interaction_source_db_name     81743 non-null  object 
 10  interaction_source_db_version  81743 non-null  object 
 11  interaction_types              30408 non-null  object 
 12  interaction_score              81743 non-null 

## interaction_types

[2025-11-05 with 2024-Dec data]

The data before filtering has the same unique interaction_types values. 

In [10]:
## compare before-after

starting_interact_types == set(df["interaction_types"].unique())

True

In [11]:
df["interaction_types"].nunique(dropna=False)

df["interaction_types"].value_counts(dropna=False).sort_index()

31

interaction_types
activator                      584
activator|blocker                4
activator|inhibitor              2
agonist                       5882
agonist|inhibitor               22
agonist|modulator                2
antibody                       298
antibody|immunotherapy           4
antisense oligonucleotide        4
binder                         258
blocker                       1807
blocker|activator                2
blocker|inhibitor                2
cleavage                        83
immunotherapy                    3
immunotherapy|antibody           4
inhibitor                    18695
inhibitor|activator              2
inhibitor|agonist               14
inhibitor|blocker                3
inhibitor|modulator              5
inverse agonist                 36
modulator                     1241
modulator|agonist                1
modulator|inhibitor              3
negative modulator             133
other/unknown                  219
positive modulator            1013
po

In [12]:
## replace NA with "~NULL"
## makes next steps working with this column easier, will be at end alphanumerically 

df["interaction_types"] = df["interaction_types"].fillna("~NULL")

In [13]:
df["interaction_types"].value_counts().sort_index()

interaction_types
activator                      584
activator|blocker                4
activator|inhibitor              2
agonist                       5882
agonist|inhibitor               22
agonist|modulator                2
antibody                       298
antibody|immunotherapy           4
antisense oligonucleotide        4
binder                         258
blocker                       1807
blocker|activator                2
blocker|inhibitor                2
cleavage                        83
immunotherapy                    3
immunotherapy|antibody           4
inhibitor                    18695
inhibitor|activator              2
inhibitor|agonist               14
inhibitor|blocker                3
inhibitor|modulator              5
inverse agonist                 36
modulator                     1241
modulator|agonist                1
modulator|inhibitor              3
negative modulator             133
other/unknown                  219
positive modulator            1013
po

### EDA "|-delimited" values

Muliple values, "|"-delimited (special value, needs escaping).

Only a small proportion of the dataset

In [14]:
df_piped_types = df[df["interaction_types"].str.contains("\\|")].copy()

print(f"{df_piped_types.shape[0]} rows with |-delimited interaction_types" + 
      f": {df_piped_types.shape[0] / df.shape[0]:.3%}")

df_piped_types.head()

70 rows with |-delimited interaction_types: 0.086%


Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
649,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10038,iuphar.ligand:10038,COMPOUND 5 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,inhibitor|blocker,1.864421,3.627763,0.513931,1.0
654,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10039,iuphar.ligand:10039,COMPOUND 6 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,inhibitor|blocker,1.864421,3.627763,0.513931,1.0
657,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:1713,rxcui:318,ADENOSINE TRIPHOSPHATE,True,False,False,GuideToPharmacology,2024.3,inhibitor|blocker,0.109672,0.213398,0.513931,1.0
6029,NCBIGENE:21,hgnc:33,ABCA3,IUPHAR.LIGAND:459,iuphar.ligand:459,MRE 3008F20,False,False,False,GuideToPharmacology,2024.3,agonist|inhibitor,0.134546,0.906941,0.148351,1.0
6205,NCBIGENE:277,hgnc:475,AMY1B,IUPHAR.LIGAND:9494,iuphar.ligand:9494,CYM-5541,False,False,False,GuideToPharmacology,2024.3,modulator|agonist,3.070812,3.627763,0.846475,1.0


In [15]:
df_piped_types["interaction_types"].nunique()

df_piped_types["interaction_types"].value_counts().sort_index()

14

interaction_types
activator|blocker          4
activator|inhibitor        2
agonist|inhibitor         22
agonist|modulator          2
antibody|immunotherapy     4
blocker|activator          2
blocker|inhibitor          2
immunotherapy|antibody     4
inhibitor|activator        2
inhibitor|agonist         14
inhibitor|blocker          3
inhibitor|modulator        5
modulator|agonist          1
modulator|inhibitor        3
Name: count, dtype: int64

**REVIEW**

**Opposing:**
* activator|blocker
* blocker|activator
* activator|inhibitor
* inhibitor|activator
* agonist|inhibitor
* inhibitor|agonist

**Close:**
* blocker|inhibitor
* inhibitor|blocker

**One is kinda a subclass of the other?**
* agonist|modulator
* modulator|agonist
* inhibitor|modulator
* modulator|inhibitor

**Identical?**
* antibody|immunotherapy
* immunotherapy|antibody

In [16]:
## are the "flipped order" types the same data? esp when same row count?
## NO - based on the few pairs I reviewed

df_piped_types[df_piped_types["interaction_types"] == "activator|inhibitor"]
df_piped_types[df_piped_types["interaction_types"] == "inhibitor|activator"]

# df_multi_types[df_multi_types["interaction_types"] == "agonist|modulator"]
# df_multi_types[df_multi_types["interaction_types"] == "modulator|agonist"]

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
9728,NCBIGENE:749,hgnc:10485,RYR3,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,activator|inhibitor,0.163137,0.090694,1.79876,1.0
79395,NCBIGENE:748,hgnc:10484,RYR2,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,activator|inhibitor,0.118645,0.090694,1.308189,1.0


Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
24192,NCBIGENE:747,hgnc:1165,DAGLA,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,inhibitor|activator,0.130509,0.090694,1.439008,1.0
70570,BRAF,hgnc:1097,BRAF,VEMURAFENIB,rxcui:1147220,VEMURAFENIB,True,False,True,MyCancerGenomeClinicalTrial,30-Feburary-2014,inhibitor|activator,1.587278,0.098048,0.07195,225.0


In [17]:
## from only 4 resources, mostly GuideToPharmacology

df_piped_types["interaction_source_db_name"].value_counts()

interaction_source_db_name
GuideToPharmacology            60
MyCancerGenome                  8
MyCancerGenomeClinicalTrial     1
ChEMBL                          1
Name: count, dtype: int64

### Split "|-delimited"

In [18]:
df["interaction_types"] = df["interaction_types"].str.split("|")

In [19]:
## this is correct - the row count in df_piped_types was the same
df[df["interaction_types"].map(len) > 1].shape[0]

df[df["interaction_types"].map(len) > 1]

70

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
649,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10038,iuphar.ligand:10038,COMPOUND 5 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,"[inhibitor, blocker]",1.864421,3.627763,0.513931,1.0
654,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10039,iuphar.ligand:10039,COMPOUND 6 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,"[inhibitor, blocker]",1.864421,3.627763,0.513931,1.0
657,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:1713,rxcui:318,ADENOSINE TRIPHOSPHATE,True,False,False,GuideToPharmacology,2024.3,"[inhibitor, blocker]",0.109672,0.213398,0.513931,1.0
6029,NCBIGENE:21,hgnc:33,ABCA3,IUPHAR.LIGAND:459,iuphar.ligand:459,MRE 3008F20,False,False,False,GuideToPharmacology,2024.3,"[agonist, inhibitor]",0.134546,0.906941,0.148351,1.0
6205,NCBIGENE:277,hgnc:475,AMY1B,IUPHAR.LIGAND:9494,iuphar.ligand:9494,CYM-5541,False,False,False,GuideToPharmacology,2024.3,"[modulator, agonist]",3.070812,3.627763,0.846475,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75512,NCBIGENE:13,hgnc:17,AADAC,IUPHAR.LIGAND:289,iuphar.ligand:289,AC-42,False,False,False,GuideToPharmacology,2024.3,"[agonist, modulator]",0.501960,3.627763,0.138366,1.0
79395,NCBIGENE:748,hgnc:10484,RYR2,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,"[activator, inhibitor]",0.118645,0.090694,1.308189,1.0
79397,NCBIGENE:748,hgnc:10484,RYR2,IUPHAR.LIGAND:4303,iuphar.ligand:4303,RYANODINE,False,False,False,GuideToPharmacology,2024.3,"[activator, blocker]",1.186450,0.906941,1.308189,1.0
80315,NCBIGENE:359,hgnc:634,AQP2,IUPHAR.LIGAND:2036,iuphar.ligand:2036,BIM 23056,False,False,False,GuideToPharmacology,2024.3,"[agonist, inhibitor]",0.362526,1.209254,0.299793,1.0


Then expand to multiple rows using pandas explode

In [20]:
df = df.explode("interaction_types", ignore_index=True)

## -> log
print(f"{df.shape[0]} rows after expanding rows with multiple interaction_type values")

81813 rows after expanding rows with multiple interaction_type values


## EDA interaction_source_db_name

aka "underlying sources". 
These are all single-values. 

Compared to the [website's "Interaction Sources"](https://dgidb.org/browse/sources), data has [all sources] - DrugBank + NCI. 
* On DrugBank, website actually doesn't have any "interaction claim" counts
* Not clear to me what "NCI" is. National Cancer Institute?

In [21]:
df["interaction_source_db_name"].nunique()

df["interaction_source_db_name"].value_counts().sort_index()

21

interaction_source_db_name
CGI                                  345
CIViC                               1013
CKB-CORE                            1777
COSMIC                                34
CancerCommons                        106
ChEMBL                             12292
ClearityFoundationBiomarkers         160
ClearityFoundationClinicalTrial      240
DTC                                23876
DoCM                                  72
FDA                                  402
GuideToPharmacology                16425
MyCancerGenome                       811
MyCancerGenomeClinicalTrial          315
NCI                                 6076
OncoKB                               146
PharmGKB                            5248
TALC                                 564
TEND                                2242
TTD                                 5110
TdgClinicalTrial                    4559
Name: count, dtype: int64

In [22]:
df[df["interaction_source_db_name"] == "NCI"]

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
8,ICAM3,hgnc:5346,ICAM3,GRANULOCYTE MACROPHAGE COLONY-STIMULATING FACTOR,ncit:C1288,RECOMBINANT GRANULOCYTE-MACROPHAGE COLONY-STIMULATING FACTOR,False,False,True,NCI,14-September-2017,~NULL,13.050950,3.627763,1.798760,2.0
9,ICAM3,hgnc:5346,ICAM3,PMA,ncit:C866,TETRADECANOYLPHORBOL ACETATE,False,False,True,NCI,14-September-2017,~NULL,0.283716,0.078864,1.798760,2.0
10,ICAM3,hgnc:5346,ICAM3,GM-CSF,iuphar.ligand:4942,GM-CSF,False,False,False,NCI,14-September-2017,~NULL,0.815684,0.226735,1.798760,2.0
11,ICAM3,hgnc:5346,ICAM3,VITAMIN D,rxcui:11253,VITAMIN D,True,False,False,NCI,14-September-2017,~NULL,1.003919,0.279059,1.798760,2.0
12,ICAM3,hgnc:5346,ICAM3,INTERFERONS,ncit:C584,RECOMBINANT INTERFERON,False,False,True,NCI,14-September-2017,~NULL,0.334640,0.093020,1.798760,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81808,PTPRC,hgnc:9666,PTPRC,PROTEIN KINASE INHIBITOR,ncit:C1404,PROTEIN KINASE INHIBITOR,False,False,False,NCI,14-September-2017,~NULL,1.160084,0.725553,0.799449,2.0
81809,PTPRC,hgnc:9666,PTPRC,OESTRADIOL,rxcui:24395,ESTRADIOL VALERATE,True,False,True,NCI,14-September-2017,~NULL,0.123413,0.077186,0.799449,2.0
81810,PTPRC,hgnc:9666,PTPRC,PREDNISONE,rxcui:8640,PREDNISONE,True,False,True,NCI,14-September-2017,~NULL,0.128898,0.080617,0.799449,2.0
81811,PTPRC,hgnc:9666,PTPRC,HEPARAN SULFATE,rxcui:2603494,HEPARAN SULFATE,False,False,True,NCI,14-September-2017,~NULL,0.483369,0.302314,0.799449,2.0


## FILTER Namespaces

In [23]:
## genes

df["gene_prefix"] = [i.split(":")[0] for i in df["gene_concept_id"]]

df["gene_prefix"].value_counts()

## this is ENSG
df[df["gene_prefix"] == "ensembl"]

gene_prefix
hgnc        80711
ncbigene     1101
ensembl         1
Name: count, dtype: int64

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score,gene_prefix
30355,TARP,ensembl:ENSG00000289746,TARP,TESTOSTERONE,rxcui:10379,TESTOSTERONE,True,False,True,NCI,14-September-2017,~NULL,4.015677,0.09302,14.390078,3.0,ensembl


In [24]:
## drugs

df["drug_prefix"] = [i.split(":")[0] for i in df["drug_concept_id"]]

df["drug_prefix"].value_counts()

df[df["drug_prefix"] == "chemidplus"]

drug_prefix
rxcui             34787
ncit              15105
iuphar.ligand     15055
chembl            12985
drugbank           3724
wikidata            112
hemonc               30
drugsatfda.nda       12
chemidplus            3
Name: count, dtype: int64

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score,gene_prefix,drug_prefix
5553,CYP7B1,hgnc:2652,CYP7B1,DDD,chemidplus:72-54-8,TDE,False,False,False,NCI,14-September-2017,~NULL,6.960506,3.627763,0.959339,2.0,hgnc,chemidplus
31677,NOREPINEPHRINE TRANSPORTER,hgnc:11048,SLC6A2,Hypericum,chemidplus:68917-49-7,ST. JOHN'S WORT,False,False,False,TTD,2020.06.01,~NULL,0.483369,3.627763,0.133241,1.0,hgnc,chemidplus
71911,ESTROGEN-RELATED RECEPTOR-ALPHA,hgnc:3471,ESRRA,Dexamethasone palmitate,chemidplus:14899-36-6,DEXAMETHASONE PALMITATE,False,False,False,TTD,2020.06.01,~NULL,8.700633,3.627763,2.398346,1.0,hgnc,chemidplus


**[NodeNorm](https://nodenormalization-sri.renci.org/1.5/get_curie_prefixes?semantic_type=biolink%3ANamedThing) can't handle:**
* ncit (for chemicals)
* iuphar.ligand
* wikidata
* hemonc
* drugsatfda.nda
* chemidplus

Notes:
* assuming chembl = CHEMBL.COMPOUND. based on some spot-checks, it seems to work some of the time?)
* Noticed some names are also not NameRes-able, ex: `COMPOUND 5 [PMID: 29579323]`

In [25]:
## NodeNorm doesn't recognize

PREFIXES_TO_DROP = [
    ## . probably will be treated as "all match"...unless escaped
    "ncit",
    "iuphar\\.ligand",
    "wikidata",
    "hemonc",
    "drugsatfda\\.nda",
    "chemidplus",
]

In [26]:
## set case=False so it isn't case-sensitive on matches!

n_before = df.shape[0]
df = df[~df.drug_prefix.str.contains('|'.join(PREFIXES_TO_DROP), case=False)].copy()

## -> log
print(f"{df.shape[0]} rows ({df.shape[0] / n_before:.1%}) after filtering out drug namespaces that can't be NodeNormed")

51496 rows (62.9%) after filtering out drug namespaces that can't be NodeNormed


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 51496 entries, 1 to 81811
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gene_claim_name                51491 non-null  object 
 1   gene_concept_id                51496 non-null  object 
 2   gene_name                      51496 non-null  object 
 3   drug_claim_name                51496 non-null  object 
 4   drug_concept_id                51496 non-null  object 
 5   drug_name                      51496 non-null  object 
 6   drug_is_approved               51496 non-null  object 
 7   drug_is_immunotherapy          51496 non-null  object 
 8   drug_is_antineoplastic         51496 non-null  object 
 9   interaction_source_db_name     51496 non-null  object 
 10  interaction_source_db_version  51496 non-null  object 
 11  interaction_types              51496 non-null  object 
 12  interaction_score              51496 non-null  floa

## EDA drug_is_immunotherapy

Matt wanted to know if there was a relationship between this flag and the interaction_types

In [28]:
df["drug_is_immunotherapy"].value_counts()

## this looks like more than the "immune"-related interaction_types

drug_is_immunotherapy
False    50311
True      1185
Name: count, dtype: int64

In [29]:
df[df["drug_is_immunotherapy"] == True]["interaction_types"].value_counts(dropna=False).sort_index()

interaction_types
activator          1
agonist           25
antibody          55
binder             3
blocker            1
immunotherapy      8
inhibitor         97
modulator          2
~NULL            993
Name: count, dtype: int64

In [30]:
df[df["drug_is_immunotherapy"] == False]["interaction_types"].value_counts(dropna=False).sort_index()

interaction_types
activator                      190
agonist                       1936
antibody                        35
antisense oligonucleotide        2
binder                         158
blocker                       1121
cleavage                        75
inhibitor                     7269
inverse agonist                 32
modulator                      309
negative modulator              86
other/unknown                  166
positive modulator             923
potentiator                     42
vaccine                          3
~NULL                        37964
Name: count, dtype: int64

**Summary**

`drug_is_immunotherapy` is basically independent of `interaction_types`:
* `True`: only a small subset have immune-related interaction_type (antibody, immunotherapy)
* `False`: some immune-related interaction_type rows are here! (antibody, immunotherapy, vaccine)

## EDA Merging by gene-drug pairs

In [None]:
## first drop some columns - not needed OR values won't make sense after merge
## makes merge faster

# cols_not_needed = [
#     "gene_claim_name", 
#     "drug_claim_name",
#     "interaction_source_db_version", 
#     "gene_prefix",
#     "drug_prefix",
# ]

# merge1 = df.drop(columns=cols_not_needed).copy()

In [None]:
## merge: takes ~10s to run

# cols_define_1 = ["gene_concept_id", "drug_concept_id"]

# merge1 = merge1.groupby(by=cols_define_1).agg(set).reset_index().copy()

In [None]:
# merge1.shape[0]

# merge1

In [None]:
## all single values - so same for a gene-drug pair

# ## tied to gene
# merge1[merge1["gene_name"].map(len) > 1].shape[0]

# ## tied to drug, basically a node attribute/annotation
# merge1[merge1["drug_name"].map(len) > 1].shape[0]
# merge1[merge1["drug_is_approved"].map(len) > 1].shape[0]
# merge1[merge1["drug_is_immunotherapy"].map(len) > 1].shape[0]
# merge1[merge1["drug_is_antineoplastic"].map(len) > 1].shape[0]

In [None]:
## all single values - so same for a gene-drug pair

# ## scores for gene-drug pair
# merge1[merge1["interaction_score"].map(len) > 1].shape[0]
# merge1[merge1["evidence_score"].map(len) > 1].shape[0]

# ## scores for only drug or only gene
# merge1[merge1["drug_specificity_score"].map(len) > 1].shape[0]
# merge1[merge1["gene_specificity_score"].map(len) > 1].shape[0]

In [None]:
## end up with multiple values 

# merge1[merge1["interaction_types"].map(len) > 1].shape[0]

# merge1[merge1["interaction_source_db_name"].map(len) > 1].shape[0]

### interaction_types

**6.7%** of the dataset has **multiple interaction_types from the merge**. 

Are there sets that don't make sense?

In [None]:
# merge1_multi_types = merge1[merge1["interaction_types"].map(len) > 1].copy()

# print(f"{merge1_multi_types.shape[0] / merge1.shape[0]:.2%}")

# merge1_multi_types

In [None]:
# SORT values first - so the sets are ideally unique beforehand

# merge1_multi_types["interaction_types"] = [",".join(sorted(i)) for i in merge1_multi_types["interaction_types"]]

In [None]:
# merge1_multi_types["interaction_types"].nunique()

In [None]:
# merge1_multi_types["interaction_types"].value_counts(normalize=True)

Top are (cover >82%):
* `inhibitor,~NULL` (60%)
* `agonist,~NULL` (>14%)
* `positive modulator,~NULL` (>8%)

In [None]:
# merge1_multi_types["interaction_types"].value_counts().sort_index()

**MERGED DATA ISSUES**

(fully reviewed)

**Opposing**
* **agonist,inhibitor**

**Makes sense, but tricky to merge? (1 example of each kind)**
* activator,potentiator,~NULL
* **agonist,antibody**
* antibody,binder,~NULL
* antibody,blocker,immunotherapy,inhibitor,~NULL
* antibody,inhibitor
* blocker,inhibitor
* **cleavage,inhibitor,~NULL**
* inhibitor,inverse agonist,~NULL

In [None]:
# merge1_multi_types[merge1_multi_types["interaction_types"] == "agonist,antibody"]

In [None]:
rows_with_plain = merge1_multi_types[merge1_multi_types["interaction_types"].str.contains("NULL|other")].copy()


print(f"{rows_with_plain.shape[0] / merge1_multi_types.shape[0]:.2%}")

### interaction_source_db_name

**10.8%** of data has multiple sources after merge

EDA only

In [None]:
# merge1_multi_sources = merge1[merge1["interaction_source_db_name"].map(len) > 1].copy()

# print(f"{merge1_multi_sources.shape[0] / merge1.shape[0]:.2%}")

# merge1_multi_sources

In [None]:
# SORT values first - so the sets are ideally unique beforehand

# merge1_multi_sources["interaction_source_db_name"] = [",".join(sorted(i)) for i in merge1_multi_sources["interaction_source_db_name"]]

In [None]:
# merge1_multi_sources["interaction_source_db_name"].nunique()

In [None]:
# merge1_multi_sources["interaction_source_db_name"].value_counts()[0:10]

## EDA Merging by gene-drug-interaction_type sets

In [None]:
# ## first drop some columns - not needed OR values won't make sense after merge
# ## makes merge faster

# ## and order for readability
# merge2 = df[["drug_concept_id", "drug_name", 
#              "gene_concept_id", "gene_name",
#              "interaction_types", "interaction_source_db_name",
#              "interaction_score", "evidence_score"
#             ]].copy()

In [None]:
# ## merge: takes ~5s to run

# cols_define_edge_2 = ["gene_concept_id", "drug_concept_id", "interaction_types"]

# merge2 = merge2.groupby(by=cols_define_edge_2).agg(set).reset_index().copy()

In [None]:
# merge2.shape[0]

# merge2

In [None]:
# ## how many rows with "duplicate" drug-gene pairs

# merge2_samepair = merge2[merge2.duplicated(subset=["gene_concept_id", "drug_concept_id"], keep=False)].copy()

# merge2_samepair.shape[0]

In [None]:
# merge2_samepair[40:61]

Based on spot-check, don't see cases where the `~NULL` has the same sources as its "dup" row...

`[44:47]` is a set of 3

`.loc[56285:56287]` is a set of 3 with other/unknown and NULL

In [None]:
# ## looking for "other/unknown stuff"
# merge2_samepair[merge2_samepair["interaction_types"].str.contains("other")]

In [None]:
# merge2_samepair.loc[56285:56287]

## Experimental

In [31]:
df.drop("drug_prefix", axis=1, inplace=True)

df.columns.to_list()

['gene_claim_name',
 'gene_concept_id',
 'gene_name',
 'drug_claim_name',
 'drug_concept_id',
 'drug_name',
 'drug_is_approved',
 'drug_is_immunotherapy',
 'drug_is_antineoplastic',
 'interaction_source_db_name',
 'interaction_source_db_version',
 'interaction_types',
 'interaction_score',
 'drug_specificity_score',
 'gene_specificity_score',
 'evidence_score',
 'gene_prefix']

In [32]:
## only including necessary columns 
##   + drug/gene names for readability/comparsion with website
## dropping columns makes merge faster

## order for readability
experiment_setup = df[["drug_concept_id", "drug_name", 
                       "gene_concept_id", "gene_name",
                       "interaction_types", "interaction_source_db_name",
                       "interaction_score", "evidence_score"
                      ]].copy()

**FIRST map interaction_types for "plain interacts_with" to 1 term -> "~PLAIN_INTERACTS"**

**Make new column and use it going forward - mod_type** (but in notebook, keep original so can check):
* If "other/unknown" or `~NULL`, put `~PLAIN_INTERACTS`
* Else keep original value

Reasoning:
* then when they both occur for same drug-gene pair, their rows can be merged
  * All other interaction_types map to diff predicate/qualifier-sets so they don't need merging later (confirmed 11/18 night)
* makes following logic easier: detecting "only one edge" or "plain interacts_with edge"
* Other option is to map everything to predicate/qualifier-sets, then do logic with predicate/qualifier-sets. But I think that's more complicated to check…

In [33]:
plain_interact_types = {"other/unknown", "~NULL"}

experiment_setup["mod_type"] = ["~PLAIN_INTERACTS" if i in plain_interact_types else i for i in experiment_setup["interaction_types"]]

In [34]:
## EDA: check how it looks

## regular: don't use "other" because "immunOTHERapy"
experiment_setup[experiment_setup["interaction_types"].str.contains("unknown|NULL")].shape[0]

## modded
experiment_setup[experiment_setup["mod_type"] == "~PLAIN_INTERACTS"].shape[0]

39123

39123

In [35]:
## more checks how it looks
experiment_setup[experiment_setup["mod_type"] == "~PLAIN_INTERACTS"]["interaction_types"].unique()

experiment_setup[~ (experiment_setup["mod_type"] == "~PLAIN_INTERACTS")]["interaction_types"].unique()

array(['~NULL', 'other/unknown'], dtype=object)

array(['inhibitor', 'cleavage', 'blocker', 'modulator', 'binder',
       'agonist', 'activator', 'positive modulator', 'antibody',
       'negative modulator', 'potentiator', 'inverse agonist',
       'immunotherapy', 'antisense oligonucleotide', 'vaccine'],
      dtype=object)

In [36]:
## group by drug ID, gene ID, mod_type combo

## saving in diff variable in case I want to still work with no-grouping-done later
# experiment_mergeMain = experiment_setup.copy()

## merge: takes ~3s to run
cols_define_edge_3 = ["drug_concept_id", "gene_concept_id", "mod_type"]

## when it can be multiple values, run set
## otherwise, the value is always the same (from EDA), so just keep the first
experiment_setup = experiment_setup.groupby(by=cols_define_edge_3).agg(
    {
        "interaction_types": set, 
        "interaction_source_db_name": set,
        "interaction_score": 'first',
        "evidence_score": 'first',
        "drug_name": 'first',
        "gene_name": 'first',

    }
).reset_index().copy()

In [37]:
experiment_setup.shape[0]

45015

In [38]:
## check what rows look like with both original ~PLAIN_INTERACTS values

experiment_setup[experiment_setup["gene_concept_id"] == "hgnc:9565"]

Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,interaction_score,evidence_score,drug_name,gene_name
15852,drugbank:DB16741,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.196255,1.0,BORTEZOMIB D-MANNITOL,PSMD7
20276,rxcui:1302966,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.497179,4.0,CARFILZOMIB,PSMD7
20277,rxcui:1302966,hgnc:9565,~PLAIN_INTERACTS,"{~NULL, other/unknown}","{DTC, MyCancerGenome}",0.497179,4.0,CARFILZOMIB,PSMD7
24392,rxcui:1723734,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.191223,1.0,IXAZOMIB CITRATE,PSMD7
24434,rxcui:1723735,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.177564,1.0,IXAZOMIB,PSMD7
35602,rxcui:358258,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.35095,4.0,BORTEZOMIB,PSMD7
35603,rxcui:358258,hgnc:9565,~PLAIN_INTERACTS,"{~NULL, other/unknown}","{DTC, MyCancerGenome}",0.35095,4.0,BORTEZOMIB,PSMD7


In [39]:
experiment_setup[experiment_setup["mod_type"] == "~PLAIN_INTERACTS"].shape[0]

33563

### sources logic

Create a mapping df: group-by "drug-gene" pair -> get a set of all sources. 

Then make a new sources column **mod_sources** (but in notebook, keep original so can check):
* If "~PLAIN_INTERACTS": use the mapping df's value for drug-gene pair (all sources)
* Else: keep original value

In [40]:
experiment_sources = experiment_setup.copy()

drug_gene_pair = ["drug_concept_id", "gene_concept_id"]

In [41]:
## need to merge sets!
## and keep multi-index, annoying but seems easier to retrieve values later
experiment_sources = experiment_sources.groupby(by=drug_gene_pair).agg(
    {"interaction_source_db_name":lambda x: set.union(*x)})

In [42]:
experiment_sources

Unnamed: 0_level_0,Unnamed: 1_level_0,interaction_source_db_name
drug_concept_id,gene_concept_id,Unnamed: 2_level_1
chembl:CHEMBL101168,hgnc:2596,{DTC}
chembl:CHEMBL101168,hgnc:2637,{DTC}
chembl:CHEMBL10118,hgnc:30863,{DTC}
chembl:CHEMBL101510,hgnc:2625,{DTC}
chembl:CHEMBL101804,hgnc:12841,{DTC}
...,...,...
rxcui:9997,hgnc:644,{DTC}
rxcui:9997,hgnc:7876,{PharmGKB}
rxcui:9997,hgnc:7968,{PharmGKB}
rxcui:9997,hgnc:7979,"{GuideToPharmacology, TEND, ChEMBL, TTD, TdgClinicalTrial}"


In [43]:
## trying out how to get specific value in this df using multi-index
## loc returns a series, turn to list and get first/only item

experiment_sources.loc["rxcui:1302966", "hgnc:9565"]

experiment_sources.loc["rxcui:1302966", "hgnc:9565"].to_list()[0]

interaction_source_db_name    {DTC, ChEMBL, MyCancerGenome}
Name: (rxcui:1302966, hgnc:9565), dtype: object

{'ChEMBL', 'DTC', 'MyCancerGenome'}

In [44]:
## add new column to experiment df
## in specific spot

mod_sources = [
    experiment_sources.loc[x.drug_concept_id, x.gene_concept_id].to_list()[0] if x.mod_type == "~PLAIN_INTERACTS"
    else x.interaction_source_db_name
    for x in experiment_setup[["drug_concept_id", "gene_concept_id", "mod_type", "interaction_source_db_name"]].itertuples()
]

experiment_setup.insert(5, "mod_sources", mod_sources)

In [45]:
## check what rows look like with original ~PLAIN_INTERACTS values

experiment_setup[experiment_setup["gene_concept_id"] == "hgnc:9565"]

experiment_setup[experiment_setup.duplicated(subset=["drug_concept_id", "gene_concept_id"], keep=False)][60:70]

Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
15852,drugbank:DB16741,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.196255,1.0,BORTEZOMIB D-MANNITOL,PSMD7
20276,rxcui:1302966,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.497179,4.0,CARFILZOMIB,PSMD7
20277,rxcui:1302966,hgnc:9565,~PLAIN_INTERACTS,"{~NULL, other/unknown}","{DTC, MyCancerGenome}","{DTC, ChEMBL, MyCancerGenome}",0.497179,4.0,CARFILZOMIB,PSMD7
24392,rxcui:1723734,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.191223,1.0,IXAZOMIB CITRATE,PSMD7
24434,rxcui:1723735,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.177564,1.0,IXAZOMIB,PSMD7
35602,rxcui:358258,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.35095,4.0,BORTEZOMIB,PSMD7
35603,rxcui:358258,hgnc:9565,~PLAIN_INTERACTS,"{~NULL, other/unknown}","{DTC, MyCancerGenome}","{DTC, ChEMBL, MyCancerGenome}",0.35095,4.0,BORTEZOMIB,PSMD7


Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
4576,chembl:CHEMBL2109391,hgnc:3613,modulator,{modulator},{ChEMBL},{ChEMBL},15.66114,3.0,MDX-447,FCGR1A
4577,chembl:CHEMBL2109391,hgnc:3613,~PLAIN_INTERACTS,{~NULL},{NCI},"{ChEMBL, NCI}",15.66114,3.0,MDX-447,FCGR1A
4580,chembl:CHEMBL2109398,hgnc:3387,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},10.44076,2.0,KB-004,EPHA3
4581,chembl:CHEMBL2109398,hgnc:3387,~PLAIN_INTERACTS,{~NULL},{TTD},"{ChEMBL, TTD}",10.44076,2.0,KB-004,EPHA3
4583,chembl:CHEMBL2109401,hgnc:3431,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},2.007838,2.0,AV-203,ERBB3
4584,chembl:CHEMBL2109401,hgnc:3431,~PLAIN_INTERACTS,{~NULL},{TTD},"{ChEMBL, TTD}",2.007838,2.0,AV-203,ERBB3
4585,chembl:CHEMBL2109402,hgnc:3431,antibody,{antibody},{MyCancerGenome},{MyCancerGenome},4.015677,4.0,MM-121,ERBB3
4586,chembl:CHEMBL2109402,hgnc:3431,inhibitor,{inhibitor},"{ChEMBL, ClearityFoundationClinicalTrial}","{ChEMBL, ClearityFoundationClinicalTrial}",4.015677,4.0,MM-121,ERBB3
4587,chembl:CHEMBL2109402,hgnc:3431,~PLAIN_INTERACTS,{~NULL},{TTD},"{ChEMBL, TTD, ClearityFoundationClinicalTrial, MyCancerGenome}",4.015677,4.0,MM-121,ERBB3
4596,chembl:CHEMBL2109435,hgnc:4445,binder,{binder},{ChEMBL},{ChEMBL},26.101899,2.0,HUA33,GPA33


### scores logic

~In notebook, creating new df rather than working on old to make easier to adjust~

Only care about logic for removing scores (setting NA)

Group by "drug-gene" pair:
* If there's > 1 row (mod_type): 
  * If row mod_type isn't "~PLAIN_INTERACTS": remove scores

In [46]:
# experiment_scores = experiment_mergeMain.copy()

# drug_gene_pair = ["drug_concept_id", "gene_concept_id"]

In [47]:
## iterate through a group-by, takes ~5s

grp = experiment_setup.groupby(by=drug_gene_pair)

for name, group in grp:
    if group.shape[0] > 1:
        for idx,row in group.iterrows():
            if row.mod_type != "~PLAIN_INTERACTS":
                experiment_setup.at[idx, "interaction_score"] = pd.NA
                experiment_setup.at[idx, "evidence_score"] = pd.NA

In [48]:
## check what rows look like
## some scores are now NA! 58,830 VS 55,030 

experiment_setup.info()

print("\n")
print(experiment_setup["interaction_score"].isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45015 entries, 0 to 45014
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   drug_concept_id             45015 non-null  object 
 1   gene_concept_id             45015 non-null  object 
 2   mod_type                    45015 non-null  object 
 3   interaction_types           45015 non-null  object 
 4   interaction_source_db_name  45015 non-null  object 
 5   mod_sources                 45015 non-null  object 
 6   interaction_score           42426 non-null  float64
 7   evidence_score              42426 non-null  float64
 8   drug_name                   45015 non-null  object 
 9   gene_name                   45015 non-null  object 
dtypes: float64(2), object(8)
memory usage: 3.4+ MB


2589


In [49]:
## check what rows look like

experiment_setup[experiment_setup["gene_concept_id"] == "hgnc:9565"]

experiment_setup[experiment_setup.duplicated(subset=drug_gene_pair, keep=False)][60:70]

Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
15852,drugbank:DB16741,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.196255,1.0,BORTEZOMIB D-MANNITOL,PSMD7
20276,rxcui:1302966,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},,,CARFILZOMIB,PSMD7
20277,rxcui:1302966,hgnc:9565,~PLAIN_INTERACTS,"{~NULL, other/unknown}","{DTC, MyCancerGenome}","{DTC, ChEMBL, MyCancerGenome}",0.497179,4.0,CARFILZOMIB,PSMD7
24392,rxcui:1723734,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.191223,1.0,IXAZOMIB CITRATE,PSMD7
24434,rxcui:1723735,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.177564,1.0,IXAZOMIB,PSMD7
35602,rxcui:358258,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},,,BORTEZOMIB,PSMD7
35603,rxcui:358258,hgnc:9565,~PLAIN_INTERACTS,"{~NULL, other/unknown}","{DTC, MyCancerGenome}","{DTC, ChEMBL, MyCancerGenome}",0.35095,4.0,BORTEZOMIB,PSMD7


Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
4576,chembl:CHEMBL2109391,hgnc:3613,modulator,{modulator},{ChEMBL},{ChEMBL},,,MDX-447,FCGR1A
4577,chembl:CHEMBL2109391,hgnc:3613,~PLAIN_INTERACTS,{~NULL},{NCI},"{ChEMBL, NCI}",15.66114,3.0,MDX-447,FCGR1A
4580,chembl:CHEMBL2109398,hgnc:3387,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},,,KB-004,EPHA3
4581,chembl:CHEMBL2109398,hgnc:3387,~PLAIN_INTERACTS,{~NULL},{TTD},"{ChEMBL, TTD}",10.44076,2.0,KB-004,EPHA3
4583,chembl:CHEMBL2109401,hgnc:3431,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},,,AV-203,ERBB3
4584,chembl:CHEMBL2109401,hgnc:3431,~PLAIN_INTERACTS,{~NULL},{TTD},"{ChEMBL, TTD}",2.007838,2.0,AV-203,ERBB3
4585,chembl:CHEMBL2109402,hgnc:3431,antibody,{antibody},{MyCancerGenome},{MyCancerGenome},,,MM-121,ERBB3
4586,chembl:CHEMBL2109402,hgnc:3431,inhibitor,{inhibitor},"{ChEMBL, ClearityFoundationClinicalTrial}","{ChEMBL, ClearityFoundationClinicalTrial}",,,MM-121,ERBB3
4587,chembl:CHEMBL2109402,hgnc:3431,~PLAIN_INTERACTS,{~NULL},{TTD},"{ChEMBL, TTD, ClearityFoundationClinicalTrial, MyCancerGenome}",4.015677,4.0,MM-121,ERBB3
4596,chembl:CHEMBL2109435,hgnc:4445,binder,{binder},{ChEMBL},{ChEMBL},,,HUA33,GPA33


## For comparing to pipeline output

In [50]:
## count after adding extra edges

augmented = ["agonist", "antibody", "blocker", "inhibitor", "inverse agonist"]

experiment_setup.shape[0] + \
experiment_setup[experiment_setup.mod_type.str.contains('|'.join(augmented), case=False)].shape[0]

54719

In [51]:
experiment_setup[experiment_setup["drug_concept_id"] == "chembl:CHEMBL101168"]

Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
0,chembl:CHEMBL101168,hgnc:2596,~PLAIN_INTERACTS,{~NULL},{DTC},{DTC},0.044017,1.0,CHEMBL:CHEMBL101168,CYP1A2
1,chembl:CHEMBL101168,hgnc:2637,~PLAIN_INTERACTS,{~NULL},{DTC},{DTC},0.029197,1.0,CHEMBL:CHEMBL101168,CYP3A4


In [52]:
experiment_setup.head()

Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
0,chembl:CHEMBL101168,hgnc:2596,~PLAIN_INTERACTS,{~NULL},{DTC},{DTC},0.044017,1.0,CHEMBL:CHEMBL101168,CYP1A2
1,chembl:CHEMBL101168,hgnc:2637,~PLAIN_INTERACTS,{~NULL},{DTC},{DTC},0.029197,1.0,CHEMBL:CHEMBL101168,CYP3A4
2,chembl:CHEMBL10118,hgnc:30863,~PLAIN_INTERACTS,{~NULL},{DTC},{DTC},104.407597,2.0,FAMOXADONE,UQCR10
3,chembl:CHEMBL101510,hgnc:2625,~PLAIN_INTERACTS,{~NULL},{DTC},{DTC},0.087885,1.0,CHEMBL:CHEMBL101510,CYP2D6
4,chembl:CHEMBL101804,hgnc:12841,~PLAIN_INTERACTS,{~NULL},{DTC},{DTC},0.25098,1.0,CHEMBL:CHEMBL101804,YES1


In [53]:
experiment_setup[(experiment_setup["mod_sources"].map(len) > 1)].shape[0]

4743

In [54]:
## can also look at how many rows have scores (matches count in info)

## number of rows with multiple final supporting sources
## same as pipeline before normalization! (grep)

n_final_multi = 0

for row in experiment_setup[["mod_sources", "mod_type"]].itertuples():
    temp = 0
    for i in row.mod_sources:
        if i not in ["TALC", "TEND", "TdgClinicalTrial"]:
            temp += 1
    if temp > 1:
        n_final_multi += 1
        if row.mod_type in augmented:
            n_final_multi += 1
        
print(n_final_multi)

3091


In [55]:
## number of rows with publications
## same as pipeline before normalization! (grep)

n_final_pubs = 0

for row in experiment_setup[["mod_sources", "mod_type"]].itertuples():
    has_pub = False
    
    for i in row.mod_sources:
        if i in ["TALC", "TEND", "TdgClinicalTrial"]:
            has_pub = True
    if has_pub:
        n_final_pubs += 1
        if row.mod_type in augmented:
            n_final_pubs += 1

print(n_final_pubs)

4150


In [56]:
## drugbank: 78 fail NodeNorm -> small proportion

experiment_setup[experiment_setup["drug_concept_id"].str.contains("drugbank")]["drug_concept_id"].nunique()

78 / experiment_setup[experiment_setup["drug_concept_id"].str.contains("drugbank")]["drug_concept_id"].nunique()

980

0.07959183673469387

In [57]:
## ncbigene: all fail NodeNorm...odd 

experiment_setup[experiment_setup["gene_concept_id"].str.contains("ncbigene")]["gene_concept_id"].nunique()

49

In [58]:
## chembl: 22 fail NodeNorm -> very small proportion

experiment_setup[experiment_setup["drug_concept_id"].str.contains("chembl:")]["drug_concept_id"].nunique()

22 / experiment_setup[experiment_setup["drug_concept_id"].str.contains("chembl:")]["drug_concept_id"].nunique()

4406

0.004993191103041307