<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#FILTER-Missing-values" data-toc-modified-id="FILTER-Missing-values-1">FILTER Missing values</a></span></li><li><span><a href="#interaction_types" data-toc-modified-id="interaction_types-2">interaction_types</a></span><ul class="toc-item"><li><span><a href="#EDA-&quot;|-delimited&quot;-values" data-toc-modified-id="EDA-&quot;|-delimited&quot;-values-2.1">EDA "|-delimited" values</a></span></li><li><span><a href="#Split-&quot;|-delimited&quot;" data-toc-modified-id="Split-&quot;|-delimited&quot;-2.2">Split "|-delimited"</a></span></li></ul></li><li><span><a href="#EDA-interaction_source_db_name" data-toc-modified-id="EDA-interaction_source_db_name-3">EDA interaction_source_db_name</a></span></li><li><span><a href="#FILTER-Namespaces" data-toc-modified-id="FILTER-Namespaces-4">FILTER Namespaces</a></span></li><li><span><a href="#EDA-drug_is_immunotherapy" data-toc-modified-id="EDA-drug_is_immunotherapy-5">EDA drug_is_immunotherapy</a></span></li><li><span><a href="#EDA-Merging-by-gene-drug-pairs" data-toc-modified-id="EDA-Merging-by-gene-drug-pairs-6">EDA Merging by gene-drug pairs</a></span><ul class="toc-item"><li><span><a href="#interaction_types" data-toc-modified-id="interaction_types-6.1">interaction_types</a></span></li><li><span><a href="#interaction_source_db_name" data-toc-modified-id="interaction_source_db_name-6.2">interaction_source_db_name</a></span></li></ul></li><li><span><a href="#EDA-Merging-by-gene-drug-interaction_type-sets" data-toc-modified-id="EDA-Merging-by-gene-drug-interaction_type-sets-7">EDA Merging by gene-drug-interaction_type sets</a></span></li><li><span><a href="#Experimental" data-toc-modified-id="Experimental-8">Experimental</a></span><ul class="toc-item"><li><span><a href="#sources-logic" data-toc-modified-id="sources-logic-8.1">sources logic</a></span></li><li><span><a href="#scores-logic" data-toc-modified-id="scores-logic-8.2">scores logic</a></span></li></ul></li></ul></div>

# DGIdb notebook: EDA

In [1]:
## for notebook only 

## allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## for printing
from pprint import pprint

## for loading locally-stored files
import pathlib

In [2]:
import pandas as pd

## NOT for parser: for viewing df only
pd.options.display.max_columns = None
pd.set_option('display.max_colwidth', None)

<div class="alert alert-block alert-danger">

This notebook was originally written using the **2024-Dec** interactions.tsv from https://dgidb.org/downloads. Its "last modified" date is Fri, **06 Dec 2024** 15:20:44 GMT, according to the headers returned from a HEAD request to [its download link](https://dgidb.org/data/2024-Dec/interactions.tsv).
    
<br>
    
I didn't use the "latest" interactions.tsv because its "last modified" date is Mon, **10 Jun 2024** 16:04:52 GMT according to the headers returned from a HEAD request to [its download link](https://dgidb.org/data/latest/interactions.tsv)

The 2024-Dec file has two header lines showing the DGIdb semantic version **5.0.7** and "data" version (month-year date). 

```
# Data version: Dec-2024
# DGIdb version: v.5.0.7
```

In [3]:
## path to raw resource file
interactions_path = pathlib.Path.home().joinpath("Desktop", 
                                                 "DGIdb_files",
                                                 "2024-Dec-interactions.tsv")

In [4]:
## load file in pandas directly

## should automatically treat first line after comments as header
df = pd.read_table(interactions_path, comment="#")

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98920 entries, 0 to 98919
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gene_claim_name                98915 non-null  object 
 1   gene_concept_id                90442 non-null  object 
 2   gene_name                      90442 non-null  object 
 3   drug_claim_name                98918 non-null  object 
 4   drug_concept_id                88396 non-null  object 
 5   drug_name                      88396 non-null  object 
 6   drug_is_approved               88395 non-null  object 
 7   drug_is_immunotherapy          88395 non-null  object 
 8   drug_is_antineoplastic         88395 non-null  object 
 9   interaction_source_db_name     98917 non-null  object 
 10  interaction_source_db_version  98917 non-null  object 
 11  interaction_types              35632 non-null  object 
 12  interaction_score              81740 non-null 

## FILTER Missing values

Review:
* **All of the columns have some missing values**
* the entity ID columns `gene_concept_id` and `drug_concept_id` are missing thousands of values - even though DGIdb did an [entity-resolving/common-ID-assignment step](https://dgidb.org/about/overview/grouping). 
* `interaction_source_db_name` is missing a few values, which is unexpected
* BUT it is expected that `interaction_types` is missing values, since not all sources will assign specific relationship types (ex: inhibitor). 

In [6]:
## EDA - going through columns looking at missing values

# df[df["gene_concept_id"].isna()]

# df[df["drug_concept_id"].isna()]

## these rows look weird - unexpected drug namespace or no drug ID, lots of NAs
df[df["interaction_source_db_name"].isna()]

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
8301,NCBIGENE:362,hgnc:638,AQP5,IUPHAR.LIGAND:2129,iuphar.ligand:2129,N&,,,,,,,,,,
12204,NCBIGENE:2721,ncbigene:2721,GLC,,,,,,,,,,,,,
12205,NCBIGENE:2721,ncbigene:2721,GLC,,,,,,,,,,,,,


<div class="alert alert-block alert-success">

**DECISION**: drop rows with NA in `gene_concept_id` OR `drug_concept_id` OR `interaction_source_db_name`. 

In [7]:
## logs

## number of rows left after dropping NAs
## default is remove if any column has NA for the row
have_values = df.dropna(subset=["gene_concept_id", 
                                "drug_concept_id",
                                "interaction_source_db_name"]).shape[0]
print(f"{have_values} rows have both entity IDs and an underlying source" + 
      f": {have_values / df.shape[0]:.1%}\n")


## gene IDs
have_gene_id = df["gene_concept_id"].notna().sum()
print(f"{have_gene_id} rows have gene IDs: {have_gene_id / df.shape[0]:.1%}")

## drug IDs
have_drug_id = df["drug_concept_id"].notna().sum()
print(f"{have_drug_id} rows have drug IDs: {have_drug_id / df.shape[0]:.1%}")

## sources
have_source = df["interaction_source_db_name"].notna().sum()
print(f"{have_source} rows have an underlying source: {have_source / df.shape[0]:.3%}")

81740 rows have both entity IDs and an underlying source: 82.6%

90442 rows have gene IDs: 91.4%
88396 rows have drug IDs: 89.4%
98917 rows have an underlying source: 99.997%


In [8]:
## save set of interaction_types before filtering, to compare to after

starting_interact_types = set(df["interaction_types"].unique())
len(starting_interact_types)

31

In [9]:
## drop rows, check
## default is remove if any column has NA for the row
df = df.dropna(subset=["gene_concept_id", 
                       "drug_concept_id",
                       "interaction_source_db_name"], ignore_index=True).copy()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81740 entries, 0 to 81739
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gene_claim_name                81735 non-null  object 
 1   gene_concept_id                81740 non-null  object 
 2   gene_name                      81740 non-null  object 
 3   drug_claim_name                81740 non-null  object 
 4   drug_concept_id                81740 non-null  object 
 5   drug_name                      81740 non-null  object 
 6   drug_is_approved               81740 non-null  object 
 7   drug_is_immunotherapy          81740 non-null  object 
 8   drug_is_antineoplastic         81740 non-null  object 
 9   interaction_source_db_name     81740 non-null  object 
 10  interaction_source_db_version  81740 non-null  object 
 11  interaction_types              30405 non-null  object 
 12  interaction_score              81740 non-null 

## interaction_types

[2025-11-05 with 2024-Dec data]

The data before filtering has the same unique interaction_types values. 

In [10]:
## compare before-after

starting_interact_types == set(df["interaction_types"].unique())

True

In [11]:
df["interaction_types"].nunique(dropna=False)

df["interaction_types"].value_counts(dropna=False).sort_index()

31

interaction_types
activator                      584
activator|blocker                4
activator|inhibitor              2
agonist                       5882
agonist|inhibitor               22
agonist|modulator                2
antibody                       298
antibody|immunotherapy           4
antisense oligonucleotide        4
binder                         258
blocker                       1807
blocker|activator                2
blocker|inhibitor                2
cleavage                        83
immunotherapy                    3
immunotherapy|antibody           4
inhibitor                    18692
inhibitor|activator              2
inhibitor|agonist               14
inhibitor|blocker                3
inhibitor|modulator              5
inverse agonist                 36
modulator                     1241
modulator|agonist                1
modulator|inhibitor              3
negative modulator             133
other/unknown                  219
positive modulator            1013
po

In [12]:
## replace NA with "~NULL"
## makes next steps working with this column easier, will be at end alphanumerically 

df["interaction_types"] = df["interaction_types"].fillna("~NULL")

In [13]:
df["interaction_types"].value_counts().sort_index()

interaction_types
activator                      584
activator|blocker                4
activator|inhibitor              2
agonist                       5882
agonist|inhibitor               22
agonist|modulator                2
antibody                       298
antibody|immunotherapy           4
antisense oligonucleotide        4
binder                         258
blocker                       1807
blocker|activator                2
blocker|inhibitor                2
cleavage                        83
immunotherapy                    3
immunotherapy|antibody           4
inhibitor                    18692
inhibitor|activator              2
inhibitor|agonist               14
inhibitor|blocker                3
inhibitor|modulator              5
inverse agonist                 36
modulator                     1241
modulator|agonist                1
modulator|inhibitor              3
negative modulator             133
other/unknown                  219
positive modulator            1013
po

### EDA "|-delimited" values

Muliple values, "|"-delimited (special value, needs escaping).

Only a small proportion of the dataset

In [14]:
df_piped_types = df[df["interaction_types"].str.contains("\\|")].copy()

print(f"{df_piped_types.shape[0]} rows with |-delimited interaction_types" + 
      f": {df_piped_types.shape[0] / df.shape[0]:.3%}")

df_piped_types.head()

70 rows with |-delimited interaction_types: 0.086%


Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
649,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10038,iuphar.ligand:10038,COMPOUND 5 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,inhibitor|blocker,1.864421,3.627763,0.513931,1.0
654,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10039,iuphar.ligand:10039,COMPOUND 6 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,inhibitor|blocker,1.864421,3.627763,0.513931,1.0
657,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:1713,rxcui:318,ADENOSINE TRIPHOSPHATE,True,False,False,GuideToPharmacology,2024.3,inhibitor|blocker,0.109672,0.213398,0.513931,1.0
6029,NCBIGENE:21,hgnc:33,ABCA3,IUPHAR.LIGAND:459,iuphar.ligand:459,MRE 3008F20,False,False,False,GuideToPharmacology,2024.3,agonist|inhibitor,0.134546,0.906941,0.148351,1.0
6205,NCBIGENE:277,hgnc:475,AMY1B,IUPHAR.LIGAND:9494,iuphar.ligand:9494,CYM-5541,False,False,False,GuideToPharmacology,2024.3,modulator|agonist,3.070812,3.627763,0.846475,1.0


In [15]:
df_piped_types["interaction_types"].nunique()

df_piped_types["interaction_types"].value_counts().sort_index()

14

interaction_types
activator|blocker          4
activator|inhibitor        2
agonist|inhibitor         22
agonist|modulator          2
antibody|immunotherapy     4
blocker|activator          2
blocker|inhibitor          2
immunotherapy|antibody     4
inhibitor|activator        2
inhibitor|agonist         14
inhibitor|blocker          3
inhibitor|modulator        5
modulator|agonist          1
modulator|inhibitor        3
Name: count, dtype: int64

**REVIEW**

**Opposing:**
* activator|blocker
* blocker|activator
* activator|inhibitor
* inhibitor|activator
* agonist|inhibitor
* inhibitor|agonist

**Close:**
* blocker|inhibitor
* inhibitor|blocker

**One is kinda a subclass of the other?**
* agonist|modulator
* modulator|agonist
* inhibitor|modulator
* modulator|inhibitor

**Identical?**
* antibody|immunotherapy
* immunotherapy|antibody

In [16]:
## are the "flipped order" types the same data? esp when same row count?
## NO - based on the few pairs I reviewed

df_piped_types[df_piped_types["interaction_types"] == "activator|inhibitor"]
df_piped_types[df_piped_types["interaction_types"] == "inhibitor|activator"]

# df_multi_types[df_multi_types["interaction_types"] == "agonist|modulator"]
# df_multi_types[df_multi_types["interaction_types"] == "modulator|agonist"]

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
9727,NCBIGENE:749,hgnc:10485,RYR3,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,activator|inhibitor,0.163137,0.090694,1.79876,1.0
79392,NCBIGENE:748,hgnc:10484,RYR2,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,activator|inhibitor,0.118645,0.090694,1.308189,1.0


Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
24189,NCBIGENE:747,hgnc:1165,DAGLA,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,inhibitor|activator,0.130509,0.090694,1.439008,1.0
70567,BRAF,hgnc:1097,BRAF,VEMURAFENIB,rxcui:1147220,VEMURAFENIB,True,False,True,MyCancerGenomeClinicalTrial,30-Feburary-2014,inhibitor|activator,1.587278,0.098048,0.07195,225.0


In [17]:
## from only 4 resources, mostly GuideToPharmacology

df_piped_types["interaction_source_db_name"].value_counts()

interaction_source_db_name
GuideToPharmacology            60
MyCancerGenome                  8
MyCancerGenomeClinicalTrial     1
ChEMBL                          1
Name: count, dtype: int64

### Split "|-delimited"

In [18]:
df["interaction_types"] = df["interaction_types"].str.split("|")

In [19]:
## this is correct - the row count in df_piped_types was the same
df[df["interaction_types"].map(len) > 1].shape[0]

df[df["interaction_types"].map(len) > 1]

70

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
649,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10038,iuphar.ligand:10038,COMPOUND 5 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,"[inhibitor, blocker]",1.864421,3.627763,0.513931,1.0
654,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:10039,iuphar.ligand:10039,COMPOUND 6 [PMID: 29579323],False,False,False,GuideToPharmacology,2024.3,"[inhibitor, blocker]",1.864421,3.627763,0.513931,1.0
657,NCBIGENE:496,hgnc:820,ATP4B,IUPHAR.LIGAND:1713,rxcui:318,ADENOSINE TRIPHOSPHATE,True,False,False,GuideToPharmacology,2024.3,"[inhibitor, blocker]",0.109672,0.213398,0.513931,1.0
6029,NCBIGENE:21,hgnc:33,ABCA3,IUPHAR.LIGAND:459,iuphar.ligand:459,MRE 3008F20,False,False,False,GuideToPharmacology,2024.3,"[agonist, inhibitor]",0.134546,0.906941,0.148351,1.0
6205,NCBIGENE:277,hgnc:475,AMY1B,IUPHAR.LIGAND:9494,iuphar.ligand:9494,CYM-5541,False,False,False,GuideToPharmacology,2024.3,"[modulator, agonist]",3.070812,3.627763,0.846475,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75509,NCBIGENE:13,hgnc:17,AADAC,IUPHAR.LIGAND:289,iuphar.ligand:289,AC-42,False,False,False,GuideToPharmacology,2024.3,"[agonist, modulator]",0.501960,3.627763,0.138366,1.0
79392,NCBIGENE:748,hgnc:10484,RYR2,IUPHAR.LIGAND:707,iuphar.ligand:707,CA2+,False,False,False,GuideToPharmacology,2024.3,"[activator, inhibitor]",0.118645,0.090694,1.308189,1.0
79394,NCBIGENE:748,hgnc:10484,RYR2,IUPHAR.LIGAND:4303,iuphar.ligand:4303,RYANODINE,False,False,False,GuideToPharmacology,2024.3,"[activator, blocker]",1.186450,0.906941,1.308189,1.0
80312,NCBIGENE:359,hgnc:634,AQP2,IUPHAR.LIGAND:2036,iuphar.ligand:2036,BIM 23056,False,False,False,GuideToPharmacology,2024.3,"[agonist, inhibitor]",0.362526,1.209254,0.299793,1.0


Then expand to multiple rows using pandas explode

In [20]:
df = df.explode("interaction_types", ignore_index=True)

## -> log
print(f"{df.shape[0]} rows after expanding rows with multiple interaction_type values")

81810 rows after expanding rows with multiple interaction_type values


## EDA interaction_source_db_name

aka "underlying sources". 
These are all single-values. 

Compared to the [website's "Interaction Sources"](https://dgidb.org/browse/sources), data has [all sources] - DrugBank + NCI. 
* On DrugBank, website actually doesn't have any "interaction claim" counts
* Not clear to me what "NCI" is. National Cancer Institute?

In [21]:
df["interaction_source_db_name"].nunique()

df["interaction_source_db_name"].value_counts().sort_index()

21

interaction_source_db_name
CGI                                  345
CIViC                               1013
CKB-CORE                            1777
COSMIC                                34
CancerCommons                        106
ChEMBL                             12292
ClearityFoundationBiomarkers         160
ClearityFoundationClinicalTrial      240
DTC                                23876
DoCM                                  72
FDA                                  402
GuideToPharmacology                16422
MyCancerGenome                       811
MyCancerGenomeClinicalTrial          315
NCI                                 6076
OncoKB                               146
PharmGKB                            5248
TALC                                 564
TEND                                2242
TTD                                 5110
TdgClinicalTrial                    4559
Name: count, dtype: int64

In [22]:
df[df["interaction_source_db_name"] == "NCI"]

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
8,ICAM3,hgnc:5346,ICAM3,GRANULOCYTE MACROPHAGE COLONY-STIMULATING FACTOR,ncit:C1288,RECOMBINANT GRANULOCYTE-MACROPHAGE COLONY-STIMULATING FACTOR,False,False,True,NCI,14-September-2017,~NULL,13.050950,3.627763,1.798760,2.0
9,ICAM3,hgnc:5346,ICAM3,PMA,ncit:C866,TETRADECANOYLPHORBOL ACETATE,False,False,True,NCI,14-September-2017,~NULL,0.283716,0.078864,1.798760,2.0
10,ICAM3,hgnc:5346,ICAM3,GM-CSF,iuphar.ligand:4942,GM-CSF,False,False,False,NCI,14-September-2017,~NULL,0.815684,0.226735,1.798760,2.0
11,ICAM3,hgnc:5346,ICAM3,VITAMIN D,rxcui:11253,VITAMIN D,True,False,False,NCI,14-September-2017,~NULL,1.003919,0.279059,1.798760,2.0
12,ICAM3,hgnc:5346,ICAM3,INTERFERONS,ncit:C584,RECOMBINANT INTERFERON,False,False,True,NCI,14-September-2017,~NULL,0.334640,0.093020,1.798760,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81805,PTPRC,hgnc:9666,PTPRC,PROTEIN KINASE INHIBITOR,ncit:C1404,PROTEIN KINASE INHIBITOR,False,False,False,NCI,14-September-2017,~NULL,1.160084,0.725553,0.799449,2.0
81806,PTPRC,hgnc:9666,PTPRC,OESTRADIOL,rxcui:24395,ESTRADIOL VALERATE,True,False,True,NCI,14-September-2017,~NULL,0.123413,0.077186,0.799449,2.0
81807,PTPRC,hgnc:9666,PTPRC,PREDNISONE,rxcui:8640,PREDNISONE,True,False,True,NCI,14-September-2017,~NULL,0.128898,0.080617,0.799449,2.0
81808,PTPRC,hgnc:9666,PTPRC,HEPARAN SULFATE,rxcui:2603494,HEPARAN SULFATE,False,False,True,NCI,14-September-2017,~NULL,0.483369,0.302314,0.799449,2.0


## FILTER Namespaces

In [23]:
## genes

df["gene_prefix"] = [i.split(":")[0] for i in df["gene_concept_id"]]

df["gene_prefix"].value_counts()

## this is ENSG
df[df["gene_prefix"] == "ensembl"]

gene_prefix
hgnc        80710
ncbigene     1099
ensembl         1
Name: count, dtype: int64

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score,gene_prefix
30352,TARP,ensembl:ENSG00000289746,TARP,TESTOSTERONE,rxcui:10379,TESTOSTERONE,True,False,True,NCI,14-September-2017,~NULL,4.015677,0.09302,14.390078,3.0,ensembl


In [24]:
## drugs

df["drug_prefix"] = [i.split(":")[0] for i in df["drug_concept_id"]]

df["drug_prefix"].value_counts()

df[df["drug_prefix"] == "chemidplus"]

drug_prefix
rxcui             34787
ncit              15105
iuphar.ligand     15052
chembl            12985
drugbank           3724
wikidata            112
hemonc               30
drugsatfda.nda       12
chemidplus            3
Name: count, dtype: int64

Unnamed: 0,gene_claim_name,gene_concept_id,gene_name,drug_claim_name,drug_concept_id,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_source_db_version,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score,gene_prefix,drug_prefix
5553,CYP7B1,hgnc:2652,CYP7B1,DDD,chemidplus:72-54-8,TDE,False,False,False,NCI,14-September-2017,~NULL,6.960506,3.627763,0.959339,2.0,hgnc,chemidplus
31674,NOREPINEPHRINE TRANSPORTER,hgnc:11048,SLC6A2,Hypericum,chemidplus:68917-49-7,ST. JOHN'S WORT,False,False,False,TTD,2020.06.01,~NULL,0.483369,3.627763,0.133241,1.0,hgnc,chemidplus
71908,ESTROGEN-RELATED RECEPTOR-ALPHA,hgnc:3471,ESRRA,Dexamethasone palmitate,chemidplus:14899-36-6,DEXAMETHASONE PALMITATE,False,False,False,TTD,2020.06.01,~NULL,8.700633,3.627763,2.398346,1.0,hgnc,chemidplus


**[NodeNorm](https://nodenormalization-sri.renci.org/1.5/get_curie_prefixes?semantic_type=biolink%3ANamedThing) can't handle:**
* iuphar.ligand
* wikidata
* hemonc
* drugsatfda.nda
* chemidplus

Notes:
* assuming chembl = CHEMBL.COMPOUND. based on some spot-checks, it seems to work some of the time?)
* Noticed some names are also not NameRes-able, ex: `COMPOUND 5 [PMID: 29579323]`

In [25]:
## NodeNorm doesn't recognize

PREFIXES_TO_DROP = [
    ## . probably will be treated as "all match"...unless escaped
    "iuphar\\.ligand",
    "wikidata",
    "hemonc",
    "drugsatfda\\.nda",
    "chemidplus",
]

In [26]:
## set case=False so it isn't case-sensitive on matches!

n_before = df.shape[0]
df = df[~df.drug_prefix.str.contains('|'.join(PREFIXES_TO_DROP), case=False)].copy()

## -> log
print(f"{df.shape[0]} rows ({df.shape[0] / n_before:.1%}) after filtering out drug prefixes that can't be NodeNormed")

66601 rows (81.4%) after filtering out drug prefixes that can't be NodeNormed


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 66601 entries, 0 to 81809
Data columns (total 18 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   gene_claim_name                66596 non-null  object 
 1   gene_concept_id                66601 non-null  object 
 2   gene_name                      66601 non-null  object 
 3   drug_claim_name                66601 non-null  object 
 4   drug_concept_id                66601 non-null  object 
 5   drug_name                      66601 non-null  object 
 6   drug_is_approved               66601 non-null  object 
 7   drug_is_immunotherapy          66601 non-null  object 
 8   drug_is_antineoplastic         66601 non-null  object 
 9   interaction_source_db_name     66601 non-null  object 
 10  interaction_source_db_version  66601 non-null  object 
 11  interaction_types              66601 non-null  object 
 12  interaction_score              66601 non-null  floa

## EDA drug_is_immunotherapy

Matt wanted to know if there was a relationship between this flag and the interaction_types

In [28]:
df["drug_is_immunotherapy"].value_counts()

## this looks like more than the "immune"-related interaction_types

drug_is_immunotherapy
False    65287
True      1314
Name: count, dtype: int64

In [29]:
df[df["drug_is_immunotherapy"] == True]["interaction_types"].value_counts(dropna=False).sort_index()

interaction_types
activator           1
agonist            36
antibody           83
binder              3
blocker             1
immunotherapy      10
inhibitor         128
modulator           2
~NULL            1050
Name: count, dtype: int64

In [30]:
df[df["drug_is_immunotherapy"] == False]["interaction_types"].value_counts(dropna=False).sort_index()

interaction_types
activator                      280
agonist                       2645
antibody                        49
antisense oligonucleotide        4
binder                         254
blocker                       1372
cleavage                        83
immunotherapy                    1
inhibitor                    11759
inverse agonist                 36
modulator                      745
negative modulator             132
other/unknown                  215
positive modulator            1010
potentiator                     51
vaccine                         31
~NULL                        46620
Name: count, dtype: int64

**Summary**

`drug_is_immunotherapy` is basically independent of `interaction_types`:
* `True`: only a small subset have immune-related interaction_type (antibody, immunotherapy)
* `False`: some immune-related interaction_type rows are here! (antibody, immunotherapy, vaccine)

## EDA Merging by gene-drug pairs

In [31]:
## first drop some columns - not needed OR values won't make sense after merge
## makes merge faster

# cols_not_needed = [
#     "gene_claim_name", 
#     "drug_claim_name",
#     "interaction_source_db_version", 
#     "gene_prefix",
#     "drug_prefix",
# ]

# merge1 = df.drop(columns=cols_not_needed).copy()

In [32]:
## merge: takes ~10s to run

# cols_define_1 = ["gene_concept_id", "drug_concept_id"]

# merge1 = merge1.groupby(by=cols_define_1).agg(set).reset_index().copy()

In [33]:
# merge1.shape[0]

# merge1

55077

Unnamed: 0,gene_concept_id,drug_concept_id,gene_name,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
0,ensembl:ENSG00000289746,rxcui:10379,{TARP},{TESTOSTERONE},{True},{False},{True},{NCI},{~NULL},{4.015676812221963},{0.09301957340359},{14.390078221490326},{3.0}
1,hgnc:100,rxcui:1148138,{ASIC1},{ICATIBANT},{True},{False},{True},{GuideToPharmacology},{inhibitor},{0.6868920863011254},{1.8138816813700047},{0.3786862689865875},{1.0}
2,hgnc:100,rxcui:644,{ASIC1},{AMILORIDE},{True},{False},{True},{TTD},{~NULL},{0.0808108336824853},{0.2133978448670594},{0.3786862689865875},{1.0}
3,hgnc:10000,chembl:CHEMBL107131,{RGS4},{DIMETHYLPINOCEMBRIN},{False},{False},{False},{DTC},{~NULL},{0.1706006488852468},{1.20925445424667},{0.1410791982499051},{1.0}
4,hgnc:10000,chembl:CHEMBL1255835,{RGS4},{DIAMIDE},{False},{False},{False},{DTC},{~NULL},{0.0853003244426234},{0.604627227123335},{0.1410791982499051},{1.0}
...,...,...,...,...,...,...,...,...,...,...,...,...,...
55072,ncbigene:7,rxcui:8332,{A12M4},{PINDOLOL},{True},{False},{True},{GuideToPharmacology},"{inhibitor, agonist}",{0.0610570743378778},{0.4030848180822233},{0.151474507594635},{1.0}
55073,ncbigene:7,rxcui:8353,{A12M4},{PIRIBEDIL},{False},{False},{True},{GuideToPharmacology},{inhibitor},{0.0457928057534083},{0.3023136135616675},{0.151474507594635},{1.0}
55074,ncbigene:849,ncit:C152542,{CATHL1L},{TEGOPRAZAN},{False},{False},{True},{GuideToPharmacology},{inhibitor},{5.800422062098392},{1.20925445424667},{4.796692740496775},{1.0}
55075,ncbigene:849,rxcui:1294569,{CATHL1L},{ESOMEPRAZOLE SODIUM},{True},{False},{True},{GuideToPharmacology},{inhibitor},{2.900211031049196},{0.604627227123335},{4.796692740496775},{1.0}


In [34]:
## all single values - so same for a gene-drug pair

# ## tied to gene
# merge1[merge1["gene_name"].map(len) > 1].shape[0]

# ## tied to drug, basically a node attribute/annotation
# merge1[merge1["drug_name"].map(len) > 1].shape[0]
# merge1[merge1["drug_is_approved"].map(len) > 1].shape[0]
# merge1[merge1["drug_is_immunotherapy"].map(len) > 1].shape[0]
# merge1[merge1["drug_is_antineoplastic"].map(len) > 1].shape[0]

0

0

0

0

0

In [35]:
## all single values - so same for a gene-drug pair

# ## scores for gene-drug pair
# merge1[merge1["interaction_score"].map(len) > 1].shape[0]
# merge1[merge1["evidence_score"].map(len) > 1].shape[0]

# ## scores for only drug or only gene
# merge1[merge1["drug_specificity_score"].map(len) > 1].shape[0]
# merge1[merge1["gene_specificity_score"].map(len) > 1].shape[0]

0

0

0

0

In [36]:
## end up with multiple values 

# merge1[merge1["interaction_types"].map(len) > 1].shape[0]

# merge1[merge1["interaction_source_db_name"].map(len) > 1].shape[0]

3692

5982

### interaction_types

**6.7%** of the dataset has **multiple interaction_types from the merge**. 

Are there sets that don't make sense?

In [37]:
# merge1_multi_types = merge1[merge1["interaction_types"].map(len) > 1].copy()

# print(f"{merge1_multi_types.shape[0] / merge1.shape[0]:.2%}")

# merge1_multi_types

6.70%


Unnamed: 0,gene_concept_id,drug_concept_id,gene_name,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
87,hgnc:10009,ncit:C152241,{RHD},{ROLEDUMAB},{False},{False},{True},"{ChEMBL, TTD}","{inhibitor, ~NULL}",{26.10189927944276},{3.6277633627400094},{3.5975195553725814},{2.0}
140,hgnc:1014,rxcui:1364347,{BCR},{PONATINIB},{True},{False},{True},"{FDA, ChEMBL, TALC, PharmGKB}","{inhibitor, ~NULL}",{0.3650615283838149},{0.139529360105385},{0.6540944646131966},{4.0}
141,hgnc:1014,rxcui:1546019,{BCR},{DASATINIB ANHYDROUS},{True},{False},{True},"{FDA, ChEMBL, TALC, PharmGKB}","{inhibitor, ~NULL}",{0.1248894702365682},{0.0477337284571054},{0.6540944646131966},{4.0}
149,hgnc:1014,rxcui:282388,{BCR},{IMATINIB},{True},{False},{True},"{ChEMBL, DTC, TALC, FDA, PharmGKB}","{inhibitor, ~NULL}",{0.2306986047425497},{0.0503856022602779},{0.6540944646131966},{7.0}
158,hgnc:1024,rxcui:10324,{BCYRN1P2},{TAMOXIFEN},{True},{False},{True},{GuideToPharmacology},"{inhibitor, agonist}",{0.0243374352255876},{0.055811744042154},{0.4360629764087977},{1.0}
...,...,...,...,...,...,...,...,...,...,...,...,...,...
54654,hgnc:9967,rxcui:357977,{RET},{SUNITINIB},{True},{False},{True},"{ChEMBL, TdgClinicalTrial, CGI, CKB-CORE, TEND, DoCM, MyCancerGenome, TALC, NCI, CIViC}","{inhibitor, ~NULL}",{0.3978948060890666},{0.0755784033904168},{0.1754887587986625},{30.0}
54656,hgnc:9967,rxcui:495881,{RET},{SORAFENIB},{True},{False},{True},"{ChEMBL, DTC, CKB-CORE, MyCancerGenome, NCI, PharmGKB, CIViC}","{inhibitor, ~NULL}",{0.0719670605795877},{0.0315457683716522},{0.1754887587986625},{13.0}
54943,ncbigene:6,rxcui:221153,{A12M3},{QUETIAPINE FUMARATE},{True},{False},{True},{GuideToPharmacology},"{inhibitor, agonist}",{0.0111546578117276},{0.1007712045205558},{0.1106929093960794},{1.0}
54972,ncbigene:6,rxcui:8332,{A12M3},{PINDOLOL},{True},{False},{True},{GuideToPharmacology},"{inhibitor, agonist}",{0.0446186312469107},{0.4030848180822233},{0.1106929093960794},{1.0}


In [38]:
# SORT values first - so the sets are ideally unique beforehand

# merge1_multi_types["interaction_types"] = [",".join(sorted(i)) for i in merge1_multi_types["interaction_types"]]

In [39]:
# merge1_multi_types["interaction_types"].nunique()

47

In [40]:
# merge1_multi_types["interaction_types"].value_counts(normalize=True)

interaction_types
inhibitor,~NULL                                   0.604009
agonist,~NULL                                     0.147616
positive modulator,~NULL                          0.089653
blocker,~NULL                                     0.029252
binder,~NULL                                      0.023294
inhibitor,other/unknown,~NULL                     0.022752
modulator,~NULL                                   0.019772
antibody,inhibitor,~NULL                          0.008938
other/unknown,~NULL                               0.007313
activator,~NULL                                   0.006501
agonist,inhibitor                                 0.004063
negative modulator,~NULL                          0.003792
blocker,inhibitor                                 0.003250
potentiator,~NULL                                 0.002709
cleavage,~NULL                                    0.002438
modulator,positive modulator,~NULL                0.002167
vaccine,~NULL                         

Top are (cover >82%):
* `inhibitor,~NULL` (60%)
* `agonist,~NULL` (>14%)
* `positive modulator,~NULL` (>8%)

In [41]:
# merge1_multi_types["interaction_types"].value_counts().sort_index()

interaction_types
activator,blocker                                    1
activator,inhibitor,~NULL                            1
activator,positive modulator,~NULL                   1
activator,potentiator,~NULL                          3
activator,~NULL                                     24
agonist,antibody                                     4
agonist,antibody,~NULL                               4
agonist,immunotherapy,~NULL                          1
agonist,inhibitor                                   15
agonist,inhibitor,~NULL                              2
agonist,modulator                                    2
agonist,~NULL                                      545
antibody,binder,~NULL                                3
antibody,blocker,immunotherapy,inhibitor,~NULL       1
antibody,immunotherapy                               1
antibody,immunotherapy,inhibitor                     1
antibody,immunotherapy,inhibitor,~NULL               4
antibody,immunotherapy,~NULL                   

**MERGED DATA ISSUES**

(fully reviewed)

**Opposing**
* **agonist,inhibitor**

**Makes sense, but tricky to merge? (1 example of each kind)**
* activator,potentiator,~NULL
* **agonist,antibody**
* antibody,binder,~NULL
* antibody,blocker,immunotherapy,inhibitor,~NULL
* antibody,inhibitor
* blocker,inhibitor
* **cleavage,inhibitor,~NULL**
* inhibitor,inverse agonist,~NULL

In [42]:
# merge1_multi_types[merge1_multi_types["interaction_types"] == "agonist,antibody"]

Unnamed: 0,gene_concept_id,drug_concept_id,gene_name,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
5678,hgnc:11904,ncit:C1685,{TNFRSF10A},{DULANERMIN},{False},{True},{True},"{ChEMBL, TALC, MyCancerGenome}","agonist,antibody",{15.66113956766565},{1.8138816813700047},{2.878015644298065},{3.0}
5681,hgnc:11905,chembl:CHEMBL2109247,{TNFRSF10B},{LBY-135},{False},{True},{True},"{ChEMBL, TALC, MyCancerGenome}","agonist,antibody",{17.40126618629517},{3.6277633627400094},{1.5988975801655918},{3.0}
5684,hgnc:11905,ncit:C1685,{TNFRSF10B},{DULANERMIN},{False},{True},{True},"{ChEMBL, TALC, MyCancerGenome}","agonist,antibody",{8.700633093147589},{1.8138816813700047},{1.5988975801655918},{3.0}
5688,hgnc:11905,ncit:C71693,{TNFRSF10B},{APOMAB},{False},{True},{True},"{ChEMBL, MyCancerGenome}","agonist,antibody",{11.60084412419678},{3.6277633627400094},{1.5988975801655918},{2.0}


In [43]:
rows_with_plain = merge1_multi_types[merge1_multi_types["interaction_types"].str.contains("NULL|other")].copy()


print(f"{rows_with_plain.shape[0] / merge1_multi_types.shape[0]:.2%}")

98.78%


### interaction_source_db_name

**10.8%** of data has multiple sources after merge

EDA only

In [44]:
# merge1_multi_sources = merge1[merge1["interaction_source_db_name"].map(len) > 1].copy()

# print(f"{merge1_multi_sources.shape[0] / merge1.shape[0]:.2%}")

# merge1_multi_sources

10.86%


Unnamed: 0,gene_concept_id,drug_concept_id,gene_name,drug_name,drug_is_approved,drug_is_immunotherapy,drug_is_antineoplastic,interaction_source_db_name,interaction_types,interaction_score,drug_specificity_score,gene_specificity_score,evidence_score
87,hgnc:10009,ncit:C152241,{RHD},{ROLEDUMAB},{False},{False},{True},"{ChEMBL, TTD}","{inhibitor, ~NULL}",{26.10189927944276},{3.6277633627400094},{3.5975195553725814},{2.0}
137,hgnc:1014,rxcui:11202,{BCR},{VINCRISTINE},{True},{False},{True},"{FDA, PharmGKB}",{~NULL},{0.0585901218393777},{0.0447872020091359},{0.6540944646131966},{2.0}
138,hgnc:1014,rxcui:1307619,{BCR},{BOSUTINIB},{True},{False},{True},"{FDA, PharmGKB}",{~NULL},{0.206339124738678},{0.1577288418582613},{0.6540944646131966},{2.0}
140,hgnc:1014,rxcui:1364347,{BCR},{PONATINIB},{True},{False},{True},"{FDA, ChEMBL, TALC, PharmGKB}","{inhibitor, ~NULL}",{0.3650615283838149},{0.139529360105385},{0.6540944646131966},{4.0}
141,hgnc:1014,rxcui:1546019,{BCR},{DASATINIB ANHYDROUS},{True},{False},{True},"{FDA, ChEMBL, TALC, PharmGKB}","{inhibitor, ~NULL}",{0.1248894702365682},{0.0477337284571054},{0.6540944646131966},{4.0}
...,...,...,...,...,...,...,...,...,...,...,...,...,...
54646,hgnc:9967,rxcui:2370147,{RET},{SELPERCATINIB},{True},{False},{True},"{ChEMBL, CKB-CORE, OncoKB, PharmGKB, TTD, CIViC}","{inhibitor, ~NULL}",{5.411369362811306},{0.9069408406850026},{0.1754887587986625},{34.0}
54647,hgnc:9967,rxcui:2394936,{RET},{PRALSETINIB},{True},{False},{True},"{ChEMBL, CKB-CORE, OncoKB, PharmGKB, TTD, CIViC}","{inhibitor, ~NULL}",{2.000842453476449},{0.5182519089628587},{0.1754887587986625},{22.0}
54651,hgnc:9967,rxcui:282388,{RET},{IMATINIB},{True},{False},{True},"{TdgClinicalTrial, TEND, NCI}",{~NULL},{0.035368427207917},{0.0503856022602779},{0.1754887587986625},{4.0}
54654,hgnc:9967,rxcui:357977,{RET},{SUNITINIB},{True},{False},{True},"{ChEMBL, TdgClinicalTrial, CGI, CKB-CORE, TEND, DoCM, MyCancerGenome, TALC, NCI, CIViC}","{inhibitor, ~NULL}",{0.3978948060890666},{0.0755784033904168},{0.1754887587986625},{30.0}


In [45]:
# SORT values first - so the sets are ideally unique beforehand

# merge1_multi_sources["interaction_source_db_name"] = [",".join(sorted(i)) for i in merge1_multi_sources["interaction_source_db_name"]]

In [46]:
# merge1_multi_sources["interaction_source_db_name"].nunique()

640

In [47]:
# merge1_multi_sources["interaction_source_db_name"].value_counts()[0:10]

interaction_source_db_name
ChEMBL,TTD                          782
TEND,TdgClinicalTrial               767
ChEMBL,TEND,TdgClinicalTrial        515
ChEMBL,TdgClinicalTrial             402
ChEMBL,TEND,TTD,TdgClinicalTrial    249
ChEMBL,TTD,TdgClinicalTrial         220
TEND,TTD,TdgClinicalTrial           180
FDA,PharmGKB                        180
ChEMBL,DTC                          155
TTD,TdgClinicalTrial                155
Name: count, dtype: int64

## EDA Merging by gene-drug-interaction_type sets

In [31]:
# ## first drop some columns - not needed OR values won't make sense after merge
# ## makes merge faster

# ## and order for readability
# merge2 = df[["drug_concept_id", "drug_name", 
#              "gene_concept_id", "gene_name",
#              "interaction_types", "interaction_source_db_name",
#              "interaction_score", "evidence_score"
#             ]].copy()

In [64]:
# ## merge: takes ~5s to run

# cols_define_edge_2 = ["gene_concept_id", "drug_concept_id", "interaction_types"]

# merge2 = merge2.groupby(by=cols_define_edge_2).agg(set).reset_index().copy()

In [65]:
# merge2.shape[0]

# merge2

58941

Unnamed: 0,gene_concept_id,drug_concept_id,interaction_types,drug_name,gene_name,interaction_source_db_name,interaction_score
0,ensembl:ENSG00000289746,rxcui:10379,~NULL,{TESTOSTERONE},{TARP},{NCI},{4.015676812221963}
1,hgnc:100,rxcui:1148138,inhibitor,{ICATIBANT},{ASIC1},{GuideToPharmacology},{0.6868920863011254}
2,hgnc:100,rxcui:644,~NULL,{AMILORIDE},{ASIC1},{TTD},{0.0808108336824853}
3,hgnc:10000,chembl:CHEMBL107131,~NULL,{DIMETHYLPINOCEMBRIN},{RGS4},{DTC},{0.1706006488852468}
4,hgnc:10000,chembl:CHEMBL1255835,~NULL,{DIAMIDE},{RGS4},{DTC},{0.0853003244426234}
...,...,...,...,...,...,...,...
58936,ncbigene:7,rxcui:8332,inhibitor,{PINDOLOL},{A12M4},{GuideToPharmacology},{0.0610570743378778}
58937,ncbigene:7,rxcui:8353,inhibitor,{PIRIBEDIL},{A12M4},{GuideToPharmacology},{0.0457928057534083}
58938,ncbigene:849,ncit:C152542,inhibitor,{TEGOPRAZAN},{CATHL1L},{GuideToPharmacology},{5.800422062098392}
58939,ncbigene:849,rxcui:1294569,inhibitor,{ESOMEPRAZOLE SODIUM},{CATHL1L},{GuideToPharmacology},{2.900211031049196}


In [69]:
# ## how many rows with "duplicate" drug-gene pairs

# merge2_samepair = merge2[merge2.duplicated(subset=["gene_concept_id", "drug_concept_id"], keep=False)].copy()

# merge2_samepair.shape[0]

7556

In [84]:
# merge2_samepair[40:61]

Unnamed: 0,gene_concept_id,drug_concept_id,interaction_types,drug_name,gene_name,interaction_source_db_name,interaction_score
412,hgnc:1030,rxcui:1148138,inhibitor,{ICATIBANT},{BDKRB2},{GuideToPharmacology},{2.175158273286897}
413,hgnc:1030,rxcui:1148138,~NULL,{ICATIBANT},{BDKRB2},"{TdgClinicalTrial, TEND, TTD}",{2.175158273286897}
553,hgnc:1033,rxcui:1005920,agonist,{ULIPRISTAL ACETATE},{BDNF},{GuideToPharmacology},{0.2122105632475021}
554,hgnc:1033,rxcui:1005920,inhibitor,{ULIPRISTAL ACETATE},{BDNF},{GuideToPharmacology},{0.2122105632475021}
617,hgnc:1033,rxcui:6964,agonist,{MIFEPRISTONE},{BDNF},{GuideToPharmacology},{0.0454736921244647}
618,hgnc:1033,rxcui:6964,inhibitor,{MIFEPRISTONE},{BDNF},{GuideToPharmacology},{0.0454736921244647}
619,hgnc:1033,rxcui:6964,~NULL,{MIFEPRISTONE},{BDNF},{NCI},{0.0454736921244647}
996,hgnc:10436,drugbank:DB12690,inhibitor,{LY-2584702},{RPS6KB1},{ChEMBL},{2.821826949128947}
997,hgnc:10436,drugbank:DB12690,~NULL,{LY-2584702},{RPS6KB1},{TTD},{2.821826949128947}
998,hgnc:10436,drugbank:DB15431,inhibitor,{M-2698},{RPS6KB1},{ChEMBL},{0.3135363276809941}


Based on spot-check, don't see cases where the `~NULL` has the same sources as its "dup" row...

`[44:47]` is a set of 3

`.loc[56285:56287]` is a set of 3 with other/unknown and NULL

In [91]:
# ## looking for "other/unknown stuff"
# merge2_samepair[merge2_samepair["interaction_types"].str.contains("other")]

Unnamed: 0,gene_concept_id,drug_concept_id,interaction_types,drug_name,gene_name,interaction_source_db_name,interaction_score
3691,hgnc:11048,rxcui:221138,other/unknown,{PHENTERMINE RESIN},{SLC6A2},{ChEMBL},{0.6444913402331547}
3705,hgnc:11048,rxcui:3287,other/unknown,{DEXTROAMPHETAMINE SULFATE},{SLC6A2},{ChEMBL},{0.0508808952815648}
3746,hgnc:11048,rxcui:8896,other/unknown,{PSEUDOEPHEDRINE},{SLC6A2},{ChEMBL},{0.0568668829617489}
3750,hgnc:11048,rxcui:91239,other/unknown,{GUANADREL SULFATE},{SLC6A2},{ChEMBL},{0.7250527577622989}
3799,hgnc:11049,rxcui:3287,other/unknown,{DEXTROAMPHETAMINE SULFATE},{SLC6A3},{ChEMBL},{0.3496905166623911}
...,...,...,...,...,...,...,...
56286,hgnc:9565,rxcui:1302966,other/unknown,{CARFILZOMIB},{PSMD7},{MyCancerGenome},{0.4971790338941479}
56291,hgnc:9565,rxcui:358258,other/unknown,{BORTEZOMIB},{PSMD7},{MyCancerGenome},{0.350949906278222}
56296,hgnc:9566,rxcui:1302966,other/unknown,{CARFILZOMIB},{PSMD8},{MyCancerGenome},{0.5800422062098393}
56301,hgnc:9566,rxcui:358258,other/unknown,{BORTEZOMIB},{PSMD8},{MyCancerGenome},{0.4094415573245924}


In [101]:
# merge2_samepair.loc[56285:56287]

Unnamed: 0,gene_concept_id,drug_concept_id,interaction_types,drug_name,gene_name,interaction_source_db_name,interaction_score
56285,hgnc:9565,rxcui:1302966,inhibitor,{CARFILZOMIB},{PSMD7},{ChEMBL},{0.4971790338941479}
56286,hgnc:9565,rxcui:1302966,other/unknown,{CARFILZOMIB},{PSMD7},{MyCancerGenome},{0.4971790338941479}
56287,hgnc:9565,rxcui:1302966,~NULL,{CARFILZOMIB},{PSMD7},{DTC},{0.4971790338941479}


## Experimental

In [31]:
## only including necessary columns 
##   + drug/gene names for readability/comparsion with website
## dropping columns makes merge faster

## order for readability
experiment_setup = df[["drug_concept_id", "drug_name", 
                       "gene_concept_id", "gene_name",
                       "interaction_types", "interaction_source_db_name",
                       "interaction_score", "evidence_score"
                      ]].copy()

**FIRST map interaction_types for "plain interacts_with" to 1 term -> "~PLAIN"**

**Make new column and use it going forward - mod_type** (but in notebook, keep original so can check):
* If "other/unknown" or "~NULL", put "~PLAIN"
* Else keep original value

Reasoning:
* then when they both occur for same drug-gene pair, their rows can be merged
  * All other interaction_types map to diff predicate/qualifier-sets so they don't need merging later (confirmed 11/18 night)
* makes following logic easier: detecting "only one edge" or "plain interacts_with edge"
* Other option is to map everything to predicate/qualifier-sets, then do logic with predicate/qualifier-sets. But I think that's more complicated to check…

In [32]:
plain_types = {"other/unknown", "~NULL"}

experiment_setup["mod_type"] = ["~PLAIN" if i in plain_types else i for i in experiment_setup["interaction_types"]]

In [33]:
## check how it looks

## regular: don't use "other" because "immunOTHERapy"
experiment_setup[experiment_setup["interaction_types"].str.contains("unknown|NULL")].shape[0]

## modded
experiment_setup[experiment_setup["mod_type"] == "~PLAIN"].shape[0]

47885

47885

In [34]:
## more checks how it looks
experiment_setup[experiment_setup["mod_type"] == "~PLAIN"]["interaction_types"].unique()

experiment_setup[~ (experiment_setup["mod_type"] == "~PLAIN")]["interaction_types"].unique()

array(['~NULL', 'other/unknown'], dtype=object)

array(['inhibitor', 'modulator', 'cleavage', 'blocker', 'binder',
       'agonist', 'activator', 'positive modulator', 'vaccine',
       'antibody', 'immunotherapy', 'negative modulator', 'potentiator',
       'inverse agonist', 'antisense oligonucleotide'], dtype=object)

In [35]:
## group by drug ID, gene ID, mod_type combo

## saving in diff variable in case I want to still work with no-grouping-done later
experiment_mergeMain = experiment_setup.copy()

## merge: takes ~8s to run
cols_define_edge_3 = ["drug_concept_id", "gene_concept_id", "mod_type"]

## when it can be multiple values, run set
## otherwise, the value is always the same (from EDA), so just keep the first
experiment_mergeMain = experiment_mergeMain.groupby(by=cols_define_edge_3).agg(
    {
        "interaction_types": set, 
        "interaction_source_db_name": set,
        "interaction_score": 'first',
        "evidence_score": 'first',
        "drug_name": 'first',
        "gene_name": 'first',

    }
).reset_index().copy()

In [36]:
experiment_mergeMain.shape[0]

58830

In [37]:
## check what rows look like with both original ~PLAIN values

experiment_mergeMain[experiment_mergeMain["gene_concept_id"] == "hgnc:9565"]

Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,interaction_score,evidence_score,drug_name,gene_name
15852,drugbank:DB16741,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.196255,1.0,BORTEZOMIB D-MANNITOL,PSMD7
22106,ncit:C2160,hgnc:9565,~PLAIN,{~NULL},{NCI},0.745769,2.0,PROTEASOME INHIBITOR,PSMD7
29335,ncit:C91388,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.186442,1.0,OPROZOMIB,PSMD7
34091,rxcui:1302966,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.497179,4.0,CARFILZOMIB,PSMD7
34092,rxcui:1302966,hgnc:9565,~PLAIN,"{other/unknown, ~NULL}","{MyCancerGenome, DTC}",0.497179,4.0,CARFILZOMIB,PSMD7
38207,rxcui:1723734,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.191223,1.0,IXAZOMIB CITRATE,PSMD7
38249,rxcui:1723735,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.177564,1.0,IXAZOMIB,PSMD7
49417,rxcui:358258,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},0.35095,4.0,BORTEZOMIB,PSMD7
49418,rxcui:358258,hgnc:9565,~PLAIN,"{other/unknown, ~NULL}","{MyCancerGenome, DTC}",0.35095,4.0,BORTEZOMIB,PSMD7


### sources logic

Create a mapping df: group-by "drug-gene" pair -> get a set of all sources. 

Then make a new sources column **mod_sources** (but in notebook, keep original so can check):
* If "~PLAIN": use the mapping df's value for drug-gene pair (all sources)
* Else: keep original value

In [38]:
experiment_sources = experiment_mergeMain.copy()

drug_gene_pair = ["drug_concept_id", "gene_concept_id"]

In [39]:
## need to merge sets!
## and keep multi-index, annoying but seems easier to retrieve values later
experiment_sources = experiment_sources.groupby(by=drug_gene_pair).agg(
    {"interaction_source_db_name":lambda x: set.union(*x)})

In [40]:
experiment_sources

Unnamed: 0_level_0,Unnamed: 1_level_0,interaction_source_db_name
drug_concept_id,gene_concept_id,Unnamed: 2_level_1
chembl:CHEMBL101168,hgnc:2596,{DTC}
chembl:CHEMBL101168,hgnc:2637,{DTC}
chembl:CHEMBL10118,hgnc:30863,{DTC}
chembl:CHEMBL101510,hgnc:2625,{DTC}
chembl:CHEMBL101804,hgnc:12841,{DTC}
...,...,...
rxcui:9997,hgnc:644,{DTC}
rxcui:9997,hgnc:7876,{PharmGKB}
rxcui:9997,hgnc:7968,{PharmGKB}
rxcui:9997,hgnc:7979,"{TTD, GuideToPharmacology, ChEMBL, TdgClinicalTrial, TEND}"


In [41]:
## trying out how to get specific value in this df using multi-index
## loc returns a series, turn to list and get first/only item

experiment_sources.loc["rxcui:1302966", "hgnc:9565"]

experiment_sources.loc["rxcui:1302966", "hgnc:9565"].to_list()[0]

interaction_source_db_name    {DTC, MyCancerGenome, ChEMBL}
Name: (rxcui:1302966, hgnc:9565), dtype: object

{'ChEMBL', 'DTC', 'MyCancerGenome'}

In [42]:
## add new column to experiment df
## in specific spot

mod_sources = [
    experiment_sources.loc[x.drug_concept_id, x.gene_concept_id].to_list()[0] if x.mod_type == "~PLAIN"
    else x.interaction_source_db_name
    for x in experiment_mergeMain[["drug_concept_id", "gene_concept_id", "mod_type", "interaction_source_db_name"]].itertuples()
]

experiment_mergeMain.insert(5, "mod_sources", mod_sources)

In [43]:
## check what rows look like with original ~PLAIN values

experiment_mergeMain[experiment_mergeMain["gene_concept_id"] == "hgnc:9565"]

experiment_mergeMain[experiment_mergeMain.duplicated(subset=["drug_concept_id", "gene_concept_id"], keep=False)][60:70]

Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
15852,drugbank:DB16741,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.196255,1.0,BORTEZOMIB D-MANNITOL,PSMD7
22106,ncit:C2160,hgnc:9565,~PLAIN,{~NULL},{NCI},{NCI},0.745769,2.0,PROTEASOME INHIBITOR,PSMD7
29335,ncit:C91388,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.186442,1.0,OPROZOMIB,PSMD7
34091,rxcui:1302966,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.497179,4.0,CARFILZOMIB,PSMD7
34092,rxcui:1302966,hgnc:9565,~PLAIN,"{other/unknown, ~NULL}","{MyCancerGenome, DTC}","{DTC, MyCancerGenome, ChEMBL}",0.497179,4.0,CARFILZOMIB,PSMD7
38207,rxcui:1723734,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.191223,1.0,IXAZOMIB CITRATE,PSMD7
38249,rxcui:1723735,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.177564,1.0,IXAZOMIB,PSMD7
49417,rxcui:358258,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.35095,4.0,BORTEZOMIB,PSMD7
49418,rxcui:358258,hgnc:9565,~PLAIN,"{other/unknown, ~NULL}","{MyCancerGenome, DTC}","{DTC, MyCancerGenome, ChEMBL}",0.35095,4.0,BORTEZOMIB,PSMD7


Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
4576,chembl:CHEMBL2109391,hgnc:3613,modulator,{modulator},{ChEMBL},{ChEMBL},15.66114,3.0,MDX-447,FCGR1A
4577,chembl:CHEMBL2109391,hgnc:3613,~PLAIN,{~NULL},{NCI},"{NCI, ChEMBL}",15.66114,3.0,MDX-447,FCGR1A
4580,chembl:CHEMBL2109398,hgnc:3387,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},10.44076,2.0,KB-004,EPHA3
4581,chembl:CHEMBL2109398,hgnc:3387,~PLAIN,{~NULL},{TTD},"{TTD, ChEMBL}",10.44076,2.0,KB-004,EPHA3
4583,chembl:CHEMBL2109401,hgnc:3431,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},2.007838,2.0,AV-203,ERBB3
4584,chembl:CHEMBL2109401,hgnc:3431,~PLAIN,{~NULL},{TTD},"{TTD, ChEMBL}",2.007838,2.0,AV-203,ERBB3
4585,chembl:CHEMBL2109402,hgnc:3431,antibody,{antibody},{MyCancerGenome},{MyCancerGenome},4.015677,4.0,MM-121,ERBB3
4586,chembl:CHEMBL2109402,hgnc:3431,inhibitor,{inhibitor},"{ClearityFoundationClinicalTrial, ChEMBL}","{ClearityFoundationClinicalTrial, ChEMBL}",4.015677,4.0,MM-121,ERBB3
4587,chembl:CHEMBL2109402,hgnc:3431,~PLAIN,{~NULL},{TTD},"{ClearityFoundationClinicalTrial, TTD, MyCancerGenome, ChEMBL}",4.015677,4.0,MM-121,ERBB3
4596,chembl:CHEMBL2109435,hgnc:4445,binder,{binder},{ChEMBL},{ChEMBL},26.101899,2.0,HUA33,GPA33


### scores logic

In notebook, creating new df rather than working on old to make easier to adjust

Only care about logic for removing scores (setting NA)

Group by "drug-gene" pair:
* If there's > 1 row (mod_type): 
  * If row mod_type isn't "~PLAIN": remove scores

In [45]:
experiment_scores = experiment_mergeMain.copy()

drug_gene_pair = ["drug_concept_id", "gene_concept_id"]

In [47]:
## iterate through a group-by

grp = experiment_scores.groupby(by=drug_gene_pair)

for name, group in grp:
    if group.shape[0] > 1:
        for idx,row in group.iterrows():
            if row.mod_type != "~PLAIN":
                experiment_scores.at[idx, "interaction_score"] = pd.NA
                experiment_scores.at[idx, "evidence_score"] = pd.NA

In [50]:
## check what rows look like
## some scores are now NA! 58,830 VS 55,030 

experiment_scores.info()

experiment_scores[experiment_scores["interaction_score"].isna()].shape[0]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58830 entries, 0 to 58829
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   drug_concept_id             58830 non-null  object 
 1   gene_concept_id             58830 non-null  object 
 2   mod_type                    58830 non-null  object 
 3   interaction_types           58830 non-null  object 
 4   interaction_source_db_name  58830 non-null  object 
 5   mod_sources                 58830 non-null  object 
 6   interaction_score           55030 non-null  float64
 7   evidence_score              55030 non-null  float64
 8   drug_name                   58830 non-null  object 
 9   gene_name                   58830 non-null  object 
dtypes: float64(2), object(8)
memory usage: 4.5+ MB


3800

In [51]:
## check what rows look like

experiment_scores[experiment_scores["gene_concept_id"] == "hgnc:9565"]

experiment_scores[experiment_scores.duplicated(subset=["drug_concept_id", "gene_concept_id"], keep=False)][60:70]

Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
15852,drugbank:DB16741,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.196255,1.0,BORTEZOMIB D-MANNITOL,PSMD7
22106,ncit:C2160,hgnc:9565,~PLAIN,{~NULL},{NCI},{NCI},0.745769,2.0,PROTEASOME INHIBITOR,PSMD7
29335,ncit:C91388,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.186442,1.0,OPROZOMIB,PSMD7
34091,rxcui:1302966,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},,,CARFILZOMIB,PSMD7
34092,rxcui:1302966,hgnc:9565,~PLAIN,"{other/unknown, ~NULL}","{MyCancerGenome, DTC}","{DTC, MyCancerGenome, ChEMBL}",0.497179,4.0,CARFILZOMIB,PSMD7
38207,rxcui:1723734,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.191223,1.0,IXAZOMIB CITRATE,PSMD7
38249,rxcui:1723735,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},0.177564,1.0,IXAZOMIB,PSMD7
49417,rxcui:358258,hgnc:9565,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},,,BORTEZOMIB,PSMD7
49418,rxcui:358258,hgnc:9565,~PLAIN,"{other/unknown, ~NULL}","{MyCancerGenome, DTC}","{DTC, MyCancerGenome, ChEMBL}",0.35095,4.0,BORTEZOMIB,PSMD7


Unnamed: 0,drug_concept_id,gene_concept_id,mod_type,interaction_types,interaction_source_db_name,mod_sources,interaction_score,evidence_score,drug_name,gene_name
4576,chembl:CHEMBL2109391,hgnc:3613,modulator,{modulator},{ChEMBL},{ChEMBL},,,MDX-447,FCGR1A
4577,chembl:CHEMBL2109391,hgnc:3613,~PLAIN,{~NULL},{NCI},"{NCI, ChEMBL}",15.66114,3.0,MDX-447,FCGR1A
4580,chembl:CHEMBL2109398,hgnc:3387,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},,,KB-004,EPHA3
4581,chembl:CHEMBL2109398,hgnc:3387,~PLAIN,{~NULL},{TTD},"{TTD, ChEMBL}",10.44076,2.0,KB-004,EPHA3
4583,chembl:CHEMBL2109401,hgnc:3431,inhibitor,{inhibitor},{ChEMBL},{ChEMBL},,,AV-203,ERBB3
4584,chembl:CHEMBL2109401,hgnc:3431,~PLAIN,{~NULL},{TTD},"{TTD, ChEMBL}",2.007838,2.0,AV-203,ERBB3
4585,chembl:CHEMBL2109402,hgnc:3431,antibody,{antibody},{MyCancerGenome},{MyCancerGenome},,,MM-121,ERBB3
4586,chembl:CHEMBL2109402,hgnc:3431,inhibitor,{inhibitor},"{ClearityFoundationClinicalTrial, ChEMBL}","{ClearityFoundationClinicalTrial, ChEMBL}",,,MM-121,ERBB3
4587,chembl:CHEMBL2109402,hgnc:3431,~PLAIN,{~NULL},{TTD},"{ClearityFoundationClinicalTrial, TTD, MyCancerGenome, ChEMBL}",4.015677,4.0,MM-121,ERBB3
4596,chembl:CHEMBL2109435,hgnc:4445,binder,{binder},{ChEMBL},{ChEMBL},,,HUA33,GPA33
