<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#DrugCentral-notebook:-querying,-EDA" data-toc-modified-id="DrugCentral-notebook:-querying,-EDA-1">DrugCentral notebook: querying, EDA</a></span><ul class="toc-item"><li><span><a href="#Access-database" data-toc-modified-id="Access-database-1.1">Access database</a></span></li><li><span><a href="#Get-version" data-toc-modified-id="Get-version-1.2">Get version</a></span></li><li><span><a href="#EDA:-List-of-tables-(SQL)" data-toc-modified-id="EDA:-List-of-tables-(SQL)-1.3">EDA: List of tables (SQL)</a></span></li><li><span><a href="#act_table_full-(SQL-EDA)" data-toc-modified-id="act_table_full-(SQL-EDA)-1.4">act_table_full (SQL EDA)</a></span></li><li><span><a href="#act_table_full-(pandas-EDA)" data-toc-modified-id="act_table_full-(pandas-EDA)-1.5">act_table_full (pandas EDA)</a></span><ul class="toc-item"><li><span><a href="#Notes-on-columns" data-toc-modified-id="Notes-on-columns-1.5.1">Notes on columns</a></span><ul class="toc-item"><li><span><a href="#CHEMICAL-SUBJECT" data-toc-modified-id="CHEMICAL-SUBJECT-1.5.1.1">CHEMICAL SUBJECT</a></span></li><li><span><a href="#PROTEIN-OBJECT" data-toc-modified-id="PROTEIN-OBJECT-1.5.1.2">PROTEIN OBJECT</a></span></li><li><span><a href="#ASSOCIATION" data-toc-modified-id="ASSOCIATION-1.5.1.3">ASSOCIATION</a></span></li></ul></li><li><span><a href="#Filtered-data,-duplicates?" data-toc-modified-id="Filtered-data,-duplicates?-1.5.2">Filtered data, duplicates?</a></span></li></ul></li><li><span><a href="#omop_relationship_doid_view-(SQL/pandas-EDA)" data-toc-modified-id="omop_relationship_doid_view-(SQL/pandas-EDA)-1.6">omop_relationship_doid_view (SQL/pandas EDA)</a></span><ul class="toc-item"><li><span><a href="#DiseaseOrPheno-ID-fields" data-toc-modified-id="DiseaseOrPheno-ID-fields-1.6.1">DiseaseOrPheno ID fields</a></span></li><li><span><a href="#Filtered-data,-duplicates?" data-toc-modified-id="Filtered-data,-duplicates?-1.6.2">Filtered data, duplicates?</a></span></li><li><span><a href="#Filtering-objects" data-toc-modified-id="Filtering-objects-1.6.3">Filtering objects</a></span></li><li><span><a href="#Other-parser-dev" data-toc-modified-id="Other-parser-dev-1.6.4">Other parser dev</a></span></li></ul></li><li><span><a href="#Close-connection" data-toc-modified-id="Close-connection-1.7">Close connection</a></span></li><li><span><a href="#Other-tables-(SQL/pandas-EDA)" data-toc-modified-id="Other-tables-(SQL/pandas-EDA)-1.8">Other tables (SQL/pandas EDA)</a></span><ul class="toc-item"><li><span><a href="#action_type" data-toc-modified-id="action_type-1.8.1">action_type</a></span></li><li><span><a href="#data_source" data-toc-modified-id="data_source-1.8.2">data_source</a></span></li><li><span><a href="#vetomop" data-toc-modified-id="vetomop-1.8.3">vetomop</a></span></li></ul></li></ul></li></ul></div>

# DrugCentral notebook: querying, EDA

In [1]:
## for notebook only 

## allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## for printing: pprint sorts dict keys by default
from pprint import pprint,pp

<div class="alert alert-block alert-danger">

For SQL querying, using:
    
* `pandas`: has [methods to read SQL queries/tables into dataframes](https://pandas.pydata.org/pandas-docs/stable/reference/io.html#sql)
  * Depends on `SQLAlchemy`
  * People have successfully plugged psycopg2 directly into pandas, but pandas will throw a UserWarning and it isn't recommended/guaranteed to continue working ([Stack Overflow](https://stackoverflow.com/questions/70892143/psycopg2-connection-sql-database-to-pandas-dataframe#comment134809897_70892202), [Github gist](https://gist.github.com/jakebrinkmann/de7fd185efe9a1f459946cf72def057e?permalink_comment_id=4489702#gistcomment-4489702))
* `SQLAlchemy`: uses other packages as database drivers like [psycopg](https://docs.sqlalchemy.org/en/20/dialects/postgresql.html#module-sqlalchemy.dialects.postgresql.psycopg), slightly slower ([ref](https://www.geeksforgeeks.org/python/difference-between-psycopg2-and-sqlalchemy-in-python/)). [Installed](https://docs.sqlalchemy.org/en/20/intro.html#installation) using pip 
* `psycopg`(aka psycopg3)]: successor of psycopg2. [Dev of both suggests](https://github.com/psycopg/psycopg2) new projects use this instead. [Installed](https://www.psycopg.org/psycopg3/docs/basic/install.html#binary-installation) using pip (`pip install "psycopg[binary]"`)

Later sections include docs/references I used.

In [2]:
## PUT required imports here

## psycopg is underlying dependency, not directly used 
# import psycopg
from sqlalchemy import create_engine, URL, text

import re
import pandas as pd
## in parser, using util.biolink function instead
import bmt  ## for omop_relationship filtering on semantic type



## for EDA only
## in standard python
from urllib.parse import urlparse   ## for EDA on source_url columns

## NOT for parser: for viewing df only
pd.options.display.max_columns = None
pd.set_option('display.max_colwidth', 60)

In [3]:
## useful function
def check_if_contains(df, column_name, patterns):
    for i in patterns:
        temp = df[df[column_name].str.contains(pat=i, na=False)]
        if temp.size > 0:
            print(f'{i} count: {temp.shape[0]}')
            
## use raw string, escape char for special char
delimiters = [",", ";", ":", "-", "_", r"\|", " "]

<div class="alert alert-block alert-danger">    
    
This notebook was originally written using data accessed Dec 2025, which was dbversion number 54 and date 2023-11-01.

## Access database

Using DrugCentral's public Postgres database - details in https://drugcentral.org/download. 

I'm not sure whether to hide password. It is fully public...

References:
* [MyChem](https://github.com/biothings/mychem.info/blob/master/src/hub/dataload/sources/drugcentral/drugcentral_dump.py) parser
* [Github gist](https://gist.github.com/jakebrinkmann/de7fd185efe9a1f459946cf72def057e?permalink_comment_id=4489702#gistcomment-4489702)
* [SQLAlchemy and psycopg driver](https://docs.sqlalchemy.org/en/20/dialects/postgresql.html#module-sqlalchemy.dialects.postgresql.psycopg)

In [4]:
## connection info for public Postgres database 

DIALECT = "postgresql"
DRIVER = "psycopg"  ## package dependency
## from DrugCentral downloads page
USER = "drugman"
PASSWORD = "dosage"
HOST = "unmtid-dbs.net"
PORT = 5433
DBNAME = "drugcentral"

In [5]:
## parser

def get_server_url(dialect, driver, user, password, host, port: int, dbname):
    """
    Uses SQLAlchemy method to compose server url for SQLAlchemy engine, 
    rather than using hard-coded string formatting. 
    
    Returns: sqlalchemy.engine.url.URL
    """
    return URL.create(
        drivername = f"{dialect}+{driver}",
        username=user,
        password=password,
        host=host,
        port=port,
        database=dbname,
    )

In [6]:
## parser

server_url = get_server_url(
    dialect=DIALECT, 
    driver=DRIVER, 
    user=USER, 
    password=PASSWORD, 
    host=HOST, 
    port=PORT, 
    dbname=DBNAME
)
type(server_url)

print(server_url)

sqlalchemy.engine.url.URL

postgresql+psycopg://drugman:***@unmtid-dbs.net:5433/drugcentral


In [7]:
## notebook only: for easy querying

print("Connecting to the PostgreSQL database...")

engine_dev = create_engine(server_url)
conn = engine_dev.connect()

print("Connection successful")

Connecting to the PostgreSQL database...
Connection successful


In [8]:
conn

<sqlalchemy.engine.base.Connection at 0x10863d5b0>

## Get version

According to MyChem parser, the `dbversion` table contains version number and date. 

**Using date as source version** because it's less confusing (version number differs between live database and dump file's header)

In [9]:
## EDA

result = conn.execute(text("SELECT * FROM dbversion"))

## only 1 line in table, can use fetchone to retrieve just this line
result.fetchall()

[(54, datetime.datetime(2023, 11, 1, 12, 10, 57, 835000))]

In [10]:
## parser

engine = create_engine(server_url)

## closes connection automatically afterwards
with engine.connect() as db_conn: 
    result = db_conn.execute(text("SELECT * FROM dbversion"))
    version_date = result.fetchone()[1]

version_date = version_date.strftime("%Y_%m_%d")

version_date

'2023_11_01'

## EDA: List of tables (SQL)

References:
* Raw SQL query: 
  * https://stackoverflow.com/a/75752699
  * https://www.pythontutorials.net/blog/sqlalchemy-getting-a-list-of-tables/#postgresql
* Background:
  * https://www.postgresql.org/docs/current/infoschema-tables.html
  * https://www.geeksforgeeks.org/dbms/difference-between-view-and-table/

In [11]:
Q_TABLE_NAMES = """
    SELECT table_name,table_type
    FROM information_schema.tables
    WHERE (table_schema = 'public')
"""

result = conn.execute(text(Q_TABLE_NAMES))

response_table_names = result.fetchall()

In [12]:
len(response_table_names)

result.keys()

response_table_names[0]

260

RMKeyView(['table_name', 'table_type'])

('SOL_PHARMACOVIGILANCE_qzJrh0Bu_FDA_products_input', 'BASE TABLE')

**Many more tables than expected!!** And odd names, not what I expected


Also **NOTE**: response acts similar to a [NamedTuple](https://www.geeksforgeeks.org/python/namedtuple-in-python/): can work with **keys** and indexes

In [13]:
## look at tables that aren't "BASE TABLE" (normal type):

for i in response_table_names:
    if i.table_type != "BASE TABLE":
        print(i)

('faers_top', 'VIEW')
('ob_exclusivity_view', 'VIEW')
('ob_patent_view', 'VIEW')
('omop_relationship_doid_view', 'VIEW')
('my_first_dbt_model', 'VIEW')
('my_second_dbt_model', 'VIEW')


`omop_relationship_doid_view` sounds interesting, so I don't want to restrict to only "BASE TABLE" in my original query. 

Now to review the full list of tables...

In [14]:
table_names = [i.table_name for i in response_table_names]
table_names = sorted(table_names)

In [15]:
## going through table names
table_names[250:275]

# for i in table_names:
#     if "test" in i:
#         print(i)

['test_2_1',
 'test_2_2',
 'test_3',
 'test_coffee_recipe',
 'test_snap',
 'vetomop',
 'vetprod',
 'vetprod2struct',
 'vetprod_type',
 'vettype']

**Observations and thoughts:**
- some sound like Gen AI / LLM / ML stuff. Substrings:
  - `AGENT` / `AGENTIC`
  - `DEMO`
  - `GENAI`
  - `PREDICTIONLLM`
  - `model` / `MODEL`
  - `PRED` - prediction?
  - `machine_learning`
- some don't sound official/like normal use. Substrings:
  - `test`
  - `snapshot`
  - `recipe`, including `test_coffee_recipe`
- makes me wonder if users have write-access, or what happened

**Characteristics of "legit" tables**

Based on the tables previously used in parsers and other reliable-looking names, I think "legit" table names tend to only have lowercase letters, underscores, and rarely the number 2 (for "to").

Code references:
* "string shouldn't contain substrings from this list": https://stackoverflow.com/questions/58641898/check-if-string-does-not-contain-strings-from-the-list

In [16]:
name_pat = re.compile("[a-z2_]+")

possibly_good_tables = [i for i in table_names if name_pat.fullmatch(i)]

## then filter out keywords of some names that don't seem legit
odd_name_strings = ["test", "model", "snapshot"]
possibly_good_tables = [i for i in possibly_good_tables if not any(x in i for x in odd_name_strings)]

In [17]:
len(possibly_good_tables)

71

In [18]:
## reviewing these closer

possibly_good_tables[60:]

['target_dictionary',
 'target_go',
 'target_keyword',
 'td2tc',
 'tdgo2tc',
 'tdkey2tc',
 'vetomop',
 'vetprod',
 'vetprod2struct',
 'vetprod_type',
 'vettype']

## act_table_full (SQL EDA)

Link for how to get list of column names: https://www.geeksforgeeks.org/python/how-to-get-column-names-from-sqlalchemy/

In [19]:
## total number of rows

q_count_rows = """
    SELECT COUNT(*) 
    FROM act_table_full
"""

result = conn.execute(text(q_count_rows))

result.fetchone()

(20978,)

In [20]:
## 2 ways: number of rows with action_type filled out (not None)
Q_NOTNULL_1 = """
    SELECT COUNT(action_type)
    FROM act_table_full
"""

result = conn.execute(text(Q_NOTNULL_1))
result.fetchone()

Q_NOTNULL_2 = """
    SELECT COUNT(*) 
    FROM act_table_full
    WHERE action_type IS NOT NULL
"""

result = conn.execute(text(Q_NOTNULL_2))
result.fetchone()

(4360,)

(4360,)

In [21]:
# ## looking at number of rows filled out for other columns

# Q_NOTNULL_1 = """
#     SELECT COUNT(act_value)
#     FROM act_table_full
# """

# result = conn.execute(text(Q_NOTNULL_1))
# result.fetchone()

In [21]:
# ## grab a row from table, see columns and example values

Q_EXAMPLE = """
    SELECT * 
    FROM act_table_full 
    WHERE act_ref_id IS NOT NULL
    LIMIT 1
"""
result = conn.execute(text(Q_EXAMPLE))
row = result.fetchall()
{k:v for k,v in zip(result.keys(), row[0])}

{'act_id': 215525,
 'struct_id': 296,
 'target_id': 596,
 'target_name': 'DNA topoisomerase 1',
 'target_class': 'Enzyme',
 'accession': 'P11387',
 'gene': 'TOP1',
 'swissprot': 'TOP1_HUMAN',
 'act_value': 6.561,
 'act_unit': None,
 'act_type': 'IC50',
 'act_comment': None,
 'act_source': 'SCIENTIFIC LITERATURE',
 'relation': '=',
 'moa': 1,
 'moa_source': 'SCIENTIFIC LITERATURE',
 'act_source_url': 'https://pubmed.ncbi.nlm.nih.gov/9875499',
 'moa_source_url': 'https://pubmed.ncbi.nlm.nih.gov/9875499',
 'action_type': 'INHIBITOR',
 'first_in_class': None,
 'tdl': 'Tclin',
 'act_ref_id': 612,
 'moa_ref_id': 612,
 'organism': 'Homo sapiens'}

In [22]:
list(result.keys())

['act_id',
 'struct_id',
 'target_id',
 'target_name',
 'target_class',
 'accession',
 'gene',
 'swissprot',
 'act_value',
 'act_unit',
 'act_type',
 'act_comment',
 'act_source',
 'relation',
 'moa',
 'moa_source',
 'act_source_url',
 'moa_source_url',
 'action_type',
 'first_in_class',
 'tdl',
 'act_ref_id',
 'moa_ref_id',
 'organism']

In [23]:
# ## see value enum

Q_DISTINCT = """
    SELECT DISTINCT organism
    FROM act_table_full
"""
result = conn.execute(text(Q_DISTINCT))
row = result.fetchall()

len(row)

276

In [24]:
Q_DISTINCT = """
    SELECT DISTINCT action_type
    FROM act_table_full
"""
result = conn.execute(text(Q_DISTINCT))
row = result.fetchall()

[i[0] for i in row]

['PHARMACOLOGICAL CHAPERONE',
 'INVERSE AGONIST',
 'ALLOSTERIC MODULATOR',
 'RELEASING AGENT',
 'ANTAGONIST',
 'PARTIAL AGONIST',
 'NEGATIVE ALLOSTERIC MODULATOR',
 'AGONIST',
 None,
 'POSITIVE MODULATOR',
 'GATING INHIBITOR',
 'BLOCKER',
 'NEGATIVE MODULATOR',
 'ACTIVATOR',
 'OTHER',
 'BINDING AGENT',
 'ANTIBODY BINDING',
 'ANTISENSE INHIBITOR',
 'ALLOSTERIC ANTAGONIST',
 'INHIBITOR',
 'OPENER',
 'POSITIVE ALLOSTERIC MODULATOR',
 'SUBSTRATE',
 'MODULATOR']

## act_table_full (pandas EDA)

In [23]:
## takes <10 s

df1 = pd.read_sql_table(table_name="act_table_full", con=conn)

In [24]:
df1.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20978 entries, 0 to 20977
Data columns (total 24 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   act_id          20978 non-null  int64  
 1   struct_id       20978 non-null  int64  
 2   target_id       20978 non-null  int64  
 3   target_name     20978 non-null  object 
 4   target_class    20978 non-null  object 
 5   accession       20978 non-null  object 
 6   gene            20580 non-null  object 
 7   swissprot       20978 non-null  object 
 8   act_value       19375 non-null  float64
 9   act_unit        0 non-null      object 
 10  act_type        19371 non-null  object 
 11  act_comment     15104 non-null  object 
 12  act_source      20978 non-null  object 
 13  relation        19123 non-null  object 
 14  moa             2866 non-null   float64
 15  moa_source      2865 non-null   object 
 16  act_source_url  1047 non-null   object 
 17  moa_source_url  2866 non-null  

In [25]:
## looking into accession

df1["accession"].nunique()

list(df1["accession"][0:5])

## look for delimiters: yes, there's pipe
check_if_contains(df1, "accession", delimiters)

## want to find most number of pipes/IDs
n_longest = 0
for i in df1["accession"]:
    temp = i.split("|")
    if len(temp) > n_longest:
        n_longest = len(temp)
n_longest

## looking at accession when target isn't a protein (target_class, gene column NA)
# df1[df1["target_class"] == "RNA"]
# df1[df1["target_class"] == "Cytosolic other"]
# df1[df1["gene"].isna()]

3192

['Q92887',
 'O43193',
 'Q9NYW5',
 'Q07806|Q51504|Q51505|Q9X6V3|Q9X6V7|Q9X6W0',
 'P0AD68']

\| count: 1206


50

In [26]:
## looking into act_source

df1["act_source"].value_counts()

act_source
CHEMBL                   12581
WOMBAT-PK                 2845
DRUG MATRIX               2255
IUPHAR                    1294
SCIENTIFIC LITERATURE      773
PDSP                       734
DRUG LABEL                 353
DRUGBANK                    64
UNKNOWN                     59
KEGG DRUG                   20
Name: count, dtype: int64

In [27]:
## effect of filtering part 1

df1_filtered = df1[df1["action_type"].notna()].copy()

## sources with action_type value
df1_filtered.shape[0]
df1_filtered["act_source"].value_counts()

4360

act_source
IUPHAR                   1286
CHEMBL                   1011
WOMBAT-PK                 810
SCIENTIFIC LITERATURE     698
DRUG LABEL                337
DRUGBANK                   64
UNKNOWN                    54
DRUG MATRIX                50
PDSP                       30
KEGG DRUG                  20
Name: count, dtype: int64

In [28]:
## effect of filtering part 2

## also remove specific sources
removed_sources = "CHEMBL|IUPHAR|DRUGBANK|KEGG"

df1_filtered = df1_filtered[(~ df1_filtered["act_source"].str.contains(removed_sources, na=False))].copy()


df1_filtered.shape[0]

df1_filtered["action_type"].nunique()

df1_filtered["action_type"].value_counts().sort_index()

1979

22

action_type
ACTIVATOR                         18
AGONIST                          348
ALLOSTERIC ANTAGONIST              5
ALLOSTERIC MODULATOR               2
ANTAGONIST                       358
ANTIBODY BINDING                  80
ANTISENSE INHIBITOR                9
BINDING AGENT                     51
BLOCKER                          125
INHIBITOR                        798
INVERSE AGONIST                    7
MODULATOR                         35
NEGATIVE ALLOSTERIC MODULATOR      3
NEGATIVE MODULATOR                 3
OPENER                             5
OTHER                              1
PARTIAL AGONIST                   16
PHARMACOLOGICAL CHAPERONE          1
POSITIVE ALLOSTERIC MODULATOR     94
POSITIVE MODULATOR                14
RELEASING AGENT                    2
SUBSTRATE                          4
Name: count, dtype: int64

In [44]:
# df1[(df1["act_source"] == "PDSP") &
#      (df1["action_type"].notna())]

In [29]:
## without filtering

df1[(df1["action_type"].isna()) & (df1["act_value"].isna())].shape[0]

df1["action_type"].nunique()

df1["action_type"].value_counts().sort_index()

745

23

action_type
ACTIVATOR                          89
AGONIST                           957
ALLOSTERIC ANTAGONIST               6
ALLOSTERIC MODULATOR               41
ANTAGONIST                        868
ANTIBODY BINDING                  110
ANTISENSE INHIBITOR                 9
BINDING AGENT                      53
BLOCKER                           334
GATING INHIBITOR                   34
INHIBITOR                        1581
INVERSE AGONIST                    13
MODULATOR                          44
NEGATIVE ALLOSTERIC MODULATOR       9
NEGATIVE MODULATOR                  3
OPENER                             35
OTHER                               1
PARTIAL AGONIST                    23
PHARMACOLOGICAL CHAPERONE           3
POSITIVE ALLOSTERIC MODULATOR     111
POSITIVE MODULATOR                 22
RELEASING AGENT                     7
SUBSTRATE                           7
Name: count, dtype: int64

In [30]:
## Looking into act_source_url

## no delimiter
check_if_contains(df1, "act_source_url", [r"\|"])

In [31]:
## Looking into act_source_url

websites = set()
for i in df1["act_source_url"]:
    if i:
        temp = urlparse(i)
        base_url = f"{temp.scheme}://{temp.netloc}"
        websites.add(base_url)

website_counts = dict()
total_count = 0
for i in websites:
    temp = df1[df1["act_source_url"].str.startswith(i, na=False)].shape[0]
    website_counts[i] = temp
    total_count += temp

## sort
website_counts = {key: value for key, value in sorted(website_counts.items(), 
                               key=lambda item: item[1], reverse=True)}

## total number of rows with act_source_url value
total_count

website_counts

# df1[df1["act_source_url"].str.contains("https://doi.org", na=False)]

df1[df1["act_source_url"].notna()]["act_source"].value_counts()
df1["act_source"].value_counts()

# df1[df1["act_source_url"] == df1["moa_source_url"]].shape[0]

1047

{'https://pubmed.ncbi.nlm.nih.gov': 706,
 'https://www.accessdata.fda.gov': 170,
 'http://www.accessdata.fda.gov': 107,
 'https://www.ema.europa.eu': 18,
 'http://dx.doi.org': 14,
 'http://www.ema.europa.eu': 12,
 'https://www.pmda.go.jp': 10,
 'https://www.fda.gov': 4,
 'http://professional.diabetes.org': 2,
 'https://doi.org': 2,
 'http://eisai.jp': 1,
 'https://clinicaltrials.gov': 1}

act_source
SCIENTIFIC LITERATURE    731
DRUG LABEL               297
IUPHAR                    14
CHEMBL                     5
Name: count, dtype: int64

act_source
CHEMBL                   12581
WOMBAT-PK                 2845
DRUG MATRIX               2255
IUPHAR                    1294
SCIENTIFIC LITERATURE      773
PDSP                       734
DRUG LABEL                 353
DRUGBANK                    64
UNKNOWN                     59
KEGG DRUG                   20
Name: count, dtype: int64

In [32]:
df1[df1["act_source_url"].str.contains("eisai", na=False)]

Unnamed: 0,act_id,struct_id,target_id,target_name,target_class,accession,gene,swissprot,act_value,act_unit,act_type,act_comment,act_source,relation,moa,moa_source,act_source_url,moa_source_url,action_type,first_in_class,tdl,act_ref_id,moa_ref_id,organism
5408,134517,1180,632,Catechol O-methyltransferase,Enzyme,P21964,COMT,COMT_HUMAN,,,,Mechanism of Action,DRUG LABEL,,1.0,DRUG LABEL,http://eisai.jp/medical/products/di/EPI/CSP_C_EPI.pdf,http://eisai.jp/medical/products/di/EPI/CSP_C_EPI.pdf,INHIBITOR,,Tclin,309.0,309.0,Homo sapiens


In [33]:
## bucket for other columns: queries with interesting results

# df1["target_id"].nunique()
# df1["target_name"].nunique()

# df1["target_class"].nunique()
# df1["target_class"].value_counts()

# df1["gene"].nunique()
# ## use raw string, escape char for special char
# delimiters = [",", ";", ":", "-", "_", r"\|"]
# check_if_contains(df1, "gene", delimiters)
# df1[df1["gene"].isna()]
# df1[(df1["gene"].isna()) & 
#     (df1["organism"] == "Homo sapiens")]

## looking at fields related to activity value
# df1[(df1["act_type"].notna()) & (df1["act_value"].notna())].shape[0]
# df1[(df1["act_type"].isna()) & (df1["act_value"].notna())]
# df1[(df1["act_type"].notna()) & (df1["act_value"].isna())]
# df1[(df1["relation"].notna()) & (df1["act_value"].isna())]

# df1[(df1["act_type"].isna()) & (df1["act_comment"].notna())].shape[0]

## looking at moa fields
# df1[(df1["moa_source"].isna()) & 
#     (df1["moa"].notna())]

# df1["act_ref_id"].nunique()
# df1["moa_ref_id"].nunique()
# df1[df1["act_ref_id"].notna()]
# df1[df1["act_ref_id"] == df1["moa_ref_id"]].shape[0]

# df1["organism"].nunique()

### Notes on columns

<mark>Highlight</mark> means this column is important for ingest

####  CHEMICAL SUBJECT

<mark>**struct_id**</mark>: loads as int! single DrugCentral drug ID (NodeNorm-supported!)
* NodeNorm currently has [4995 DrugCentral IDs in ChemicalEntity](https://nodenorm.ci.transltr.io/1.5/get_curie_prefixes?semantic_type=biolink%3AChemicalEntity). This matches drug count above the word "Drugs" on [DrugCentral's front page](https://drugcentral.org/).
* All previous parsers used this ID (rather than doing a join-type SQL query with other tables that may have external chem IDs) 

**first_in_class**: seems like a drug property. Only one value: 1 = yes? (loads as float, likely due to the NA) Don't see on drug bioactivity page.

#### PROTEIN OBJECT

**Key columns**:

<mark>**accession**</mark>: UniProtKB ID (protein), assuming for target. Pipe-delimited (> 1/3 are), longest is 80 IDs o_0
* Didn't find any placeholder for "no value". Looked at odd target_class "RNA" and "Cytosolic other" and gene column NA. 
* some IDs are removed, scheduled for removal
* these are from LOTS of diff species. Seems to correspond to organism column

---

**Other columns:**

**target_id**: loads as int. Seems like an internal ID

**target_name**: human-readable label. Oddly, its nunique is less than nunique for target_id. So maybe some target_id have the same name?

**target_class**: mix of structural/functional/location classes. Enum with >20 values. "RNA" all have UniProtKB IDs (for related protein/gene) and drug action_type is antisense inhibitors/binders (makes sense).  

**gene**: gene name, assuming for target protein. Pipe-delimited, has other special char. A few NA, but based on quick look, these are still proteins (just don't have a gene name?) and are non-human (organism not Homo sapiens).  

**swissprot**: UniProt name (should correspond to accession). Pipe-delimited. 

**tdl**: target property. Pipe-delimited (1 per accession in row?). Don't see on drug bioactivity section. 

#### ASSOCIATION

**Key columns**:

<mark>**act_source**</mark>: underlying source, assuming of the row/relationship. Corresponds to "Bioact source" column in webpage Bioactivity table. Enum

<mark>**act_source_url**</mark>: url for underlying source. Don't see it on webpages. Present on most (not all) rows with act_source "SCIENTIFIC LITERATURE" or "DRUG LABEL". And present on a few rows with other act_source values (but these will be filtered out). 
* Process, put into publications (not source_record_urls - doesn't make sense when drugcentral is primary) 
* find/replace on these, leave rest as urls:
  * https://pubmed.ncbi.nlm.nih.gov/ -> "PMID:"
  * http://dx.doi.org/ -> "DOI:"
  * https://doi.org/ -> "DOI:"

<mark>**action_type**</mark>: relationship (chemical effect on target). About 20% of rows (~4300) have a non-NULL value. Enum, has values that aren't in CHEMBL MOA. 

---

**Other columns (reordered for easier understanding):**

**act_id**: loads as int, unique for each row (nunique == n_rows). Seems to be a record/row ID. Don't see it on webpages

**act_comment**: looks like free text. Can apply to aspects besides act_type/act_value, because some are found on rows where those are NA. Don't see on webpage. 

**act_type**: type of activity value (ex: Ki, IC50, Kd, EC50 are major, but there's more). Enum-ish.
* Oddly, 1 row has this but NA act_value. And 5 rows are NA for this but have act_value. 

**relation**: assuming this is for act_type -> act_value. Don't see on webpage. Enum
* There are some rows with act_type/act_value but are NA for this. 
* There are some rows that don't have act_type but have this.
* Oddly, there are a few rows that don't have act_type + act_value, but have this field 

**act_value**: activity value (float). Some NA.
* Webpages say the "units" are `-log[M]` (log10, M=concentration in Molar of drug? For assay datapoint described by act_type) 

**act_unit**: All NA.

**moa**: Only one value: 1 = yes (loads as float, likely due to the NA). Corresponds to green-checkmark in "Mechanism action" column of webpage
* marks when this bioactivity is considered part of this drug's mechanism of action for intended therapeutic effect  

**moa_source**: different from act_source. Enum: slightly diff and less items. See elsewhere in my internal GDoc notes for my paper digestion of how mechanisms of action are assigned.
* Oddly, there's 1 row with moa and NA moa_source (it does have moa_source_url and it's from CHEMBL). 

**moa_source_url**: url for moa_source. usually diff from act_source_url and have more/diff base urls. See elsewhere in my internal GDoc notes for my paper digestion of how mechanisms of action are assigned.
* sometimes equal to act_source_url 

**act_ref_id**: seems like internal ID for references used in act_source_url, moa_source_url. (loads as float, likely due to the NA.) Don't see on drug bioactivity page.
* Figured out because when this is same as moa_ref_id, the url fields are the same too

**moa_ref_id**: see notes for act_ref_id. Don't see on drug bioactivity page

**organism**: organism name. Seems to correspond to protein (accession). Enum-ish (>200 unique values). Don't see on drug bioactivity page. 

### Filtered data, duplicates?

In [34]:
## action_type has value, filtered sources

main_columns = ["struct_id", "accession", "action_type", "act_source", "act_source_url"]
df1_filtered = df1_filtered[main_columns].copy()

df1_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1979 entries, 4 to 20977
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   struct_id       1979 non-null   int64 
 1   accession       1979 non-null   object
 2   action_type     1979 non-null   object
 3   act_source      1979 non-null   object
 4   act_source_url  940 non-null    object
dtypes: int64(1), object(4)
memory usage: 92.8+ KB


In [35]:
## no duplicates

main_triple = ["struct_id", "accession", "action_type"]

df1_filtered[df1_filtered.duplicated(subset=main_triple, keep=False)]

# df1_filtered.drop_duplicates(subset=main_triple)

Unnamed: 0,struct_id,accession,action_type,act_source,act_source_url


In [36]:
## shows there's pipe-delimited values
check_if_contains(df1_filtered, "accession", delimiters)

## want to find most number of pipes/IDs
n_longest = 0
for i in df1_filtered["accession"]:
    temp = i.split("|")
    if len(temp) > n_longest:
        n_longest = len(temp)
n_longest

\| count: 249


40

In [37]:
df1_filtered["action_type"].nunique()

df1_filtered["action_type"].value_counts().sort_index()

22

action_type
ACTIVATOR                         18
AGONIST                          348
ALLOSTERIC ANTAGONIST              5
ALLOSTERIC MODULATOR               2
ANTAGONIST                       358
ANTIBODY BINDING                  80
ANTISENSE INHIBITOR                9
BINDING AGENT                     51
BLOCKER                          125
INHIBITOR                        798
INVERSE AGONIST                    7
MODULATOR                         35
NEGATIVE ALLOSTERIC MODULATOR      3
NEGATIVE MODULATOR                 3
OPENER                             5
OTHER                              1
PARTIAL AGONIST                   16
PHARMACOLOGICAL CHAPERONE          1
POSITIVE ALLOSTERIC MODULATOR     94
POSITIVE MODULATOR                14
RELEASING AGENT                    2
SUBSTRATE                          4
Name: count, dtype: int64

In [38]:
df1_filtered["act_source"].nunique()

df1_filtered["act_source"].value_counts().sort_index()

6

act_source
DRUG LABEL               337
DRUG MATRIX               50
PDSP                      30
SCIENTIFIC LITERATURE    698
UNKNOWN                   54
WOMBAT-PK                810
Name: count, dtype: int64

In [39]:
## Looking into act_source_url

websites = set()
for i in df1_filtered["act_source_url"]:
    if i:
        temp = urlparse(i)
        base_url = f"{temp.scheme}://{temp.netloc}"
        websites.add(base_url)

website_counts = dict()
total_count = 0
for i in websites:
    temp = df1_filtered[df1_filtered["act_source_url"].str.startswith(i, na=False)].shape[0]
    website_counts[i] = temp
    total_count += temp

## sort
website_counts = {key: value for key, value in sorted(website_counts.items(), 
                               key=lambda item: item[1], reverse=True)}

## total number of rows with act_source_url value
total_count

website_counts

940

{'https://pubmed.ncbi.nlm.nih.gov': 626,
 'https://www.accessdata.fda.gov': 166,
 'http://www.accessdata.fda.gov': 90,
 'https://www.ema.europa.eu': 18,
 'http://dx.doi.org': 14,
 'http://www.ema.europa.eu': 12,
 'https://www.pmda.go.jp': 7,
 'https://www.fda.gov': 2,
 'https://doi.org': 2,
 'http://eisai.jp': 1,
 'http://professional.diabetes.org': 1,
 'https://clinicaltrials.gov': 1}

## omop_relationship_doid_view (SQL/pandas EDA)

Looking at the two options - table omop_relationship and view omop_relationship_doid_view

In [11]:
## same row count

q_count_rows = """
    SELECT COUNT(*) 
    FROM omop_relationship
"""
result = conn.execute(text(q_count_rows))
result.fetchone()

q_count_rows = """
    SELECT COUNT(*) 
    FROM omop_relationship_doid_view
"""
result = conn.execute(text(q_count_rows))
result.fetchone()

(42307,)

(42307,)

In [12]:
## rows look similar, just with doid field added

Q_EXAMPLE = """
    SELECT * 
    FROM omop_relationship 
    LIMIT 1
"""
result = conn.execute(text(Q_EXAMPLE))
row = result.fetchall()
{k:v for k,v in zip(result.keys(), row[0])}

Q_EXAMPLE = """
    SELECT * 
    FROM omop_relationship_doid_view 
    LIMIT 1
"""
result = conn.execute(text(Q_EXAMPLE))
row = result.fetchall()
{k:v for k,v in zip(result.keys(), row[0])}

{'id': 174026,
 'struct_id': 5391,
 'concept_id': 40249429,
 'relationship_name': 'indication',
 'concept_name': 'Triple negative breast neoplasms',
 'umls_cui': 'C3539878',
 'snomed_full_name': 'Triple negative breast neoplasms',
 'cui_semantic_type': 'T191',
 'snomed_conceptid': 706970001}

{'id': 174026,
 'struct_id': 5391,
 'concept_id': 40249429,
 'relationship_name': 'indication',
 'concept_name': 'Triple negative breast neoplasms',
 'umls_cui': 'C3539878',
 'snomed_full_name': 'Triple negative breast neoplasms',
 'cui_semantic_type': 'T191',
 'snomed_conceptid': 706970001,
 'doid': None}

In [13]:
## parser
## takes ~3 sec

engine = create_engine(server_url)

with engine.connect() as db_conn: 
    df2 = pd.read_sql_table(table_name="omop_relationship_doid_view", con=db_conn)

In [14]:
df2.head()

Unnamed: 0,id,struct_id,concept_id,relationship_name,concept_name,umls_cui,snomed_full_name,cui_semantic_type,snomed_conceptid,doid
0,144492,564,21000286,indication,Gonococcal meningitis,C0153225,Gonococcal meningitis,T047,151004.0,
1,169224,559,21000286,indication,Gonococcal meningitis,C0153225,Gonococcal meningitis,T047,151004.0,
2,164652,1572,21001507,contraindication,Heart valve disorder,C0018824,Heart valve disorder,T047,368009.0,DOID:4079
3,161984,2770,21001507,contraindication,Heart valve disorder,C0018824,Heart valve disorder,T047,368009.0,DOID:4079
4,170085,1968,21001507,contraindication,Heart valve disorder,C0018824,Heart valve disorder,T047,368009.0,DOID:4079


In [15]:
df2.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42307 entries, 0 to 42306
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 42307 non-null  int64  
 1   struct_id          42307 non-null  int64  
 2   concept_id         42307 non-null  int64  
 3   relationship_name  42307 non-null  object 
 4   concept_name       42307 non-null  object 
 5   umls_cui           38432 non-null  object 
 6   snomed_full_name   38430 non-null  object 
 7   cui_semantic_type  38431 non-null  object 
 8   snomed_conceptid   38403 non-null  float64
 9   doid               14715 non-null  object 
dtypes: float64(1), int64(3), object(6)
memory usage: 15.0 MB


In [16]:
## has duplicated main triples

main_triple = ["struct_id", "relationship_name", "umls_cui"]

df2[df2.duplicated(subset=main_triple, keep=False)].shape[0]

3282

In [17]:
## EDA on other columns

df2["id"].nunique()

df2["concept_id"].nunique()
df2["concept_name"].nunique()

42307

3917

3998

In [18]:
df2["struct_id"].nunique()

2795

In [19]:
## relationship_name

df2["relationship_name"].value_counts()

df2[df2["relationship_name"] == "diagnosis"]

relationship_name
contraindication         27731
indication               12047
off-label use             2525
symptomatic treatment        2
reduce risk                  1
diagnosis                    1
Name: count, dtype: int64

Unnamed: 0,id,struct_id,concept_id,relationship_name,concept_name,umls_cui,snomed_full_name,cui_semantic_type,snomed_conceptid,doid
36618,173422,5271,40249335,diagnosis,Adult growth hormone deficiency,C1720505,Adult growth hormone deficiency,T047,421684006.0,


### DiseaseOrPheno ID fields

There are 3 fields with DiseaseOrPheno IDs

In [20]:
print(f"UMLS: {df2["umls_cui"].count()} rows, {df2["umls_cui"].nunique()} unique IDs\n")

print(f"SNOMEDCT: {df2["snomed_conceptid"].count()} rows, {df2["snomed_conceptid"].nunique()} unique IDs")
print(f"snomed unique names: {df2["snomed_full_name"].nunique()}\n")

print(f"DOID: {df2["doid"].count()} rows, {df2["doid"].nunique()} unique IDs")

UMLS: 38432 rows, 2612 unique IDs

SNOMEDCT: 38403 rows, 2642 unique IDs
snomed unique names: 2668

DOID: 14715 rows, 808 unique IDs


In [21]:
## UMLS - look for delimiters

## use raw string, escape char for special char
delimiters = [",", ";", ":", "-", "_", r"\|", " "]

## UMLS: 1 ID has a space - stripping whitespace should fix that  
print("UMLS:")
check_if_contains(df2, "umls_cui", delimiters)

df2[df2["umls_cui"].str.contains(" ", na=False)]

UMLS:
  count: 1


Unnamed: 0,id,struct_id,concept_id,relationship_name,concept_name,umls_cui,snomed_full_name,cui_semantic_type,snomed_conceptid,doid
17614,174193,760,40249439,indication,Kawasaki's disease,C002669,Acute febrile mucocutaneous lymph node syndrome,T047,75053002.0,DOID:13378


In [22]:
## DOID - look for delimiters

## SNOMEDCT doesn't have delimiters - can tell because it's float type

## take out colon since it's in ID
## use raw string, escape char for special char
delimiters = [",", ";", "-", "_", r"\|", " "]

## UMLS: 1 ID has a space - stripping whitespace should fix that  
print("DOID:")
check_if_contains(df2, "doid", delimiters)

df2[df2["doid"].str.contains(",", na=False)][["doid", "concept_name"]].drop_duplicates()

DOID:
, count: 35


Unnamed: 0,doid,concept_name
4098,"DOID:13250,DOID:12384",Infectious diarrheal disease
8603,"DOID:1936,DOID:2348",Atherosclerosis
22110,"DOID:5603,DOID:0050523",Relapsed or refractory T-cell leukemia-lymphoma
24695,"DOID:0050747,DOID:0060901,DOID:9080",Waldenström macroglobulinemia
25050,"DOID:450,DOID:0050759",Myotonic disorder
36669,"DOID:3240,DOID:0050152",Aspiration pneumonia
36703,"DOID:146,DOID:10175",Optic disc edema


**Comparing coverage**

In [23]:
## SNOMEDCT covers all DOID

df2[(df2["snomed_conceptid"].isna()) &
    (df2["doid"].notna())].shape[0]

0

In [24]:
## UMLS covers all SNOMEDCT, then a little more

## SNOMEDCT, but no UMLS
df2[(df2["umls_cui"].isna()) &
    (df2["snomed_conceptid"].notna())].shape[0]

## UMLS, but no SNOMEDCT
df2[(df2["umls_cui"].notna()) &
    (df2["snomed_conceptid"].isna())].shape[0]

df2[(df2["umls_cui"].notna()) &
    (df2["snomed_conceptid"].isna())][["umls_cui", "concept_name"]].drop_duplicates()

0

29

Unnamed: 0,umls_cui,concept_name
38767,C3888500,Ulcerated haemangioma
39877,C0855112,Diffuse large B-cell lymphoma refractory
39895,C2349261,Relapse multiple myeloma
39898,C4288754,Metastatic urothelial carcinoma
39902,C4076057,Urinary tract infection caused by Klebsiella
39916,C0278987,Metastatic non-small cell lung cancer
39925,C4303663,Extensive stage primary small cell carcinoma of lung
39928,C4721209,Metastatic human epidermal growth factor 2 positive carc...
40023,C5244027,Pneumonia caused by Severe acute respiratory syndrome co...
40086,C5419539,Unresectable uveal melanoma


In [25]:
## 1 weird row with UMLS ID but no semantic type

df2["cui_semantic_type"].nunique()

## UMLS, but no semantic type
df2[(df2["umls_cui"].notna()) &
    (df2["cui_semantic_type"].isna())].shape[0]

df2[(df2["umls_cui"].notna()) &
    (df2["cui_semantic_type"].isna())]

# ## semantic type but no umls: 0
# df2[(df2["umls_cui"].isna()) &
#     (df2["cui_semantic_type"].notna())].shape[0]

35

1

Unnamed: 0,id,struct_id,concept_id,relationship_name,concept_name,umls_cui,snomed_full_name,cui_semantic_type,snomed_conceptid,doid
40218,175843,4806,40249866,contraindication,Monoamine oxidase inhibitors,C0026457,,,,


**NOTES ON COLUMNS (reordered for easier understanding)**

<mark>**struct_id**</mark>: loads as int! DrugCentral drug ID (NodeNorm-supported!)

<mark>**relationship_name**</mark>: Enum

<mark>**umls_cui**</mark>: DiseaseOrPheno ID. Should be single values (no delimiters). Does need strip whitespace. These IDs aren't included on webpages (although the concepts are). 

<mark>**cui_semantic_type**</mark>: use for filtering out non-DoP objects. semantic type for UMLS ID. >30 unique values. And don't see it on webpages

Other columns:
- **id**: loads as int, unique for each row (nunique == n_rows). Seems to be a record/row ID. Don't see it on webpages
- **concept_id**: loads as int. Seems like an internal ID for disease/pheno. Don't see it on webpages
- **concept_name**: could use for filtering out non-DoP objects. human-readable label. Oddly, its nunique is more than nunique for concept_id. So maybe some names have the same id?
- **snomed_conceptid**: DiseaseOrPheno ID. loads as float, likely due to the NA. [ID format is actually int](https://bioregistry.io/registry/snomedct). Should be single values since it's numeric. (NodeNorm-supported!). Is fully covered by UMLS. 
- **snomed_full_name**: human-readable label. Oddly, its nunique is more than nunique for snomed_conceptid. So maybe some names have the same id?
- **doid**: DiseaseOrPheno ID. ","-delimited. Has Translator-standard prefix already. (NodeNorm-supported!) I trust this ID the most to be the correct type. Is fully covered by SNOMEDCT/UMLS.

### Filtered data, duplicates?

In [91]:
## parser but adjusted there

## only keep main columns
main_columns = ["struct_id", "relationship_name", "umls_cui",
                "concept_name", "cui_semantic_type"]
df2_filtered = df2[main_columns].copy()

In [92]:
## for parser log

", ".join(main_columns)

'struct_id, relationship_name, umls_cui, concept_name, cui_semantic_type'

In [93]:
## parser but adjusted there to drop based on entire df, any NA

## drop if don't have umls ID
df2_filtered.dropna(subset="umls_cui", ignore_index=True, inplace=True)

In [94]:
df2_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38432 entries, 0 to 38431
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   struct_id          38432 non-null  int64 
 1   relationship_name  38432 non-null  object
 2   umls_cui           38432 non-null  object
 3   concept_name       38432 non-null  object
 4   cui_semantic_type  38431 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.5+ MB


There are duplicate rows in this dataset - with diff concept_name but same ID. 

In [95]:
df2_filtered["concept_name"].nunique()

df2_filtered["umls_cui"].nunique()

2837

2612

In [96]:
## duplicate rows

main_triple = ["struct_id", "relationship_name", "umls_cui"]

df2_filtered[df2_filtered.duplicated(subset=main_triple, keep=False)].shape[0]

df2_filtered[df2_filtered.duplicated(subset=main_triple, keep=False)].sort_values(by=main_triple)

216

Unnamed: 0,struct_id,relationship_name,umls_cui,concept_name,cui_semantic_type
14408,24,indication,C0751753,"Congenital hyperammonemia, type I",T047
38225,24,indication,C0751753,Deficiency of carbamylphosphate synthetase (CPS),T047
15874,56,off-label use,C0033845,Benign intracranial hypertension,T047
15877,56,off-label use,C0033845,Idiopathic intracranial hypertension,T047
23261,144,off-label use,C0876926,Cognitive impairment following traumatic brain injury,T037
...,...,...,...,...,...
37454,5514,indication,C0860594,Unresectable or metastatic melanoma,T191
37459,5514,indication,C0860594,Advanced melanoma with tumour cell PD-L1 expression below 1%,T191
28678,5710,indication,C0006142,HER2-negative advanced or metastatic breast cancer,T191
28680,5710,indication,C0006142,ESR1-mutated advanced or metastatic breast cancer,T191


### Filtering objects

[tying to biolink-model master right now]

Want rows where UMLS Semantic types map to biolink-model DoP and descendants

In [97]:
## parser but adjusted

## setup biolink-model toolkit

bl_schema = "https://raw.githubusercontent.com/biolink/biolink-model/refs/heads/master/biolink-model.yaml"
tk = bmt.Toolkit(schema=bl_schema)

In [98]:
## EDA only

full_semantic_mapping = dict()
for i in df2_filtered["cui_semantic_type"].dropna().unique():
    full_semantic_mapping[i] = tk.get_element_by_mapping("STY:" + i)
    
full_semantic_mapping

{'T047': 'disease',
 'T048': 'disease',
 'T046': 'pathological process',
 'T184': 'phenotypic feature',
 'T037': 'pathological process',
 'T130': 'diagnostic aid',
 'T060': 'procedure',
 'T033': 'disease or phenotypic feature',
 'T061': 'procedure',
 'T034': 'phenomenon',
 'T042': 'physiological process',
 'T191': 'disease',
 'T019': 'disease',
 'T074': 'device',
 'T125': 'small molecule',
 'T084': None,
 'T020': 'disease',
 'T190': 'phenotypic feature',
 'T040': 'physiological process',
 'T079': 'information content entity',
 'T059': 'procedure',
 'T041': 'behavior',
 'T167': 'chemical entity',
 'T002': 'plant',
 'T131': 'chemical entity',
 'T204': None,
 'T007': 'bacterium',
 'T058': 'activity',
 'T121': 'drug',
 'T109': 'small molecule',
 'T091': None,
 'T201': 'clinical attribute',
 'T080': 'information content entity',
 'T116': 'polypeptide',
 'T129': 'biological entity'}

In [99]:
## parser but adjusted

## get categories that we want for object (DoP and descendants)

dop_descendants = tk.get_descendants("disease or phenotypic feature")
dop_descendants

['disease or phenotypic feature',
 'disease',
 'phenotypic feature',
 'behavioral feature',
 'clinical finding']

In [100]:
## parser but adjusted

## get umls semantic types that map to these categories

dop_semantic_types = list()
dop_semantic_mapping = dict()

## go through semantic types in data (without NA)
## easier to read this way rather than list-comprehension
for i in df2_filtered["cui_semantic_type"].dropna().unique():
    temp = tk.get_element_by_mapping("STY:" + i)
    if temp in dop_descendants:
        dop_semantic_types.append(i)
        dop_semantic_mapping[i] = temp

dop_semantic_types

dop_semantic_mapping

['T047', 'T048', 'T184', 'T033', 'T191', 'T019', 'T020', 'T190']

{'T047': 'disease',
 'T048': 'disease',
 'T184': 'phenotypic feature',
 'T033': 'disease or phenotypic feature',
 'T191': 'disease',
 'T019': 'disease',
 'T020': 'disease',
 'T190': 'phenotypic feature'}

In [101]:
## parser but adjusted

## filter - only rows with these semantic types 
df2_filtered = df2_filtered[df2_filtered["cui_semantic_type"].isin(dop_semantic_types)].copy()

In [102]:
df2_filtered.shape[0]

df2_filtered

33901

Unnamed: 0,struct_id,relationship_name,umls_cui,concept_name,cui_semantic_type
0,564,indication,C0153225,Gonococcal meningitis,T047
1,559,indication,C0153225,Gonococcal meningitis,T047
2,1572,contraindication,C0018824,Heart valve disorder,T047
3,2770,contraindication,C0018824,Heart valve disorder,T047
4,1968,contraindication,C0018824,Heart valve disorder,T047
...,...,...,...,...,...
38427,5393,indication,C0278987,Metastatic non-small cell lung cancer,T191
38428,5404,indication,C0855112,Diffuse large B-cell lymphoma refractory,T191
38429,5253,indication,C4076057,Urinary tract infection caused by Klebsiella,T047
38430,5392,indication,C0278987,Metastatic non-small cell lung cancer,T191


In [103]:
## went through TTD list, didn't find any problematic concept names

df2_filtered[df2_filtered["concept_name"].str.contains("canine")]

Unnamed: 0,struct_id,relationship_name,umls_cui,concept_name,cui_semantic_type


<div class="alert alert-block alert-danger">

**Problematic objects after running pipeline**:
    
<mark>**DECIDED TO FILTER OUT**</mark>

<br>    
    
**C0085228** (Fluvoxamine, NodeNormed to CHEBI:5138): doesn't match predicate or Association range
* record is actually true - fluvoxamine is contraindicated for patients who need to take 4224 (Pirfenidone)
* hard to catch beforehand. Had to see it after NodeNorming/validation
* ?? in future, could catch and make a diff association based on relationship_name? Larger questions on how to model these drug-drug contraindications/problematic interactions
  
**C0022650** (Kidney Calculi aka kidney stones, NodeNorm maps to AnatomicalEntity): doesn't match Association range
* maybe a NodeNorm issue, should be a phenotypic feature instead? It is a medical issue and not normal/desired anatomy...
* these relationships seem odd and not well-known when I review them. And they're all contraindications
  * 2728 (triamterene) is a true contraindication (can cause kidney stones)

In [104]:
df2_filtered[df2_filtered["umls_cui"] == "C0085228"]

df2_filtered[df2_filtered["umls_cui"] == "C0022650"].relationship_name.unique()
df2_filtered[df2_filtered["umls_cui"] == "C0022650"]

Unnamed: 0,struct_id,relationship_name,umls_cui,concept_name,cui_semantic_type
32972,4224,contraindication,C0085228,Fluvoxamine,T047


array(['contraindication'], dtype=object)

Unnamed: 0,struct_id,relationship_name,umls_cui,concept_name,cui_semantic_type
21702,4129,contraindication,C0022650,Kidney stone,T047
21703,323,contraindication,C0022650,Kidney stone,T047
21704,4348,contraindication,C0022650,Kidney stone,T047
21705,2166,contraindication,C0022650,Kidney stone,T047
21706,2728,contraindication,C0022650,Kidney stone,T047
...,...,...,...,...,...
21766,4243,contraindication,C0022650,Kidney stone,T047
21767,4288,contraindication,C0022650,Kidney stone,T047
21768,2144,contraindication,C0022650,Kidney stone,T047
21769,4511,contraindication,C0022650,Kidney stone,T047


In [107]:
## parser but adjusted

problematic_objects = [
    "C0085228",   ## Fluvoxamine (drug): doesn't match predicate (contraindicated) or Association range
    "C0022650",   ## Kidney Calculi (aka stones, NodeNorm maps to AnatomicalEntity): doesn't match Association range
]

## how many rows affected
df2_filtered[df2_filtered["umls_cui"].isin(problematic_objects)].shape[0]

70

In [108]:
## remove rows

df2_filtered = df2_filtered[~ df2_filtered["umls_cui"].isin(problematic_objects)].copy()

In [109]:
df2_filtered.shape[0]

33831

### Other parser dev

In [82]:
## duplicate rows check

main_triple = ["struct_id", "relationship_name", "umls_cui"]

df2_filtered[df2_filtered.duplicated(subset=main_triple, keep=False)].shape[0]

# df2_filtered[df2_filtered.duplicated(subset=main_triple, keep=False)].sort_values(by=main_triple)

201

In [83]:
## parser but adjusted

## remove whitespace: good to do before removing duplicates, adding prefix
df2_filtered["umls_cui"] = df2_filtered["umls_cui"].str.strip()

In [84]:
## fixes whitespace that was there before

## use raw string, escape char for special char
delimiters = [",", ";", ":", "-", "_", r"\|", " "]

## UMLS: 1 ID has a space - stripping whitespace should fix that  
print("UMLS:")
check_if_contains(df2_filtered, "umls_cui", delimiters)

UMLS:


In [85]:
## parser but adjusted

## drop columns other than main triple
df2_filtered.drop(columns=["cui_semantic_type", "concept_name"], inplace=True)

In [86]:
## parser but adjusted

df2_filtered.drop_duplicates(inplace=True, ignore_index=True)

In [87]:
## check after

df2_filtered.info()

df2_filtered.head()

df2_filtered["relationship_name"].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33724 entries, 0 to 33723
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   struct_id          33724 non-null  int64 
 1   relationship_name  33724 non-null  object
 2   umls_cui           33724 non-null  object
dtypes: int64(1), object(2)
memory usage: 790.5+ KB


Unnamed: 0,struct_id,relationship_name,umls_cui
0,564,indication,C0153225
1,559,indication,C0153225
2,1572,contraindication,C0018824
3,2770,contraindication,C0018824
4,1968,contraindication,C0018824


relationship_name
contraindication         23135
indication                8749
off-label use             1837
symptomatic treatment        1
reduce risk                  1
diagnosis                    1
Name: count, dtype: int64

In [89]:
## parser but adjust

## generating source_record_url
f"https://drugcentral.org/drugcard/{df2_filtered["struct_id"][0]}#druguse"

## generating final ID
f"DRUGCENTRAL:{df2_filtered["struct_id"][0]}"

'https://drugcentral.org/drugcard/564#druguse'

'DRUGCENTRAL:564'

In [90]:
df2_filtered["struct_id"].nunique()

df2_filtered["umls_cui"].nunique()

df2_filtered["struct_id"].nunique() + df2_filtered["umls_cui"].nunique()

2628

2245

4873

## Close connection

In [23]:
conn.rollback()

In [None]:
conn.close()

## Other tables (SQL/pandas EDA)

### action_type

In [47]:
q_count_rows = """
    SELECT COUNT(*) 
    FROM action_type
"""
result = conn.execute(text(q_count_rows))
result.fetchone()

(33,)

In [48]:
df_at = pd.read_sql_table(table_name="action_type", con=conn)

<div class="alert alert-block alert-danger">

CLEAR OR LIMIT OUTPUT of next code block. The text has typos that codespell will complain on: 
* `targeting` with an extra "t" before the suffix
* `positively` missing the first "i"

In [50]:
pd.set_option('display.max_colwidth', None)

df_at.sort_values("action_type")[0:5]

Unnamed: 0,id,action_type,description,parent_type
9,1,ACTIVATOR,"Positively effects the normal functioning of the protein e.g., activation of an enzyme or cleaving a clotting protein precursor",POSITIVE MODULATOR
10,2,AGONIST,"Binds to and activates a receptor, often mimicking the effect of the endogenous ligand",POSITIVE MODULATOR
6,28,ALKYLATING AGENT,Introduce alkyl radicals into biologically active molecules and thereby prevent their proper functioning,OTHER
11,3,ALLOSTERIC ANTAGONIST,Binds to a receptor at an allosteric site and prevents activation by a positive allosteric modulator at that site,NEGATIVE MODULATOR
0,33,ALLOSTERIC MODULATOR,Allosteric modulator is a substance which indirectly modulates the effects of a receptor agonist or inverse agonist at its receptor protein target.,OTHER


### data_source

Lists data sources

In [64]:
q_count_rows = """
    SELECT COUNT(*) 
    FROM data_source
"""
result = conn.execute(text(q_count_rows))
result.fetchone()

(13,)

In [65]:
df_ds = pd.read_sql_table(table_name="data_source", con=conn)

In [66]:
df_ds.sort_values("source_name")

Unnamed: 0,src_id,source_name
1,11,BINDINGDB
9,2,CHEMBL
4,8,DRUG LABEL
6,6,DRUG MATRIX
12,5,DRUGBANK
3,9,EXPERT CURATOR
10,3,IUPHAR
0,12,KEGG DRUG
2,10,NDFRT
11,4,PDSP


### vetomop

In [67]:
q_count_rows = """
    SELECT COUNT(*) 
    FROM vetomop
"""
result = conn.execute(text(q_count_rows))
result.fetchone()

(2581,)

In [68]:
df_veto = pd.read_sql_table(table_name="vetomop", con=conn)

**NOTES**

Drug indications for non-human animals (vet)

struct_id is DrugCentral drug ID

relationship_type: only value is Indication

**Problems**
* omopid likely disease/pheno ID -> needs mapping to other ID namespace (maybe another table?). And some aren't issues in humans. 
* mapping species which are colloquial names, sometimes very general like "fish"

In [69]:
df_veto.info()

df_veto.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2581 entries, 0 to 2580
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   omopid             2581 non-null   int64 
 1   struct_id          2581 non-null   int64 
 2   species            2581 non-null   object
 3   relationship_type  2581 non-null   object
 4   concept_name       2581 non-null   object
dtypes: int64(2), object(3)
memory usage: 100.9+ KB


Unnamed: 0,omopid,struct_id,species,relationship_type,concept_name
0,1737,5522,Dogs,Indication,Moist Dermatitis
1,1738,73,Cats,Indication,"Control intractable animals during examination, treatment, grooming, x-ray and minor surgical procedures"
2,1739,73,Cats,Indication,Itching caused by skin irritation
3,1740,73,Cats,Indication,Control vomiting associated with motion sickness
4,1741,73,Dogs,Indication,"Control intractable animals during examination, treatment, grooming, x-ray and minor surgical procedures"


In [70]:
df_veto["struct_id"].nunique()

df_veto["omopid"].nunique()
df_veto["concept_name"].nunique()

377

2581

1459

In [71]:
df_veto["relationship_type"].value_counts()

df_veto["species"].value_counts()

relationship_type
Indication    2581
Name: count, dtype: int64

species
Dogs                                            656
Cattle                                          490
Cats                                            430
Horses                                          334
Chickens                                        175
Swine                                           167
Sheep                                            98
Turkeys                                          80
Fish                                             25
Goats                                            20
Multiple Species                                 16
Ducks                                            10
Chicken                                           8
Cervidae                                          8
Pheasants                                         6
Quail                                             6
Fish - Salmonids                                  4
Horse                                             4
Mustelids                                         4
Bees