<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#TTD-notebook:-testing-code,-EDA" data-toc-modified-id="TTD-notebook:-testing-code,-EDA-1">TTD notebook: testing code, EDA</a></span></li><li><span><a href="#ID-mapping" data-toc-modified-id="ID-mapping-2">ID-mapping</a></span><ul class="toc-item"><li><span><a href="#P1-03:-TTD-Drug-IDs--&gt;-PUBCHEM.COMPOUND,-CAS,-CHEBI" data-toc-modified-id="P1-03:-TTD-Drug-IDs--&gt;-PUBCHEM.COMPOUND,-CAS,-CHEBI-2.1">P1-03: TTD Drug IDs -&gt; PUBCHEM.COMPOUND, CAS, CHEBI</a></span><ul class="toc-item"><li><span><a href="#Parse-header---define-function!" data-toc-modified-id="Parse-header---define-function!-2.1.1">Parse header - define function!</a></span></li><li><span><a href="#EDA-parse-mappings" data-toc-modified-id="EDA-parse-mappings-2.1.2">EDA parse mappings</a></span></li><li><span><a href="#Parser-code" data-toc-modified-id="Parser-code-2.1.3">Parser code</a></span></li></ul></li></ul></li><li><span><a href="#Edges" data-toc-modified-id="Edges-3">Edges</a></span><ul class="toc-item"><li><span><a href="#P1-05:-&quot;drug-treats-indication&quot;" data-toc-modified-id="P1-05:-&quot;drug-treats-indication&quot;-3.1">P1-05: "drug treats indication"</a></span><ul class="toc-item"><li><span><a href="#Parse-header" data-toc-modified-id="Parse-header-3.1.1">Parse header</a></span></li><li><span><a href="#EDA-grab-content" data-toc-modified-id="EDA-grab-content-3.1.2">EDA grab content</a></span></li><li><span><a href="#Parser---grab-content" data-toc-modified-id="Parser---grab-content-3.1.3">Parser - grab content</a></span></li><li><span><a href="#Map-clinical_status" data-toc-modified-id="Map-clinical_status-3.1.4">Map clinical_status</a></span></li><li><span><a href="#Map-TTD-drug-IDs" data-toc-modified-id="Map-TTD-drug-IDs-3.1.5">Map TTD drug IDs</a></span></li><li><span><a href="#Filter-out-some-indication-names" data-toc-modified-id="Filter-out-some-indication-names-3.1.6">Filter out some indication names</a></span></li><li><span><a href="#EDA-map-indication-names" data-toc-modified-id="EDA-map-indication-names-3.1.7">EDA map indication names</a></span></li><li><span><a href="#Parser-map-indication-names" data-toc-modified-id="Parser-map-indication-names-3.1.8">Parser map indication names</a></span></li><li><span><a href="#Merge-&quot;duplicates&quot;-after-mapping" data-toc-modified-id="Merge-&quot;duplicates&quot;-after-mapping-3.1.9">Merge "duplicates" after mapping</a></span></li><li><span><a href="#EDA-generating-TTD-links" data-toc-modified-id="EDA-generating-TTD-links-3.1.10">EDA generating TTD links</a></span></li></ul></li></ul></li></ul></div>

# TTD notebook: testing code, EDA

In [1]:
## for notebook only 

## allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## for printing
from pprint import pprint

## for loading locally-stored files
import pathlib

In [2]:
## INCLUDE in parser

import re
from typing import Dict, Union, Iterable
import itertools
import pandas as pd
import requests

## from BioThings annotator code: for interoperability between diff Python versions
try:
    from itertools import batched  # new in Python 3.12
except ImportError:
    from itertools import islice

    def batched(iterable, n):
        # batched('ABCDEFG', 3) → ABC DEF G
        if n < 1:
            raise ValueError("n must be at least one")
        iterator = iter(iterable)
        while batch := tuple(islice(iterator, n)):
            yield batch


## NameRes url
NAMERES_URL = "https://name-resolution-sri.renci.org/bulk-lookup"
            
            
## NOT for parser: for viewing df only
pd.options.display.max_columns = None
pd.set_option('display.max_colwidth', None)

<div class="alert alert-block alert-danger">

This notebook was originally written using resource files downloaded 2025-10-30 from https://ttd.idrblab.net/full-data-download. 

# ID-mapping 

## P1-03: TTD Drug IDs -> PUBCHEM.COMPOUND, CAS, CHEBI

### Parse header - define function!

First step is parsing the header ONLY to get:
* its line count - then we know how many lines to skip to get to the actual data
* version/date of this specific file, included in the header

The header ends with a 2nd dash "divider" line: either "----" or "____". 

We want this to be a separate function so it's reusable for all "custom txt" files with this kind of header. 

In [3]:
## paths to raw resource files
base_file_path = pathlib.Path.home().joinpath("Desktop", "TTD_files")

p1_03_path = base_file_path.joinpath("P1-03-TTD_crossmatching.txt")

In [4]:
## INCLUDE in parser

def parse_header(file_path) -> Dict[str, Union[str, int]]:
    ## use \\ to escape special characters like ".", "()"
    version_pattern = "^Version ([0-9\\.]+) \\(([0-9\\.]+)\\)"
    
    line_counter = 0  ## count lines read so far
    dash_counter = 0  ## count dash "divider" lines - header ends after 2nd one

    with open(file_path, 'r') as file:
        for line in file:
            if dash_counter == 2:  ## already read the 2nd "divider" line
                break
            else:
                line_counter += 1
                ## if line is a dash "divider" line
                if line.startswith("---") or line.startswith("___"):
                    dash_counter += 1
                ## assuming there's only 1 line in the header that matches this condition
                elif line.startswith("Version"):
                    capture = re.search(version_pattern, line)
                    version = capture.group(1)
                    date = capture.group(2)
                    date = date.replace(".", "-")
    
    return {
        "len_header": line_counter,
        "version": version,
        "date": date
    }

In [5]:
## INCLUDE in parser

p1_03_header_info = parse_header(p1_03_path)
p1_03_header_info

{'len_header': 27, 'version': '10.1.01', 'date': '2024-01-10'}

In [6]:
## regex testing

ex_string = "Version 10.1.01 (2024.01.10)"

pattern = "^Version ([0-9\\.]+) \\(([0-9\\.]+)\\)"

a = re.search(pattern, ex_string)
a.group(1)
a.group(2)

'10.1.01'

'2024.01.10'

### EDA parse mappings

* Uses [itertools.islice](https://stackoverflow.com/a/53942825) ([Python](https://docs.python.org/3/library/itertools.html#itertools.islice) [docs](https://coderivers.org/blog/python-islice/)) and string method removeprefix (from [python 3.9+](https://www.geeksforgeeks.org/python/python-string-removeprefix-function/))
* the order of namespace lines should be PUBCHCID, CASNUMBE, CHEBI_ID (aka PUBCHCID is always first and could create the TTD ID entry). But I'm not assuming this, so for PUBCHCID, I'm always checking if the TTD ID entry already exists or not. 
* assuming that there is 1 of each namespace line in 1 TTD ID entry. If this assumption is false, this code will only keep the latest encounter's data. 
* the "adding to mapping dict" logic is the same for each namespace. BUT not making into a function because it would involve passing the dict in and out/rewriting it which would cost more

In [7]:
## format {TTD: {namespace: [list of IDs with correct prefixes]}}
temp_drug_mappings = dict()

## dev: stopping loop
# line_counter = 0

with open(p1_03_path, 'r') as file:
    ## iterate from beginning of data (after 2nd dash divider line) to end of file
    for line in itertools.islice(file, p1_03_header_info["len_header"], None):
        
        ## skip "blank" lines that only contain whitespace (seem to be "\n")
        if line.isspace():
            continue
        else:
            ## tab-delimited, line ends in "\n" so remove whitespace
            ## [0] == TTD ID, [1] == column name, [2] == value
            data = [i for i in line.strip().split("\t")]
            
            ## grab namespace lines
            if data[1] == "PUBCHCID":
                ## "value" can be "; "-delimited.
                ## And its ids don't have prefixes, so add desired prefix
                pubchem_ids = ["PUBCHEM.COMPOUND:" + j.strip() for j in data[2].split(";") if j.strip()]
                ## add to temp_drug_mappings
                ## if TTD ID isn't in mapping dict yet
                if data[0] not in temp_drug_mappings.keys():
                    temp_drug_mappings[data[0]] = {"pubchem_compound": pubchem_ids}
                ## already in mapping dict
                else:
                    temp_drug_mappings[data[0]]["pubchem_compound"] = pubchem_ids
            elif data[1] == "CASNUMBE":
                ## for EDA only
                if not data[2].startswith("CAS "):   
                    print(data[0], data[2])
                    
                ## based on grep EDA, this value is always single (no delimiter)
                ## remove prefix if it exists, then add desired prefix
                ## not doing a replace because some lines don't have "CAS "
                cas_id = "CAS:" + data[2].removeprefix("CAS ")
                ## add to temp_drug_mappings
                ## if TTD ID isn't in mapping dict yet
                if data[0] not in temp_drug_mappings.keys():
                    temp_drug_mappings[data[0]] = {"cas": [cas_id]}
                ## already in mapping dict
                else:
                    temp_drug_mappings[data[0]]["cas"] = [cas_id]
            elif data[1] == "CHEBI_ID":
                ## based on grep EDA, this value is always single (no delimiter)
                ## and the prefix is already desired format
                ## add to temp_drug_mappings
                ## if TTD ID isn't in mapping dict yet
                if data[0] not in temp_drug_mappings.keys():
                    temp_drug_mappings[data[0]] = {"chebi": [data[2]]}
                ## already in mapping dict
                else:
                    temp_drug_mappings[data[0]]["chebi"] = [data[2]]
                    
#         line_counter += 1
#         if line_counter > 20:
#             pprint(temp_drug_mappings)
#             break

D0A9ZX 6834-98-6
D0LR4B 68767-14-6
D0QO7H 1143-38-0
D0S0ES 89371-37-9
D0W9GA 78994-23-7
D02HUB 7759-35-5
D0W2FZ 760981-83-7
D00SBD 203191-10-0
D0E9BF 67-47-0
D0K4MU 221019-25-6
D0G6HI 239802-15-4
D0PK3W 169148-84-9
D0Q3QS 133-10-8
D0S1BN 642407-44-1
D0V9LJ 156953-42-3
D0E4FG 174391-92-5
D0LK8O 119514-66-8
D05MEY 1203902-67-3
D08RUN 18263-25-7
D09MXM 88939-40-6
D0NZ3Q 339615-76-8
D0Y0ID 3131-60-0


In [8]:
## check mappings

## not expected CAS format - handled correctly
# temp_drug_mappings["D0K4MU"]

## multiple pubchem ids
# multi_counter = 0
# for k,v in temp_drug_mappings.items():
#     if v.get("pubchem_compound"):
#         if len(v["pubchem_compound"]) > 1:
#             multi_counter += 1
# #             print(k,v)
# #             break
# multi_counter

# ## has chebi id
# for k,v in temp_drug_mappings.items():
#     if v.get("chebi"):
#         print(k,v)
#         break

## check that this doesn't have empty string ID - "PUBCHEM.COMPOUND:"
# temp_drug_mappings["D0B0HK"]
    
    
## look for entries without namespace: 
## actually, all have pubchem_compound IDs o_0 - nothing prints
for k,v in temp_drug_mappings.items():
    if not v.get("pubchem_compound"):
        print(k,v)

In [9]:
## how many mappings total / for each namespace

print(f"Total TTD drug IDs mapped: {len(temp_drug_mappings)}")

pubchem_counter = 0
cas_counter = 0
chebi_counter = 0

for k,v in temp_drug_mappings.items():
    if v.get("pubchem_compound"):
        pubchem_counter += 1
    if v.get("cas"):
        cas_counter += 1
    if v.get("chebi"):
        chebi_counter += 1
        
print(f"pubchem mappings: {pubchem_counter}")
print(f"cas mappings: {cas_counter}")
print(f"chebi mappings: {chebi_counter}")

Total TTD drug IDs mapped: 25423
pubchem mappings: 25423
cas mappings: 12163
chebi mappings: 5231


**So...we could use only the pubchem mappings!**

Now double-checking if there are cases of multiple PUBCHCID lines for 1 TTD ID (there shouldn't be). If we don't have to worry about this, then this makes final parsing simpler

In [10]:
## diff format than EDA! {TTD: [list of pubchem compound IDs with correct prefixes]}
temp_drug_mappings = dict()

## dev: stopping loop
# line_counter = 0

with open(p1_03_path, 'r') as file:
    ## iterate from beginning of data (after 2nd dash divider line) to end of file
    for line in itertools.islice(file, p1_03_header_info["len_header"], None):
        
        ## skip "blank" lines that only contain whitespace (seem to be "\n")
        if line.isspace():
            continue
        else:
            ## tab-delimited, line ends in "\n" so remove whitespace
            ## [0] == TTD ID, [1] == column name, [2] == value
            data = [i for i in line.strip().split("\t")]
            
            ## grab pubchem-compound lines
            if data[1] == "PUBCHCID":
                ## "value" can be "; "-delimited.
                ## And its ids don't have prefixes, so add desired prefix
                pubchem_ids = ["PUBCHEM.COMPOUND:" + j.strip() for j in data[2].split(";") if j.strip()]
                
                ## add to mappings
                ## if TTD ID isn't in mapping dict yet
                if data[0] not in temp_drug_mappings.keys():
                    temp_drug_mappings[data[0]] = pubchem_ids
                ## already in mapping dict - catch this!!
                else:
                    ## debug
                    print(data[0])
#                     ## logic to add info/get unique values only
#                     temp_drug_mappings[data[0]].append(pubchem_ids) 
#                     temp_drug_mappings[data[0]] = list(set(temp_drug_mappings[data[0]]))
                    
#         line_counter += 1
#         if line_counter > 20:
#             pprint(temp_drug_mappings)
#             break

Nothing prints, so **there's no cases of multiple PUBCHCID lines for 1 TTD ID**!

<div class="alert alert-block alert-success">

**FOR PIPELINE CODE**: 
* only grab **pubchem_compound** mapping (Other namespaces don't increase the number of mapped TTD IDs)
* assume there's 1 PUBCHCID line per TTD ID

### Parser code

In [11]:
## INCLUDE in parser

## format {TTD: [list of pubchem compound IDs with correct prefixes]}
ttd_drug_mappings = dict()

## dev: stopping loop
# line_counter = 0

with open(p1_03_path, 'r') as file:
    ## iterate from beginning of data (after 2nd dash divider line) to end of file
    for line in itertools.islice(file, p1_03_header_info["len_header"], None):
        
        ## skip "blank" lines that only contain whitespace (seem to be "\n")
        if line.isspace():
            continue
        else:
            ## tab-delimited, line ends in "\n" so remove whitespace
            ## [0] == TTD ID, [1] == column name, [2] == value
            data = [i for i in line.strip().split("\t")]
            
            ## grab pubchem-compound lines
            if data[1] == "PUBCHCID":
                ## "value" can be "; "-delimited.
                ## And its ids don't have prefixes, so add desired prefix
                pubchem_ids = ["PUBCHEM.COMPOUND:" + j.strip() for j in data[2].split(";") if j.strip()]
                
                ## add to mappings
                ttd_drug_mappings[data[0]] = pubchem_ids
                
#         line_counter += 1
#         if line_counter > 20:
#             pprint(ttd_drug_mappings)
#             break

In [12]:
len(ttd_drug_mappings)

ttd_drug_mappings["D0B0HK"]

25423

['PUBCHEM.COMPOUND:91865905']

# Edges

## P1-05: "drug treats indication"

### Parse header

In [13]:
p1_05_path = base_file_path.joinpath("P1-05-Drug_disease.txt")

In [14]:
## INCLUDE in parser

p1_05_header_info = parse_header(p1_05_path)
p1_05_header_info

{'len_header': 21, 'version': '10.1.01', 'date': '2024-03-30'}

### EDA grab content

* inspired by Lucy's parser logic in [load_drug_dis_data](https://github.com/lucyzhang95/BioThings_TTD_Dataplugin/blob/3b72b79b09a2f24fec6c29273df66c20c89590e3/TTD_parser.py#L350)
* putting rough edge objects into list, then load into pandas once - which is [recommended](https://sentry.io/answers/create-an-empty-python-pandas-dataframe-and-add-rows/)

**Using indication name "column" due to issues with ICD 11 "column" values:**
* is a code, not the "actual" ICD-11 ID aka "foundation" URI. Ex: TTD uses `5C58.03` for "Progressive familial intrahepatic cholestasis", but the [ICD-11 foundation URI is 1457142642](https://icd.who.int/browse/2025-01/mms/en#1457142642)
  * [MyDisease's mondo.xrefs.icd11](https://mydisease.info/v1/query?q=_exists_:mondo.xrefs.icd11&fields=mondo) use the "foundation" IDs, based on the ID format (numeric)
  * NodeNorm currently has very little support for ICD11 IDs in DiseaseOrPheno (5 total in database?), and they are the "foundation" URIs
* are sometimes a "range" of codes, not listed individually. Ex: for "solid tumour/cancer", TTD uses "2A00-2F9Z" aka code "2A00" to "2F9Z". 

In [15]:
## format [{edge objects}]
eda_edges = list()


## dev: stopping loop
line_counter = 0

with open(p1_05_path, 'r') as file:
    ## iterate from beginning of data (after 2nd dash divider line) to end of file
    for line in itertools.islice(file, p1_05_header_info["len_header"], None):
        ## skip "blank" lines that only contain whitespace (seem to be "\n")
        if line.isspace():
            ttd_drug = ""  ## reset temp variable
            continue
        else:
            ## tab-delimited, has extra whitespace at end of line
            data = [i for i in line.strip().split("\t")]
            
            ## TTD drug ID line
            if data[0] == "TTDDRUID":
                ## save in temp variable, always seems to be single value
                ttd_drug = data[1]
            elif data[0] == "INDICATI":
                ## EDA: if ttd_drug is empty, "TTDDRUID" didn't come first...needs review
                if not ttd_drug:
                    print(line_counter)
                    print(data)
                    continue
                else:
                    ## don't make edge if indication name seems fake
                    if data[1] == "#N/A":
                        continue
                    else:
                        eda_edges.append({
                            "subject_ttd_drug": ttd_drug,
                            "object_indication_name": data[1],
                            "clinical_status": data[3],
                        })
                    
        line_counter += 1
#         if line_counter > 20:
#             pprint(eda_edges)
#             break

Nothing printed - meaning we never had an empty ttd_drug variable when encountering an "indication" edge. 
This means we can simplify the code and assume that ttd_drug will always be filled properly before the "indication" rows. 

In [16]:
## basic EDA

## expected based on length of file, number of fake names
len(eda_edges)

eda_edges[100]

30312

{'subject_ttd_drug': 'D8LAE2',
 'object_indication_name': 'Non-small-cell lung cancer',
 'clinical_status': 'Phase 3'}

In [17]:
## EDA using dataframe

temp_df = pd.DataFrame(eda_edges)

## expected based on grep EDA, then subtract based on fake names
temp_df["subject_ttd_drug"].nunique()

temp_df["object_indication_name"].nunique()

temp_df["clinical_status"].nunique()

23717

1944

40

### Parser - grab content

In [18]:
## INCLUDE in parser

## format [{edge objects}]
edges = list()

with open(p1_05_path, 'r') as file:
    ## iterate from beginning of data (after 2nd dash divider line) to end of file
    for line in itertools.islice(file, p1_05_header_info["len_header"], None):
        ## skip "blank" lines that only contain whitespace (seem to be "\n")
        if line.isspace():
            continue
        else:
            ## tab-delimited, has extra whitespace at end of line
            data = [i for i in line.strip().split("\t")]
            
            ## TTD drug ID line
            if data[0] == "TTDDRUID":
                ## save in temp variable, always seems to be single value
                ttd_drug = data[1]
            elif data[0] == "INDICATI":
                ## don't make edge if indication name seems fake
                if data[1] == "#N/A":
                    continue
                else:
                    edges.append({
                        "subject_ttd_drug": ttd_drug,
                        "object_indication_name": data[1],
                        "clinical_status": data[3],
                    })

df = pd.DataFrame(edges)

In [19]:
## INCLUDE in parser -> log

print(f"{df.shape[0]} rows after parsing file\n")

print(f"{df["subject_ttd_drug"].nunique()} unique TTD drug IDs")

print(f"{df["object_indication_name"].nunique()} unique indication names")

print(f"{df["clinical_status"].nunique()} unique clinical status values")

30312 rows after parsing file

23717 unique TTD drug IDs
1944 unique indication names
40 unique clinical status values


In [20]:
df

Unnamed: 0,subject_ttd_drug,object_indication_name,clinical_status
0,DZB84T,Pruritus,Approved
1,DZB84T,Progressive familial intrahepatic cholestasis,Phase 3
2,DZB84T,Alagille syndrome,Phase 2
3,DZA90G,Coronavirus Disease 2019 (COVID-19),Approved
4,DZ8DF0,Primary hyperoxaluria type 1,Approved
...,...,...,...
30307,D00AJS,Non-insulin dependent diabetes,Investigative
30308,D00AIZ,Mycobacterium infection,Investigative
30309,D00AIS,Solid tumour/cancer,Investigative
30310,D00AHV,Central nervous system disease,Investigative


### Map clinical_status

"clinical status" values -> biolink predicate ("treats" hierarchy)

Doing first so we can review all possible values.

<div class="alert alert-block alert-danger">

**Values not included in `CLINICAL_STATUS_MAP` will be filtered out. This means new values need to be manually reviewed, added to `CLINICAL_STATUS_MAP` if we want to keep their data.**

In [21]:
## EDA

df["clinical_status"].nunique()

df["clinical_status"].value_counts().sort_index()

## to get without counts
# sorted(df["clinical_status"].unique())

40

clinical_status
Application submitted                72
Approval submitted                    1
Approved                           3742
Approved (orphan drug)                5
Approved in China                     5
Approved in EU                        8
BLA submitted                         1
Clinical Trial                        7
Clinical trial                      108
Discontinued in Phase 1             745
Discontinued in Phase 1/2            57
Discontinued in Phase 2            1091
Discontinued in Phase 2/3             6
Discontinued in Phase 2a              1
Discontinued in Phase 2b              2
Discontinued in Phase 3             268
Discontinued in Phase 4               2
Discontinued in Preregistration      41
IND submitted                         7
Investigative                      4619
NDA filed                             6
Patented                           3155
Phase 0                              11
Phase 1                            4956
Phase 1/2               

In [22]:
## INCLUDE in parser

## biolink predicates
BIOLINK_TREATS = "biolink:treats"
BIOLINK_STUDIED_TREAT = "biolink:studied_to_treat"
BIOLINK_PRECLINICAL = "biolink:in_preclinical_trials_for"
BIOLINK_CLINICAL_TRIALS = "biolink:in_clinical_trials_for"

## hard-coded mapping of "clinical status" values to biolink "treats" predicates
## purposely doesn't include all values - rest are filtered out
CLINICAL_STATUS_MAP = {
    ## treats
    "Approved": BIOLINK_TREATS,
    "Approved (orphan drug)": BIOLINK_TREATS,
    "Approved in China": BIOLINK_TREATS,
    "Approved in EU": BIOLINK_TREATS,
    "Phase 4": BIOLINK_TREATS,
    ## studied to treat
    "Investigative": BIOLINK_STUDIED_TREAT,
    "Patented": BIOLINK_STUDIED_TREAT,
    ## in preclinical trials for
    "Preclinical": BIOLINK_PRECLINICAL,
    "IND submitted": BIOLINK_PRECLINICAL,
    ## in clinical trials for
    "Clinical Trial": BIOLINK_CLINICAL_TRIALS,
    "Clinical trial": BIOLINK_CLINICAL_TRIALS,
    "Preregistration": BIOLINK_CLINICAL_TRIALS,
    "Registered": BIOLINK_CLINICAL_TRIALS,
    "Phase 0": BIOLINK_CLINICAL_TRIALS,
    "Phase 1": BIOLINK_CLINICAL_TRIALS,
    "Phase 1b": BIOLINK_CLINICAL_TRIALS,
    "Phase 1/2": BIOLINK_CLINICAL_TRIALS,
    "Phase 1/2a": BIOLINK_CLINICAL_TRIALS,
    "Phase 1b/2a": BIOLINK_CLINICAL_TRIALS,
    "Phase 2": BIOLINK_CLINICAL_TRIALS,
    "Phase 2a": BIOLINK_CLINICAL_TRIALS,
    "Phase 2b": BIOLINK_CLINICAL_TRIALS,
    "Phase 2/3": BIOLINK_CLINICAL_TRIALS,
    "Phase 3": BIOLINK_CLINICAL_TRIALS,
    "phase 3": BIOLINK_CLINICAL_TRIALS,
    "NDA filed": BIOLINK_CLINICAL_TRIALS,
    "BLA submitted": BIOLINK_CLINICAL_TRIALS,
    "Approval submitted": BIOLINK_CLINICAL_TRIALS,
}

**[last edited 2025-11-03, based on v10.1.01 '2024-03-30' data]**

**NOTES ON MAPPING**
* treats:
  * `Phase 4` is after the drug is approved by FDA - [ref](https://www.allclinicaltrials.com/blog/phases-of-clinical-trials)
* studied_to_treat:
  * `Investigative`: not sure what this exactly means. VS [investigational](https://www.cancer.gov/publications/dictionaries/cancer-terms/def/investigational-drug) is a formal term meaning in clinical trials
  * `Patented`: according to [TTD 2020 paper](https://academic.oup.com/nar/article/48/D1/D1031/5613683#190994680), these are in "early drug discovery" stage and include "potential therapeutic indications"
* in_preclinical_trials_for:
  * `IND submitted`: application to start clinical trials - ref [gov](https://www.cc.nih.gov/orcs/ind), [fdareview](https://www.fdareview.org/issues/the-drug-development-and-approval-process/), [allucent](https://www.allucent.com/resources/blog/5-common-fda-applications-drugs-and-biologics)
* in_clinical_trials_for - [ref](https://www.allclinicaltrials.com/blog/phases-of-clinical-trials)
  * `Preregistration`: registering clinical trial before/at start of clinical trial - [ref](https://www.atsjournals.org/doi/10.1513/AnnalsATS.202508-835ED)
  * `Registered`: not sure what this exactly means. Assuming it's "clinical trial registered"
  * `NDA filed`: applied for FDA approval of drug after "finishing" clinical trials - ref [fdareview](https://www.fdareview.org/issues/the-drug-development-and-approval-process/), 
  * `BLA submitted`: applied for FDA approval of biologic after "finishing" clinical trials - ref [allucent](https://www.allucent.com/resources/blog/what-are-regulatory-differences-between-nda-and-bla)
  * `Approval submitted`: not sure what this exactly means. Assuming it's the "drug approval" application after clinical trials done and successful
  

**Filter out everything else**

I was unsure, considered its ingest "controversial" (people would have diff opinions):
* `Application submitted`: not sure what this exactly means - "application" is too vague
* All the "discontinued" values: this can happen for reasons not related to safety/efficacy - [ref](https://doi.org/10.1186/s13104-020-05391-w)
  * `Discontinued in Preregistration`
  * `Discontinued in Phase 1`
  * `Discontinued in Phase 1/2`
  * `Discontinued in Phase 2`
  * `Discontinued in Phase 2a`
  * `Discontinued in Phase 2b`
  * `Discontinued in Phase 2/3`
  * `Discontinued in Phase 3`
  * `Discontinued in Phase 4`
* `Terminated`: can happen for reasons not related to safety/efficacy - [ref](https://pmc.ncbi.nlm.nih.gov/articles/PMC4444136/)
* `Withdrawn from market`

**FUTURE:**
* review filtered-out terms, figure out how to ingest them (predicate to use)? 
* mappings to edge attribute(s) values that capture original "clinical status" meaning more precisely


**CONCERNS:**

**Some P1-05 data doesn't actually fit "treats".** Ex: indication is "contraception" or drug is for imaging (SomaKit-TOC; gallium (68Ga) edotreotide). 

**TTD actually uses "indication", which seems to be a wider term than "treats…".** <br>
Seems more like "reasonable to use for condition X" or "has therapeutic use for condition X"

* Includes "diagnosis" and "used to achieve" (ex: contraception)
* Includes things that aren't diseases - sign, symptom, condition
* [Wiki](https://en.wikipedia.org/wiki/Indication_(medicine)): "**a valid reason to use** a certain **test**, medication, procedure, or surgery." 
  * Opposite is contraindication ("a reason to withhold"). 
  * Drug can be **indicated for** "treatment, **prevention, mitigation, cure, relief, or diagnosis** of that disease or **condition**"
  * There's also the official "drug indications" approved by licensing body 
* [NCI](https://www.cancer.gov/publications/dictionaries/cancer-terms/def/indication): "a **sign, symptom, or medical condition** that leads to the recommendation of a treatment, **test, or procedure**."
* [Oprea/Dumontier/etc paper](https://pmc.ncbi.nlm.nih.gov/articles/PMC6259666/): "Indications" and actually describing the therapeutic intent are tricky! 


In [23]:
## INCLUDE in parser

## map "clinical status" to biolink predicate
## get method returns None if key (clinical status) not found in mapping 
df["biolink_predicate"] = [CLINICAL_STATUS_MAP.get(i) for i in df["clinical_status"]]

In [24]:
## INCLUDE in parser -> log

n_mapped = df["biolink_predicate"].notna().sum()
print(f"{n_mapped} rows with mapped clinical status: {n_mapped / df.shape[0]:.1%}")

26474 rows with mapped clinical status: 87.3%


**~87% of the data could be mapped.**

In [25]:
## INCLUDE in parser -> log/save

## double-checking what clinical status values aren't mapped
sorted(df[df["biolink_predicate"].isna()].clinical_status.unique())

['Application submitted',
 'Discontinued in Phase 1',
 'Discontinued in Phase 1/2',
 'Discontinued in Phase 2',
 'Discontinued in Phase 2/3',
 'Discontinued in Phase 2a',
 'Discontinued in Phase 2b',
 'Discontinued in Phase 3',
 'Discontinued in Phase 4',
 'Discontinued in Preregistration',
 'Terminated',
 'Withdrawn from market']

In [26]:
## INCLUDE in parser

## drop rows without predicate mapping
df.dropna(subset="biolink_predicate", inplace=True, ignore_index=True)

In [27]:
## EDA

df.shape

(26474, 4)

### Map TTD drug IDs

Uses `ttd_drug_mappings` generated earlier when parsing mapping files (P1-03). 

If mappings had more namespaces (logic needed to pick one), would write it out - not use list comprehension:
* If getting TTD ID successful...
  * If getting pubchem successful...use it
  * Else, use inchikey
 * Else: use None

In [28]:
## INCLUDE in parser

## get method returns None if key (TTD ID) not found in mapping 
df["subject_pubchem"] = [ttd_drug_mappings.get(i) for i in df["subject_ttd_drug"]]

In [29]:
## EDA - check that no value is "PUBCHEM.COMPOUND:"

df[df["subject_pubchem"] == "PUBCHEM.COMPOUND:"]

Unnamed: 0,subject_ttd_drug,object_indication_name,clinical_status,biolink_predicate,subject_pubchem


In [30]:
## INCLUDE in parser -> log

n_mapped = df["subject_pubchem"].notna().sum()
print(f"{n_mapped} rows with mapped TTD drug IDs: {n_mapped / df.shape[0]:.1%}")

10148 rows with mapped TTD drug IDs: 38.3%


**Only ~38% of the data could be mapped.**

In [31]:
## INCLUDE in parser

## remove rows without drug mapping
df.dropna(subset="subject_pubchem", inplace=True, ignore_index=True)

In [32]:
df.shape

(10148, 5)

In [33]:
## EDA: multiple pubchem values? 

df[df["subject_pubchem"].map(len) > 1].shape

df[df["subject_pubchem"].map(len) > 1]

(37, 5)

Unnamed: 0,subject_ttd_drug,object_indication_name,clinical_status,biolink_predicate,subject_pubchem
185,D0XW2H,Asthma,Approved,biolink:treats,"[PUBCHEM.COMPOUND:441336, PUBCHEM.COMPOUND:3083544]"
1690,D0C4DH,Hepatitis C virus infection,Approved,biolink:treats,"[PUBCHEM.COMPOUND:66828839, PUBCHEM.COMPOUND:58031952]"
2254,D07DPI,Cardiovascular disease,Approved,biolink:treats,"[PUBCHEM.COMPOUND:2244, PUBCHEM.COMPOUND:9568614]"
2429,D06AGH,Melanoma,Approved,biolink:treats,"[PUBCHEM.COMPOUND:11707110, PUBCHEM.COMPOUND:44462760]"
2508,D05JFY,Heart failure,Approved,biolink:treats,"[PUBCHEM.COMPOUND:9351, PUBCHEM.COMPOUND:6883]"
3424,D0Z1SL,Postmenopausal osteoporosis,Phase 3,biolink:in_clinical_trials_for,"[PUBCHEM.COMPOUND:154257, PUBCHEM.COMPOUND:45357473]"
3426,D0Z0HL,Type-2 diabetes,Phase 3,biolink:in_clinical_trials_for,"[PUBCHEM.COMPOUND:90472060, PUBCHEM.COMPOUND:44146714]"
3692,D0L4LX,Hypercholesterolaemia,Phase 3,biolink:in_clinical_trials_for,"[PUBCHEM.COMPOUND:60823, PUBCHEM.COMPOUND:150311]"
3794,D0H0UB,Chronic obstructive pulmonary disease,Phase 3,biolink:in_clinical_trials_for,"[PUBCHEM.COMPOUND:25195533, PUBCHEM.COMPOUND:642444]"
3949,D0C0MN,Hypertension,Phase 3,biolink:in_clinical_trials_for,"[PUBCHEM.COMPOUND:3749, PUBCHEM.COMPOUND:5560]"


In [34]:
## INCLUDE in parser

## expand to multiple rows when subject_pubchem list length > 1
## also pops every subject_pubchem value out into a string
df = df.explode("subject_pubchem", ignore_index=True)

## -> log
print(f"{df.shape[0]} rows after expanding mappings with multiple ID values")

## EDA only
df

10190 rows after expanding mappings with multiple ID values


Unnamed: 0,subject_ttd_drug,object_indication_name,clinical_status,biolink_predicate,subject_pubchem
0,DZB84T,Pruritus,Approved,biolink:treats,PUBCHEM.COMPOUND:9831643
1,DZB84T,Progressive familial intrahepatic cholestasis,Phase 3,biolink:in_clinical_trials_for,PUBCHEM.COMPOUND:9831643
2,DZB84T,Alagille syndrome,Phase 2,biolink:in_clinical_trials_for,PUBCHEM.COMPOUND:9831643
3,DXP04H,Non-hodgkin lymphoma,Approved,biolink:treats,PUBCHEM.COMPOUND:129269915
4,DXP04H,Chronic lymphocytic leukaemia,Phase 3,biolink:in_clinical_trials_for,PUBCHEM.COMPOUND:129269915
...,...,...,...,...,...
10185,D00EIL,Solid tumour/cancer,Investigative,biolink:studied_to_treat,PUBCHEM.COMPOUND:6419747
10186,D00DJC,Solid tumour/cancer,Investigative,biolink:studied_to_treat,PUBCHEM.COMPOUND:11488447
10187,D00DHM,Psychotic disorder,Investigative,biolink:studied_to_treat,PUBCHEM.COMPOUND:14277536
10188,D00DDR,Multiple myeloma,Investigative,biolink:studied_to_treat,PUBCHEM.COMPOUND:91865905


### Filter out some indication names

First, filter out some indication names that are known to be problematic - not "conditions that are treated" or I'm worried how the statement will look 

In [38]:
## EDA - used to help find problematic indications

test = "feline"

## set case=False so it isn't case-sensitive on matches!
df[df["object_indication_name"].str.contains(test, case=False)].shape

sorted(df[df["object_indication_name"].str.contains(test, case=False)]["object_indication_name"].unique())

(1, 5)

['Canine and feline spontaneous neoplasm']

In [39]:
## INCLUDE in parser

STRINGS_TO_FILTER = [
    "imaging",
    "radio",         ## related to imaging
    "esthesia",      ## for multiple spellings of an(a)esthesia
    "abortion",      ## problematic? but "spontaneous abortion" aka miscarriage can be treated...
    "sedation",
    "Discover", 
    "icide",         ## catches Herbicide, Insecticide, etc. But catches "poisoning" due to these things too
    "procedure",
    "barrier",       ## catches Blood brain barrier
    "astringent",
    "stimul",        ## catches "Caerulein stimulated..." and ovarian stimulation
    "suppress",      ## catches Appetite suppressant
    "contrast",      ## related to imaging
    "Diagnostic",    ## diagnostic
    "vasodilator",
    "Dutch elm disease",   ## this is a plant disease
    "Exam",
    "lubricant",
    "Localisation",
    "Measur",        ## catches Measure kidney function
    "Pest attack",   ## plant disease?
    "Plant grey",    ## catches Plant grey mould disease
    "Stabil",        ## catches Stabilize muscle contraction
    "canine",        ## Canine and feline spontaneous neoplasm
]

In [40]:
## INCLUDE in parser

## set case=False so it isn't case-sensitive on matches!
df = df[~df.object_indication_name.str.contains('|'.join(STRINGS_TO_FILTER), case=False)].copy()

## -> log
print(f"{df.shape[0]} after filtering out problematic indication names")

10054 after filtering out problematic indication names


In [41]:
## check that filtering worked

for i in STRINGS_TO_FILTER:
    if df[df["object_indication_name"].str.contains(i, case=False)].shape[0] > 0:
        print(i)

### EDA map indication names

Uses NameRes.

Using a score threshold to remove "low-quality" mappings, but there's a tradeoff...
* some "incorrect" mappings slip through - have high enough scores (300 < x < 500)
* some correct mappings get filtered out - have low scores (<300)

Decided to pick 300 as "good enough", but it has both problems

In [None]:
# ## get set of unique names to put into NameRes
# indication_names = df["object_indication_name"].unique()

# len(indication)

In [None]:
# ## EDA version

# indication_mapping = {}
# threshold = 300
# batch_size = 100

# ## set up variables to catch mapping failures
# stats_indication_mapping_failures = {
#     "unexpected_error": {},
#     "returned_empty": [],
#     "score_under_threshold": {},
# }
# ## for debug: stopping early
# # counter = 0

# for batch in batched(indication_names, batch_size):
#     ## returns tuples -> cast to list
#     req_body = {
#         "strings": list(batch),
#         "autocomplete": False,
#         "limit": 1,  ## only analyzing top hit
#         "biolink_types": ["DiseaseOrPhenotypicFeature"],
#         "exclude_prefixes": "UMLS|MESH",  ## try finding high-quality hits
#     }
#     r = requests.post(NAMERES_URL, json=req_body)
#     response = r.json()
        
#     ## not doing dict comprehension. allows easier review, logic writing
#     for k,v in response.items():
#         ## catch unexpected errors
#         try:
#             ## will catch if v is an empty list (aka NameRes didn't have info)
#             if v:
#                 ## v is a 1-element list, work with it directly
#                 temp = v[0]
#                 ## also throw out mapping if score < threshold: want better-matching hits
#                 if temp["score"] > threshold:
#                     indication_mapping.update({
#                         k: {"id": temp["curie"],
#                             "label": temp["label"],
#                            }
#                     })
#                 else:
#                     temp_data = {
#                         "curie": temp["curie"],
#                         "label": temp["label"],
#                         "types": temp["types"][0],
#                         "score": temp["score"],
#                     }
#                     stats_indication_mapping_failures["score_under_threshold"].update({k: temp_data})
#             else:
#                 stats_indication_mapping_failures["returned_empty"].append(k)
#         except Exception as e:
#             stats_indication_mapping_failures["unexpected_error"].update({k: e})
#             print(f"Encountered an unexpected error: {e}.")
#             print(f"NameRes response key: {k}")
#             print(f"NameRes response value: {v}")
    
# #     counter += batch_size
# #     if counter == 500:
# #         break

In [None]:
# len(indication_mapping)

# stats_indication_mapping_failures.keys()

In [None]:
# stats_indication_mapping_failures["unexpected_error"]

# stats_indication_mapping_failures["returned_empty"]

In [None]:
# len(stats_indication_mapping_failures["score_under_threshold"])

# ## to review 
# # sorted(stats_indication_mapping_failures["score_under_threshold"].keys())

### Parser map indication names

In [42]:
## INCLUDE in parser

## get set of unique names to put into NameRes
indication_names = df["object_indication_name"].unique()

## EDA only
len(indication_names)

1222

In [43]:
## INCLUDE in parser

def run_nameres(names: Iterable[str], batch_size: int, types: list, url: str, score_threshold: int, exclude_namespace: str):
    """
    Parameters:
    - names: iterable of string names to NameRes
    - batch_size: number of strings to include in 1 query to NameRes
    - types: list of biolink categories that NameRes hits should have
    - url: NameRes url/endpoint to use
    - score_threshold: only accept hit if its score is greater than this, for quality
    - exclude_namespaces: |-delimited string of ID namespaces to exclude, for quality

    Returns: tuple of mapping dict and failure stats dict
    """
    ## set up variables to collect output
    mapping = {}
    stats_failures = {
        "unexpected_error": {},
        "returned_empty": [],
        "score_under_threshold": [],
    }
    ## for debug: stopping early
#     counter = 0

    for batch in batched(names, batch_size):
        req_body = {
            "strings": list(batch),  ## returns tuples -> cast to list
            "autocomplete": False,   ## names are complete search term
            "limit": 1,              ## only want to review top hit
            "biolink_types": types,
            "exclude_prefixes": exclude_namespace,    ## try to increase quality of hits
        }
        r = requests.post(url, json=req_body)
        response = r.json()

        ## not doing dict comprehension. allows easier review, logic writing
        for k,v in response.items():
            ## catch unexpected errors
            try:
                ## will catch if v is an empty list (aka NameRes didn't have info)
                if v:
                    ## v is a 1-element list, work with it directly
                    temp = v[0]
                    ## also throw out mapping if score < score_threshold: want better-matching hits
                    if temp["score"] > score_threshold:
                        mapping.update({
                            k: temp["curie"]
                        })
                    else:
                        stats_failures["score_under_threshold"].append(k)
                else:
                    stats_failures["returned_empty"].append(k)
            except Exception as e:
                stats_failures["unexpected_error"].update({k: e})

#         counter += batch_size
#         if counter == 500:
#             break

    return mapping, stats_failures

In [44]:
## INCLUDE in parser

indication_score_threshold = 300
indication_batch_size = 100
indication_types = ["DiseaseOrPhenotypicFeature"]
indication_exclude_prefixes = "UMLS|MESH"

## use NAMERES_URL initialized earlier

indication_mapping, stats_indication_mapping_failures = \
    run_nameres(url=NAMERES_URL, 
                names=indication_names,
                types=indication_types,
                score_threshold=indication_score_threshold,
                batch_size=indication_batch_size,
                exclude_namespace=indication_exclude_prefixes
    )

In [45]:
## EDA only

len(indication_mapping)

stats_indication_mapping_failures.keys()

stats_indication_mapping_failures["unexpected_error"]
stats_indication_mapping_failures["returned_empty"]

len(stats_indication_mapping_failures["score_under_threshold"])

1051

dict_keys(['unexpected_error', 'returned_empty', 'score_under_threshold'])

{}

['Sterilant',
 'Contraception',
 'Kinetoplastids',
 'Vulnerary',
 'Chemoprotection',
 'Sweetener',
 'Expectorant']

164

In [46]:
## INCLUDE in parser

## get method returns None if key (indication name) not found in mapping 
df["object_nameres_id"] = [indication_mapping.get(i) for i in df["object_indication_name"]]

In [47]:
## INCLUDE in parser -> log

n_mapped = df["object_nameres_id"].notna().sum()
print(f"{n_mapped} rows with mapped drug IDs: {n_mapped / df.shape[0]:.1%}")

8350 rows with mapped drug IDs: 83.1%


**~83% of the data could be mapped.**

In [48]:
## INCLUDE in parser

## drop rows without indication mapping
df.dropna(subset="object_nameres_id", inplace=True, ignore_index=True)

In [49]:
df.shape

(8350, 6)

### Merge "duplicates" after mapping

<div class="alert alert-block alert-danger">

[2025-11-03]

With the current pipeline and data-modeling, **only the mapped columns uniquely define an edge**:
* subject_pubchem
* biolink_predicate
* object_nameres_id

**We still care about...**
* `subject_ttd_drug`: used to generate links to TTD in `source_record_urls`. That property can be multi-valued (handle mapping to multiple original TTD IDs)

And the rest won't be included in the output edges:
* `object_indication_name`: would be nice to include as "original_object" edge property, but currently the pipeline will override it with the provided object ID (object_nameres_id). Also it shouldn't be multi-valued right now
* `clinical_status`: not including in output right now. Needs to be mapped to edge-attribute(s) and values - then those properties could be included in "unique definition of an edge". 

In [50]:
## INCLUDE in parser

cols_define_edge = ["subject_pubchem", "biolink_predicate", "object_nameres_id"]
df = df.groupby(by=cols_define_edge).agg(set).reset_index().copy()

In [51]:
## EDA only

df[df["subject_ttd_drug"].map(len) > 1].shape

df[df["object_indication_name"].map(len) > 1].shape

df[df["clinical_status"].map(len) > 1].shape

(51, 6)

(25, 6)

(7, 6)

In [52]:
## EDA only

df[df["clinical_status"].map(len) > 1]

Unnamed: 0,subject_pubchem,biolink_predicate,object_nameres_id,subject_ttd_drug,object_indication_name,clinical_status
1624,PUBCHEM.COMPOUND:136033680,biolink:in_clinical_trials_for,MONDO:0018177,{D0V8AG},"{Recurrent glioblastoma, Glioblastoma multiforme}","{Phase 1/2, Phase 2}"
1925,PUBCHEM.COMPOUND:15950826,biolink:in_clinical_trials_for,MONDO:0056806,{D0X8TS},"{Non-small-cell lung cancer, Small-cell lung cancer}","{Phase 2, Phase 3}"
3111,PUBCHEM.COMPOUND:25195461,biolink:in_clinical_trials_for,MONDO:0005027,"{D2JF0S, D0TS8E}",{Epilepsy},"{Phase 1, Phase 2}"
3453,PUBCHEM.COMPOUND:3083544,biolink:treats,MONDO:0004979,"{D0XW2H, D0D1DI}",{Asthma},"{Approved, Phase 4}"
6101,PUBCHEM.COMPOUND:60825,biolink:in_clinical_trials_for,MONDO:0022839,"{D01SKR, D02UVT}",{Human immunodeficiency virus infection},"{Phase 2, Phase 3}"
7412,PUBCHEM.COMPOUND:85364156,biolink:in_clinical_trials_for,MONDO:0006769,{D0U2CE},"{Gastroparesis, Diabetic gastroparesis}","{Phase 2, Phase 3}"
7552,PUBCHEM.COMPOUND:89683805,biolink:in_clinical_trials_for,MONDO:0018874,{D0K7FT},"{Acute myeloid leukaemia, Acute myelogenous leukaemia}","{Phase 1/2, Phase 3}"


In [53]:
## INCLUDE in parser -> log

print(f"{df.shape[0]} rows after handling 'edge-level' duplicates\n")

print(f"{df["subject_pubchem"].nunique()} unique mapped drug IDs")

print(f"{df["object_nameres_id"].nunique()} unique mapped DiseaseOrPheno IDs")

# print(f"{df["clinical_status"].nunique()} unique clinical status values")

8276 rows after handling 'edge-level' duplicates

6048 unique mapped drug IDs
907 unique mapped DiseaseOrPheno IDs


In [None]:
## pipeline output would use to_dict
# df.to_dict(orient="records")

### EDA generating TTD links

In [54]:
## writing out here because it's two loops 

for i in df["subject_ttd_drug"]:
    if len(list(i)) > 1:
        ## INCLUDE IN PARSER
        temp = ["https://ttd.idrblab.cn/data/drug/details/" + j.lower() for j in i]
        print(temp)
        break

['https://ttd.idrblab.cn/data/drug/details/d02eyg', 'https://ttd.idrblab.cn/data/drug/details/d05jac']
