# MeSH ID Parsing - UMLS

**UMLS page:**
- Create account at https://www.nlm.nih.gov/research/umls/index.html.  Using Gmail is sufficient but need to sign up first.
- 1 - 2 day wait before getting approval usually, API key comes in by default.

**API instructions:**
- https://documentation.uts.nlm.nih.gov/rest/home.html


**API Terms of Use:**
- See: https://documentation.uts.nlm.nih.gov/terms-of-service.html
- In order to avoid overloading our servers, NLM requires that users **send no more than 20 requests per second per IP address**.
- Requests that exceed this limit may not be serviced, and **service will not be restored until the request rate falls beneath the limit**.
- To limit the number of requests that you send to the APIs, NLM **recommends caching results for a 12-24 hour period**. 

In [75]:
import time
import requests
import pandas as pd

## CUID Query
- CUID is basically the indexing identifier used by UMLS -  everything needs to start by finding the correct CUID
- Before running this part, **remember to first load the data using the code blocks towards the end of the notebook.**

In [35]:
## Looking at the list of chemicals from the training set
df_train['chemicals']

0      [naloxone, clonidine, clonidine, nalozone, alp...
1                      [lidocaine, lidocaine, lidocaine]
2      [suxamethonium, suxamethonium chloride, sch, sch]
3      [galanthamine hydrobromide, scopolamine, hyosc...
4      [lithium, lithium, lithium, lithium, lithium, ...
                             ...                        
495    [zonisamide, zonisamide, zonisamide, zonisamid...
496    [tyrosine, pan, tyrosine, puromycin aminonucle...
497    [ticlopidine, ticlopidine, ticlopidine, ticlop...
498    [morphine, scopolamine, cycloheximide, morphin...
499    [apomorphine, dopamine agonist, dopamine, apom...
Name: chemicals, Length: 500, dtype: object

In [47]:
## Base url definitions:
base_template = 'https://uts-ws.nlm.nih.gov/rest'
CUID_template = '/search/current?string='
CUID_page = '/content/current/CUI/'
API_key = 'ENTER YOUR KEY HERE'
API_template = '&apiKey=' + API_key

## Handpicked Example: Naloxone
- The following shows one way of getting to the MeSH ID of naloxone, which is the first chemical in example 0 of the training set
- In essence, we first do a word search on "naloxone" through the API to retrieve its CUID
- Then we look in the "atoms" section of the CUID page for naloxone, and loops through the related definitions and look for the 1st one of the type "MH"
- We also do a simple checking to make sure the content URL does contain sth like "/DXXX..." which should correspond to the MeSH ID, and then proceed to extract that

In [96]:
##picking naloxone as an example
example = df_train['chemicals'][0]
example[0]

'naloxone'

In [97]:
## CUID query
CUID_query = base_template + CUID_template + example[0] + API_template

CUID_response = requests.get(CUID_query)
CUID_pages = CUID_response.json()

## Printing the first 5 to see the results.  We can see an exact match here.
CUID_pages['result']['results'][0:2]

[{'ui': 'C0027358',
  'rootSource': 'MTH',
  'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0027358',
  'name': 'naloxone'},
 {'ui': 'C0700549',
  'rootSource': 'MTH',
  'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0700549',
  'name': 'naloxone hydrochloride'}]

In [98]:
CUID_query = 'C0027358'
CUID_info_query = base_template + CUID_page + CUID + '?' + API_template[1:]
CUID_info = requests.get(CUID_info_query)

CUID_info.json()

{'pageSize': 25,
 'pageNumber': 1,
 'pageCount': 1,
 'result': {'ui': 'C0027358',
  'name': 'naloxone',
  'dateAdded': '09-30-1990',
  'majorRevisionDate': '04-29-2021',
  'classType': 'Concept',
  'suppressible': False,
  'status': 'R',
  'semanticTypes': [{'name': 'Organic Chemical',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/semantic-network/2024AA/TUI/T109'}],
  'atoms': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0027358/atoms',
  'definitions': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0027358/definitions',
  'relations': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0027358/relations',
  'defaultPreferredAtom': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0027358/atoms/preferred',
  'atomCount': 69,
  'cvMemberCount': 0,
  'attributeCount': 0,
  'relationCount': 19}}

In [105]:
CUID_info.json()['result']['semanticTypes'][0]['name'].lower()

'organic chemical'

In [107]:
## Retrieving the CUID and then compiling a query on the atoms part of the CUID page
CUID = CUID_pages['result']['results'][0]['uri'][(CUID_pages['result']['results'][0]['uri'].find('CUI') + 4):]
MeSHID_query = base_template + CUID_page + CUID + '/atoms?' + API_template[1:]
MeSHID_response = requests.get(MeSHID_query)

## Looping through the results to find the first entry with 'MH'
results = MeSHID_response.json()['result']

for result in results:
    if result['termType'] == 'MH':
        concept_url = result['sourceDescriptor']
        if '/D' in concept_url and 'MSH' in concept_url:
            MeSHID = concept_url[concept_url.find('/D') + 1:]
            break
MeSHID

'D009270'

In [106]:
concept_url

'https://uts-ws.nlm.nih.gov/rest/content/2024AA/source/MSHFIN/D009270'

In [94]:
## Addtional example: nalozone is '-1' for MeSH ID in the dataset, but how do we want to deal with this
## when it's very likely the same as naloxone...?

## Note that a CUID is returned but results are about diseases!!!

test = 'nalozone'
## CUID query
CUID_query = base_template + CUID_template + test + API_template

CUID_response = requests.get(CUID_query)
CUID_pages = CUID_response.json()

## Printing the first 5 to see the results.  We can see an exact match here.
CUID_pages['result']['results'][0:2]

[{'ui': 'C0038826',
  'rootSource': 'MSH',
  'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0038826',
  'name': 'Superinfection'},
 {'ui': 'C0262417',
  'rootSource': 'SNOMEDCT_US',
  'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0262417',
  'name': 'Acute on chronic pancreatitis'}]

## Defining Functions for Calling UMLS

In [92]:
def UMLS_retrieve_CUID(entries, API_key, wait_time = 0.1, verbose = False, partial = False):
    '''
    Function to retrieve CUID from UMLS.  Extracts simply the first returned entry at the moment.
    
    Inputs:
    entries: string (single entity) or list of chemical/disease names (single or multiple entities)
    API_key: UMLS api key, string format
    wait_time: wait_time between each call to the API, defaulted to 0.1 which leaves some buffer for the 
               20 requests per second per IP address cap
    verbose: whether to provide additional info for returned entries.
    partial: whether to consider partial matches from UMLS. #### NOT YET IMPLEMENTED 
    
    Returns: 
    CUIDs in the original format (list or string)    
    '''
    base_template = 'https://uts-ws.nlm.nih.gov/rest'
    CUID_template = '/search/current?string='
    CUID_page = '/content/current/CUI/'
    API_template = '&apiKey=' + API_key
    
    assert type(entries) == str or type(entries) == list, f"Search term(s) for entry {entries} should be string or list."
    
    string_entered = False
    
    if type(entries) == str:
        entries = [entries]
        string_entered = True
        
    CUIDs = []
    
    for entry in entries:
        CUID_query = base_template + CUID_template + entry + API_template
        CUID_response = requests.get(CUID_query)
        
        assert CUID_response.status_code == 200, f"Error in calling API for entry {entry}, please check connection or API key."
        
        CUID_pages = CUID_response.json()
        
        if CUID_pages['result']['recCount'] == 0:
            if not partial:
                CUID = '-1'
                if verbose:
                    print(f"No complete match found for entry {entry}")
        
        if CUID_pages['result']['recCount'] > 0:       
            try:
                CUID = CUID_pages['result']['results'][0]['uri'][(CUID_pages['result']['results'][0]['uri'].find('CUI') + 4):]
            except:
                print(f"Error in reading the returned JSON on the CUID query for entry {entry}.  Call format may have changed?")
            
        CUIDs.append(CUID)
        
        time.sleep(wait_time)
        
    assert len(CUIDs) == len(entries), f"The returned number of CUIDs is {len(CUIDs)}, less than that of the inputs ({len(entries)})"
        
    if string_entered:
        CUIDs = str(CUIDs[0])
        
    return CUIDs    
    

In [186]:
def UMLS_retrieve_MeSHID(entries, entity_type, API_key, wait_time = 0.1001, verbose = False, partial = False):
    '''
    Function to retrieve MeSH ID from UMLS.  Extracts simply the first returned "suitable" entry at the moment.
    
    Specifically, suitable means:
    (i)   the termType is "MH"
    (ii)  "MSH" is found in the descriptor link
    (iii) phrase similar to "/D" is found in the descriptor link
    (iv)  semantic type of the CUID page matches that of the declared entity type
    
    Inputs:
    entries: string (single entity) or list of CUIDs (single or multiple entities),
             presumably coming from the "UMLS_retrieve_CUID" function
    entity_type: 'chemical' or 'disease'
    API_key: UMLS api key, string format
    wait_time: wait_time between each call to the API, defaulted to 0.1 which leaves some buffer for the 
               20 requests per second per IP address cap
    verbose: whether to provide additional info for returned entries.
    partial: whether to consider partial matches from UMLS. #### NOT YET IMPLEMENTED 
    
    Returns: 
    CUIDs in the original format (list or string)    
    '''
    base_template = 'https://uts-ws.nlm.nih.gov/rest'
    CUID_template = '/search/current?string='
    CUID_page = '/content/current/CUI/'
    API_template = '&apiKey=' + API_key
    ## manual selection from https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/SemanticTypes_2018AB.txt
    chemical_typings = ['amino acid, peptide, or protein',
                        'amino acid sequence',
                        'antibiotic',
                        'biologically active substance',
                        'body substance',
                        'chemical',
                        'chemical viewed functionally',
                        'chemical viewed structurally',
                        'clinical drug',
                        'carbohydrate sequence',
                        'element, ion, or isotope',
                        'enzyme',
                        'hazardous or poisonous substance',
                        'hormone',
                        'inorganic chemical',
                        'indicator, reagent, or diagnostic aid',
                        'molecular sequence',
                        'nucleic acid, nucleoside, or nucleotide',
                        'nucleotide sequence',
                        'organic chemical',
                        'pharmacologic substance',
                        'plant',
                        'substance',
                        'vitamin']
    ## manual selection from https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/SemanticTypes_2018AB.txt
    disease_typings = ['acquired abnormality',
                       'anatomical abnormality',
                       'bacterium',
                       'congenital abnormality',
                       'clinical attribute',
                       'cell or molecular dysfunction',
                       'disease or syndrome',
                       'experimental model of disease',
                       'event', ## "Fungus",
                       'injury or poisoning',
                       'mental or behavioral dysfunction',
                       'pathologic function', ## Patient or Disabled Group,
                       'sign or symptom',
                       'virus']
    
    assert type(entries) == str or type(entries) == list, f"Search CUIDs for entry {entries} should be string or list."
    assert entity_type in ['chemical', 'disease'], f"Entity type must be 'chemical' or 'disease' for entry {entries}"
    
    string_entered = False
    
    if type(entries) == str:
        entries = [entries]
        string_entered = True
        
    MeSH_IDs = []
    
    for entry in entries:
        ## entry of -1 gives MeSH ID of -1
        if entry == "-1":
            MeSH_ID = "-1"
            MeSH_IDs.append(MeSH_ID)
            continue
            
        ## Checking entity type
        CUID_info_query = base_template + CUID_page + entry + '?' + API_template[1:]
        CUID_info = requests.get(CUID_info_query)
        
        assert CUID_info.status_code == 200, f"Error in calling API for entry {entry}, please check connection or API key."
        semantic_type = CUID_info.json()['result']['semanticTypes'][0]['name']
        
        if entity_type == 'chemical' and semantic_type.lower() not in chemical_typings:
            MeSH_ID = "-1"
            MeSH_IDs.append(MeSH_ID)
            continue
        ## for disease, wonder what are the possible types?  
        ## There are terms like "abnormality", "dysfunction" and "injury or poisoning"
        ## See:https://lhncbc.nlm.nih.gov/ii/tools/MetaMap/Docs/SemanticTypes_2018AB.txt
        elif entity_type == 'disease' and semantic_type.lower() not in disease_typings:
            MeSH_ID = "-1"
            MeSH_IDs.append(MeSH_ID)
            continue
            
        else:
            MeSHID_query = base_template + CUID_page + entry + '/atoms?' + API_template[1:]
            MeSHID_response = requests.get(MeSHID_query)
            
            found = False
            
            assert MeSHID_response.status_code == 200, f"Error in calling API for entry {entry}, please check connection or API key."

            ## Looping through the results to find the first entry with 'MH'
            MeSH_results = MeSHID_response.json()['result']

            for result in MeSH_results:
                if result['termType'] == 'MH' or 'MSH':
                    concept_url = result['sourceDescriptor']
                    if '/D' in concept_url and 'MSH' in concept_url:
                        MeSH_ID = concept_url[concept_url.find('/D') + 1:]
                        MeSH_IDs.append(MeSH_ID)
                        found =  True
                        break
                        
            if not found:
                MeSH_ID = "Not found"
                MeSH_IDs.append(MeSH_ID)
        
        time.sleep(wait_time)
    
    assert len(MeSH_IDs) == len(entries), f"The returned number of MeSH IDs is {len(MeSH_IDs)}, less than that of the inputs ({len(entries)})"
        
    if string_entered:
        MeSH_IDs = str(MeSH_IDs[0])
        
    return MeSH_IDs

## Test Run on the Full Example 0 of the Training Set

In [138]:
entries

['naloxone',
 'clonidine',
 'clonidine',
 'nalozone',
 'alpha-methyldopa',
 'naloxone',
 'naloxone',
 'clonidine',
 '3h-naloxone',
 'naloxone',
 'clonidine',
 '3h-dihydroergocryptine',
 'naloxone',
 'clonidine',
 'clonidine',
 'alpha-methyldopa']

In [136]:
entries = df_train['chemicals'][0]
CUID_testcases = UMLS_retrieve_CUID(example, API_key)

CUID_testcases

['C0027358',
 'C0009014',
 'C0009014',
 'C0038826',
 'C0025741',
 'C0027358',
 'C0027358',
 'C0009014',
 '-1',
 'C0027358',
 'C0009014',
 '-1',
 'C0027358',
 'C0009014',
 'C0009014',
 'C0025741']

In [140]:
## Correct answers: ['D009270', 'D003000', 'D003000', '-1', 'D008750',
##                   'D009270', 'D009270', 'D003000', '-1', 'D009270',
##                   'D003000', '-1', 'D009270', 'D003000', 'D003000', 'D008750']
MeSH_testcases = UMLS_retrieve_MeSHID(CUID_testcases, 'chemical', API_key)

MeSH_testcases

['D009270',
 'D003000',
 'D003000',
 '-1',
 'D008750',
 'D009270',
 'D009270',
 'D003000',
 '-1',
 'D009270',
 'D003000',
 '-1',
 'D009270',
 'D003000',
 'D003000',
 'D008750']

## Testing on 10 Rows of the Chemical Columns of the Training Set
- Testing on all rows took over an hour and ended in connection errors (using .apply).
- Truncating that to the 2nd to 11th rows only.

In [148]:
df_train_truncated = df_train[1:11]

In [163]:
## using .apply somehow took forever
mapped_CUID = []

for i in range(len(df_train_truncated)):
    chemical = df_train_truncated['chemicals'][i+1]
    
    chemical_CUID = UMLS_retrieve_CUID(chemical, API_key)
    mapped_CUID.append(chemical_CUID)
    
    time.sleep(2)
    
df_train_truncated = df_train_truncated.assign(mapped_CUID = mapped_CUID)

In [189]:
mapped_MeSH = []

for i in range(len(df_train_truncated)):
    chemical_CUID = df_train_truncated['mapped_CUID'][i+1]
    
    chemical_MeSH = UMLS_retrieve_MeSHID(chemical_CUID, 'chemical', API_key)
    mapped_MeSH.append(chemical_MeSH)
    time.sleep(2)
    
df_train_truncated = df_train_truncated.assign(mapped_MeSH = mapped_MeSH)

In [190]:
df_train_truncated['chemical_ids_match'] = df_train_truncated['mapped_MeSH'] == df_train_truncated['chemical_ids']

df_train_truncated

Unnamed: 0,article_code,title,abstract,chemicals,diseases,chemical_start_indices,chemical_end_indices,disease_start_indices,disease_end_indices,chemical_ids,disease_ids,CID_chemical,CID_disease,CID_chemical_name,CID_disease_name,mapped_CUID,mapped_MeSH,chemical_ids_match
1,354896,Lidocaine-induced cardiac asystole.,Intravenous administration of a single 50-mg b...,"[lidocaine, lidocaine, lidocaine]","[cardiac asystole, depression, bradyarrhythmias]","['0', '90', '409']","['9', '99', '418']","['18', '142', '331']","['34', '152', '347']","[D008012, D008012, D008012]","[D006323, D003866, D001919]",[D008012],[D006323],[lidocaine],[cardiac asystole],"[C0023660, C0023660, C0023660]","[D008012, D008012, D008012]",True
2,435349,Suxamethonium infusion rate and observed fasci...,Suxamethonium chloride (Sch) was administered ...,"[suxamethonium, suxamethonium chloride, sch, sch]","[fasciculations, tetanic, fasciculations, fasc...","['0', '80', '104', '312']","['13', '102', '107', '315']","['41', '265', '395', '483', '523', '538', '561...","['55', '272', '409', '496', '536', '544', '568...","[D013390, D013390, D013390, D013390]","[D005207, D013746, D005207, D005207, D005207, ...",[D013390],[D005207],[suxamethonium],[fasciculations],"[C0038627, C0012792, C5771327, C5771327]","[D013390, D013390, -1, -1]",False
3,603022,"Galanthamine hydrobromide, a longer acting ant...","Galanthamine hydrobromide, an anticholinestera...","[galanthamine hydrobromide, scopolamine, hyosc...",[overdosage],"['0', '111', '124', '135', '292', '305', '352'...","['25', '122', '132', '160', '303', '313', '365...",['315'],['325'],"[D005702, D012601, D012601, D005702, D012601, ...",[D062787],[D012601],[D062787],[scopolamine],[overdosage],"[C0949312, C0036442, C0036442, C0949312, C0036...","[D005702, D012601, D012601, D005702, D012601, ...",True
4,1378968,Effects of uninephrectomy and high protein fee...,Rats with lithium-induced nephropathy were sub...,"[lithium, lithium, lithium, lithium, lithium, ...","[chronic renal failure, nephropathy, renal fai...","['54', '111', '362', '520', '581', '608', '632...","['61', '118', '369', '527', '588', '615', '639...","['70', '127', '309', '975', '1000', '1027', '1...","['91', '138', '322', '986', '1012', '1045', '1...","[D008094, D008094, D008094, D008094, D008094, ...","[D007676, D007674, D051437, D011507, D006973, ...","[D008094, D008094, D008094]","[D006973, D011507, D007676]","[lithium, lithium, lithium]","[hypertension, proteinuria, chronic renal fail...","[C0023870, C0023870, C0023870, C0023870, C0023...","[D008094, D008094, D008094, D008094, D008094, ...",False
5,1420741,Treatment of Crohn's disease with fusidic acid...,Fusidic acid is an antibiotic with T-cell spec...,"[fusidic acid, cyclosporin, cyclosporin, fusid...","[crohn's disease, crohn's disease, crohn's dis...","['34', '107', '217', '391', '507', '743', '120...","['46', '118', '228', '403', '519', '755', '121...","['13', '292', '467', '910', '1263', '1440']","['28', '307', '482', '916', '1278', '1466']","[D005672, D016572, D016572, D005672, D005672, ...","[D003424, D003424, D003424, D009325, D003424, ...",[D005672],[D009325],[fusidic acid],[nausea],"[C0016875, C0010592, C0010592, C0016875, C0016...","[D005672, D016572, D016572, D005672, D005672, ...",True
6,1601297,Electrocardiographic evidence of myocardial in...,The electrocardiograms (ECG) of 99 cocaine-abu...,"[cocaine, cocaine, cocaine]","[myocardial injury, schizophrenic, myocardial ...","['83', '135', '232']","['90', '142', '239']","['33', '194', '305', '334', '357', '371']","['50', '207', '322', '355', '365', '390']","[D003042, D003042, D003042]","[D009202, D012559, D009202, D009203, D007511, ...","[D003042, D003042]","[D009203, D002037]","[cocaine, cocaine]","[myocardial infarction, bundle branch block]","[C0009170, C0009170, C0009170]","[D003042, D003042, D003042]",True
7,1967484,Sulpiride-induced tardive dystonia.,Sulpiride is a selective D2-receptor antagonis...,"[sulpiride, sulpiride, antidepressant, sulpiri...","[tardive dystonia, tardive dyskinesia, parkins...","['0', '36', '107', '204', '395', '456']","['9', '45', '121', '213', '404', '465']","['18', '222', '245', '355', '474']","['34', '240', '257', '363', '490']","[D013469, D013469, D000928, D013469, D013469, ...","[D004421, D004409, D010302, D004421, D004421]",[D013469],[D004421],[sulpiride],[tardive dystonia],"[C0038803, C0038803, C0003289, C0038803, C0038...","[D013469, D013469, D000928, D013469, D013469, ...",True
8,2234245,Ocular and auditory toxicity in hemodialyzed p...,During an 18-month period of study 41 hemodial...,"[desferrioxamine, desferrioxamine, desferrioxa...","[ocular and auditory toxicity, audiovisual tox...","['64', '151', '766', '1030', '1097', '1234']","['79', '166', '781', '1045', '1106', '1249']","['0', '250', '314', '457', '534', '576', '604'...","['28', '270', '341', '472', '548', '599', '631...","[D003676, D003676, D003676, D003676, -1, D003676]","[D014786|D006311, D014786|D006311, D014786|D00...","[D003676, D003676, D003676]","[D012164, D014786, D006319]","[desferrioxamine, desferrioxamine, desferrioxa...","[pigmentary retinal deposits, visual toxicity,...","[C0011145, C0011145, C0011145, C0011145, C0002...","[D003676, D003676, D003676, D003676, D000535, ...",False
9,2385256,Myasthenia gravis presenting as weakness after...,We studied a patient with no prior history of ...,"[magnesium, magnesium, magnesium, magnesium, a...","[myasthenia gravis, neuromuscular disease, qua...","['47', '192', '245', '321', '691', '777', '1024']","['56', '201', '254', '330', '704', '786', '1033']","['0', '119', '162', '221', '525', '761', '844'...","['17', '140', '174', '233', '560', '770', '861...","[D008274, D008274, D008274, D008274, D000109, ...","[D009157, D009468, D011782, D011225, D009468, ...",[D008274],[D009157],[magnesium],[myasthenia gravis],"[C0024467, C0024467, C0024467, C0024467, C0001...","[D008274, D008274, D008274, D008274, D000109, ...",True
10,2505783,Chloroacetaldehyde and its contribution to uro...,"Based on clinical data, indicating that chloro...","[chloroacetaldehyde, cyclophosphamide, ifosfam...","[hemorrhagic cystitis, bladder damage]","['0', '77', '97', '192', '212', '349', '423', ...","['18', '93', '107', '210', '215', '352', '426'...","['375', '476']","['395', '490']","[C004656, D003520, D007069, C004656, C004656, ...","[D006470|D003556, D001745]","[C004656, C004656]","[D003556, D006470]","[chloroacetaldehyde, chloroacetaldehyde]","[cystitis, bleeding]","[C0055382, C0010583, C0020823, C0055382, C1842...","[Not found, D003520, D007069, Not found, -1, -...",False


## Some error analysis
- Some simple terms doesn't work in the API
- One example is "li" which is supposed to be a short hand of lithium

In [196]:
df_train_truncated['chemicals'][4]

['lithium',
 'lithium',
 'lithium',
 'lithium',
 'lithium',
 'lithium',
 'lithium',
 'lithium',
 'lithium',
 'li',
 'lithium',
 'creatinine',
 'lithium',
 'li',
 'li']

In [197]:
df_train_truncated['chemical_ids'][4]

['D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D003404',
 'D008094',
 'D008094',
 'D008094']

In [198]:
df_train_truncated['mapped_MeSH'][4]

['D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 'D008094',
 '-1',
 'D008094',
 'D003404',
 'D008094',
 '-1',
 '-1']

In [224]:
CUID_query

'https://uts-ws.nlm.nih.gov/rest/search/current?string=li&apiKey=51ee3239-e0c1-4458-b8b5-b1fb7e9e0098'

In [231]:
entry = 'li'

CUID_query = base_template + CUID_template + entry + API_template + '&sabs=MSH'
CUID_response = requests.get(CUID_query)

In [232]:
CUID_response.json()

{'pageSize': 25,
 'pageNumber': 1,
 'result': {'classType': 'searchResults',
  'results': [{'ui': 'C0966327',
    'rootSource': 'MSH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0966327',
    'name': 'LI 160'},
   {'ui': 'C5772942',
    'rootSource': 'MSH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C5772942',
    'name': 'Li people'},
   {'ui': 'C0674977',
    'rootSource': 'MSH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0674977',
    'name': 'LI 1370'},
   {'ui': 'C1310770',
    'rootSource': 'MSH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C1310770',
    'name': 'LI 150'},
   {'ui': 'C3529829',
    'rootSource': 'MSH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C3529829',
    'name': 'polygalasaponin LI'},
   {'ui': 'C5565544',
    'rootSource': 'MSH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C5565544',
    'name': 'gypenoside LI'},
   {'ui': 'C0085390',
   

## Import Dataset

In [2]:
import ast

In [3]:
# Path for datasets
datapath = '../data/'

In [7]:
# Load datasets

df_train = pd.read_csv(f'{datapath}' + 'OfficialTrainingSet1.csv')
df_val = pd.read_csv(f'{datapath}' + 'OfficialValidationSet1.csv')
df_test = pd.read_csv(f'{datapath}' + 'OfficialTestSet1.csv')

print("Shape of train dataset:", df_train.shape)
print("Shape of validation dataset:", df_val.shape)
print("Shape of test dataset:", df_test.shape)

df_train.head(3)

Shape of train dataset: (500, 13)
Shape of validation dataset: (500, 13)
Shape of test dataset: (500, 13)


Unnamed: 0,article_code,title,abstract,chemicals,diseases,chemical_start_indices,chemical_end_indices,disease_start_indices,disease_end_indices,chemical_ids,disease_ids,CID_chemical,CID_disease
0,227508,Naloxone reverses the antihypertensive effect ...,"In unanesthetized, spontaneously hypertensive ...","['Naloxone', 'clonidine', 'clonidine', 'nalozo...","['hypertensive', 'hypotensive', 'hypertensive'...","['0', '49', '181', '244', '306', '354', '364',...","['8', '58', '190', '252', '322', '362', '372',...","['93', '274', '469', '750']","['105', '285', '481', '762']","['D009270', 'D003000', 'D003000', '-1', 'D0087...","['D006973', 'D007022', 'D006973', 'D006973']",['D008750'],['D007022']
1,354896,Lidocaine-induced cardiac asystole.,Intravenous administration of a single 50-mg b...,"['Lidocaine', 'lidocaine', 'lidocaine']","['cardiac asystole', 'depression', 'bradyarrhy...","['0', '90', '409']","['9', '99', '418']","['18', '142', '331']","['34', '152', '347']","['D008012', 'D008012', 'D008012']","['D006323', 'D003866', 'D001919']",['D008012'],['D006323']
2,435349,Suxamethonium infusion rate and observed fasci...,Suxamethonium chloride (Sch) was administered ...,"['Suxamethonium', 'Suxamethonium chloride', 'S...","['fasciculations', 'tetanic', 'Fasciculations'...","['0', '80', '104', '312']","['13', '102', '107', '315']","['41', '265', '395', '483', '523', '538', '561...","['55', '272', '409', '496', '536', '544', '568...","['D013390', 'D013390', 'D013390', 'D013390']","['D005207', 'D013746', 'D005207', 'D005207', '...",['D013390'],['D005207']


In [8]:
# Data transformation functions

def convert_col_to_list(string):
    """
    Converts all string columns that look like lists (col index 3 to end) into actual lists 
    """
    return ast.literal_eval(string)


def lowercase_cols(lst):
    """
    Converts chemicals and diseases column to lowercase
    """
    return [item.lower() for item in lst]


def map_cid_to_chemical_name(row):
    """
    Maps CID of chemical in the CID_chemical column into the actual name of the chemical
    """
    cid_chemicals = row['CID_chemical']
    chemical_ids = row['chemical_ids']
    chemicals = row['chemicals']
    
    chemical_names = []
    
    for cid in cid_chemicals:
        if cid in chemical_ids:
            idx = chemical_ids.index(cid)
            chemical_names.append(chemicals[idx])
        else:
            chemical_names.append('unknown')
    
    return chemical_names


def map_cid_to_disease_name(row):
    """
    Maps CID of disease in the CID_disease column into the actual name of the disease
    """
    cid_diseases = row['CID_disease']
    disease_ids = row['disease_ids']
    diseases = row['diseases']
    
    disease_names = []
    
    for cid in cid_diseases:
        if cid in disease_ids:
            idx = disease_ids.index(cid) 
            disease_names.append(diseases[idx]) 
        else:
            disease_names.append('unknown')
    
    return disease_names


# Function to handle "unknown" for chemical names
def map_cid_to_chemical_name_unknown(data):
    '''
    Addresses 'unknown' instances of CID_chemical_names caused by chemicals with pipe (|) notation
    '''
    chemical_id_map = {}
    for i, row in data.iterrows():
        for cid, chemical in zip(row['chemical_ids'], row['chemicals']):
            chemical_id_map[cid] = chemical
    
    # Function to map "unknown" to the correct chemical name if possible
    def resolve_unknown_chemical_name(cids):
        names = []
        for cid in cids:
            # Split combined IDs (separated by '|') and check for matches in the map
            split_ids = cid.split('|')
            name = ' | '.join([chemical_id_map.get(split_id, 'unknown') for split_id in split_ids])
            names.append(name)
        return names

    # Apply the function only to rows where CID_chemical_name has "unknown"
    data['CID_chemical_name'] = data.apply(lambda row: resolve_unknown_chemical_name(row['CID_chemical']) 
                                       if 'unknown' in row['CID_chemical_name'] else row['CID_chemical_name'], axis=1)
    return data

# Function to handle "Unknown" for disease names
def map_cid_to_disease_name_unknown(data):
    '''
    Addresses 'unknown' instances of CID_disease_names caused by diseases with pipe (|) notation
    '''
    disease_id_map = {}
    for i, row in data.iterrows():
        for cid, disease in zip(row['disease_ids'], row['diseases']):
            disease_id_map[cid] = disease
    
    # Function to map "unknown" to the correct disease name if possible
    def resolve_unknown_disease_name(cids):
        names = []
        for cid in cids:
            # Split combined IDs (separated by '|') and check for matches in the map
            split_ids = cid.split('|')
            name = ' | '.join([disease_id_map.get(split_id, 'unknown') for split_id in split_ids])
            names.append(name)
        return names

    # Apply the function only to rows where CID_disease_name has "Unknown"
    data['CID_disease_name'] = data.apply(lambda row: resolve_unknown_disease_name(row['CID_disease']) 
                                      if 'unknown' in row['CID_disease_name'] else row['CID_disease_name'], axis=1)
    return data

In [9]:
# Apply the data transformations functions to all three datasets

list_columns = ['chemicals', 'diseases', 'chemical_ids', 'disease_ids', 'CID_chemical', 'CID_disease']
for col in list_columns:
    df_train[col] = df_train[col].apply(convert_col_to_list) 
    df_val[col] = df_val[col].apply(convert_col_to_list) 
    df_test[col] = df_test[col].apply(convert_col_to_list) 

df_train['chemicals'] = df_train['chemicals'].apply(lowercase_cols)
df_train['diseases'] = df_train['diseases'].apply(lowercase_cols)
df_val['chemicals'] = df_val['chemicals'].apply(lowercase_cols)
df_val['diseases'] = df_val['diseases'].apply(lowercase_cols)
df_test['chemicals'] = df_test['chemicals'].apply(lowercase_cols)
df_test['diseases'] = df_test['diseases'].apply(lowercase_cols)

df_train['CID_chemical_name'] = df_train.apply(map_cid_to_chemical_name, axis=1)
df_train['CID_disease_name'] = df_train.apply(map_cid_to_disease_name, axis=1)
df_val['CID_chemical_name'] = df_val.apply(map_cid_to_chemical_name, axis=1)
df_val['CID_disease_name'] = df_val.apply(map_cid_to_disease_name, axis=1)
df_test['CID_chemical_name'] = df_test.apply(map_cid_to_chemical_name, axis=1)
df_test['CID_disease_name'] = df_test.apply(map_cid_to_disease_name, axis=1)

df_train = map_cid_to_chemical_name_unknown(df_train)
df_train = map_cid_to_disease_name_unknown(df_train)
df_val = map_cid_to_chemical_name_unknown(df_val)
df_val = map_cid_to_disease_name_unknown(df_val)
df_test = map_cid_to_chemical_name_unknown(df_test)
df_test = map_cid_to_disease_name_unknown(df_test)

df_train.head(3)

Unnamed: 0,article_code,title,abstract,chemicals,diseases,chemical_start_indices,chemical_end_indices,disease_start_indices,disease_end_indices,chemical_ids,disease_ids,CID_chemical,CID_disease,CID_chemical_name,CID_disease_name
0,227508,Naloxone reverses the antihypertensive effect ...,"In unanesthetized, spontaneously hypertensive ...","[naloxone, clonidine, clonidine, nalozone, alp...","[hypertensive, hypotensive, hypertensive, hype...","['0', '49', '181', '244', '306', '354', '364',...","['8', '58', '190', '252', '322', '362', '372',...","['93', '274', '469', '750']","['105', '285', '481', '762']","[D009270, D003000, D003000, -1, D008750, D0092...","[D006973, D007022, D006973, D006973]",[D008750],[D007022],[alpha-methyldopa],[hypotensive]
1,354896,Lidocaine-induced cardiac asystole.,Intravenous administration of a single 50-mg b...,"[lidocaine, lidocaine, lidocaine]","[cardiac asystole, depression, bradyarrhythmias]","['0', '90', '409']","['9', '99', '418']","['18', '142', '331']","['34', '152', '347']","[D008012, D008012, D008012]","[D006323, D003866, D001919]",[D008012],[D006323],[lidocaine],[cardiac asystole]
2,435349,Suxamethonium infusion rate and observed fasci...,Suxamethonium chloride (Sch) was administered ...,"[suxamethonium, suxamethonium chloride, sch, sch]","[fasciculations, tetanic, fasciculations, fasc...","['0', '80', '104', '312']","['13', '102', '107', '315']","['41', '265', '395', '483', '523', '538', '561...","['55', '272', '409', '496', '536', '544', '568...","[D013390, D013390, D013390, D013390]","[D005207, D013746, D005207, D005207, D005207, ...",[D013390],[D005207],[suxamethonium],[fasciculations]


In [235]:
test = 'nephropathy'
CUID_query = base_template + CUID_template + test + API_template

CUID_response = requests.get(CUID_query)
CUID_pages = CUID_response.json()

In [236]:
CUID_pages

{'pageSize': 25,
 'pageNumber': 1,
 'result': {'classType': 'searchResults',
  'results': [{'ui': 'C0022658',
    'rootSource': 'MTH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0022658',
    'name': 'Kidney Diseases'},
   {'ui': 'C0011881',
    'rootSource': 'MTH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0011881',
    'name': 'Diabetic Nephropathy'},
   {'ui': 'C0848548',
    'rootSource': 'MTH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0848548',
    'name': 'hypertensive nephropathy'},
   {'ui': 'C0004698',
    'rootSource': 'MTH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0004698',
    'name': 'Balkan Nephropathy'},
   {'ui': 'C0595916',
    'rootSource': 'MTH',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0595916',
    'name': 'Toxic nephropathy'},
   {'ui': 'C0149938',
    'rootSource': 'SNOMEDCT_US',
    'uri': 'https://uts-ws.nlm.nih.gov/rest/content/2024AA/CUI/C0149938',