### Using selenium to dynamically scrape ATC classifications from DrugBank in order to produce drug tree

In this notebook, I will be using the selenium package to scrape info. from https://go.drugbank.com/atc, a dynamic web page. The goal is to produce a tree figure that looks like this: https://caseolap.github.io/covid-cvd-knowledgegraph/drugtree/index.html. The root node represents ATC level 1 (e.g. "C"), while the children nodes represent lower ATC levels (refer to https://www.who.int/tools/atc-ddd-toolkit/atc-classification). B/c the ATC DrugBank web page is dynamic, I must interact with the page (clicking, searching, etc.) in order to reveal more HTML elements that I can extract info. from. I used 2 different algorithms to do this. In summary, I did the following:
1. Use both algorithms to extract the name (id) of the ATC code, and the ATC code itself (value) for all ATC codes that start with a "C" from our data. 
2. Save info. in data frame, which will then be converted to a csv file (refer to cvd_drug_tree.csv in repo).
3. Create tree using csv file (html) 

In [19]:
import json
import numpy as np
import pandas as pd
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [10]:
with open('C:\\Users\\ttran\\OneDrive\\Desktop\\COVID-CDV-DATA\\covidii_KG\\covidii_import\\cvdrug_ent_drugpw.json') as f:
    drug_data = json.load(f)

In [11]:
data_atc_codes = set()
for drug in drug_data:
    for code in drug['ATC code']:
        data_atc_codes.add(code)

In [15]:
data_cvd_codes = {code for code in data_atc_codes if code[0]=="C"}

First algorithm (search): From set containing all ATC codes that start with a "C" (data_cvd_codes), search up ATC code on https://go.drugbank.com/atc, clear search box. Do for all codes in set.

Took around 400 seconds to run

In [22]:
start = time.time()

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://go.drugbank.com/atc")
driver.implicitly_wait(5)

atc_codes = []
atc_names = []

for search_key in data_cvd_codes:
    search = driver.find_element(By.CSS_SELECTOR, "input[name='query'][type='search']")
    search.send_keys(search_key)
    search.send_keys(Keys.RETURN)
    atc_name_str = ""
    for i in range(1,len(search_key)+1):
        if i==2 or i==6: # if 2nd of 6th digit in code, skip.
            continue
        else:
            search_key_id = search_key[:i] + "_anchor"
            search_key_xpath = "//*[@id='" + search_key_id + "']" 
            atc_name_found = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, search_key_xpath))
            )
            if i==1:
                atc_name_str = atc_name_str + atc_name_found.text
            else:
                atc_name_str =  atc_name_str + ";" + atc_name_found.text
            if search_key[:i] not in atc_codes:
                atc_codes.append(search_key[:i])
                atc_names.append(atc_name_str)
    driver.find_element(By.CSS_SELECTOR, "input[name='query'][type='search']").clear() # clear search box
atc_dict = {'id': atc_names,'value': atc_codes}

end = time.time()
print(end - start)

399.5164952278137


Second algorithm (recursively click and then extract info): recursively click on all tabs under "Cardiovascular system (C)" tab to expand all of them, and then extract the info. 

Took around 93 seconds to run -> much more efficient; ~1/4 the time of the first algorithm

In [20]:
start = time.time()

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://go.drugbank.com/atc")
driver.implicitly_wait(5)

def recursive_click(n=1):
    if n == 5:
        return
    else:
        if n == 1:
            ATC_lvl = driver.find_elements(By.CSS_SELECTOR, f"li[role='treeitem'][aria-level='{n}'][id='C']")
        else:
            ATC_lvl = driver.find_elements(By.CSS_SELECTOR, f"li[role='treeitem'][aria-level='{n}'][aria-expanded='false']")
        for ele in ATC_lvl:
            expand_icon = ele.find_element(By.CSS_SELECTOR, "i")
            driver.execute_script("arguments[0].click();", expand_icon)
            sleep(0.25) # wait 0.25 seconds before clicking on next tab; ensures that clicks are not intercepting one another.
        recursive_click(n + 1)

recursive_click()

cvd_codes = []
cvd_names = []

for cvd_code in data_cvd_codes:
    cvd_name_str = ""
    for i in range(1,len(cvd_code)+1):
        if i==2 or i==6: # if 2nd of 6th digit in code, skip.
            continue
        else:
            cvd_code_id = cvd_code[:i] + "_anchor"
            cvd_code_xpath = "//*[@id='" + cvd_code_id + "']" 
            cvd_name_found = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, cvd_code_xpath))
            )
            if i==1:
                cvd_name_str = cvd_name_str + cvd_name_found.text
            else:
                cvd_name_str =  cvd_name_str + ";" + cvd_name_found.text
            if cvd_code[:i] not in cvd_codes:
                cvd_codes.append(cvd_code[:i])
                cvd_names.append(cvd_name_str)
atc_dict = {'id': cvd_names,'value': cvd_codes}

end = time.time()
print(end - start)

93.40204572677612


In [23]:
atc_df = pd.DataFrame.from_dict(atc_dict)

In [24]:
atc_df[atc_df['value'].apply(lambda x: len(x)==7)]

Unnamed: 0,id,value
4,Cardiovascular system (C);Peripheral vasodilat...,C04AE02
8,Cardiovascular system (C);Cardiac therapy (C01...,C01CA02
11,Cardiovascular system (C);Cardiac therapy (C01...,C01BD05
15,Cardiovascular system (C);Antihypertensives (C...,C02LE01
19,Cardiovascular system (C);Agents acting on the...,C09DX05
...,...,...
582,Cardiovascular system (C);Calcium channel bloc...,C08DA02
583,Cardiovascular system (C);Cardiac therapy (C01...,C01BG11
584,Cardiovascular system (C);Beta blocking agents...,C07AA14
585,Cardiovascular system (C);Antihypertensives (C...,C02AC02


In [25]:
len(data_cvd_codes) 

440

In [26]:
atc_df = atc_df.sort_values('value')

In [27]:
atc_df.to_csv('C:\\Users\\ttran\\OneDrive\\Desktop\\COVID-CDV-DATA\\cvd_drug_tree.csv', index=False)