# ARTG Scraper Rewrite

Recently, I wrote a scraper for the "Breaking Good" project (hereafter BG), a project which aims to collect information about the accessibility and availability of medicines classified as "essential" by the W.H.O.

At the time, I had written the tool using Selenium - largely because I had little experience with BeautifulSoup, and I found it relatively-easy to implement automated file downloads using the former. Having now spent a bit of time getting aquainted with the latter, and because I am right now on holiday and the scope of the scraper requirements is clearer, I figured it might be time to re-visit the project, for fun, and to see if I can glean any new insights from the data.

## Loading Data and Imports

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import pdfplumber
import re
import time
import urllib3
import io

In [506]:
## Load the original CSV of essential medicine names
essential_medicine_names = pd.read_csv(f"WHO_essential_anti_infective_medicines.csv", header=None)

## what is the dimensionality of the longest search string? (We will need it later)
essential_medicine_names_split = pd.concat([essential_medicine_names[0].str.split(', ', expand=True)], axis=1)
essential_medicine_names_split.head()
num_cols = essential_medicine_names_split.shape[1] ## in this case, 4

## get rows where column 3 is not None (just to see)
non_none = essential_medicine_names_split.loc[essential_medicine_names_split[num_cols-1].notnull(), [i for i in range(0,num_cols)]]
#non_none.head()

## Boilerplate Functions

In [549]:
def formatted_active_ingredient_string(input_string):
    '''
    Receives a comma-separated string of non-zero length containing medicine names:
    "abacavir" OR "abacavir, lamivudine" etc.
    Returns a formatted string for the meta_A tag on the TGA search client url:
    "abacavir" OR "abacavir%2C+lamivudine"
    
    If an empty string is given as input, function returns 0.
    '''

    ## convert string to list
    input_string = input_string.split(',')
    
    ## first active ingredient is first element
    actives = str(input_string[0])
    
    ## if there is more than one active ingredient, append it to formatted string
    if len(input_string) > 1:
        for ing in [str(ing.lstrip()) for ing in input_string[1:]]:
            actives+="%2C+"+str(ing)
    
    return actives


def artg_search(*args, **kwargs):
    '''
    Performs a search of the ARTG, in two modes. First, if a formatted ingredient string is passed
    from formatted_active_ingredient_string and a start_rank integer is also passed, we are 
    running a search of the ARTG on the main database page. Otherwise, if we receive only
    a valid ARTG identifier, we are running a sub-search of the database for a particular medicine.
    These two modes have different URLs, and will return Soup objects that need to be parsed differently.
    
    Returns a BeautifulSoup object.
    '''
    
    if "formatted_string" in kwargs and "start_rank" in kwargs:
        url = f"https://tga-search.clients.funnelback.com/s/search.html?from-advanced=true&collection=tga-artg&meta_A={kwargs['formatted_string']}&fmo=on&start_rank={kwargs['start_rank']}"
    elif "artg_id" in kwargs:
        url = f"https://tga-search.clients.funnelback.com/s/search.html?collection=tga-artg&profile=record&meta_i={kwargs['artg_id']}"
    else: return 0
    
    ## check status code is 200
    req = requests.get(url)
    if req.status_code != 200:
        return 0
    
    return BeautifulSoup(req.text, 'lxml')


def read_pdf_from_url(url):
    '''
    Code taken from Demian Wolf @ 
    https://stackoverflow.com/questions/62075033/read-pdf-from-url-to-memory-omitting-saving-file-to-local-file &&
    '''
    http = urllib3.PoolManager()
    temp = io.BytesIO()
    temp.write(http.request("GET", url).data)

    all_text = '' # new line
    with pdfplumber.open(temp) as pdf:
        
        for pdf_page in pdf.pages:
            
            # separate each page's text with newline
            single_page_text = pdf_page.extract_text()
            all_text = all_text + '\n' + single_page_text
            
    return all_text


def get_ARTG_data_from_string(query_string, category):
    '''
    Does what it says on the box. Receives a query_string (ideally, the PDF contents from read_pdf_from_url)
    and returns a match for "category XXX", where XXX is the value of category, if it exists.
    '''
    
    date_matches = ["Start Date", "Effective Date"]
    pair_matches = ["Dosage Form", "Route of Administration", "Status"]
    
    match_string = ""
    if category in date_matches:
        match_string = f"({category}) (\d+/\d+/\d+)"
    elif category in pair_matches:
        match_string = f"({category}) (.+)"
    else:
        return np.nan
    
    matches = re.findall(match_string, query_string)
    
    copy = []
    for match in matches:
        if not match[1] == '':
            copy.append(match)

    x = [tuple([k[0].lower().strip(), k[1]]) for k in copy]
    
    value = [i[1] for i in list(x)]

    return value


def get_number_of_subsearches(soup):
    '''
    Receives a beautifulsoup object. Finds the "description" class, and returns substrings that
    match the regex string.
    
    Check that search returned zero partial matches, and use the first group to define 
    the number of sub-searches we need to make (due to pagination).
    '''
    
    description_string = soup.find("p", {"class": "description"}).text
    match = re.search(r'Documents: (\d+) fully matching plus (\d+) partially matching', description_string)
    
    return int(match.group(1)), int(match.group(2))


def convert_profile_table(medicine, ID_string, profile_soup):
    '''
    Extract information from the profile information page. Namely, if there is a Public Summary document,
    extract additional information from parsing it.
    '''
    
    ## get the profile table into a dict; initialise the table with known ID; init variables
    public_summary_data_OI = ["Start Date", "Dosage Form", "Route of Administration", "Effective Date", "Status"]
    keys_not_of_interest = ["Product Information", "Public ARTG summary", "Consumer Medicines Information", "ARTG entry for"]
    table_dict = {"ARTG ID": int(ID_string),
                  "Search String": medicine}
    table = profile_soup.find('table')
    
    for row in table.findAll('tr'):
        if row.find('th').text == "Public ARTG summary":
            table_dict[row.find('th').text] = row.find('a')['href']
        else:
            table_dict[row.find('th').text] = [row.find('td').text]
            
    if table_dict["Public ARTG summary"]:
        public_summary_text = read_pdf_from_url(table_dict["Public ARTG summary"])
        
        for datum in public_summary_data_OI:
            table_dict[datum] = get_ARTG_data_from_string(public_summary_text, datum)
            
        if table_dict["Active ingredients"]:
            for ingredient_list in table_dict["Active ingredients"]:
                if len(medicine.split(',')) != len(ingredient_list.split(',')):
                    return None
                for index, ingredient in enumerate(ingredient_list.split(',')):
                    table_dict[f"Ingredient {index+1}"], table_dict[f"Ingredient {index+1} strength"] = get_ingredients_and_strengths_from_PDF(public_summary_text, ingredient)

    for key in keys_not_of_interest:
        table_dict.pop(key, None)

    ## some elements are lists, so we need to return by index and then transpose
    return pd.DataFrame.from_dict(table_dict, orient='index').transpose()


def get_ingredients_and_strengths_from_PDF(soup, ingredient):
    '''
    This will regex for "ingredient" in "soup".
    '''
    
    ## we assume strength units will only contain the characters m,g, and l, for milli, grams, and litres.
    search_string = f"({ingredient}\ *)([0-9]*\.?[0-9]*\ *[mgl]*\ *\/*[0-9]*[mgl]*(?<!/))"
    matches = re.findall(search_string, soup, flags = re.I)

    copy = []
    for match in matches:
        if not match[1] == '':
            copy.append(match)

    x = [tuple([k[0].lower().strip(), "".join(k[1].split())]) for k in copy]
    
    ingredient = list(set([i[0] for i in list(set(x))]))
    strength = list(set([i[1] for i in list(set(x))]))
    
    return ingredient, strength

## Data Extraction

In [354]:
out_df = pd.DataFrame()

for i in range(len(essential_medicine_names)):
    rank = 1
    medicine = essential_medicine_names.iloc[i, 0]
    f_string = formatted_active_ingredient_string(medicine)
    ini_soup = artg_search(formatted_string=f_string, start_rank=rank)
    
    full_matches, partial_matches = get_number_of_subsearches(ini_soup)
    if partial_matches != 0 or full_matches == 0:
        continue
    
    while rank < full_matches:

        ## get the results page
        soup = artg_search(formatted_string=f_string, start_rank=rank)
        
        ## find all identifiers on the page
        ARTD_ID_list = [x.text for x in soup.findAll('li') if 'ARTG ID:' in str(x.text)]
        
        ## for each ID, run a profile search
        for ID_fullstring in ARTD_ID_list:
            
            ID_string = re.search(r'\d+', ID_fullstring).group(0)
            profile_soup = artg_search(artg_id=ID_string)
            current_df = convert_profile_table(medicine, ID_string, profile_soup)
            if current_df is not None:
                out_df = out_df.append(current_df, ignore_index=True)
                out_df.to_csv('out.csv')
        
        rank += 10 ## pagination increment by 10
        print(f"Getting soup for medicine: {medicine}; nmeds: {full_matches}, rank: {rank}")


Getting soup for medicine: abacavir; nmeds: 25, rank: 11
Getting soup for medicine: abacavir; nmeds: 25, rank: 21
Getting soup for medicine: abacavir; nmeds: 25, rank: 31
Getting soup for medicine: abacavir, lamivudine; nmeds: 23, rank: 11
Getting soup for medicine: abacavir, lamivudine; nmeds: 23, rank: 21
Getting soup for medicine: abacavir, lamivudine; nmeds: 23, rank: 31
Getting soup for medicine: aciclovir; nmeds: 90, rank: 11
Getting soup for medicine: aciclovir; nmeds: 90, rank: 21
Getting soup for medicine: aciclovir; nmeds: 90, rank: 31
Getting soup for medicine: aciclovir; nmeds: 90, rank: 41
Getting soup for medicine: aciclovir; nmeds: 90, rank: 51
Getting soup for medicine: aciclovir; nmeds: 90, rank: 61
Getting soup for medicine: aciclovir; nmeds: 90, rank: 71
Getting soup for medicine: aciclovir; nmeds: 90, rank: 81
Getting soup for medicine: aciclovir; nmeds: 90, rank: 91
Getting soup for medicine: albendazole; nmeds: 4, rank: 11
Getting soup for medicine: amikacin; nmed

Getting soup for medicine: linezolid; nmeds: 53, rank: 31
Getting soup for medicine: linezolid; nmeds: 53, rank: 41
Getting soup for medicine: linezolid; nmeds: 53, rank: 51
Getting soup for medicine: linezolid; nmeds: 53, rank: 61
Getting soup for medicine: lopinavir, ritonavir; nmeds: 3, rank: 11
Getting soup for medicine: mebendazole; nmeds: 18, rank: 11
Getting soup for medicine: mebendazole; nmeds: 18, rank: 21
Getting soup for medicine: meropenem; nmeds: 24, rank: 11
Getting soup for medicine: meropenem; nmeds: 24, rank: 21
Getting soup for medicine: meropenem; nmeds: 24, rank: 31
Getting soup for medicine: metronidazole; nmeds: 23, rank: 11
Getting soup for medicine: metronidazole; nmeds: 23, rank: 21
Getting soup for medicine: metronidazole; nmeds: 23, rank: 31
Getting soup for medicine: moxifloxacin; nmeds: 9, rank: 11
Getting soup for medicine: nevirapine; nmeds: 13, rank: 11
Getting soup for medicine: nevirapine; nmeds: 13, rank: 21
Getting soup for medicine: nitrofurantoin;

## Data Cleaning

In [550]:
## If out_df isn't in kernel cache, re-load it from output CSV. We shouldn't (ideally)
## lose any accurary with the save/re-load because we're not storing numerical data
out_df = pd.read_csv("out.csv", index_col=0)

In [551]:
## Operate on a safe copy
out_df_copy = out_df.copy(deep=True)

## Remove annoying characters
for column in out_df_copy.columns[1:]:
    out_df_copy[column] = out_df_copy[column].str.strip("[]").str.replace("'", "")

out_df_copy.head()

Unnamed: 0,ARTG ID,Search String,Product name,Active ingredients,Sponsor name,Start Date,Dosage Form,Route of Administration,Effective Date,Status,Ingredient 1,Ingredient 1 strength,Ingredient 2,Ingredient 2 strength,Ingredient 3,Ingredient 3 strength
0,66879,abacavir,ZIAGEN abacavir (as sulfate) 20mg/mL oral solu...,abacavir sulfate,ViiV Healthcare Pty Ltd,9/06/1999,"Oral Liquid, solution",Oral,25/01/2019,Active,abacavir sulfate,23.4mg/mL,,,,
1,66878,abacavir,ZIAGEN abacavir (as sulfate) 300mg tablet blis...,abacavir sulfate,ViiV Healthcare Pty Ltd,9/06/1999,"Tablet, film coated",Oral,25/01/2019,Active,abacavir sulfate,351mg,,,,
2,99090,"abacavir, lamivudine",KIVEXA abacavir 600 mg (as sulfate) and lamivu...,"abacavir sulfate,lamivudine",ViiV Healthcare Pty Ltd,24/03/2005,"Tablet, film coated",Oral,6/06/2019,Active,abacavir sulfate,702mg,lamivudine,300mg,,
3,296702,"abacavir, lamivudine",BEZORT abacavir 600mg (as sulfate) and lamivud...,"abacavir sulfate,lamivudine",ViiV Healthcare Pty Ltd,11/12/2017,"Tablet, film coated",Oral,11/12/2017,Active,abacavir sulfate,702mg,lamivudine,300mg,,
4,296381,"abacavir, lamivudine",ABACAVIR/LAMIVUDINE 600/300 SUN abacavir 600 m...,"abacavir sulfate,lamivudine",Sun Pharma ANZ Pty Ltd,5/09/2018,"Tablet, film coated",Oral,5/09/2018,Active,abacavir sulfate,702.78mg,lamivudine,"600/300, 300mg",,


In [596]:
## Find the rows in which we weren't able to find a 
## unique strength or ingredient identifier from the PDF

mask = (out_df_copy["Ingredient 1 strength"].str.split(",").str.len() > 1) | (out_df_copy["Ingredient 1"].str.split(",").str.len() > 1) | \
       (out_df_copy["Ingredient 2 strength"].str.split(",").str.len() > 1) | (out_df_copy["Ingredient 2"].str.split(",").str.len() > 1) | \
       (out_df_copy["Ingredient 3 strength"].str.split(",").str.len() > 1) | (out_df_copy["Ingredient 3"].str.split(",").str.len() > 1)

out_df_for_correction = out_df_copy.loc[mask] ## there are 325 with issues.
out_df_not_correction = out_df_copy.loc[~mask]

out_df_for_correction.to_csv("out_for_correction.csv")

In [603]:
out_df_corrected = pd.read_csv("out_for_correction.csv", index_col=0)
out_df_corrected = out_df_not_correction.append(out_df_corrected)

In [605]:
out_df_corrected.head()

Unnamed: 0,ARTG ID,Search String,Product name,Active ingredients,Sponsor name,Start Date,Dosage Form,Route of Administration,Effective Date,Status,Ingredient 1,Ingredient 1 strength,Ingredient 2,Ingredient 2 strength,Ingredient 3,Ingredient 3 strength
0,66879,abacavir,ZIAGEN abacavir (as sulfate) 20mg/mL oral solu...,abacavir sulfate,ViiV Healthcare Pty Ltd,9/06/1999,"Oral Liquid, solution",Oral,25/01/2019,Active,abacavir sulfate,23.4mg/mL,,,,
1,66878,abacavir,ZIAGEN abacavir (as sulfate) 300mg tablet blis...,abacavir sulfate,ViiV Healthcare Pty Ltd,9/06/1999,"Tablet, film coated",Oral,25/01/2019,Active,abacavir sulfate,351mg,,,,
2,99090,"abacavir, lamivudine",KIVEXA abacavir 600 mg (as sulfate) and lamivu...,"abacavir sulfate,lamivudine",ViiV Healthcare Pty Ltd,24/03/2005,"Tablet, film coated",Oral,6/06/2019,Active,abacavir sulfate,702mg,lamivudine,300mg,,
3,296702,"abacavir, lamivudine",BEZORT abacavir 600mg (as sulfate) and lamivud...,"abacavir sulfate,lamivudine",ViiV Healthcare Pty Ltd,11/12/2017,"Tablet, film coated",Oral,11/12/2017,Active,abacavir sulfate,702mg,lamivudine,300mg,,
21,99421,aciclovir,ACICLOVIR SANDOZ aciclovir 800mg tablet bliste...,aciclovir,Sandoz Pty Ltd,3/02/2004,"Tablet, uncoated",Oral,7/06/2019,Active,aciclovir,800mg,,,,


In [587]:
out_df_test["Ingredient 1 strength"].str.split(",")

31        [500mg,  500mg/20mL]
32        [250mg,  250mg/10mL]
33                [50mg/g,  5]
34                [50mg/g,  5]
40      [25mg/mL,  500mg/20mL]
                 ...          
1283               [200mg,  G]
1284               [200mg,  G]
1285                [G,  50mg]
1286                [G,  50mg]
1310      [50mg/5mL,  10mg/mL]
Name: Ingredient 1 strength, Length: 304, dtype: object