# ARTG Scraper (Rewrite)

## Introduction

Recently, I wrote a scraper for the "Breaking Good" project (hereafter BG), a project which aims to collect information about the accessibility and availability of medicines classified as "essential" by the W.H.O.

At the time, I had written the tool using Selenium - largely because I had little experience with BeautifulSoup, and I found it relatively-easy to implement automated file downloads using the former. Having now spent a bit of time getting aquainted with the latter, and because I am right now on holiday and the scope of the scraper requirements is clearer, I figured it might be time to re-visit the project, for fun, and to see if I can glean any new insights from the data.

#### Note for future self: 

<b>Every ARTG entry should have a "Public ARTG summary" PDF - pdfplumber looks like it does a very good job of parsing these documents, and the ones I've tested have all had pretty consistent formats, from 2001 to 2021. These documents contain lots of pertinent data!</b>

## Datasets

A list of essential medicines of particular interest to the BG team was provided as a single-column CSV. We can plug the medicines on that list into the Aus. Govt. Australian Register of Therapuetic Goods (ARTG), hosted by the EBS. From that, we should be able to get a list of relevant ARTG IDs, which are unique identifiers for each medicine and its formulations.

In [195]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import pdfplumber
import re
import time

In [93]:
## Load the original CSV of essential medicine names
essential_medicine_names = pd.read_csv(f"WHO_essential_anti_infective_medicines.csv", header=None)
essential_medicine_names.head()

Unnamed: 0,0
0,abacavir
1,"abacavir, lamivudine"
2,aciclovir
3,acyclovir
4,albendazole


In [187]:
def formatted_active_ingredient_string(input_string):
    '''
    Receives a comma-separated string of non-zero length containing medicine names:
        "abacavir" OR "abacavir, lamivudine" etc.
    
    Returns a formatted string for the meta_A tag on the TGA search client url:
        "abacavir" OR "abacavir%2C+lamivudine"
        
        If an empty string is given as input, function returns 0.
    '''
    
    if input_string == "" or input_string == " ":
        return 0
    
    ## convert string to list
    input_string = input_string.split(',')
    
    ## first active ingredient is first element
    actives = str(input_string[0])
    
    ## if there is more than one active ingredient, 
    ## append it to formatted string
    if len(input_string) > 1:
        for ing in [str(ing.lstrip()) for ing in input_string[1:]]:
            actives+="%2C+"+str(ing)
    
    return actives


def artg_search(*args, **kwargs):
    '''
    Receives a formatted ingredient string from formatted_active_ingredient_string, and performs
    a search of the ARTG database, by URL.
    
    Returns BeautifulSoup object. 
    
    ## if we provide a rank, then we're on the main page, otherwise, we're following a link
    ## to get a public summary document!
    
    (ARTG ID, Product name, Active ingredients, Sponsor, has_CMI, has_PI for each result
    returned from the search query.)
    '''
    
    if "formatted_string" in kwargs and "start_rank" in kwargs:
        url = f"https://tga-search.clients.funnelback.com/s/search.html?from-advanced=true&collection=tga-artg&meta_A={kwargs['formatted_string']}&fmo=on&start_rank={kwargs['start_rank']}"

    elif "artg_id" in kwargs:
        url = f"https://tga-search.clients.funnelback.com/s/search.html?collection=tga-artg&profile=record&meta_i={kwargs['artg_id']}"

    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'lxml')
    
    return soup

In [200]:
for i in range(len(essential_medicine_names)):
    
    rank = 1
    medicine = essential_medicine_names.iloc[i, 0]
    f_string = formatted_active_ingredient_string(medicine)
    ini_soup = artg_search(formatted_string=f_string, start_rank=rank)
    time.sleep(0.5) ## to be polite
    
    ## This should return html of the form "Documents: \d+ fully matching plus \d+ partially matching"
    ## we can use re groups to get those numbers - check that we have zero partial matches, and use the
    ## first group to define the number of sub-searches we need to make (due to pagination).
    description_string = ini_soup.find("p", {"class": "description"}).text
    match = re.search(r'Documents: (\d+) fully matching plus (\d+) partially matching', description_string)
    
    ## if we get 0 exact matches or any partial matches, break
    if int(match.group(2)) != 0 or int(match.group(1)) == 0:
        continue
    
    while rank < int(match.group(1)):
        ## this gets the results page
        soup = artg_search(formatted_string=f_string, start_rank=rank) ## re-do the first one, for the sake of wrapping the loop
        
        ## analysis loop here
        ## for each soup.find_all etc., 
        
        rank += 10 ## pagination increment by 10
        print(f"Getting soup for medicine: {medicine}; nmeds: {match.group(1)}, rank: {rank}")


Running search for medicine: abacavir; nmeds: 25, rank: 1
Getting soup for medicine: abacavir; nmeds: 25, rank: 11
Getting soup for medicine: abacavir; nmeds: 25, rank: 21
Getting soup for medicine: abacavir; nmeds: 25, rank: 31
Running search for medicine: abacavir, lamivudine; nmeds: 23, rank: 1
Getting soup for medicine: abacavir, lamivudine; nmeds: 23, rank: 11
Getting soup for medicine: abacavir, lamivudine; nmeds: 23, rank: 21
Getting soup for medicine: abacavir, lamivudine; nmeds: 23, rank: 31
Running search for medicine: aciclovir; nmeds: 90, rank: 1
Getting soup for medicine: aciclovir; nmeds: 90, rank: 11
Getting soup for medicine: aciclovir; nmeds: 90, rank: 21
Getting soup for medicine: aciclovir; nmeds: 90, rank: 31
Getting soup for medicine: aciclovir; nmeds: 90, rank: 41
Getting soup for medicine: aciclovir; nmeds: 90, rank: 51
Getting soup for medicine: aciclovir; nmeds: 90, rank: 61
Getting soup for medicine: aciclovir; nmeds: 90, rank: 71
Getting soup for medicine: ac

KeyboardInterrupt: 

In [None]:
for i in range(len(essential_medicine_names)):
    print(i)

In [89]:

with pdfplumber.open("pdf_1.pdf") as pdf:
    page  = pdf.pages[0]
    text = pdf.pages[0].extract_text()
    print(text)

Public Summary 
Summary for ARTG Entry: 99090 KIVEXA abacavir 600mg (as sulfate) and lamivudine 300mg tablet blister pack
ARTG entry for Medicine Registered 
Sponsor ViiV Healthcare Pty Ltd
Postal Address PO Box 18095, MELBOURNE CITY MC, VIC, 8001 
Australia
ARTG Start Date 24/03/2005
Product Category Medicine 
Status Active
Approval Area Drug Safety Evaluation Branch
Conditions
Conditions applicable to all therapeutic goods as specified in the document "Standard Conditions Applying to Registered or Listed Therapeutic Goods Under 
Section 28 of the Therapeutic Goods Act 1989" effective 1 July 1995.
Conditions applicable to the relevant category and class of therapeutic goods as specified in the document "Standard Conditions Applying to Registered or 
Listed Therapeutic Goods Under Section 28 of the Therapeutic Goods Act 1989" effective 1 July 1995.
Products
1 . KIVEXA abacavir 600 mg (as sulfate) and lamivudine 300 mg tablet blister pack
P
Product Type Single Medicine Product  Effectiv