# Clinical Trial Data Retrieval (Scrape)

This notebook contains Python script I have written in order to more efficiently examine clinical trials and identify trials that report high occurrences of specific adverse events, neuropathy in this example.  The main set of clinical trials I will use are .xml files downloaded from clinicaltrials.gov when searching for completed trials with results involving [Multiple Myeloma](https://clinicaltrials.gov/ct2/results?cond=Multiple+Myeloma&term=&cntry=&state=&city=&dist=&Search=Search&recrs=e&rslt=With).  This search resulted in 311 separate trials that were downloaded simultaneously.



To start, I imported some libraries (os, numpy, and pandas) to work with the data, and bs4 (BeautifulSoup) to scrape the HTML-based trial information from the .xml files.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import bs4 as bs

These first four functions were created to read and parse the basic clinical trial information.

In [2]:
def clinical_trial_xml_reader(file):
    """Uses BeautifulSoup to open and parse an xml file from a clinical trial.
    Returns the html/xml text.  
    
    The path and file name together are the ony argument.
    The xml_soup is returned
    """
    xml_soup = bs.BeautifulSoup(open(file,"r"), "html.parser")
    # This should automatically close the file.
    return xml_soup


def get_tag_text(soup, tag="title"):
    """A function that returns the text of the first specified tag if present, otherwise returns nan.

    Takes a soup of choice and the tag of choice as arguments.  Remember to put the tag in quotes.
    Returns either the text from the tag or, if the tag isn't present, NaN.
    """
    try:
        return soup.find(tag).get_text()
    except AttributeError:
        return np.NaN


def parse_clinical_trial_xml(soup, trial_data_categories):
    """A function to parse multiple myeloma clinical trials from xml files.
    Scrapes multiple fields of interest to describe the study generally.  Uses the get_tag_text()
    function to find text corresponding to tags in the list trial_data_categories.
    
    Takes as arguments the html/xml text from xml_reader() function and a list
    that acts as labels for the columns of the Series'.
    Returns a Series called 'clinical_trial_row' that can be appended as a row to a DataFrame.
    """
    category_dict = {}
    for category in trial_data_categories:
        category_dict[category] = get_tag_text(soup, category.lower())
    clinical_trial_row = pd.Series(data=category_dict, dtype=None)
    return clinical_trial_row

def clinical_trial_scrape(folder_path):
    """Uses clinical_trial_xml_reader() and parse_clinical_trial_xml functions to
    scrape basic information about all clinical trials present in the folder_path.
    If a specific field contains "nan", that means the trial did not report that
    information, which could be either improper reporting or just absence of information.
    
    Takes a folder's path as an argument.  The folder should contain .xml files from
    ClinicalTrials.gov to scrape.
    Returns a DataFrame called 'clinical_trial_df' containing basic information about
    each clinical trial as a row.
    """
    # Create a list of the categories to be scraped, and use this list as column names for a DataFrame.
    trial_data_categories = ["NCT_ID",
                             "Acronym",
                             "Brief_Title",
                             "Phase",
                             "Agency",
                             "URL",
                             "Overall_Status",
                             "Start_Date",
                             "Completion_Date",
                             "Enrollment",
                             "Number_of_Arms"]
    clinical_trial_df = pd.DataFrame(columns=trial_data_categories)
    
    # Generate a list of all .xml files in the folder, then iterate over the list to parse each file.
    files = sorted([file for file in os.listdir(path) if file.endswith(".xml")])
    for file in files:
        soup = clinical_trial_xml_reader(os.path.join(path, file))
        clinical_trial_row = parse_clinical_trial_xml(soup, trial_data_categories)
        clinical_trial_df = clinical_trial_df.append(clinical_trial_row, ignore_index=True)
    
    clinical_trial_df.Start_Date = pd.to_datetime(clinical_trial_df.Start_Date)
    clinical_trial_df.Completion_Date = pd.to_datetime(clinical_trial_df.Completion_Date)
    clinical_trial_df.Enrollment = clinical_trial_df.Enrollment.astype('int64')
    
    return clinical_trial_df

The 311 .xml files are stored in the following path.

In [3]:
path = "/Users/blixt007/HTML/xml/MM_Trials"

Next, the created functions from above are used to obtain a DataFrame of the clinical trial information.

In [4]:
MM_trials = clinical_trial_scrape(path)

I chose to save the clinical trial information as a .csv file within the same path as the trial .xml files.

In [5]:
MM_trials.to_csv(os.path.join(path, "MM_Trials.csv"), index=False)

Here is the scraped information from the first five trials.

In [6]:
MM_trials.head()

Unnamed: 0,NCT_ID,Acronym,Brief_Title,Phase,Agency,URL,Overall_Status,Start_Date,Completion_Date,Enrollment,Number_of_Arms
0,NCT00002850,,Antibiotic Therapy in Preventing Early Infecti...,Phase 3,Gary Morrow,https://clinicaltrials.gov/show/NCT00002850,Completed,1997-03-01,2012-01-01,212,3
1,NCT00006184,,"Chemotherapy, Stem Cell Transplantation and Do...",Phase 2,National Cancer Institute (NCI),https://clinicaltrials.gov/show/NCT00006184,Completed,2001-02-08,2008-01-12,20,2
2,NCT00006244,,"Melphalan, Peripheral Stem Cell Transplantatio...",Phase 2,Fred Hutchinson Cancer Research Center,https://clinicaltrials.gov/show/NCT00006244,Completed,2000-02-01,2016-04-01,36,1
3,NCT00027560,,"Melphalan, Fludarabine, and Alemtuzumab Follow...",Phase 2,Memorial Sloan Kettering Cancer Center,https://clinicaltrials.gov/show/NCT00027560,Completed,2001-07-01,2009-04-01,51,1
4,NCT00040937,,"S0204 Thalidomide, Chemotherapy, and Periphera...",Phase 2,Southwest Oncology Group,https://clinicaltrials.gov/show/NCT00040937,Completed,2002-06-01,2015-10-01,147,1


Here are the last five trials.

In [7]:
MM_trials.tail()

Unnamed: 0,NCT_ID,Acronym,Brief_Title,Phase,Agency,URL,Overall_Status,Start_Date,Completion_Date,Enrollment,Number_of_Arms
306,NCT02481934,NK-VS-MM,Clinical Trial of Expanded and Activated Autol...,Phase 1,"Joaquín Martínez López, MD, PhD",https://clinicaltrials.gov/show/NCT02481934,Completed,2013-03-01,2016-10-01,5,1
307,NCT02566265,SHIVERING 2,Study of High-dose Influenza Vaccine Efficacy ...,Phase 2,Yale University,https://clinicaltrials.gov/show/NCT02566265,Completed,2015-09-01,2018-06-01,122,2
308,NCT02632786,PRONTO,"The PRONTO Study, a Global Phase 2b Study of N...",Phase 2,Prothena Therapeutics Ltd.,https://clinicaltrials.gov/show/NCT02632786,Completed,2016-03-01,2018-03-01,129,2
309,NCT02669615,,Pharmacokinetic Study of Propylene Glycol-Free...,Phase 2,Medical College of Wisconsin,https://clinicaltrials.gov/show/NCT02669615,Completed,2016-11-01,2017-07-19,24,1
310,NCT03000452,FUSION-MM-005,A Study to Determine the Efficacy of the Combi...,Phase 2,Celgene,https://clinicaltrials.gov/show/NCT03000452,Completed,2017-03-14,2017-12-04,18,1


However, I wanted to investigate occurrence of adverse events caused by treatments/drugs targeting multiple myeloma.
These following two functions are used to parse adverse events from clinical trials. 

In [8]:
def min_max_adverse_event(path, event):
    """Determine the maximum and minimum percentage of participants in any treatment
    arm that experience the specified adverse event.
    
    Takes a path and the event as a string as agruments.
    Returns a Series of float values with the trial's NCT ID as the index.
    If the study does not report the specified adverse event, np.NaN will be returned.  
    
    Note: many studies report similar adverse events with slightly different names.
    For this reason it is best to search for the essential portion of the adverse event's 
    name instead of a very specific format.  For instance, some studies report only
    "neuropathy," while others report "neuropathy peripheral" or even "peripheral neuropathy."
    """
    
    min_adverse_event_dict = {}
    max_adverse_event_dict = {}
    files = sorted([file for file in os.listdir(path) if file.endswith(".xml")])
    for file in files:
        soup = clinical_trial_xml_reader(os.path.join(path, file))

        adverse_events = [sub_title for sub_title in soup.find_all("sub_title") if 
                          event.lower() in sub_title.get_text().lower()]

        # Iterate over each adverse event type to find all <counts> and determine the percentage 
        # of each group with said event.
        all_adverse_event_dict = {}
        for adverse_event in adverse_events:
            counts = adverse_event.parent.find_all("counts")
            for count in counts:
                try:
                    all_adverse_event_dict[(count.get("group_id") + "_" + adverse_event.get_text())] = (
                        round(int(count["subjects_affected"])/int(count["subjects_at_risk"])*100, 2))
                except ZeroDivisionError:
                    all_adverse_event_dict[(count.get("group_id") + "_" + adverse_event.get_text())] = np.nan

        try:
            max_adverse_event_dict[soup.nct_id.get_text()] = max(all_adverse_event_dict.values())
            if min(all_adverse_event_dict.values()) is not max(all_adverse_event_dict.values()):
                min_adverse_event_dict[soup.nct_id.get_text()] = min(all_adverse_event_dict.values())
        except ValueError:
            min_adverse_event_dict[soup.nct_id.get_text()] = np.nan
            max_adverse_event_dict[soup.nct_id.get_text()] = np.nan


    return pd.DataFrame([min_adverse_event_dict, max_adverse_event_dict], index=[
            "Min % " + event.title(), "Max % " + event.title()]).transpose()


def percent_adverse_events(path, event_list=["neuropathy", "paraesthesia"]):
    """Use the min_max_adverse_event function to parse clinical trials for 
    multiple adverse events supplied as a list.
    Returns a DataFrame with the reported minimum and maximum percentage of 
    participants who experienced each specified adverse event.
    """
    adverse_events_dataFrame = pd.DataFrame()
    for event in event_list:
        percent_event = min_max_adverse_event(path, event)
        adverse_events_dataFrame = pd.concat([adverse_events_dataFrame, percent_event], sort=False)
    return adverse_events_dataFrame

I used the same set of trial .xml files and parsed each one for adverse events involving neuropathy and paraesthesia.

In [9]:
path = "/Users/blixt007/HTML/xml/MM_Trials"
event_list = ["neuropathy", "paraesthesia"]
adverse_events_df = percent_adverse_events(path, event_list)

Sort the DataFrame by descending percentage of the 'Max % Neuropathy' column and display the first five trials.

In [10]:
adverse_events_df.sort_values(by="Max % Neuropathy", inplace=True, ascending=False)
adverse_events_df.head()

Unnamed: 0,Min % Neuropathy,Max % Neuropathy,Min % Paraesthesia,Max % Paraesthesia
NCT00903968,0.0,100.0,,
NCT01246063,0.0,100.0,,
NCT01706666,0.0,100.0,,
NCT01344876,0.0,100.0,,
NCT01794039,75.0,100.0,,


Next I want to select the trials that have higher levels of reported neuropathy events and exclude the reaming trials.

To do this, I selected the NCT_ID for every trial in which the maximum percentage of participants who experienced some form of neuropathy was greater than 30 %.

In [11]:
trial_NCT_IDs = adverse_events_df.loc[adverse_events_df["Max % Neuropathy"] > 60].index

len(trial_NCT_IDs)

28

This shows us that there are 28 out of the original 331 clinical trials in which 60 % or more of the participants in at least one treatment arm experienced a form of neuropathy.

Now I want to examine which adverse events in the above trials were reported in each treatment arm, and what treatments were used in each arm.  The next two functions accomplish this goal.

In [20]:
def parse_nervous_system_events(path, NCT_IDs):
    """Parse all reported adverse events related to the nervous system for each
    treatment arm and report the percentage of participants who experienced said
    events for each trial.
    
    Treatment arms are referred to as simple numbers, not "E#".

    Takes the path containing the .xml files and a list of the NCT_ID numbers
    for each trial to be examined.
    Returns a DataFrame in which adverse events are columns and each treatment
    arm is an index value grouped by the NCT_ID number. Values are represented
    as percent of affected out of total per treatment arm.
    
    Note: Adverse event names can vary slightly from trial to trial, and some
    trials report many more types of adverse events than others.  Not every
    trial will have reported values for each column (adverse event) present
    in the DataFrame.  In this case, nan is reported.
    """
    adverse_event_collections = pd.DataFrame()
    for file in NCT_IDs:
        file = file + ".xml"
        soup = clinical_trial_xml_reader(os.path.join(path, file))

        # Obtain all non-serious nervous system disorders reported.
        nervous_system_disorders = [event.find_next() for event in soup.find_all(
                "title") if "Nervous" in event.get_text()]

        # If two separate fields for nervous system disorders is present, 
        # the second is non-serious events, which is what I am investigating.  
        # So drop the first instance if there are two.
        if len(nervous_system_disorders) > 1:
            del nervous_system_disorders[0]

        # Create a nested dictionary that contains the treatment arm group and
        # percentage of participants who reported an adverse event per treatment
        # arm for each adverse event.  Use this dictionary to create a DataFrame.
        adverse_event_dict = {}
        for event in nervous_system_disorders[0].find_all("event"):
            counts_dict = {}
            for count in event.find_all("counts"):
                counts_dict[soup.find("nct_id").get_text(), count.get("group_id")[1:]] = (
                    round((int(count.get("subjects_affected"))/
                           int(count.get("subjects_at_risk"))*100), 2))
            
            # Rename neuropathy sub_titles to reduce redundancy.
            word = event.find("sub_title").get_text().lower()
            if "neuropathy" in word:
                if "peripheral" in word:
                    if "sensory" in word:
                        word = "peripheral sensory neuropathy"
                    else:
                        word = "peripheral neuropathy"
                else:
                    if "sensory" in word:
                        word = "sensory neuropathy"
            else:
                pass
            
            adverse_event_dict[word.title()] = counts_dict

        adverse_event_df = pd.DataFrame(adverse_event_dict)
        adverse_event_collections = adverse_event_collections.append(adverse_event_df, sort=False)

    # Neuropathies are the main focus, so I want to reorder to columns of adverse_event_collections
    # so that all columns corresponding to neuropathies are grouped together and others are removed.
    cols = list(adverse_event_collections.columns)
    neuro_cols = []
    for col in cols:
        if "neuro" in col.lower():
            neuro_cols.append(col)
        elif "paraesthesia" in col.lower():
            neuro_cols.append(col)

    neuro_cols.sort()
    adverse_event_collections = adverse_event_collections[neuro_cols]

    return adverse_event_collections


def get_treatments(path, NCT_IDs):
    """Creates a DataFrame relating the treatment arm number to the treatment.
    Takes a path containing clinical trial .xml files as the only argument.
    Returns a DataFrame with NCT_ID numbers as the index and treatment arm number as columns.
    
    Note: If a treatment arm does not exist or is not reported, the description will be replaced
    by nan.
    """
    trial_dict = {}
    for file in NCT_IDs:
        file = file + ".xml"
        soup = clinical_trial_xml_reader(os.path.join(path, file))

        # Groupings for treatment arms and adverse effects are nested under <reported_events>.
        group_id_dict = {}
        reported_events = soup.reported_events.find_all("group")
        n = 0
        for n in range(n, len(reported_events)):
            group_id_dict[int(reported_events[n].get("group_id")[1:])
                         ] = reported_events[n].title.get_text()
            n+=1

        trial_dict[file[:-4]] = group_id_dict


    return pd.DataFrame(trial_dict).transpose()

Now parse the short list of clinical trials to obtain all neuro-related adverse events.

In [21]:
adverse_event_collections = parse_nervous_system_events(path, trial_NCT_IDs)

Summarizing the DataFrame and sorting by the highest count of each adverse event, it is clear that peripheral neuropathy and similar adverse events occur much more frequently than others.

In [22]:
adverse_event_collections.describe().sort_values(by="count", axis=1, ascending=False)

Unnamed: 0,Peripheral Sensory Neuropathy,Peripheral Neuropathy,Sensory Neuropathy,Paraesthesia,Neurologic-Other,Neuropathy-Motor,Neuropathic- Pain,Neuropathy Cn Iv Down/In Eye Move,Neuropathy Cn Xii Tongue,Neurological Disorder Nos,Neuropathic Pain,Neuropathy: Motor,Weakness (Motor Neuropathy),Neuropathy,Neuro-Other,Neuro-Cranial,Neurological (Other),Neuropathy (Motor),"Neuropathic, Pain"
count,52.0,41.0,19.0,13.0,10.0,10.0,9.0,9.0,9.0,7.0,4.0,4.0,3.0,3.0,3.0,2.0,2.0,2.0,1.0
mean,47.642115,24.81561,46.721579,14.678462,1.075,11.547,3.222222,0.336667,0.336667,1.924286,7.115,3.06,11.063333,29.23,4.533333,1.43,7.575,4.545,7.81
std,33.886619,27.728456,29.810734,17.953478,2.289188,14.261221,8.273116,1.01,1.01,1.569914,10.320496,6.12,10.566221,35.113913,4.074486,2.022325,10.712668,6.427601,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.81
25%,17.6125,0.0,20.835,7.55,0.0,0.0,0.0,0.0,0.0,0.42,0.0,0.0,6.07,9.755,2.855,0.715,3.7875,2.2725,7.81
50%,58.145,14.0,57.58,10.0,0.0,3.905,0.0,0.0,0.0,2.94,3.29,0.0,12.14,19.51,5.71,1.43,7.575,4.545,7.81
75%,75.0,36.0,66.67,14.29,0.0,22.75,0.0,0.0,0.0,3.18,10.405,3.06,16.595,43.845,6.8,2.145,11.3625,6.8175,7.81
max,100.0,83.33,100.0,66.67,6.06,33.33,25.0,3.03,3.03,3.33,21.88,12.24,21.05,68.18,7.89,2.86,15.15,9.09,7.81


Now I want to obtain a list of clinical trials that reported the highest levels of neuropathies in at least one treatment arm.  To do this, first I filter the DataFrame by excluding trials that had a maximum percentage of neuropathy below 60%.

In [15]:
neuropathy_trials = adverse_event_collections.loc[:, adverse_event_collections.max() > 60]
neuropathy_trials.head(20)

Unnamed: 0,Unnamed: 1,Neuropathy,Paraesthesia,Peripheral Neuropathy,Peripheral Sensory Neuropathy,Sensory Neuropathy
NCT00903968,1,,,,,100.0
NCT00903968,2,,,,,66.67
NCT00903968,3,,,,,66.67
NCT00903968,4,,,,,33.33
NCT00903968,5,,,,,25.0
NCT00903968,6,,,,,16.67
NCT00903968,7,,,,,66.67
NCT00903968,8,,,,,48.0
NCT00903968,9,,,,,57.58
NCT01246063,1,,,,100.0,


Since this DataFrame has a multi-index, I need to first reset the index to only the NCT_ID value, then I can save all the NCT_IDs.

In [16]:
high_neuropathy_NCT_IDs = set(list(neuropathy_trials.reset_index(level=1).index))
high_neuropathy_NCT_IDs

{'NCT00040937',
 'NCT00148317',
 'NCT00153920',
 'NCT00287872',
 'NCT00432458',
 'NCT00478218',
 'NCT00558896',
 'NCT00566098',
 'NCT00581919',
 'NCT00609167',
 'NCT00750815',
 'NCT00772915',
 'NCT00903968',
 'NCT00911859',
 'NCT00985959',
 'NCT01001442',
 'NCT01034553',
 'NCT01056276',
 'NCT01063907',
 'NCT01215344',
 'NCT01246063',
 'NCT01344876',
 'NCT01383928',
 'NCT01447914',
 'NCT01706666',
 'NCT01782963',
 'NCT01794039',
 'NCT01955434'}

Additionally, I want to determine whether there are any common drugs used in studies in which the rates of neuropathy were above 60%.  Using the get_treatments() function let's me list each treatment for all clinical trials examined. 

Unfortunately, due to inconsistent reporting, some descriptions provide the drug and dose, while others mention vague generic words.

In [17]:
treatment_arms = get_treatments(path, high_neuropathy_NCT_IDs).sort_index()
treatment_arms

Unnamed: 0,1,2,3,4,5,6,7,8,9
NCT00040937,Induction/PBSC Mobilization,Autologous PBSCT,Prednisone + Thalidomide,,,,,,
NCT00148317,Treatment Arm (All Patients),,,,,,,,
NCT00153920,Bortezomib,,,,,,,,
NCT00287872,Bortezomib and Thalidomide,,,,,,,,
NCT00432458,Arm I: Thal/ZLD,Arm II: ZLD,,,,,,,
NCT00478218,LCD (Cyclophosphamide 300 mg/m^2),LCD (Cyclophosphamide 300 mg),,,,,,,
NCT00558896,Relapsed Myeloma (<4 Prior Regimens): Low Dose,Lenalidomide Refractory Myeloma: Low Dose,Bortezomib/Lenalidomide Refractory/Relapsed My...,Bortezomib/Lenalidomide Relapsed/Refractory My...,Relapsed Myeloma (< 4 Prior Regimens): High Dose,Relapsed/Refractory Myeloma: High Dose,Relapsed Amyloidosis: Low Dose,,
NCT00566098,ASCT+MILs,,,,,,,,
NCT00581919,"Bort, Dex, and Dox With ALCAR",,,,,,,,
NCT00609167,CyBorD (Bortezomib 1.3mg/m^2),CyBorD (Bortezomib 1.5mg/m^2),,,,,,,
