# Clinical Trial Data Retrieval (Scrape)

This notebook contains Python script I have written in order to more efficiently examine clinical trials and identify trials that report high occurrences of specific adverse events, neuropathy in this example.  The main set of clinical trials I will use are .xml files downloaded from clinicaltrials.gov when searching for completed clinical trials with results invovling [cancer](https://clinicaltrials.gov/ct2/results?cond=cancer&term=&cntry=&state=&city=&dist=&Search=Search&recrs=e&rslt=With).  This search resulted in 5896 separate trials that were bulk downloaded.



To start, I imported the os, numpy, and pandas libraries to work with the data, bs4 (BeautifulSoup) to scrape the HTML-based trial information from the .xml files, and sqlite3 to save the results to a database and perform queries.

In [1]:
import os
import numpy as np
import pandas as pd

import bs4 as bs
import sqlite3

from shutil import copyfile

The 5896 .xml files were stored in the following path.

In [2]:
path = "/Users/blixt007/HTML/xml/cancer_trials"

# Initial Parsing Functions

These first four functions were created to read and parse the basic clinical trial information.

In [3]:
def clinical_trial_xml_reader(file):
    """Uses BeautifulSoup to open and parse an xml file from a clinical trial.
    Returns the html/xml text.  
    
    The path and file name together are the ony argument.
    The xml_soup is returned
    """
    xml_soup = bs.BeautifulSoup(open(file,"r"), "html.parser")
    # This should automatically close the file.
    return xml_soup


def get_tag_text(soup, tag="title"):
    """A function that returns the text of the first specified tag if present,
    otherwise returns nan.

    Takes a soup of choice and the tag of choice as arguments.  Remember to put the tag in quotes.
    Returns either the text from the tag or, if the tag isn't present, NaN.
    """
    try:
        return soup.find(tag).get_text()
    except AttributeError:
        return np.NaN


def parse_clinical_trial_xml(soup, trial_data_categories):
    """A function to parse multiple myeloma clinical trials from xml files.
    Scrapes multiple fields of interest to describe the study generally.  Uses the get_tag_text()
    function to find text corresponding to tags in the list trial_data_categories.
    
    Takes as arguments the html/xml text from xml_reader() function and a list
    that acts as labels for the columns of the Series'.
    Returns a Series called 'clinical_trial_row' that can be appended as a row to a DataFrame.
    """
    category_dict = {}
    for category in trial_data_categories:
        category_dict[category] = get_tag_text(soup, category.lower())
    clinical_trial_row = pd.Series(data=category_dict, dtype=None)
    return clinical_trial_row

def clinical_trial_scrape(folder_path):
    """Uses clinical_trial_xml_reader() and parse_clinical_trial_xml functions to
    scrape basic information about all clinical trials present in the folder_path.
    If a specific field contains "nan", that means the trial did not report that
    information, which could be either improper reporting or just absence of information.
    
    Takes a folder's path as an argument.  The folder should contain .xml files from
    ClinicalTrials.gov to scrape.
    Returns a DataFrame called 'clinical_trial_df' containing basic information about
    each clinical trial as a row.
    """
    # Create a list of the categories to be scraped, 
    # and use this list as column names for a DataFrame.
    trial_data_categories = ["NCT_ID",
                             "Acronym",
                             "Brief_Title",
                             "Phase",
                             "Agency",
                             "URL",
                             "Overall_Status",
                             "Start_Date",
                             "Completion_Date",
                             "Enrollment",
                             "Number_of_Arms"]
    clinical_trial_df = pd.DataFrame(columns=trial_data_categories)
    
    # Generate a list of all .xml files in the folder,
    # then iterate over the list to parse each file.
    files = sorted([file for file in os.listdir(path) if file.endswith(".xml")])
    for file in files:
        soup = clinical_trial_xml_reader(os.path.join(path, file))
        clinical_trial_row = parse_clinical_trial_xml(soup, trial_data_categories)
        clinical_trial_df = clinical_trial_df.append(clinical_trial_row, ignore_index=True)
    
    clinical_trial_df.Start_Date = pd.to_datetime(clinical_trial_df.Start_Date)
    clinical_trial_df.Completion_Date = pd.to_datetime(clinical_trial_df.Completion_Date)
    clinical_trial_df.Enrollment = clinical_trial_df.Enrollment.astype('int64')
    
    return clinical_trial_df

# Basic Parse of All Trials

Next, the created functions from above were used to obtain a DataFrame of the clinical trial information.

In [4]:
clinical_trial_df = clinical_trial_scrape(path)

In [5]:
clinical_trial_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5896 entries, 0 to 5895
Data columns (total 11 columns):
NCT_ID             5896 non-null object
Acronym            720 non-null object
Brief_Title        5896 non-null object
Phase              5684 non-null object
Agency             5896 non-null object
URL                5896 non-null object
Overall_Status     5896 non-null object
Start_Date         5895 non-null datetime64[ns]
Completion_Date    5785 non-null datetime64[ns]
Enrollment         5896 non-null int64
Number_of_Arms     5611 non-null object
dtypes: datetime64[ns](2), int64(1), object(8)
memory usage: 506.8+ KB


I chose to save the clinical trial information in an SQLite database to use later.

In [6]:
conn = sqlite3.connect("cancer_trials.db")
cur = conn.cursor()

clinical_trial_df.to_sql("cancer_trials", con=conn, index=False, index_label="NCT_ID")
conn.commit()

Here is the scraped information from the first five trials.

In [7]:
clinical_trial_df.head()

Unnamed: 0,NCT_ID,Acronym,Brief_Title,Phase,Agency,URL,Overall_Status,Start_Date,Completion_Date,Enrollment,Number_of_Arms
0,NCT00000479,WHS,Women's Health Study (WHS): A Randomized Trial...,Phase 3,Brigham and Women's Hospital,https://clinicaltrials.gov/show/NCT00000479,Completed,1992-09-01,2005-02-01,39876,4
1,NCT00001566,,A Pilot Study of Autologous T-Cell Transplanta...,Phase 2,National Cancer Institute (NCI),https://clinicaltrials.gov/show/NCT00001566,Completed,1996-12-01,2008-09-01,42,1
2,NCT00001575,,"Anti-Tac(90 Y-HAT) to Treat Hodgkin's Disease,...",Phase 1/Phase 2,National Cancer Institute (NCI),https://clinicaltrials.gov/show/NCT00001575,Completed,1997-04-01,2013-11-01,87,1
3,NCT00001586,,Treatment of Chronic Lymphocytic Leukemia/Smal...,Phase 2,National Cancer Institute (NCI),https://clinicaltrials.gov/show/NCT00001586,Completed,1997-09-01,2011-11-01,105,2
4,NCT00001832,,Lymphocyte Re-infusion During Immune Suppressi...,Phase 2,National Cancer Institute (NCI),https://clinicaltrials.gov/show/NCT00001832,Completed,1999-08-01,2010-05-01,170,15


Here are the last five trials.

In [8]:
clinical_trial_df.tail()

Unnamed: 0,NCT_ID,Acronym,Brief_Title,Phase,Agency,URL,Overall_Status,Start_Date,Completion_Date,Enrollment,Number_of_Arms
5891,NCT03444090,,Impacts of Inspection During Instrument Insert...,,"Evergreen General Hospital, Taiwan",https://clinicaltrials.gov/show/NCT03444090,Completed,2017-10-02,2018-06-30,428,2.0
5892,NCT03456427,3D PAC,Patient-Assisted Compression in 3D - Impact on...,,GE Healthcare,https://clinicaltrials.gov/show/NCT03456427,Completed,2018-01-04,2018-01-09,36,1.0
5893,NCT03489551,PHDC,Feasibility of Prophylactic Haldol to Prevent ...,Phase 4,Michelle Weckmann,https://clinicaltrials.gov/show/NCT03489551,Completed,2011-11-01,2013-10-01,17,1.0
5894,NCT03628885,,An Evaluation of Tailored Messages to Address ...,,Indiana University,https://clinicaltrials.gov/show/NCT03628885,Completed,2018-09-11,2018-09-27,908,3.0
5895,NCT03832322,,Adenoma Miss Rate With Water Exchange vs Carbo...,,"Evergreen General Hospital, Taiwan",https://clinicaltrials.gov/show/NCT03832322,Completed,2018-07-09,2018-11-28,176,


# Parsing Adverse Events

These following two functions were used to parse adverse events from clinical trials to help identify trials in which rates of neuropathy-related adverse events were high.

In [9]:
def min_max_adverse_event(path, event):
    """Determine the maximum and minimum percentage of participants in any treatment
    arm that experience the specified adverse event.
    
    Takes a path and the event as a string as agruments.
    Returns a Series of float values with the trial's NCT ID as the index.
    If the study does not report the specified adverse event, np.NaN will be returned.  
    
    Note: many studies report similar adverse events with slightly different names.
    For this reason it is best to search for the essential portion of the adverse event's 
    name instead of a very specific format.  For instance, some studies report only
    "neuropathy," while others report "neuropathy peripheral" or even "peripheral neuropathy."
    """
    
    max_adverse_event_dict = {}
    files = sorted([file for file in os.listdir(path) if file.endswith(".xml")])
    for file in files:
        soup = clinical_trial_xml_reader(os.path.join(path, file))

        adverse_events = [sub_title for sub_title in soup.find_all("sub_title") if 
                          event.lower() in sub_title.get_text().lower()]

        # Iterate over each adverse event type to find all <counts> and determine the percentage 
        # of each group with said event.
        all_adverse_event_dict = {}
        for adverse_event in adverse_events:
            counts = adverse_event.parent.find_all("counts")
            for count in counts:
                try:
                    all_adverse_event_dict[
                        (count.get("group_id") + "_" + adverse_event.get_text())] = (
                        round(int(count["subjects_affected"])/int(count["subjects_at_risk"])*100, 2))
                # If there is a key error when reading subjects affects or at risk, the
                # values are probably not reported correctly, so pass this count.
                except KeyError:
                    continue
                except ZeroDivisionError:
                    all_adverse_event_dict[
                        (count.get("group_id") + "_" + adverse_event.get_text())] = np.nan

        try:
            max_adverse_event_dict[soup.nct_id.get_text()] = max(all_adverse_event_dict.values())
        except ValueError:
            max_adverse_event_dict[soup.nct_id.get_text()] = np.nan


    return pd.DataFrame([max_adverse_event_dict], index=[
            "Max % " + event.title()]).transpose()


def percent_adverse_events(path, event_list=["neuropathy", "paraesthesia"]):
    """Use the min_max_adverse_event function to parse clinical trials for 
    multiple adverse events supplied as a list.
    Returns a DataFrame with the reported minimum and maximum percentage of 
    participants who experienced each specified adverse event.
    """
    adverse_events_dataFrame = pd.DataFrame()
    for event in event_list:
        percent_event = min_max_adverse_event(path, event)
        adverse_events_dataFrame = pd.concat([adverse_events_dataFrame, percent_event], sort=False)
    return adverse_events_dataFrame

I used the same set of trial .xml files from above and parsed each one for adverse events involving any kind of neuropathy and paraesthesia.

In [10]:
event_list = ["neuropathy", "paraesthesia"]
adverse_events_df = percent_adverse_events(path, event_list)

Sort the DataFrame by descending percentage of the 'Max % Neuropathy' column and display the first five trials.

In [11]:
adverse_events_df.sort_values(by="Max % Neuropathy", inplace=True, ascending=False)
adverse_events_df.head()

Unnamed: 0,Max % Neuropathy,Max % Paraesthesia
NCT00568022,100.0,
NCT01706666,100.0,
NCT01106352,100.0,
NCT01094288,100.0,
NCT01246063,100.0,


# Filtering Trials with Lower Neuropathy Reports

Next I wanted to select the trials that have higher levels of any reported neuropathy event and exclude the remaining trials.

To do this, I selected the NCT_ID for every trial in which the maximum percentage of participants who experienced some form of neuropathy or paraesthesia was greater than 50 percent.

In [12]:
trial_NCT_IDs = list(adverse_events_df.loc[adverse_events_df["Max % Neuropathy"] > 50].index) + list(
        adverse_events_df.loc[adverse_events_df["Max % Paraesthesia"] > 50].index)

len(trial_NCT_IDs)

277

There were 277 out of the original 5896 clinical trials in which 50 percent or more of the participants in at least one treatment arm experienced a form of neuropathy or paraesthesia.


# Detailed Neuropathy-Related Parsing

Next I examined which adverse events in the above trials were reported in each treatment arm, and what treatments were used in each arm.

In [13]:
def parse_nervous_system_events(path, NCT_IDs):
    """Parse all reported adverse events related to the nervous system for each
    treatment arm and report the percentage of participants who experienced said
    events for each trial.
    
    Treatment arms are referred to as simple numbers, not "E#" as in the actual trial data.

    Takes the path containing the .xml files and a list of the NCT_ID numbers
    for each trial to be examined.
    Returns a DataFrame in which adverse events are columns and each treatment
    arm is an index value grouped by the NCT_ID number. Values are represented
    as percent of affected out of total per treatment arm.
    
    Note: Adverse event names can vary slightly from trial to trial, and some
    trials report many more types of adverse events than others.  Not every
    trial will have reported values for each column (adverse event) present
    in the DataFrame.  In this case, nan is reported.
    """
    adverse_event_collections = pd.DataFrame()
    for file in NCT_IDs:
        file = file + ".xml"
        soup = clinical_trial_xml_reader(os.path.join(path, file))

        # Obtain all non-serious nervous system disorders reported.
        nervous_system_disorders = [event.find_next() for event in soup.find_all(
                "title") if "Nervous" in event.get_text()]

        # If two separate fields for nervous system disorders are present, 
        # the second is non-serious events, which is what I am investigating.  
        # So drop the first instance if there are two.  If there are no fields
        # present, then no nervous system adverse events were reported, and 
        # the current trial should be skipped.
        if len(nervous_system_disorders) > 1:
            del nervous_system_disorders[0]
        elif len(nervous_system_disorders) == 0:
            continue

        # Create a nested dictionary that contains the treatment arm group and
        # percentage of participants who reported an adverse event per treatment
        # arm for each adverse event.  Use this dictionary to create a DataFrame.
        # If the subjects at risk is reported as 0, the value is changed to nan.
        adverse_event_dict = {}
        for event in nervous_system_disorders[0].find_all("event"):
            counts_dict = {}
            for count in event.find_all("counts"):
                try:
                    counts_dict[soup.find("nct_id").get_text(), count.get("group_id")[1:]
                               ] = (round((int(count.get("subjects_affected"))/
                               int(count.get("subjects_at_risk"))*100), 2))
                
                except ZeroDivisionError:
                    counts_dict[soup.find("nct_id").get_text(), count.get("group_id")[1:]] = np.nan

            # Rename neuropathy sub_titles to reduce redundancy.
            word = event.find("sub_title").get_text().lower()
            if "neuropathy" in word:
                if "peripheral" in word:
                    if "sensory" in word:
                        word = "peripheral sensory neuropathy"
                    else:
                        word = "peripheral neuropathy"
                elif "sensory" in word:
                    word = "sensory neuropathy"
                elif "motor" in word:
                    word = "motor neuropathy"
            elif "neuropath" in word:
                if "pain" in word:
                    word = "neuropathic pain"
            elif "neuro" in word:
                if "other" in word:
                    word = "other neuro"
                elif "cranial" in word:
                    word = "other neuro"
            # Remove spaces from sub_titles
            if " " in word:
                word = word.replace(" ", "_")
            
            adverse_event_dict[word.title()] = counts_dict

        adverse_event_df = pd.DataFrame(adverse_event_dict)
        adverse_event_collections = adverse_event_collections.append(adverse_event_df, sort=False)

    # Neuropathies are the main focus. Columns of adverse_event_collections are reordered
    # so all columns corresponding to neuropathies are grouped together and others are removed.
    cols = list(adverse_event_collections.columns)
    neuro_cols = []
    for col in cols:
        if "neuro" in col.lower():
            neuro_cols.append(col)
        elif "paraesthesia" in col.lower():
            neuro_cols.append(col)

    neuro_cols.sort()
    adverse_event_collections = adverse_event_collections[neuro_cols]

    return adverse_event_collections

Then I used the shortened list of clinical trials to obtain all neuro-related adverse events for these trials.  The data was stored in the same SQLite database. I also verified that my two tables are present in the SQLite database.

In [14]:
adverse_event_collections = parse_nervous_system_events(path, trial_NCT_IDs)

adverse_event_collections.to_sql(name="ae_coll", con=conn, index_label=["nct_id", "arm"])
conn.commit()
cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
cur.fetchall()

[('cancer_trials',), ('ae_coll',)]

Summarizing the DataFrame and sorting by the highest count of each adverse event, it is clear that peripheral neuropathy and similar adverse events were reported much more frequently than others.

In [15]:
adverse_event_collections.describe().sort_values(by="count", axis=1, ascending=False)

Unnamed: 0,Peripheral_Sensory_Neuropathy,Peripheral_Neuropathy,Paraesthesia,Sensory_Neuropathy,Motor_Neuropathy,Other_Neuro,Neuropathy,Neurotoxicity,Polyneuropathy,Neurological_Disorder_Nos,...,Paraesthesia_-_Grade_1,Neuropathy_Sensor,Neuropathy:_Cranial_(Cn_Viii_Hearing_And_Balance),Neuropathy_Cranial,Neuropathy:_Cranial,Neuropathy_Cn_I_Smell,Neuropathy_-_Grade_2,Neuropathy:_Induction,Neuropathy:_Cranial_Optic,Neuropathy_(Hearing)
count,491.0,476.0,320.0,136.0,105.0,57.0,56.0,53.0,51.0,29.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,32.475825,15.586492,10.9365,55.123162,12.458762,7.971404,44.225536,4.74,6.802745,3.571379,...,16.67,59.18,2.13,2.04,0.83,2.78,2.78,52.78,1.22,13.04
std,30.867438,23.842819,18.751186,23.337703,15.845214,9.965248,24.984968,10.65805,14.302684,4.262429,...,,,,,,,,,,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,16.67,59.18,2.13,2.04,0.83,2.78,2.78,52.78,1.22,13.04
25%,0.0,0.0,0.0,50.5925,3.74,0.0,20.08,0.0,0.0,0.0,...,16.67,59.18,2.13,2.04,0.83,2.78,2.78,52.78,1.22,13.04
50%,23.33,2.8,0.0,59.055,7.69,5.56,51.905,0.0,0.0,2.74,...,16.67,59.18,2.13,2.04,0.83,2.78,2.78,52.78,1.22,13.04
75%,57.14,25.0,14.29,69.035,14.29,11.63,61.84,0.61,2.01,4.23,...,16.67,59.18,2.13,2.04,0.83,2.78,2.78,52.78,1.22,13.04
max,100.0,100.0,100.0,100.0,100.0,40.0,87.23,33.33,66.67,14.29,...,16.67,59.18,2.13,2.04,0.83,2.78,2.78,52.78,1.22,13.04


# Parsing Treatments for Each Trial

Additionally, I wanted to determine whether there were any common drugs used in studies in which the rates of neuropathy were above 50 percent.  Using the get_treatments() function lists each treatment for all clinical trials examined.

In [16]:
def get_treatments(path, NCT_IDs):
    """Creates a DataFrame relating the treatment arm number to the treatment.
    Takes a path containing clinical trial .xml files as the only argument.
    Returns a DataFrame with NCT_ID numbers as the index and treatment
    arm number as columns.
    
    Note: If a treatment arm does not exist or is not reported, the description will be replaced
    by nan.
    """
    trial_dict = {}
    for file in NCT_IDs:
        file = file + ".xml"
        soup = clinical_trial_xml_reader(os.path.join(path, file))

        # Classification of groupings for treatment arms and adverse effects
        # are nested under <reported_events>.
        group_id_dict = {}
        reported_events = soup.reported_events.find_all("group")
        n = 0
        for n in range(n, len(reported_events)):
            group_id_dict[
                int(reported_events[n].get("group_id")[1:])] = (
                reported_events[n].title.get_text())
            n+=1

        trial_dict[file[:-4]] = group_id_dict


    return pd.DataFrame(trial_dict).transpose()

In [17]:
treatment_arms = get_treatments(path, trial_NCT_IDs).sort_index()

treatment_arms.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,19,20,21,22,23,24,25,26,27,28
NCT00002931,HD Chemo and Auto Stem Cells,,,,,,,,,,...,,,,,,,,,,
NCT00003389,Arm A (ABVD),Arm B (Stanford V),,,,,,,,,...,,,,,,,,,,
NCT00004092,Arm I (ACT),Arm II (STAMP V),,,,,,,,,...,,,,,,,,,,
NCT00006110,Chemotherapy + Herceptin,Control: Non-Herceptin,,,,,,,,,...,,,,,,,,,,
NCT00010257,Thymoma,Thymic Carcinoma,,,,,,,,,...,,,,,,,,,,


Though the arm treatments were not well organized for an SQL table, they I still stored them in the SQLite database. To verify that the data was stored correctly, I created a new DataFrame by reading the arm treatments from the database.

In [18]:
treatment_arms.to_sql(name="treat_arms", con=conn, index_label="nct_id")
conn.commit()

treat_arms = pd.read_sql_query("SELECT * FROM treat_arms", con=conn, index_col="nct_id")

# Querying for Trial Number from Database

Ideally, larger trials with more participants should provide better data to understand why neuropathies occur.  Using an SQLite query, I selected a subset of nct id values and treatment arms of trials that have higher than 50 percent reported neuropathic adverse events and also have at least 250 participants enrolled.

In [19]:
cur.execute(
    """SELECT nct_id, arm FROM ae_coll
    WHERE Peripheral_Neuropathy OR
    Peripheral_Sensory_Neuropathy OR
    Sensory_Neuropathy > 50 AND
    nct_id IN (
    SELECT NCT_ID from cancer_trials WHERE Enrollment > 250)
    """)
nct_id_arm = cur.fetchall()

# Saving Treatments to .csv

Using the SQLite query results, I created a Series using the query as the index and the corresponding treatment from treat_arms as the values.  This was saved in the SQLite database and as a .csv file to investigate each study more closely outside of Python.  Part of the reason for this is because there was too much variation in the reported treatment arms. So graphing or analyzing the types of treatments cannot be done without extensive manipulation of reported treatments, as can be seen by the printed dictionary values below.  

In [20]:
trial_query = {pair:treat_arms.loc[pair[0], pair[1]] for pair in nct_id_arm}

pd.Series(data=trial_query).to_csv(os.path.join(path, "trials_to_investigate.csv"))

# The full list of treatments is printed below.
for value in trial_query.values():
    print(value)
    
trial_query.to_sql(name="trials_to_investigate", con=conn)
conn.commit()
conn.close()

Ixabepilone 32 mg/m^2 + Capecitabine 1650 mg/m^2/Day
Ixabepilone 40 mg/m^2 + Capecitabine 1650 mg/m^2/Day
Ixabepilone 40 mg/m^2 + Capecitabine 2000 mg/m^2/Day
Arm A
Arm C
Alpharadin 25 kBq/kg + Docetaxel 75 mg/m^2 - Dose Escalation
Alpharadin 25 kBq/kg + Docetaxel 60 mg/m^2 - Dose Escalation
Alpharadin 50 kBq/kg + Docetaxel 60 mg/m^2 - Dose Escalation
Alpharadin 50 kBq/kg + Docetaxel 60 mg/m^2 - Safety Cohort
Docetaxel 75 mg/m^2 - Safety Cohort
Alisertib 10 mg (7D) + Docetaxel 75 mg/m^2
Alisertib 20 mg (7D) + Docetaxel 75 mg/m^2
Alisertib 30 mg (7D) + Docetaxel 60 mg/m^2
Phase I - Part 1 Dose Level 0 (Carfilzomib 20/27 mg/m^2)
Phase I - Part 1 Dose Level 1 (Carfilzomib 20/36 mg/m^2)
Phase I -Part 2 Cohort 0 (Carfilzomib 56 mg/m^2+Dexamethasone)
Phase 2 (Carfilzomib 56 mg/m^2+ Dexamethasone)
Rituximab Plus Bortezomib
LY 10/Carb 5/Pem 500 (Cohort 1)
LY 10/Carb 6/Pem 500 (Cohort 2)
LY 40/Carb 6/Pem 500 (Cohort 4)
LY 80/Carb 6/Pem 500 + R50 (Cohort 7)
LY 40/Carb 6/Pem 500 + R50 (Cohort 9)


AttributeError: 'dict' object has no attribute 'to_sql'

Lastly, I created a list of the .xml files from trial_query, copied each .xml file, and moved them to a new folder called "to_investigate" for further work.  

In [21]:
trials_to_investigate = set([pair[0] for pair in nct_id_arm])

os.mkdir(os.path.join(path, "to_investigate"))
for file in trials_to_investigate:
    file = file + ".xml"
    full_path = os.path.join(path, file)
    destination = os.path.join(path, "to_investigate", file)
    copyfile(full_path, destination)