<h1> Data Extraction from ClinicalTrial.gov </h1>

---

In [1]:
import pandas as pd
import json

In [None]:
with open('Clinical Trial raw data.json') as file:
    data = json.load(file)

<b> The 'Clinical Trial raw data.json' file is 272 MB in size and too big to be uploaded to Github.

<h2>1. Conversion of JSON data from ClinicalTrial.gov into Pandas Dataframe </h2>

The data from ClinicalTrial.gov is downloaded as a json file and loaded into Python.

In [None]:
nctid = []
trial_status = []
last_known_trial_status = []
trial_locations = []
trial_start_date = []
trial_completion_date = []
sponsor_name = []
sponsor_type = []
FDA_regulated_drug = []
conditions = []
phases = []
num_of_participants = []
interventions = []
healthy_participants = []
gender = []
min_age = []
max_age = []

# For loop to append value of each specified key to the empty list. If the Key does not exist, append null value.
for row in data:
    protocolSection = row['protocolSection']
    try:
        nctid.append(protocolSection['identificationModule']['nctId'])
    except KeyError:
        nctid.append(None)
    try:
        trial_status.append(protocolSection['statusModule']['overallStatus'])
    except KeyError:
        trial_status.append(None)
    try:
        last_known_trial_status.append(protocolSection['statusModule']['lastKnownStatus'])
    except KeyError:
        last_known_trial_status.append(None)
    try:
        trial_locations.append(protocolSection['contactsLocationsModule']['locations'])
    except KeyError:
        trial_locations.append(None)
    try:
        trial_start_date.append(protocolSection['statusModule']['startDateStruct']['date'])
    except KeyError:
        trial_start_date.append(None)
    try:
        trial_completion_date.append(protocolSection['statusModule']['completionDateStruct']['date'])
    except KeyError:
        trial_completion_date.append(None)
    try:
        sponsor_name.append(protocolSection['sponsorCollaboratorsModule']['leadSponsor']['name'])
    except KeyError:
        sponsor_name.append(None)
    try:
        sponsor_type.append(protocolSection['sponsorCollaboratorsModule']['leadSponsor']['class'])
    except KeyError:
        sponsor_type.append(None)
    try:
        FDA_regulated_drug.append(protocolSection['oversightModule']['isFdaRegulatedDrug'])
    except KeyError:
        FDA_regulated_drug.append(None)
    try:
        conditions.append(protocolSection['conditionsModule']['conditions'])
    except KeyError:
        conditions.append(None)
    try:
        phases.append(protocolSection['designModule']['phases'])
    except KeyError:
        phases.append(None)
    try:
        num_of_participants.append(protocolSection['designModule']['enrollmentInfo']['count'])
    except KeyError:
        num_of_participants.append(None)
    try:
        interventions.append(protocolSection['armsInterventionsModule']['interventions'])
    except KeyError:
        interventions.append(None)
    try:
        healthy_participants.append(protocolSection['eligibilityModule']['healthyVolunteers'])
    except KeyError:
        healthy_participants.append(None)
    try:
        gender.append(protocolSection['eligibilityModule']['sex'])
    except KeyError:
        gender.append(None)
    try:
        min_age.append(protocolSection['eligibilityModule']['minimumAge'])
    except KeyError:
        min_age.append(None)
    try:
        max_age.append(protocolSection['eligibilityModule']['maximumAge'])
    except KeyError:
        max_age.append(None)

The JSON data is a nested list of dictionaries. <br>

Important clinical trial data from the JSON file were assigned to separate lists.

In [4]:
clinicaltrial = {'NCT ID' : nctid, 'Trial Status' : trial_status, 'Last Known Trial Status' : last_known_trial_status, 'Trial Locations' : trial_locations, 'Trial Start Date' : trial_start_date, 
                 'Trial Completion Date' : trial_completion_date, 'Sponsor Name' : sponsor_name, 'Sponsor Type' : sponsor_type, 'FDA Regulated Drug' : FDA_regulated_drug, 'Phases' : phases, 
                 'No of Participants' : num_of_participants, 'Interventions' : interventions, 'Conditions' : conditions, 'Healthy Participants' : healthy_participants, 'Gender' : gender, 'Min Age' : min_age, 'Max Age' : max_age}
clinicaltrial_df = pd.DataFrame(clinicaltrial)
clinicaltrial_df

Unnamed: 0,NCT ID,Trial Status,Last Known Trial Status,Trial Locations,Trial Start Date,Trial Completion Date,Sponsor Name,Sponsor Type,FDA Regulated Drug,Phases,No of Participants,Interventions,Conditions,Healthy Participants,Gender,Min Age,Max Age
0,NCT00987766,COMPLETED,,[{'country': 'United States'}],2009-11,2016-10,Vanderbilt-Ingram Cancer Center,OTHER,,[PHASE1],28.0,"[{'type': 'DRUG', 'name': 'erlotinib hydrochlo...","[Extrahepatic Bile Duct Cancer, Gallbladder Ca...",False,ALL,18 Years,
1,NCT02734966,COMPLETED,,"[{'country': 'China'}, {'country': 'China'}, {...",2007-12,2011-12,"Chia Tai Tianqing Pharmaceutical Group Co., Ltd.",INDUSTRY,,[PHASE2],174.0,"[{'type': 'DRUG', 'name': 'Magnesium Isoglycyr...",[Drug-Induced Liver Injury],False,ALL,18 Years,70 Years
2,NCT02013466,COMPLETED,,[{'country': 'Netherlands'}],2008-10,2009-02,Nutricia Research,INDUSTRY,,[PHASE1],12.0,"[{'type': 'DIETARY_SUPPLEMENT', 'name': 'Bolus...",[Sarcopenia],True,ALL,65 Years,
3,NCT02922166,UNKNOWN,ACTIVE_NOT_RECRUITING,[{'country': 'United States'}],2017-02-03,2019-12,Azevan Pharmaceuticals,INDUSTRY,True,[PHASE1],36.0,"[{'type': 'DRUG', 'name': 'SRX246'}, {'type': ...","[Fear, Anxiety]",True,ALL,21 Years,50 Years
4,NCT06530966,RECRUITING,,[{'country': 'United States'}],2024-07-23,2024-12-25,InnoCare Pharma Inc.,INDUSTRY,True,[PHASE1],24.0,"[{'type': 'DRUG', 'name': 'ICP-332 Tablets'}, ...",[Healthy Subjects],True,ALL,18 Years,55 Years
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135900,NCT01610687,COMPLETED,,[{'country': 'United Kingdom'}],2001-07,2005-02,Jazz Pharmaceuticals,INDUSTRY,,[PHASE3],137.0,"[{'type': 'DRUG', 'name': 'GW-1000-02'}]","[Multiple Sclerosis, Spasticity]",False,ALL,18 Years,
135901,NCT05813587,COMPLETED,,[{'country': 'China'}],2023-03-23,2023-10-26,Jiangsu Pacific Meinuoke Bio Pharmaceutical Co...,INDUSTRY,False,[PHASE3],121.0,"[{'type': 'BIOLOGICAL', 'name': 'Meplazumab fo...",[Post-COVID-19],False,ALL,18 Years,84 Years
135902,NCT05476887,ACTIVE_NOT_RECRUITING,,"[{'country': 'China'}, {'country': 'China'}, {...",2022-11-25,2025-02,"Kira Pharmacenticals (US), LLC.",INDUSTRY,False,[PHASE2],35.0,"[{'type': 'DRUG', 'name': 'KP104'}]",[Paroxysmal Nocturnal Hemoglobinuria],False,ALL,18 Years,
135903,NCT01887587,TERMINATED,,[{'country': 'United States'}],2013-06,2016-02-29,Ehab L Atallah,OTHER,,[PHASE1],5.0,"[{'type': 'DRUG', 'name': 'MLN9708'}, {'type':...",[Relapsed or Refractory Acute Lymphoblastic Le...,False,ALL,18 Years,


Combined all the lists into a single dictionary where the Keys are the column names, and the Values are the lists containing different types of clinical trial data. <br>

Converted the dictionary into a pandas dataframe.

In [None]:
clinicaltrial_df['Trial Location'] = clinicaltrial_df['Trial Locations']
countries = clinicaltrial_df['Trial Locations']

# For loop to extract country name from first dictionary of list, for every row in the column
for index in range(len(countries)):
    if countries[index] != None:
        clinicaltrial_df.loc[index, 'Trial Location'] = countries[index][0].get('country')

        # For loop to compare subsequent country names to the first country name to check for uniqueness. Only combine if unique.
        for index2 in range(len(countries[index]) - 1):
            try:
                if countries[index][index2 + 1].get('country') != countries[index][index2].get('country'):
                    clinicaltrial_df.loc[index, 'Trial Location'] = clinicaltrial_df.loc[index, 'Trial Location'] + ', ' + countries[index][index2 + 1].get('country')
            except TypeError:
                index2 += 1
    else:
        clinicaltrial_df.loc[index, 'Trial Location'] =  None

clinicaltrial_df['Trial Location']

0                                             United States
1                                                     China
2                                               Netherlands
3                                             United States
4                                             United States
                                ...                        
135900                                       United Kingdom
135901                                                China
135902                                                China
135903                                        United States
135904    United States, Australia, Austria, Belgium, Br...
Name: Trial Location, Length: 135905, dtype: object

Each row in the 'Trial Locations' column contains a list of dictionaries which were converted into character strings so that the data can be easily accessed during analysis. <br>

For each row, only unique country names were kept and combined into a single string with each country separated by a comma.

In [None]:
clinicaltrial_df['Phase'] = clinicaltrial_df['Phases']
phases = clinicaltrial_df['Phases']

# For loop to extract phase from the first list, for every row in the column.
for index in range(len(phases)):
    if phases[index] != None:
        clinicaltrial_df.loc[index, 'Phase'] = phases[index][0]

        # For loop to combined all subsequent phases to the first phase.
        for index2 in range(len(phases[index]) - 1):
            try:
                clinicaltrial_df.loc[index, 'Phase'] = clinicaltrial_df.loc[index, 'Phase'] + ', ' + phases[index][index2 + 1]
            except TypeError:
                index2 += 1
    else:
        clinicaltrial_df.loc[index, 'Phase'] =  None

clinicaltrial_df['Phase']

0         PHASE1
1         PHASE2
2         PHASE1
3         PHASE1
4         PHASE1
           ...  
135900    PHASE3
135901    PHASE3
135902    PHASE2
135903    PHASE1
135904    PHASE3
Name: Phase, Length: 135905, dtype: object

Each row in the 'Phases' column contains a list which were converted into character strings so that the data can be easily accessed during analysis. <br>

For each row, all phases were combined into a single string with each phase separated by a comma.

In [None]:
clinicaltrial_df['Treated Condition'] = clinicaltrial_df['Conditions']
conditions = clinicaltrial_df['Conditions']

# For loop to extract condition from the first list, for every row in the column.
for index in range(len(conditions)):
    if conditions[index] != None:
        clinicaltrial_df.loc[index, 'Treated Condition'] = conditions[index][0]

        # For loop to combined all subsequent conditions to the first condition.
        for index2 in range(len(conditions[index]) - 1):
            try:
                clinicaltrial_df.loc[index, 'Treated Condition'] = clinicaltrial_df.loc[index, 'Treated Condition'] + ', ' + conditions[index][index2 + 1]
            except TypeError:
                index2 += 1
    else:
        clinicaltrial_df.loc[index, 'Treated Condition'] =  None

clinicaltrial_df['Treated Condition']

0         Extrahepatic Bile Duct Cancer, Gallbladder Can...
1                                 Drug-Induced Liver Injury
2                                                Sarcopenia
3                                             Fear, Anxiety
4                                          Healthy Subjects
                                ...                        
135900                       Multiple Sclerosis, Spasticity
135901                                        Post-COVID-19
135902                  Paroxysmal Nocturnal Hemoglobinuria
135903    Relapsed or Refractory Acute Lymphoblastic Leu...
135904                             Diabetes Mellitus Type 2
Name: Treated Condition, Length: 135905, dtype: object

Each row in the 'Conditions' column contains a list which were converted into character strings so that the data can be easily accessed during analysis. <br>

For each row, all conditions were combined into a single string with each condition separated by a comma.

In [None]:
clinicaltrial_df['Intervention Type'] = None
clinicaltrial_df['Intervention Name'] = None
interventions = clinicaltrial_df['Interventions']

# For loop to assign each row in the interventions columns to a new list of dictionaries
for index, row in clinicaltrial_df.iterrows():
    interventions_list = row['Interventions']

    # Only assign row to new lists if the row is not empty
    if interventions_list and isinstance(interventions_list, list):
        types = []
        names = []

        # For loop to assign the intervention type and name to separate lists.
        for item in interventions_list:
            name = item.get('name', '')
            itype = item.get('type')
            if itype in ['DRUG', 'BIOLOGICAL'] and 'placebo' not in name.lower():
                types.append(itype)
                names.append(name)
        if types and names:
            clinicaltrial_df.at[index, 'Intervention Type'] = ', '.join(types)
            clinicaltrial_df.at[index, 'Intervention Name'] = ', '.join(names)

clinicaltrial_df['Intervention Type']

0               DRUG, DRUG, DRUG
1               DRUG, DRUG, DRUG
2                           None
3                           DRUG
4                           DRUG
                   ...          
135900                      DRUG
135901                BIOLOGICAL
135902                      DRUG
135903    DRUG, DRUG, DRUG, DRUG
135904    DRUG, DRUG, DRUG, DRUG
Name: Intervention Type, Length: 135905, dtype: object

Each row in the 'Interventions' column contains a list of dictionaries with Keys 'type' and 'name', and their corresponding values. The key-value pair for 'type' is converted into a new column called 'Intervention Type'. <br>

For each row in the 'Intervention Type' column, only 'DRUG' and 'BIOLOGICAL' intervention types were combined into a single string with each intervention type separated by a comma.

In [9]:
clinicaltrial_df['Intervention Name']

0         erlotinib hydrochloride, gemcitabine hydrochlo...
1         Magnesium Isoglycyrrhizinate Injection 100mg O...
2                                                      None
3                                                    SRX246
4                                           ICP-332 Tablets
                                ...                        
135900                                           GW-1000-02
135901                             Meplazumab for injection
135902                                                KP104
135903     MLN9708, Vincristine, Doxorubicin, Dexamethasone
135904    insulin glargine, metformin, taspoglutide, tas...
Name: Intervention Name, Length: 135905, dtype: object

The key-value pair for 'name' is also converted into a new column called 'Intervention Name'. <br>

For each row in the 'Intervention Name' column, all intervention names except those containing the word 'placebo' were combined into a single string with each intervention name separated by a comma.

<h2>2. Cleaning of Clinical Trial Data </h2>

In [10]:
clinicaltrial_df.shape

(135905, 22)

In [11]:
clinicaltrial_df.isnull().sum()

NCT ID                          0
Trial Status                    0
Last Known Trial Status    119048
Trial Locations             10706
Trial Start Date              301
Trial Completion Date           0
Sponsor Name                    0
Sponsor Type                    0
FDA Regulated Drug          80948
Phases                          0
No of Participants           1172
Interventions                   0
Conditions                      1
Healthy Participants          357
Gender                         55
Min Age                      4232
Max Age                     60118
Trial Location              10714
Phase                           0
Treated Condition               1
Intervention Type           16631
Intervention Name           16631
dtype: int64

In [12]:
clinicaltrial_df = clinicaltrial_df.dropna(subset = ['Trial Start Date'])
clinicaltrial_df = clinicaltrial_df.dropna(subset = ['Intervention Type'])
clinicaltrial_df = clinicaltrial_df.dropna(subset = ['Trial Locations'])
clinicaltrial_df.isnull().sum()

NCT ID                         0
Trial Status                   0
Last Known Trial Status    97872
Trial Locations                0
Trial Start Date               0
Trial Completion Date          0
Sponsor Name                   0
Sponsor Type                   0
FDA Regulated Drug         62042
Phases                         0
No of Participants           873
Interventions                  0
Conditions                     0
Healthy Participants         241
Gender                        24
Min Age                     3098
Max Age                    49172
Trial Location                 8
Phase                          0
Treated Condition              0
Intervention Type              0
Intervention Name              0
dtype: int64

Dropped all rows with missing values in the 'Trial Locations', 'Trial Start Date', and 'Intervention Type' columns.

In [13]:
clinicaltrial_df['Start Date'] = pd.to_datetime(clinicaltrial_df['Trial Start Date'], format = 'mixed')
clinicaltrial_df['Start Date'] = clinicaltrial_df['Start Date'].dt.strftime('%Y-%m')

clinicaltrial_df['Completion Date'] = pd.to_datetime(clinicaltrial_df['Trial Completion Date'], format = 'mixed')
clinicaltrial_df['Completion Date'] = clinicaltrial_df['Completion Date'].dt.strftime('%Y-%m')

clinicaltrial_df[['Start Date', 'Completion Date']]

Unnamed: 0,Start Date,Completion Date
0,2009-11,2016-10
1,2007-12,2011-12
3,2017-02,2019-12
4,2024-07,2024-12
5,2010-06,2012-09
...,...,...
135900,2001-07,2005-02
135901,2023-03,2023-10
135902,2022-11,2025-02
135903,2013-06,2016-02


To keep the datetime format consistent, converted both 'Start Date' and 'Completion Date' columns into YYYY-MM datetime formats.

In [14]:
clinicaltrial_df['Trial Duration (Days)'] = (pd.to_datetime(clinicaltrial_df['Completion Date']) - pd.to_datetime(clinicaltrial_df['Start Date'])).dt.days
clinicaltrial_df['Trial Duration (Days)']

0         2526
1         1461
3         1033
4          153
5          823
          ... 
135900    1311
135901     214
135902     823
135903     975
135904     760
Name: Trial Duration (Days), Length: 109775, dtype: int64

Created a new column for trial duration in days, by subtracting 'Completion Date' from 'Start Date'.

In [15]:
clinicaltrial_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 109775 entries, 0 to 135904
Data columns (total 25 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   NCT ID                   109775 non-null  object 
 1   Trial Status             109775 non-null  object 
 2   Last Known Trial Status  11903 non-null   object 
 3   Trial Locations          109775 non-null  object 
 4   Trial Start Date         109775 non-null  object 
 5   Trial Completion Date    109775 non-null  object 
 6   Sponsor Name             109775 non-null  object 
 7   Sponsor Type             109775 non-null  object 
 8   FDA Regulated Drug       47733 non-null   object 
 9   Phases                   109775 non-null  object 
 10  No of Participants       108902 non-null  float64
 11  Interventions            109775 non-null  object 
 12  Conditions               109775 non-null  object 
 13  Healthy Participants     109534 non-null  object 
 14  Gender   

In [16]:
clinicaltrial_df = clinicaltrial_df[['NCT ID', 'Trial Status', 'Last Known Trial Status', 'Phase', 'Start Date', 'Completion Date', 'Trial Duration (Days)', 'Trial Location', 'Sponsor Name', 'Sponsor Type',
                                     'FDA Regulated Drug', 'Intervention Type', 'Intervention Name', 'Treated Condition', 'No of Participants', 'Healthy Participants', 'Gender', 'Min Age', 'Max Age']]
clinicaltrial_df

Unnamed: 0,NCT ID,Trial Status,Last Known Trial Status,Phase,Start Date,Completion Date,Trial Duration (Days),Trial Location,Sponsor Name,Sponsor Type,FDA Regulated Drug,Intervention Type,Intervention Name,Treated Condition,No of Participants,Healthy Participants,Gender,Min Age,Max Age
0,NCT00987766,COMPLETED,,PHASE1,2009-11,2016-10,2526,United States,Vanderbilt-Ingram Cancer Center,OTHER,,"DRUG, DRUG, DRUG","erlotinib hydrochloride, gemcitabine hydrochlo...","Extrahepatic Bile Duct Cancer, Gallbladder Can...",28.0,False,ALL,18 Years,
1,NCT02734966,COMPLETED,,PHASE2,2007-12,2011-12,1461,China,"Chia Tai Tianqing Pharmaceutical Group Co., Ltd.",INDUSTRY,,"DRUG, DRUG, DRUG",Magnesium Isoglycyrrhizinate Injection 100mg O...,Drug-Induced Liver Injury,174.0,False,ALL,18 Years,70 Years
3,NCT02922166,UNKNOWN,ACTIVE_NOT_RECRUITING,PHASE1,2017-02,2019-12,1033,United States,Azevan Pharmaceuticals,INDUSTRY,True,DRUG,SRX246,"Fear, Anxiety",36.0,True,ALL,21 Years,50 Years
4,NCT06530966,RECRUITING,,PHASE1,2024-07,2024-12,153,United States,InnoCare Pharma Inc.,INDUSTRY,True,DRUG,ICP-332 Tablets,Healthy Subjects,24.0,True,ALL,18 Years,55 Years
5,NCT00651066,COMPLETED,,PHASE2,2010-06,2012-09,823,Vietnam,French National Agency for Research on AIDS an...,OTHER_GOV,,"DRUG, DRUG",rifabutin in combination with lopinavir booste...,"HIV Infections, Tuberculosis",47.0,False,ALL,18 Years,65 Years
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135900,NCT01610687,COMPLETED,,PHASE3,2001-07,2005-02,1311,United Kingdom,Jazz Pharmaceuticals,INDUSTRY,,DRUG,GW-1000-02,"Multiple Sclerosis, Spasticity",137.0,False,ALL,18 Years,
135901,NCT05813587,COMPLETED,,PHASE3,2023-03,2023-10,214,China,Jiangsu Pacific Meinuoke Bio Pharmaceutical Co...,INDUSTRY,False,BIOLOGICAL,Meplazumab for injection,Post-COVID-19,121.0,False,ALL,18 Years,84 Years
135902,NCT05476887,ACTIVE_NOT_RECRUITING,,PHASE2,2022-11,2025-02,823,China,"Kira Pharmacenticals (US), LLC.",INDUSTRY,False,DRUG,KP104,Paroxysmal Nocturnal Hemoglobinuria,35.0,False,ALL,18 Years,
135903,NCT01887587,TERMINATED,,PHASE1,2013-06,2016-02,975,United States,Ehab L Atallah,OTHER,,"DRUG, DRUG, DRUG, DRUG","MLN9708, Vincristine, Doxorubicin, Dexamethasone",Relapsed or Refractory Acute Lymphoblastic Leu...,5.0,False,ALL,18 Years,


Dropped all unnecessary columns: Phases, Interventions, Conditions, and Trial Locations.

Re-ordered the remaining columns for easier analysis.

In [17]:
mask = (clinicaltrial_df['Trial Location'].str.contains('United States', case = False, na = False)) & (clinicaltrial_df['FDA Regulated Drug'] != False)
clinicaltrial_df = clinicaltrial_df[mask]
clinicaltrial_df.shape

(56043, 19)

Dropped all rows in dataset where clinical trials are not conducted in United States and not FDA regulated.

In [18]:
clinicaltrial_df.to_csv('Clinical Trial Data.csv')

Exported extracted data for next steps.