<h1> Data Extraction from DrugBank.com and ChEMBL</h1>

---

In [1]:
import numpy as np
import pandas as pd
import requests
import re
import xml.etree.ElementTree as ET
from rapidfuzz import process, fuzz

In [2]:
clinicaltrial_df = pd.read_csv('Clinical Trial Data.csv')

<h2>1. Further Cleaning of Clinical Trial Data </h2>

Before extracting data from DrugBank.com, the drug names from clinical trial data need to be further cleaned so that they can be used to match with drug names from DrugBank.com.

In [3]:
drug_names_df = pd.DataFrame(clinicaltrial_df['Intervention Name'].str.lower().str.split(', ', expand=True))
drug_names_df.insert(0, 'NCT ID', clinicaltrial_df['NCT ID'])
drug_names_df


Unnamed: 0,NCT ID,0,1,2,3,4,5,6,7,8,...,50,51,52,53,54,55,56,57,58,59
0,NCT00987766,erlotinib hydrochloride,gemcitabine hydrochloride,oxaliplatin,,,,,,,...,,,,,,,,,,
1,NCT02922166,srx246,,,,,,,,,...,,,,,,,,,,
2,NCT06530966,icp-332 tablets,,,,,,,,,...,,,,,,,,,,
3,NCT02367066,ar-c165395xx,,,,,,,,,...,,,,,,,,,,
4,NCT00033566,s-3304,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56038,NCT01682187,ly2157299,lomustine,,,,,,,,...,,,,,,,,,,
56039,NCT04191187,fludarabine,melphalan,,,,,,,,...,,,,,,,,,,
56040,NCT01585987,ipilimumab,,,,,,,,,...,,,,,,,,,,
56041,NCT01887587,mln9708,vincristine,doxorubicin,dexamethasone,,,,,,...,,,,,,,,,,


The intervention name column which contains a string of all drug names used in the clinical trial is split up into separate columns in stored in a new dataframe for further cleaning. <br>

The NCT ID column is also inserted into the dataframe to be used as a joining key when merging datasets.

In [None]:
drug_names_df = drug_names_df.map(lambda x: x.strip() if isinstance(x, str) else x)

# Replace strings that describe drug dosages with None
drug_names_df = drug_names_df.map(
    lambda x: None if isinstance(x, str) and re.fullmatch(
        r'\d+\s*(mg/ml|mg|mcg|μg/ml|µg/ml|µg|ug|μg|microg|microgram|micrograms|microgrammes|micrograms/kg|-mg|mg/kg|milligram|%)',
        x,
        flags=re.IGNORECASE
    ) else x
)

# Remove standalone dosage patterns from strings (but keep complex terms like "150 mg/500")
drug_names_df = drug_names_df.map(
    lambda x: re.sub(
        r'\b\d+\s*(mg/ml|mg|mcg|μg/ml|µg/ml|µg|ug|μg|microg|microgram|micrograms|microgrammes|micrograms/kg|-mg|mg/kg|milligram|%)\b',
        '',
        x,
        flags=re.IGNORECASE
    ).strip() if isinstance(x, str) else x
)

drug_names_df = drug_names_df.map(lambda x: re.sub(r'\d+\.?\d*\s*%', '', x).strip() if isinstance(x, str) else x)

I am only using unique drug names for analysis, so the names were cleaned to remove all descriptors for dosages (e.g. 500 mg).

In [5]:
drug_names_df = drug_names_df.apply(lambda row: row.where(~row.duplicated(keep = 'first')), axis=1)
drug_names_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56043 entries, 0 to 56042
Data columns (total 61 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   NCT ID  56043 non-null  object
 1   0       56038 non-null  object
 2   1       23164 non-null  object
 3   2       9736 non-null   object
 4   3       4530 non-null   object
 5   4       2369 non-null   object
 6   5       1365 non-null   object
 7   6       813 non-null    object
 8   7       487 non-null    object
 9   8       355 non-null    object
 10  9       238 non-null    object
 11  10      163 non-null    object
 12  11      134 non-null    object
 13  12      98 non-null     object
 14  13      66 non-null     object
 15  14      54 non-null     object
 16  15      35 non-null     object
 17  16      28 non-null     object
 18  17      19 non-null     object
 19  18      14 non-null     object
 20  19      8 non-null      object
 21  20      4 non-null      object
 22  21      1 non-null    

After cleaning, all duplicate drug names were removed, keeping only the first instance.

In [6]:
drug_names_df = drug_names_df.apply(lambda row: ', '.join(row.dropna().astype(str)), axis = 1)
drug_names_df

0        NCT00987766, erlotinib hydrochloride, gemcitab...
1                                      NCT02922166, srx246
2                             NCT06530966, icp-332 tablets
3                                NCT02367066, ar-c165395xx
4                                      NCT00033566, s-3304
                               ...                        
56038                    NCT01682187, ly2157299, lomustine
56039                  NCT04191187, fludarabine, melphalan
56040                              NCT01585987, ipilimumab
56041    NCT01887587, mln9708, vincristine, doxorubicin...
56042    NCT00755287, insulin glargine, metformin, tasp...
Length: 56043, dtype: object

Dropped all columns with null values, and join them back into a single column so as to remove empty columns.

In [7]:
drug_names_df = drug_names_df.str.split(', ', expand=True)
drug_names_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,45,46,47,48,49,50,51,52,53,54
0,NCT00987766,erlotinib hydrochloride,gemcitabine hydrochloride,oxaliplatin,,,,,,,...,,,,,,,,,,
1,NCT02922166,srx246,,,,,,,,,...,,,,,,,,,,
2,NCT06530966,icp-332 tablets,,,,,,,,,...,,,,,,,,,,
3,NCT02367066,ar-c165395xx,,,,,,,,,...,,,,,,,,,,
4,NCT00033566,s-3304,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56038,NCT01682187,ly2157299,lomustine,,,,,,,,...,,,,,,,,,,
56039,NCT04191187,fludarabine,melphalan,,,,,,,,...,,,,,,,,,,
56040,NCT01585987,ipilimumab,,,,,,,,,...,,,,,,,,,,
56041,NCT01887587,mln9708,vincristine,doxorubicin,dexamethasone,,,,,,...,,,,,,,,,,


The combined drug names were split back into separate columns again for further analysis. 

In [8]:
drug_names_df.rename(columns = {0: 'NCT ID'}, inplace = True)
drug_names_df = drug_names_df.set_index('NCT ID').stack().reset_index()
drug_names_df = drug_names_df[['NCT ID', 0]]
drug_names_df.columns = ['NCT ID', 'Clinical Trial Drug Name']
drug_names_df

Unnamed: 0,NCT ID,Clinical Trial Drug Name
0,NCT00987766,erlotinib hydrochloride
1,NCT00987766,gemcitabine hydrochloride
2,NCT00987766,oxaliplatin
3,NCT02922166,srx246
4,NCT06530966,icp-332 tablets
...,...,...
99767,NCT01887587,doxorubicin
99768,NCT01887587,dexamethasone
99769,NCT00755287,insulin glargine
99770,NCT00755287,metformin


The entire dataframe was flattened into 2 columns, retaining the unique NCT ID for each drug name. <Br>

There are multiple drugs tested in one clinical trial which is why there are multiple repeats of the same NCT ID.

In [9]:
unique_drug_list = drug_names_df['Clinical Trial Drug Name'].drop_duplicates(keep = 'first').dropna().reset_index(drop = True)
unique_drug_list_df = pd.DataFrame(unique_drug_list[unique_drug_list != ''])
unique_drug_list_df

Unnamed: 0,Clinical Trial Drug Name
0,erlotinib hydrochloride
1,gemcitabine hydrochloride
2,oxaliplatin
3,srx246
4,icp-332 tablets
...,...
36571,bms-686117
36572,byetta
36573,rituximab-chop
36574,rituximab-cvp


Again, all duplicates drug names are removed, keeeping only the first one. All rows with null values are also dropped.<br>

This drug name column is assigned to a new dataframe to be used as a unique drug list for matching with the drug names from DrugBank.com.

<h2>2. Conversion of XML data from DrugBank.com into Pandas Dataframe </h2>

The drugbank data was downloaded as a xml file.<br>

 <b>This file will not be provided due to copyright reasons. </b>To access this file please request an academic license directly from DrugBank.com.

In [None]:
drugbank_ids = []
drugbank_names = []
drugbank_approval_status = []
drugbank_atc_name = []
drugbank_atc = []

tree = ET.parse('./DrugBank raw data/full database.xml')
root = tree.getroot()

ns = {'db': 'http://www.drugbank.ca'}

# For loop to append text information under each nested element to the empty list. If the nested element does not exist, append null value.
for drug in root.findall('db:drug', ns):
    drugbank_ids.append(drug.find('db:drugbank-id', ns).text)
    drugbank_names.append(drug.find('db:name', ns).text)
    drugbank_approval_status.append((drug.find('db:groups', ns)).find('db:group', ns).text)
    if ((drug.find('db:atc-codes', ns)).find('db:atc-code', ns)) is not None:
        levels = ((drug.find('db:atc-codes', ns)).find('db:atc-code', ns)).findall('db:level', ns)
        drugbank_atc_name.append(levels[-1].text)
        drugbank_atc.append(levels[-1].get('code'))
    else:
        drugbank_atc_name.append(None)
        drugbank_atc.append(None)


The XML data has a tree structure with a root element. Within the root element are multiple nested elements which holds text information pertaining to the drug. <br>

The drug names, ids, approval status, ATC names, and ATC codes were assigned to separate lists.

In [12]:
drugbank = {'Drugbank ID' : drugbank_ids, 'Drug Name' : drugbank_names, 'Approval Status' : drugbank_approval_status, 'ATC Name' : drugbank_atc_name, 'ATC Class' : drugbank_atc}
drugbank_df = pd.DataFrame(drugbank)
drugbank_df['Drug Name'] = drugbank_df['Drug Name'].str.lower()
drugbank_df

Unnamed: 0,Drugbank ID,Drug Name,Approval Status,ATC Name,ATC Class
0,DB00001,lepirudin,approved,BLOOD AND BLOOD FORMING ORGANS,B
1,DB00002,cetuximab,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
2,DB00003,dornase alfa,approved,RESPIRATORY SYSTEM,R
3,DB00004,denileukin diftitox,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
4,DB00005,etanercept,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
...,...,...,...,...,...
17425,DB19452,exidavnemab,investigational,,
17426,DB19453,imciromab pentetate,investigational,,
17427,DB19454,cetyl oleate,investigational,,
17428,DB19455,cetyl myristoleate,investigational,,


Combined all the lists into a single dictionary where the Keys are the column names, and the Values are the lists containing different types of drug data. <br>

Converted the dictionary into a pandas dataframe.

In [None]:
# Prepare list of drug names
drugbank_drug_name = drugbank_df['Drug Name'].tolist()

# Function to get best match from drugbank for a single drug from clinical trial drugs
def get_best_match(trial_drug_name, drugbank_drug_name, scorer = fuzz.token_sort_ratio, cutoff = 90):
  
    trial_drug_name_split = trial_drug_name.split()

    best_match = None
    best_score = 0

    for word in trial_drug_name_split:
        if word is None or not word.strip():
            continue
        
        match = process.extractOne(word, drugbank_drug_name, scorer = scorer)
        if match and match[1] > best_score and match[1] >= cutoff:
            best_match = match[0]
            best_score = match[1]

    if best_match:
        return best_match, best_score
    else:
        return None, None
    
# Apply fuzzy matching
unique_drug_list_df[['DrugBank Drug Name', 'Match Score']] = unique_drug_list_df['Clinical Trial Drug Name'].apply(lambda x: pd.Series(get_best_match(x, drugbank_drug_name)))

unique_drug_list_df

Unnamed: 0,Clinical Trial Drug Name,DrugBank Drug Name,Match Score
0,erlotinib hydrochloride,erlotinib,100.000000
1,gemcitabine hydrochloride,gemcitabine,100.000000
2,oxaliplatin,oxaliplatin,100.000000
3,srx246,srx-246,92.307692
4,icp-332 tablets,,
...,...,...,...
36571,bms-686117,,
36572,byetta,,
36573,rituximab-chop,,
36574,rituximab-cvp,,


Use fuzzy matching to get best match between clinical trial drug name and drugbank drug name, setting match score at a threshold of 90 and above.

In [None]:
unique_drug_list_df2 = unique_drug_list_df.drop_duplicates(subset = 'DrugBank Drug Name', keep = 'first')
unique_drug_list_df2 = unique_drug_list_df2.dropna(subset = 'DrugBank Drug Name')

Further refined the list of unique clinical trial drug names after matching with drug names from drug bank. All duplicates and rows with null values were dropped.

<h2>3. Extracting of data from ChEMBL by API requests </h2>

In [None]:
# Code takes 87 mins to run due to API rate limit of 1s/per request
# first_approval_date = []
# dev_phase = []

# for drug_name in unique_drug_list_df2['DrugBank Drug Name']:
#     response = requests.get(f"https://www.ebi.ac.uk/chembl/api/data/molecule/search?q={drug_name}&limit=1&format=json")
#     if response.json()['molecules']:
#         data = response.json()['molecules'][0]
#         first_approval_date.append(data['first_approval'])
#         dev_phase.append(data['max_phase'])
#     else:
#         first_approval_date.append(None)
#         dev_phase.append(None)

The data for first drug approval date and highest development phase were assigned to separate lists. <br>

The API endpoint at ChEMBL has a built in delay of 1s/per API request. As a result, the data extraction process takes approximately 87 mins to run for a list of 3829 drugs. <br>

To save time, the extracted data will be saved and exported into a csv file instead of repeating the API request process every time.

In [None]:
# unique_drug_list_df2['First Approval Date'] = first_approval_date
# unique_drug_list_df2['Highest Development Phase'] = dev_phase
# unique_drug_list_df2

Unnamed: 0,Clinical Trial Drug Name,DrugBank Drug Name,Match Score,First Approval Date,Highest Development Phase
0,erlotinib hydrochloride,erlotinib,100.000000,2004.0,4.0
1,gemcitabine hydrochloride,gemcitabine,100.000000,,2.0
2,oxaliplatin,oxaliplatin,100.000000,2002.0,4.0
3,srx246,srx-246,92.307692,,
6,s-3304,s-3304,100.000000,,
...,...,...,...,...,...
36466,ea-2353,ea-2353,100.000000,,
36468,prototype (an0128 toothpaste),an0128,92.307692,,
36481,ino-3107,ino-3107,100.000000,,
36546,benznidazole,benznidazole,100.000000,2017.0,4.0


In [None]:
# unique_drug_list_df2.to_csv('Clinical Trial Drug List.csv')

The data extracted from ChEMBL was combined with the list of unique clinical trial drug names and exported.

In [23]:
unique_drug_list_df2 = pd.read_csv('Clinical Trial Drug List.csv')

<h2>4. Combining of Data from ClinicalTrial.gov, DrugBank.com, and ChEMBL </h2>

In [24]:
merged_df = pd.merge(drug_names_df, unique_drug_list_df2, on = 'Clinical Trial Drug Name', how = 'left')
merged_df

Unnamed: 0.1,NCT ID,Clinical Trial Drug Name,Unnamed: 0,DrugBank Drug Name,Match Score,First Approval Date,Highest Development Phase
0,NCT00987766,erlotinib hydrochloride,0.0,erlotinib,100.000000,2004.0,4.0
1,NCT00987766,gemcitabine hydrochloride,1.0,gemcitabine,100.000000,,2.0
2,NCT00987766,oxaliplatin,2.0,oxaliplatin,100.000000,2002.0,4.0
3,NCT02922166,srx246,3.0,srx-246,92.307692,,
4,NCT06530966,icp-332 tablets,,,,,
...,...,...,...,...,...,...,...
99767,NCT01887587,doxorubicin,61.0,doxorubicin,100.000000,,4.0
99768,NCT01887587,dexamethasone,97.0,dexamethasone,100.000000,,2.0
99769,NCT00755287,insulin glargine,,,,,
99770,NCT00755287,metformin,398.0,metformin,100.000000,1995.0,4.0


The data from ChEMBL was merged with the data from ClinicalTrial.gov to combine drug data with the unique trial ID.

In [25]:
merged_df2 = pd.merge(merged_df, drugbank_df, left_on = 'DrugBank Drug Name', right_on = 'Drug Name', how = 'left')
merged_df2

Unnamed: 0.1,NCT ID,Clinical Trial Drug Name,Unnamed: 0,DrugBank Drug Name,Match Score,First Approval Date,Highest Development Phase,Drugbank ID,Drug Name,Approval Status,ATC Name,ATC Class
0,NCT00987766,erlotinib hydrochloride,0.0,erlotinib,100.000000,2004.0,4.0,DB00530,erlotinib,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
1,NCT00987766,gemcitabine hydrochloride,1.0,gemcitabine,100.000000,,2.0,DB00441,gemcitabine,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
2,NCT00987766,oxaliplatin,2.0,oxaliplatin,100.000000,2002.0,4.0,DB00526,oxaliplatin,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
3,NCT02922166,srx246,3.0,srx-246,92.307692,,,DB16968,srx-246,investigational,,
4,NCT06530966,icp-332 tablets,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
99767,NCT01887587,doxorubicin,61.0,doxorubicin,100.000000,,4.0,DB00997,doxorubicin,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
99768,NCT01887587,dexamethasone,97.0,dexamethasone,100.000000,,2.0,DB01234,dexamethasone,approved,RESPIRATORY SYSTEM,R
99769,NCT00755287,insulin glargine,,,,,,,,,,
99770,NCT00755287,metformin,398.0,metformin,100.000000,1995.0,4.0,DB00331,metformin,approved,ALIMENTARY TRACT AND METABOLISM,A


The new dataframe was further merged with the data from DrugBank.com.

In [31]:
trial_drugbank_df = pd.merge(clinicaltrial_df, merged_df2, on = 'NCT ID', how = 'left')
trial_drugbank_df = trial_drugbank_df.drop(['Unnamed: 0_x', 'Unnamed: 0_y', 'Intervention Type', 'Drug Name', 'Match Score', 'Drugbank ID'], axis = 1)
trial_drugbank_df

Unnamed: 0,NCT ID,Trial Status,Last Known Trial Status,Phase,Start Date,Completion Date,Trial Duration (Days),Trial Location,Sponsor Name,Sponsor Type,...,Gender,Min Age,Max Age,Clinical Trial Drug Name,DrugBank Drug Name,First Approval Date,Highest Development Phase,Approval Status,ATC Name,ATC Class
0,NCT00987766,COMPLETED,,PHASE1,2009-11,2016-10,2526,United States,Vanderbilt-Ingram Cancer Center,OTHER,...,ALL,18 Years,,erlotinib hydrochloride,erlotinib,2004.0,4.0,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
1,NCT00987766,COMPLETED,,PHASE1,2009-11,2016-10,2526,United States,Vanderbilt-Ingram Cancer Center,OTHER,...,ALL,18 Years,,gemcitabine hydrochloride,gemcitabine,,2.0,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
2,NCT00987766,COMPLETED,,PHASE1,2009-11,2016-10,2526,United States,Vanderbilt-Ingram Cancer Center,OTHER,...,ALL,18 Years,,oxaliplatin,oxaliplatin,2002.0,4.0,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
3,NCT02922166,UNKNOWN,ACTIVE_NOT_RECRUITING,PHASE1,2017-02,2019-12,1033,United States,Azevan Pharmaceuticals,INDUSTRY,...,ALL,21 Years,50 Years,srx246,srx-246,,,investigational,,
4,NCT06530966,RECRUITING,,PHASE1,2024-07,2024-12,153,United States,InnoCare Pharma Inc.,INDUSTRY,...,ALL,18 Years,55 Years,icp-332 tablets,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99769,NCT01887587,TERMINATED,,PHASE1,2013-06,2016-02,975,United States,Ehab L Atallah,OTHER,...,ALL,18 Years,,doxorubicin,doxorubicin,,4.0,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
99770,NCT01887587,TERMINATED,,PHASE1,2013-06,2016-02,975,United States,Ehab L Atallah,OTHER,...,ALL,18 Years,,dexamethasone,dexamethasone,,2.0,approved,RESPIRATORY SYSTEM,R
99771,NCT00755287,COMPLETED,,PHASE3,2008-11,2010-12,760,"United States, Australia, Austria, Belgium, Br...",Hoffmann-La Roche,INDUSTRY,...,ALL,18 Years,75 Years,insulin glargine,,,,,,
99772,NCT00755287,COMPLETED,,PHASE3,2008-11,2010-12,760,"United States, Australia, Austria, Belgium, Br...",Hoffmann-La Roche,INDUSTRY,...,ALL,18 Years,75 Years,metformin,metformin,1995.0,4.0,approved,ALIMENTARY TRACT AND METABOLISM,A


The combined dataframe from ChEMBL and DrugBank.com was merged with the rest of the data from ClinicalTrial.gov using trial ID as the joining key.

In [32]:
trial_drugbank_df['Clinical Trial Drug Name'] = trial_drugbank_df['DrugBank Drug Name'].where(trial_drugbank_df['DrugBank Drug Name'].notna() 
                                                 & (trial_drugbank_df['DrugBank Drug Name'].str.strip() != ''), trial_drugbank_df['Clinical Trial Drug Name'])
trial_drugbank_df = trial_drugbank_df.drop(['DrugBank Drug Name'], axis = 1)
trial_drugbank_df

Unnamed: 0,NCT ID,Trial Status,Last Known Trial Status,Phase,Start Date,Completion Date,Trial Duration (Days),Trial Location,Sponsor Name,Sponsor Type,...,Healthy Participants,Gender,Min Age,Max Age,Clinical Trial Drug Name,First Approval Date,Highest Development Phase,Approval Status,ATC Name,ATC Class
0,NCT00987766,COMPLETED,,PHASE1,2009-11,2016-10,2526,United States,Vanderbilt-Ingram Cancer Center,OTHER,...,False,ALL,18 Years,,erlotinib,2004.0,4.0,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
1,NCT00987766,COMPLETED,,PHASE1,2009-11,2016-10,2526,United States,Vanderbilt-Ingram Cancer Center,OTHER,...,False,ALL,18 Years,,gemcitabine,,2.0,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
2,NCT00987766,COMPLETED,,PHASE1,2009-11,2016-10,2526,United States,Vanderbilt-Ingram Cancer Center,OTHER,...,False,ALL,18 Years,,oxaliplatin,2002.0,4.0,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
3,NCT02922166,UNKNOWN,ACTIVE_NOT_RECRUITING,PHASE1,2017-02,2019-12,1033,United States,Azevan Pharmaceuticals,INDUSTRY,...,True,ALL,21 Years,50 Years,srx-246,,,investigational,,
4,NCT06530966,RECRUITING,,PHASE1,2024-07,2024-12,153,United States,InnoCare Pharma Inc.,INDUSTRY,...,True,ALL,18 Years,55 Years,icp-332 tablets,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99769,NCT01887587,TERMINATED,,PHASE1,2013-06,2016-02,975,United States,Ehab L Atallah,OTHER,...,False,ALL,18 Years,,doxorubicin,,4.0,approved,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,L
99770,NCT01887587,TERMINATED,,PHASE1,2013-06,2016-02,975,United States,Ehab L Atallah,OTHER,...,False,ALL,18 Years,,dexamethasone,,2.0,approved,RESPIRATORY SYSTEM,R
99771,NCT00755287,COMPLETED,,PHASE3,2008-11,2010-12,760,"United States, Australia, Austria, Belgium, Br...",Hoffmann-La Roche,INDUSTRY,...,False,ALL,18 Years,75 Years,insulin glargine,,,,,
99772,NCT00755287,COMPLETED,,PHASE3,2008-11,2010-12,760,"United States, Australia, Austria, Belgium, Br...",Hoffmann-La Roche,INDUSTRY,...,False,ALL,18 Years,75 Years,metformin,1995.0,4.0,approved,ALIMENTARY TRACT AND METABOLISM,A


The clinical trial drug names were assigned the same name as the drugbank drug name, except when a null value is present in the drugbank drug name column. <br>

After doing so, the drugbank drug name column was dropped.

In [33]:
trial_drugbank_df.to_csv('Trial and Drugbank Data.csv', index = False)

The combined data from ClinicalTrial.gov, DrugBank.com and ChEMBL is exported for futher processing.