Hello everyone,

this project is called "machine learning for disambiguation of clinical trial scientist names".

We will be using the concept of "namespace", a namespace comprises all the scientist with the same last name and the same initial for the first name.

We will be pairing clinical trials done by scientists whose names belong to a certain namespace to researches done by scientists whose names belong to the same namespace.

After pairing them, we will asses if the scientist is the same or if she/he's not.

First thing first, to get this jupyter notebook working, you will need to download all the clinical trials at this link : https://clinicaltrials.gov/AllPublicXML.zip

After doing that, extract all the files in a folder called "AllPublicXML".

Check to have a lot of folders named (as an example) "NTC0000xxxx" or with other numbers under the "AllPublicXML" folder, if so, fantastic, you will be able to run this.

The next Cell contains a small test, if you have configured all correctly, you will be able to see the content of an XML file, we used 'NTC00270075' as an example, feel free to try with others if you want (remember, not all the numbers are valid ID).

In [1]:
import xml.dom.minidom as xml_dom    # we need this library in order to extract the dom from an xml document

# This function returns the location of a file, given the clinical trial ID or the file name
def get_file_location(name):
    inner_folder = name[:7] + 'xxxx'   # I get the name of the inner folder (e.g. NTC0000xxxx)
    location = 'AllPublicXML\\' + inner_folder + '\\' + name   # I set the location
    if(name[-4:] != '.xml'):
        location += '.xml'    # If the input was the ID, we add .xml because the file name is identical to the ID
    return location    # I return the file location
    
# This function allows us to get the dom of a file, specifying the file name or the ID of the clinical trial
def xml_doc_string(name):
    location = get_file_location(name)    # here we set the location of the file
    try:
        dom = xml_dom.parse(location)    # we parse the file using xml_dom
    except:
        return 'file not found'   # if I cannot find the file, I return this string instead
    xml_dom_as_string = dom.toprettyxml()    # get the dom as a string
    return xml_dom_as_string    # return the string

# Let's test it, if it works, you have set up the folders correclty
print(xml_doc_string('NCT00270075'))

<?xml version="1.0" ?>
<clinical_study>
	
  
	<!-- This xml conforms to an XML Schema at:
    https://clinicaltrials.gov/ct2/html/images/info/public.xsd -->
	
  
	<required_header>
		
    
		<download_date>ClinicalTrials.gov processed this data on May 29, 2019</download_date>
		
    
		<link_text>Link to the current ClinicalTrials.gov record.</link_text>
		
    
		<url>https://clinicaltrials.gov/show/NCT00270075</url>
		
  
	</required_header>
	
  
	<id_info>
		
    
		<org_study_id>CR005896</org_study_id>
		
    
		<nct_id>NCT00270075</nct_id>
		
  
	</id_info>
	
  
	<brief_title>A Study to Determine the Safety and Effectiveness of Epoetin Alfa in Facilitating Self-donation of Blood Before Surgery in Patients Who Are Not Anemic and Who Will be Undergoing Orthopedic or Heart and Blood Vessel Surgery</brief_title>
	
  
	<official_title>Recombinant Human Erythropoietin (R-HuEPO) in Non-Anemic Patients Scheduled for Orthopedic or Cardiovascular Surgery, to Facilitate Presurgical Autologou

Now that we have the folders in the right place, we proceed to create functions that will be able to find a specific field in the XML file.

In [2]:
# This function takes the ID/file name of the XML file and the name of the node to search and returns
# all the nodes with the specified name
def get_xml_node(ID, node_name):
    location = get_file_location(ID)
    try:
        dom = xml_dom.parse(location)
    except:
        return 'file not found'
    nodes = dom.getElementsByTagName(node_name)
    return nodes
    
node = get_xml_node('NCT00270075', 'textblock')
node[0].firstChild.nodeValue    #print the content of the first node found

'\n      The purpose of this study is to determine whether epoetin alfa will enable self-donation of\n      at least 4 units of blood during the 2-week period before surgery (which is a shorter period\n      of time than the conventional 3-week blood donation period before surgery) in patients who\n      are not anemic and who will be undergoing orthopedic or heart and blood vessel surgery.\n      Epoetin alfa is a genetically engineered protein that stimulates red blood cell production.\n    '

In [5]:
import xml.etree.ElementTree as ET

# This recursive function is needed to go through all the hierarchy and get the node (no 'clinical_trial' tag needed)
def recursive_node_search(root,hierarchy):
    for child in root:
        if(child.tag == hierarchy[0]):
            if(len(hierarchy) == 1):
                return child
            return recursive_node_search(child, hierarchy[1:])

# This function takes the ID/file name of the XML file and the hierarchy of the node to search and returns
# the node with the specified name
def get_xml_node_hierarchy(ID, node_hierarchy):
    location = get_file_location(ID)
    try:
        dom = ET.parse(location)
    except:
        return 'file not found'
    node = recursive_node_search(dom.getroot(),node_hierarchy)
    return node
    
node = get_xml_node_hierarchy('NCT00270075', ['sponsors','lead_sponsor','agency'])
print(node.text)
# If the node doesn't exist, a None is returned
node = get_xml_node_hierarchy('NCT00270075', ['sponsors','lead_sponsor','non_existent_node'])
print("Value of node: ", node)

Johnson & Johnson Pharmaceutical Research & Development, L.L.C.
Value of node:  <Element 'official_title' at 0x00000180E9018D18>


In [6]:
# This function can be used to get the dom and to do multiple things with it
def get_xml_dom(ID):
    location = get_file_location(ID)
    try:
        return ET.parse(location)
    except:
        return 'file not found'
    
root = get_xml_dom('NCT00270075').getroot()
print(recursive_node_search(root,['sponsors','lead_sponsor','agency']).text)
print(recursive_node_search(root,['official_title']).text)

Johnson & Johnson Pharmaceutical Research & Development, L.L.C.
Recombinant Human Erythropoietin (R-HuEPO) in Non-Anemic Patients Scheduled for Orthopedic or Cardiovascular Surgery, to Facilitate Presurgical Autologous Blood Donation (A Double-blind, Randomized, Dose Finding Study)
