Hello everyone,

this project is called "machine learning for disambiguation of clinical trial scientist names".

We will be using the concept of "namespace", a namespace comprises all the scientist with the same last name and the same initial for the first name.

We will be pairing clinical trials done by scientists whose names belong to a certain namespace to researches done by scientists whose names belong to the same namespace.

After pairing them, we will asses if the scientist is the same or if she/he's not.

First thing first, to get this jupyter notebook working, you will need to download all the clinical trials at this link : https://clinicaltrials.gov/AllPublicXML.zip

After doing that, extract all the files in a folder called "AllPublicXML".

Check to have a lot of folders named (as an example) "NTC0000xxxx" or with other numbers under the "AllPublicXML" folder, if so, fantastic, you will be able to run this.

The next Cell contains a small test, if you have configured all correctly, you will be able to see the content of an XML file, we used 'NTC00270075' as an example, feel free to try with others if you want (remember, not all the numbers are valid ID).

In [1]:
import xml.dom.minidom as xml_dom    # we need this library in order to extract the dom from an xml document

# This function returns the location of a file, given the clinical trial ID or the file name
def get_file_location(name):
    inner_folder = name[:7] + 'xxxx'   # I get the name of the inner folder (e.g. NTC0000xxxx)
    location = 'AllPublicXML\\' + inner_folder + '\\' + name   # I set the location
    if(name[-4:] != '.xml'):
        location += '.xml'    # If the input was the ID, we add .xml because the file name is identical to the ID
    return location    # I return the file location
    
# This function allows us to get the dom of a file, specifying the file name or the ID of the clinical trial
def xml_doc_string(name):
    location = get_file_location(name)    # here we set the location of the file
    try:
        dom = xml_dom.parse(location)    # we parse the file using xml_dom
    except:
        return 'file not found'   # if I cannot find the file, I return this string instead
    xml_dom_as_string = dom.toprettyxml()    # get the dom as a string
    return xml_dom_as_string    # return the string

# Let's test it, if it works, you have set up the folders correclty
print(xml_doc_string('NCT00270075'))

<?xml version="1.0" ?>
<clinical_study>
	
  
	<!-- This xml conforms to an XML Schema at:
    https://clinicaltrials.gov/ct2/html/images/info/public.xsd -->
	
  
	<required_header>
		
    
		<download_date>ClinicalTrials.gov processed this data on May 29, 2019</download_date>
		
    
		<link_text>Link to the current ClinicalTrials.gov record.</link_text>
		
    
		<url>https://clinicaltrials.gov/show/NCT00270075</url>
		
  
	</required_header>
	
  
	<id_info>
		
    
		<org_study_id>CR005896</org_study_id>
		
    
		<nct_id>NCT00270075</nct_id>
		
  
	</id_info>
	
  
	<brief_title>A Study to Determine the Safety and Effectiveness of Epoetin Alfa in Facilitating Self-donation of Blood Before Surgery in Patients Who Are Not Anemic and Who Will be Undergoing Orthopedic or Heart and Blood Vessel Surgery</brief_title>
	
  
	<official_title>Recombinant Human Erythropoietin (R-HuEPO) in Non-Anemic Patients Scheduled for Orthopedic or Cardiovascular Surgery, to Facilitate Presurgical Autologou

Now that we have the folders in the right place, we proceed to create functions that will be able to find a specific field in the XML file.

In [2]:
# This function takes the ID/file name of the XML file and the name of the node to search and returns
# all the nodes with the specified name
def get_xml_node(ID, node_name):
    location = get_file_location(ID)
    try:
        dom = xml_dom.parse(location)
    except:
        return 'file not found'
    nodes = dom.getElementsByTagName(node_name)
    return nodes
    
node = get_xml_node('NCT00270075', 'textblock')
node[0].firstChild.nodeValue    #print the content of the first node found

'\n      The purpose of this study is to determine whether epoetin alfa will enable self-donation of\n      at least 4 units of blood during the 2-week period before surgery (which is a shorter period\n      of time than the conventional 3-week blood donation period before surgery) in patients who\n      are not anemic and who will be undergoing orthopedic or heart and blood vessel surgery.\n      Epoetin alfa is a genetically engineered protein that stimulates red blood cell production.\n    '

In [3]:
import xml.etree.ElementTree as ET

# This recursive function is needed to go through all the hierarchy and get the node (no 'clinical_trial' tag needed)
def recursive_node_search(root,hierarchy):
    for child in root:
        if(child.tag == hierarchy[0]):
            if(len(hierarchy) == 1):
                return child
            return recursive_node_search(child, hierarchy[1:])

# This function takes the ID/file name of the XML file and the hierarchy of the node to search and returns
# the node with the specified name
def get_xml_node_hierarchy(ID, node_hierarchy):
    location = get_file_location(ID)
    try:
        dom = ET.parse(location)
    except:
        return 'file not found'
    node = recursive_node_search(dom.getroot(),node_hierarchy)
    return node
    
node = get_xml_node_hierarchy('NCT00270075', ['sponsors','lead_sponsor','agency'])
print(node.text)
# If the node doesn't exist, a None is returned
node = get_xml_node_hierarchy('NCT00270075', ['sponsors','lead_sponsor','non_existent_node'])
print("Value of node: ", node)

Johnson & Johnson Pharmaceutical Research & Development, L.L.C.
Value of node:  None


In [4]:
# This function can be used to get the dom and to do multiple things with it
def get_xml_dom(ID):
    location = get_file_location(ID)
    try:
        return ET.parse(location)
    except:
        return 'file not found'
    
root = get_xml_dom('NCT00270075').getroot()
print(recursive_node_search(root,['sponsors','lead_sponsor','agency']).text)
print(recursive_node_search(root,['official_title']).text)

Johnson & Johnson Pharmaceutical Research & Development, L.L.C.
Recombinant Human Erythropoietin (R-HuEPO) in Non-Anemic Patients Scheduled for Orthopedic or Cardiovascular Surgery, to Facilitate Presurgical Autologous Blood Donation (A Double-blind, Randomized, Dose Finding Study)


Now that we can extract information from clinical trials, we will need to do the same with PubMed.

We will use code from OntoGene that has already been created for this project.

The repository is there: https://github.com/OntoGene/OGER

To install it in Python using pip you can type this command: pip install git+https://github.com/OntoGene/OGER.git


In [5]:
# Get config and start a pipeline server
from oger.ctrl.router import Router, PipelineServer
conf = Router(termlist_path='oger/test/testfiles/test_terms.tsv')
pl = PipelineServer(conf)

In [6]:
# Download an article from PubMed (OntoGene verison, but contains only title and abstract)
art = pl.load_one(['21436587'], fmt='pubmed')[0]
print(art[0].text)
print(art[1]._text)

Human prostate cancer metastases target the hematopoietic stem cell niche to establish footholds in mouse bone marrow.

HSC homing, quiescence, and self-renewal depend on the bone marrow HSC niche. A large proportion of solid tumor metastases are bone metastases, known to usurp HSC homing pathways to establish footholds in the bone marrow. However, it is not clear whether tumors target the HSC niche during metastasis. Here we have shown in a mouse model of metastasis that human prostate cancer (PCa) cells directly compete with HSCs for occupancy of the mouse HSC niche. Importantly, increasing the niche size promoted metastasis, whereas decreasing the niche size compromised dissemination. Furthermore, disseminated PCa cells could be mobilized out of the niche and back into the circulation using HSC mobilization protocols. Finally, once in the niche, tumor cells reduced HSC numbers by driving their terminal differentiation. These data provide what we believe to be the first evidence that

The OntoGene is easy to use, but it can only show title and abstract, we will use the OntoGene for other purposes later.

We need to make a request at the database containing the pubmed articles and get all the information in them.

To do so, we will create the function 'fetch_articles' that can fetch a list of articles from PubMed.

In [7]:
# Download a list of full articles from PubMed
import requests

def fetch_articles(id_list):
    id_str_list = ''
    
    # I create the list of ID in a string, so that it can be usable in the HTTP request
    for el in id_list:
        id_str_list += el+','
        
    # I set up the HTTP request
    url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id='+id_str_list+'&rettype=xml'
    
    # I return the result of the request
    return requests.get(url)

In [8]:
fetched_articles = fetch_articles(['21436587','21436588']).content
print(fetched_articles)

b'<?xml version="1.0" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">\n<PubmedArticleSet>\n<PubmedArticle>\n    <MedlineCitation Status="MEDLINE" Owner="NLM">\n        <PMID Version="1">21436587</PMID>\n        <DateCompleted>\n            <Year>2011</Year>\n            <Month>06</Month>\n            <Day>21</Day>\n        </DateCompleted>\n        <DateRevised>\n            <Year>2018</Year>\n            <Month>11</Month>\n            <Day>13</Day>\n        </DateRevised>\n        <Article PubModel="Print-Electronic">\n            <Journal>\n                <ISSN IssnType="Electronic">1558-8238</ISSN>\n                <JournalIssue CitedMedium="Internet">\n                    <Volume>121</Volume>\n                    <Issue>4</Issue>\n                    <PubDate>\n                        <Year>2011</Year>\n                        <Month>Apr</Month>\n                    </PubDate>\n   

We now have a way to download the articles, we can reuse some functions we made before to deal with the clinical trials, such as the 'recursive_node_search', but we first need to obtain the xml dom from these articles in order to be able to use that function.

We will now find a way to get the xml dom from these articles.

In [24]:
def article_doms(fetch_result):
    root = ET.fromstring(fetch_result)
    doms_list = []
    
    for child in root:
        if(child.tag == "PubmedArticle"):
            doms_list.append(child)
    
    return doms_list
    
fetched_article_doms = article_doms(fetched_articles)
print(fetched_article_doms[0].tag)
print(recursive_node_search(fetched_article_doms[1],['MedlineCitation','Article','ArticleTitle']).text)

PubmedArticle
FOXO3 programs tumor-associated DCs to become tolerogenic in human and murine prostate cancer.


Now that we have the functions to download and extract data from PubMed articles and the functions to extract data from cinical trials, we can start to look at the csv of the gold standard and we can start building our first neural network for a baseline mesurement.

Here we will import the csv using pandas.

And we will print data from it.

In [22]:
import pandas as pd
data = pd.read_csv('ClinicalPmidsALL.csv', encoding = 'ISO-8859-1', sep = ';')

In [23]:
print(data)

          PMID         LastName   FirstName           CT  \
0     18425979       abdulkarim           B  NCT00703859   
1     11754709            gawin       Frank  NCT00000321   
2     21734560          deutsch      Steven  NCT00000333   
3     16357350          maisiak   Richard S  NCT00000407   
4     23047930           gorden     Phillip  NCT00001276   
5     18191187         hochster      Howard  NCT00003204   
6     17507634           fabian       Carol  NCT00005879   
7     26834067         dematteo      Ronald  NCT00025246   
8     24642382            klein      Julius  NCT00032656   
9     11480571      dispenzieri           A  NCT00047203   
10    20150372             fine    Howard A  NCT00047879   
11    19109586            orban     Tihamer  NCT00057499   
12    12143843          griffin           T  NCT00058461   
13    22688329           palmer     Jerry P  NCT00058981   
14    16018755           grossi      Sara G  NCT00066053   
15    25223501  goldbach-mansky  Raphael

In [31]:
print(data['PMID'])

0       18425979
1       11754709
2       21734560
3       16357350
4       23047930
5       18191187
6       17507634
7       26834067
8       24642382
9       11480571
10      20150372
11      19109586
12      12143843
13      22688329
14      16018755
15      25223501
16      22162591
17       1851986
18      23636291
19      27953647
20      23005614
21      18645517
22      22136405
23      25124687
24      16457713
25      25161267
26      17712402
27      21555753
28      21801082
29      26298795
          ...   
1132    21382109
1133    23861366
1134    24576563
1135    24237940
1136    27721202
1137    21769288
1138    20710042
1139    26195310
1140    23589384
1141    25218848
1142    12642152
1143    14582490
1144    21633317
1145    27307782
1146    17140849
1147    24029874
1148    23618779
1149    23722975
1150    19411310
1151    26447629
1152    23005614
1153    27778171
1154    26757787
1155    27390533
1156    24247981
1157    15494916
1158    22382359
1159    252948

In [32]:
print(data.loc[0])

PMID                                18425979
LastName                          abdulkarim
FirstName                                  B
CT                               NCT00703859
Name            Bassam Abdulkarim, MD, FRCPC
CommonAnswer                             YES
Remark                                     A
1                                        yes
2                                        yes
3                                        yes
Name: 0, dtype: object


In [35]:
print(data.loc[0, 'PMID'])

18425979


Now that we have loaded the csv correctly we can start creating the baseline curve on which to estimate our future results.

We will be using scikit-learn now, and later we will be using tensorflow.