Hello everyone,

this project is called "machine learning for disambiguation of clinical trial scientist names".

We will be using the concept of "namespace", a namespace comprises all the scientist with the same last name and the same initial for the first name.

We will be pairing clinical trials done by scientists whose names belong to a certain namespace to researches done by scientists whose names belong to the same namespace.

After pairing them, we will asses if the scientist is the same or if she/he's not.

First thing first, to get this jupyter notebook working, you will need to download all the clinical trials at this link : https://clinicaltrials.gov/AllPublicXML.zip

After doing that, extract all the files in a folder called "AllPublicXML".

Check to have a lot of folders named (as an example) "NTC0000xxxx" or with other numbers under the "AllPublicXML" folder, if so, fantastic, you will be able to run this.

The next Cell contains a small test, if you have configured all correctly, you will be able to see the content of an XML file, we used 'NTC00270075' as an example, feel free to try with others if you want (remember, not all the numbers are valid ID).

In [1]:
import numpy as np # useful later
import xml.dom.minidom as xml_dom    # we need this library in order to extract the dom from an xml document

# This function returns the location of a file, given the clinical trial ID or the file name
def get_file_location(name):
    inner_folder = name[:7] + 'xxxx'   # I get the name of the inner folder (e.g. NTC0000xxxx)
    location = 'AllPublicXML\\' + inner_folder + '\\' + name   # I set the location
    if(name[-4:] != '.xml'):
        location += '.xml'    # If the input was the ID, we add .xml because the file name is identical to the ID
    return location    # I return the file location
    
# This function allows us to get the dom of a file, specifying the file name or the ID of the clinical trial
def xml_doc_string(name):
    location = get_file_location(name)    # here we set the location of the file
    try:
        dom = xml_dom.parse(location)    # we parse the file using xml_dom
    except:
        return 'file not found'   # if I cannot find the file, I return this string instead
    xml_dom_as_string = dom.toprettyxml()    # get the dom as a string
    return xml_dom_as_string    # return the string

# Let's test it, if it works, you have set up the folders correclty
print(xml_doc_string('NCT02990273'))

<?xml version="1.0" ?>
<clinical_study>
	
  
	<!-- This xml conforms to an XML Schema at:
    https://clinicaltrials.gov/ct2/html/images/info/public.xsd -->
	
  
	<required_header>
		
    
		<download_date>ClinicalTrials.gov processed this data on May 29, 2019</download_date>
		
    
		<link_text>Link to the current ClinicalTrials.gov record.</link_text>
		
    
		<url>https://clinicaltrials.gov/show/NCT02990273</url>
		
  
	</required_header>
	
  
	<id_info>
		
    
		<org_study_id>IRB00113092</org_study_id>
		
    
		<nct_id>NCT02990273</nct_id>
		
  
	</id_info>
	
  
	<brief_title>The Utility of Thromboelastography in Cirrhotic Patients Undergoing Endoscopic Procedures</brief_title>
	
  
	<official_title>The Utility of Thromboelastography in Cirrhotic Patients Undergoing Endoscopic Procedures</official_title>
	
  
	<sponsors>
		
    
		<lead_sponsor>
			
      
			<agency>Johns Hopkins University</agency>
			
      
			<agency_class>Other</agency_class>
			
    
		</lead_sponsor>
		

Now that we have the folders in the right place, we proceed to create functions that will be able to find a specific field in the XML file.

In [2]:
# This function takes the ID/file name of the XML file and the name of the node to search and returns
# all the nodes with the specified name
def get_xml_node(ID, node_name):
    location = get_file_location(ID)
    try:
        dom = xml_dom.parse(location)
    except:
        return 'file not found'
    nodes = dom.getElementsByTagName(node_name)
    return nodes
    
node = get_xml_node('NCT02644928', 'textblock')
node[0].firstChild.nodeValue    #print the content of the first node found

'\n      This study is designed as a prospective, single-center, longitudinal and analytical study on\n      the effect of bariatric surgery in obese patients with chronic kidney disease (CKD).\n    '

In [3]:
import xml.etree.ElementTree as ET

# This recursive function is needed to go through all the hierarchy and get the node (no 'clinical_trial' tag needed)
def recursive_node_search(root,hierarchy):
    for child in root:
        if(child.tag == hierarchy[0]):
            if(len(hierarchy) == 1):
                return child
            return recursive_node_search(child, hierarchy[1:])

# This function takes the ID/file name of the XML file and the hierarchy of the node to search and returns
# the node with the specified name
def get_xml_node_hierarchy(ID, node_hierarchy):
    location = get_file_location(ID)
    try:
        dom = ET.parse(location)
    except:
        return 'file not found'
    node = recursive_node_search(dom.getroot(),node_hierarchy)
    return node
    
node = get_xml_node_hierarchy('NCT02644928', ['overall_official','last_name'])
print(node.text)
# If the node doesn't exist, a None is returned
node = get_xml_node_hierarchy('NCT00270075', ['sponsors','lead_sponsor','non_existent_node'])
print("Value of node: ", node)

Enrique Morales Ruiz, MD, PhD
Value of node:  None


In [4]:
# This function can be used to get the dom and to do multiple things with it
def get_xml_dom(ID):
    location = get_file_location(ID)
    try:
        return ET.parse(location)
    except:
        return 'file not found'

def get_xml_doms(id_list):  # use this for a list of IDs
    result = []
    for element in id_list:
        el = get_xml_dom(element)
        if(el != 'file not found'):
            result.append(el.getroot())
    return result
    
root = get_xml_dom('NCT00270075').getroot()
print(recursive_node_search(root,['sponsors','lead_sponsor','agency']).text)
print(recursive_node_search(root,['official_title']).text)

Johnson & Johnson Pharmaceutical Research & Development, L.L.C.
Recombinant Human Erythropoietin (R-HuEPO) in Non-Anemic Patients Scheduled for Orthopedic or Cardiovascular Surgery, to Facilitate Presurgical Autologous Blood Donation (A Double-blind, Randomized, Dose Finding Study)


Now that we can extract information from clinical trials, we will need to do the same with PubMed.

We will use code from OntoGene that has already been created for this project.

The repository is there: https://github.com/OntoGene/OGER

To install it in Python using pip you can type this command: pip install git+https://github.com/OntoGene/OGER.git


In [5]:
# Get config and start a pipeline server
from oger.ctrl.router import Router, PipelineServer
conf = Router(termlist_path='oger/test/testfiles/test_terms.tsv')
pl = PipelineServer(conf)

In [6]:
# Download an article from PubMed (OntoGene verison, but contains only title and abstract)
art = pl.load_one(['21436587'], fmt='pubmed')[0]
print(art[0].text)
print(art[1]._text)

Human prostate cancer metastases target the hematopoietic stem cell niche to establish footholds in mouse bone marrow.

HSC homing, quiescence, and self-renewal depend on the bone marrow HSC niche. A large proportion of solid tumor metastases are bone metastases, known to usurp HSC homing pathways to establish footholds in the bone marrow. However, it is not clear whether tumors target the HSC niche during metastasis. Here we have shown in a mouse model of metastasis that human prostate cancer (PCa) cells directly compete with HSCs for occupancy of the mouse HSC niche. Importantly, increasing the niche size promoted metastasis, whereas decreasing the niche size compromised dissemination. Furthermore, disseminated PCa cells could be mobilized out of the niche and back into the circulation using HSC mobilization protocols. Finally, once in the niche, tumor cells reduced HSC numbers by driving their terminal differentiation. These data provide what we believe to be the first evidence that

The OntoGene is easy to use, but it can only show title and abstract, we will use the OntoGene for other purposes later.

We need to make a request at the database containing the pubmed articles and get all the information in them.

To do so, we will create the function 'fetch_articles' that can fetch a list of articles from PubMed.

In [7]:
# Download a list of full articles from PubMed
import requests

def fetch_articles(id_list):
    
    id_str_list = ''
    
    # I create the list of ID in a string, so that it can be usable in the HTTP request
    for el in id_list:
        id_str_list += el+','
        
    # I set up the HTTP request
    url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id='+id_str_list+'&rettype=xml'
    
    # I return the result of the request
    return requests.get(url)

def fetch_many_articles(id_list):  # To use only for >300 articles
    begin = 0
    end = len(id_list)
    responses = []
    result = []
    
    # I make requests of 300 cinical trials each
    while(begin < end):
        responses.append(fetch_articles(id_list[begin:(begin+300)]))
        begin += 300
    # I check if all the requests has the desired output
    for response in responses:
        if(response.status_code != 200):
            raise Exception('A response doesn\'t have 200 as status code, so something has gone wrong')
    
    # I add all the articles into a single array
    for response in responses:
        for article in article_doms(response.content): # I get all the articles for every response
            result.append(article)
    return result

In [8]:
fetched_articles = fetch_articles(['21436587','21436588']).content
print(fetched_articles)

b'<?xml version="1.0" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">\n<PubmedArticleSet>\n<PubmedArticle>\n    <MedlineCitation Status="MEDLINE" Owner="NLM">\n        <PMID Version="1">21436587</PMID>\n        <DateCompleted>\n            <Year>2011</Year>\n            <Month>06</Month>\n            <Day>21</Day>\n        </DateCompleted>\n        <DateRevised>\n            <Year>2018</Year>\n            <Month>11</Month>\n            <Day>13</Day>\n        </DateRevised>\n        <Article PubModel="Print-Electronic">\n            <Journal>\n                <ISSN IssnType="Electronic">1558-8238</ISSN>\n                <JournalIssue CitedMedium="Internet">\n                    <Volume>121</Volume>\n                    <Issue>4</Issue>\n                    <PubDate>\n                        <Year>2011</Year>\n                        <Month>Apr</Month>\n                    </PubDate>\n   

We now have a way to download the articles, we can reuse some functions we made before to deal with the clinical trials, such as the 'recursive_node_search', but we first need to obtain the xml dom from these articles in order to be able to use that function.

We will now find a way to get the xml dom from these articles.

In [9]:
def article_doms(fetch_result):
    root = ET.fromstring(fetch_result)
    doms_list = []
    
    for child in root:
        if(child.tag == "PubmedArticle"):
            doms_list.append(child)
    
    return doms_list
    
fetched_article_doms = article_doms(fetched_articles)
print(fetched_article_doms[0].tag)
print(recursive_node_search(fetched_article_doms[1],['MedlineCitation','Article','ArticleTitle']).text)

PubmedArticle
FOXO3 programs tumor-associated DCs to become tolerogenic in human and murine prostate cancer.


Now that we have the functions to download and extract data from PubMed articles and the functions to extract data from cinical trials, we can start to look at the csv of the gold standard and we can start building our first neural network for a baseline mesurement.

Here we will import the csv using pandas.

And we will print data from it.

In [10]:
import pandas as pd
df = pd.read_csv('ClinicalPmidsALL.csv', encoding = 'ISO-8859-1', sep = ';')

In [11]:
print(df)

          PMID         LastName   FirstName           CT  \
0     18425979       abdulkarim           B  NCT00703859   
1     11754709            gawin       Frank  NCT00000321   
2     21734560          deutsch      Steven  NCT00000333   
3     16357350          maisiak   Richard S  NCT00000407   
4     23047930           gorden     Phillip  NCT00001276   
5     18191187         hochster      Howard  NCT00003204   
6     17507634           fabian       Carol  NCT00005879   
7     26834067         dematteo      Ronald  NCT00025246   
8     24642382            klein      Julius  NCT00032656   
9     11480571      dispenzieri           A  NCT00047203   
10    20150372             fine    Howard A  NCT00047879   
11    19109586            orban     Tihamer  NCT00057499   
12    12143843          griffin           T  NCT00058461   
13    22688329           palmer     Jerry P  NCT00058981   
14    16018755           grossi      Sara G  NCT00066053   
15    25223501  goldbach-mansky  Raphael

In [12]:
df['PMID']

0       18425979
1       11754709
2       21734560
3       16357350
4       23047930
5       18191187
6       17507634
7       26834067
8       24642382
9       11480571
10      20150372
11      19109586
12      12143843
13      22688329
14      16018755
15      25223501
16      22162591
17       1851986
18      23636291
19      27953647
20      23005614
21      18645517
22      22136405
23      25124687
24      16457713
25      25161267
26      17712402
27      21555753
28      21801082
29      26298795
          ...   
1132    21382109
1133    23861366
1134    24576563
1135    24237940
1136    27721202
1137    21769288
1138    20710042
1139    26195310
1140    23589384
1141    25218848
1142    12642152
1143    14582490
1144    21633317
1145    27307782
1146    17140849
1147    24029874
1148    23618779
1149    23722975
1150    19411310
1151    26447629
1152    23005614
1153    27778171
1154    26757787
1155    27390533
1156    24247981
1157    15494916
1158    22382359
1159    252948

In [13]:
print(df.loc[0])

PMID                                18425979
LastName                          abdulkarim
FirstName                                  B
CT                               NCT00703859
Name            Bassam Abdulkarim, MD, FRCPC
CommonAnswer                             YES
Remark                                     A
1                                        yes
2                                        yes
3                                        yes
Name: 0, dtype: object


In [14]:
print(df.loc[0, 'PMID'])

18425979


Let's override the 'yes' with '1' and the 'no' with '0'.

After that we will have a better way to classify the rows.

In [15]:
print(df.iloc[39])
df.replace({'CommonAnswer' : {'yes':1, 'YES':1, 'Yes':1, 'no':0, 'NO':0, 'NO ':0 }}, inplace = True)
df.replace({'1' : {'yes':1, 'no':0}}, inplace = True)
df.replace({'2' : {'yes':1, 'no':0}}, inplace = True)
df.replace({'3' : {'yes':1,'ynes':1, 'no':0}}, inplace = True)
# in this column there is one ynes, I consider this to be a yes

print(df['CommonAnswer'].unique())
print(df['1'].unique())
print(df['2'].unique())
print(df['3'].unique())

PMID                          11250652
LastName                       alvarez
FirstName                            C
CT                         NCT00152347
Name            Christine Alvarez, PhD
CommonAnswer                        NO
Remark                               G
1                                  yes
2                                  yes
3                                  yes
Name: 39, dtype: object
[1 0]
[1 0]
[1 0]
[1 0]


Small check to see if there are incongruences with CommonAnswer and the actual data

In [16]:
def checkData(data):
    error = False
    for i in range(len(data)):
        checkSum = data.loc[i, '1'] + data.loc[i, '2'] + data.loc[i, '3']
        if(data.loc[i, 'CommonAnswer'] != 0 and data.loc[i, 'CommonAnswer'] != 1):
            print('not a valid value at', i, ':', data.loc[i, 'CommonAnswer'])
            error = True
        if(checkSum >= 2 and data.loc[i, 'CommonAnswer'] == 0):
            print('error at', i, ': common answer should be yes')
            error = True
        if(checkSum <= 1 and data.loc[i, 'CommonAnswer'] == 1):
            print('error at', i, ': common answer should be no')
            error = True
    return error

def correctData(data):
    for i in range(len(data)):
        checkSum = data.loc[i, '1'] + data.loc[i, '2'] + data.loc[i, '3']
        if(checkSum >= 2 and data.loc[i, 'CommonAnswer'] == 0):
            data.loc[i, 'CommonAnswer'] = 1
        if(checkSum <= 1 and data.loc[i, 'CommonAnswer'] == 1):
            data.loc[i, 'CommonAnswer'] = 0
        
print('presence of error :',checkData(df))
correctData(df)
print('correcting data')
print('presence of error :',checkData(df))

#useful functions for csv

#df.info()
#df.isnull().sum()
#df['3'].unique()
#df['CommonAnswer'].unique()
#df.head()

error at 39 : common answer should be yes
error at 129 : common answer should be yes
error at 140 : common answer should be yes
error at 270 : common answer should be yes
error at 308 : common answer should be yes
error at 337 : common answer should be yes
error at 350 : common answer should be yes
error at 373 : common answer should be yes
error at 411 : common answer should be yes
error at 531 : common answer should be yes
error at 543 : common answer should be yes
error at 591 : common answer should be yes
error at 628 : common answer should be yes
error at 650 : common answer should be yes
error at 651 : common answer should be yes
error at 676 : common answer should be yes
error at 814 : common answer should be yes
error at 887 : common answer should be yes
error at 893 : common answer should be yes
presence of error : True
correcting data
presence of error : False


Now that we have loaded the csv correctly we can start creating the baseline classifier on which to estimate our future results.

We will be using scikit-learn now, and later we will be using tensorflow.

Let's import the libraries we will need for sklearn

In [17]:
#imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

  from numpy.core.umath_tests import inner1d


We are building the baseline classifier.

To build it, we will use a random forest.

We will have to define attributes on which the classifier needs to be based on.

I have chosen the following attributes 'same first name', 'same email', 'same organization'

Let's define the function to check array equality

In [18]:
# I create a new array that contains the same number of elements of the two array.
# It will contains 1 in a position if arr1[position] == arr2[position] and 0 if they are different.
def equalArray(arr1, arr2):
    if(len(arr1) != len(arr2)):
        return None
    equal = []
    for i in range(len(arr1)):
        if(arr1[i] == arr2[i]):
            equal.append(1)
        else:
            equal.append(0)
    return equal

print(equalArray(['hello','hello2'],['hello','hello3']))
print(equalArray(['hello','hello2'],['hello']))

[1, 0]
None


Now Let's extract the information needed in arrays that we will join later.

In this section we will retrieve all the articles and all the clinical trials

Here we will map every PubMed article in XML and every clinical trial in XML together.

In [19]:
# Here we get all the PubMed articles
PubMed_id_string = list(map(str, df['PMID'].tolist()))
PubMed_articles = fetch_many_articles(PubMed_id_string)

In [20]:
# Here we get all the clinical trials
ClinicalTrials = get_xml_doms(df['CT'].tolist())

In [21]:
#for i in range(len(PubMed_articles)):
#    print(recursive_node_search(PubMed_articles[i],['MedlineCitation','PMID']).text)
#for i in range(len(ClinicalTrials)):
#    print(type(recursive_node_search(ClinicalTrials[i],['id_info','nct_id']).text))
def get_article_by_ID(article_list, PMID):
    for i in range(len(article_list)):
        if(recursive_node_search(article_list[i],['MedlineCitation','PMID']).text == PMID):
            return article_list[i]
    return None
    
def get_clinical_trial_by_ID(clinical_trials_list, CT):
    for i in range(len(clinical_trials_list)):
        if(recursive_node_search(clinical_trials_list[i],['id_info','nct_id']).text == CT):
            return clinical_trials_list[i]
    return None

#Now let's map them
CT_PMID = []
for i in range(len(df)):
    row = df.iloc[i]
    article = get_article_by_ID(PubMed_articles, str(row['PMID']))
    clinical_trial = get_clinical_trial_by_ID(ClinicalTrials, row['CT'])
    if(article != None and clinical_trial != None):
        CT_PMID.append([clinical_trial, article])

In [22]:
# We need to get the last name, first name initial, first name, e-mail and organization
# I start to get them for the clinical trials

# Here I get the NCT correspondency in the gold standard to the cell in my CT_PMID
def get_correct_last_name(CT_ID, df):
    index = df.index[df['CT'] == CT_ID].tolist()[0]
    last_name = df.iloc[index]['LastName']
    return last_name.lower()

def extrapolate_name(name):
    # I use the lower case and remove the points to uniform the later comparison
    name = name.split(',')[0].lower().replace('.','').strip()  # I get rid of the M.D and PhD, I only need the name now
    name = name.split(' ')
    first_name = ' '.join(name[:-1])  # everythng except the last
    first_name_initial = first_name[0]
    last_name = name[-1]  # the last is always the last name (double surnames are united by '-')
    return last_name, first_name_initial, first_name

def extrapolate_last_name(name):
    name = name.split(',')[0].lower().replace('.','').strip()
    name = name.split(' ')
    last_name = name[-1]
    return last_name.lower()

def get_organization_name(CT_PMID, i):
    org_name = recursive_node_search(CT_PMID[i][0],['responsible_party','investigator_affiliation'])
    if(org_name != None and org_name.text.strip() != ""):
        #print(org_name.text)
        return org_name.text.strip().lower()
    org_name = recursive_node_search(CT_PMID[i][0],['responsible_party','organization'])
    if(org_name != None and org_name.text.strip() != ""):
        #print(org_name.text)
        return org_name.text.strip().lower()
    org_name = recursive_node_search(CT_PMID[i][0],['overall_official','affiliation'])
    if(org_name != None and org_name.text.strip() != ""):
        #print(org_name.text)
        return org_name.text.strip().lower()
    org_name = recursive_node_search(CT_PMID[i][0],['source'])
    if(org_name != None and org_name.text.strip() != ""):
        #print(org_name.text)
        return org_name.text.strip().lower()
    print(recursive_node_search(CT_PMID[i][0],['id_info','nct_id']).text, 'doesn\'t have an organization')
    orgs+=1
    return ""

def get_mail_CT(CT_PMID, i):
    mail = recursive_node_search(CT_PMID[i][0],['clinical_results','point_of_contact','email'])
    if(mail!= None):
        mail = mail.text.split(";")
        mails_found.append(mail[0])
        return mail[0].strip()
    return ""

element_to_delete = []
nones = 0
completely_not_found = 0
mails_found = []
orgs = 0

for i in range(len(CT_PMID)):
    # Let's get the correct last name
    correct_last_name = get_correct_last_name(recursive_node_search(CT_PMID[i][0],['id_info','nct_id']).text, df)
    
    # I get the name from the 3 possible locations
    CT_name_1 = recursive_node_search(CT_PMID[i][0],['overall_official','last_name'])
    CT_name_2 = recursive_node_search(CT_PMID[i][0],['responsible_party','investigator_full_name'])
    CT_name_3 = recursive_node_search(CT_PMID[i][0],['overall_contact','last_name'])
    
    # I get the right name
    CT_name = None
    if(CT_name_1 != None and extrapolate_last_name(CT_name_1.text) == correct_last_name):
        CT_name = CT_name_1
    if(CT_name_2 != None and extrapolate_last_name(CT_name_2.text) == correct_last_name):
        CT_name = CT_name_2
    if(CT_name_3 != None and extrapolate_last_name(CT_name_3.text) == correct_last_name):
        CT_name = CT_name_3
    
    # Here I get the organization name
    organization_name = get_organization_name(CT_PMID, i)
    
    # Here I get the mail
    mail = get_mail_CT(CT_PMID, i)

    # If I can't ge the right name, I save the node and I will delete it later (fortunately it happens only twice)
    if(CT_name == None):
        ID = recursive_node_search(CT_PMID[i][0],['id_info','nct_id']).text
        if(CT_name_1 != None):
            print(ID, 'last name in the first field:',extrapolate_last_name(CT_name_1.text), 'correct last name:',correct_last_name)
        if(CT_name_2 != None):
            print(ID, 'last name in the second field:',extrapolate_last_name(CT_name_2.text), 'correct last name:',correct_last_name)
        if(CT_name_3 != None):
            print(ID, 'last name in the third field:',extrapolate_last_name(CT_name_3.text), 'correct last name:',correct_last_name)
        if(CT_name_3 == None and CT_name_2 == None and CT_name_1 == None):
            print(ID, 'name of the principal investigator not found, correct name: ',correct_last_name)
            completely_not_found+=1
            nones-=1 # I exclude him from the nones
        nones+=1
        element_to_delete.append(CT_PMID[i])
        continue
    
    # I get the last name, first name initial and the first name
    last_name_CT, first_name_initials_CT, first_name_CT = extrapolate_name(CT_name.text)
    
    # I add what I found to the matrix
    CT_PMID[i].extend((last_name_CT, first_name_initials_CT, first_name_CT, organization_name, mail))
    
print('wrong name found:',nones,'No name found:',completely_not_found)
print('previous pairs:',len(CT_PMID))

# Now I remove the elements for which I haven't found the proper name
for el in element_to_delete:
    CT_PMID.remove(el)
print('usable pairs found:',len(CT_PMID))
print('organization found:',(len(CT_PMID)-orgs))
print('mail found:',len(mails_found))

# I put this to be able to shift accordingly to the number of elements I have placed
elements_in_clinical_trial = len(CT_PMID[0]) - 2 # All the elements in a row - the article - the clinical trial

NCT01160471 last name in the first field: jones correct last name: bluemke
NCT01306994 last name in the third field: poling correct last name: mccormick
NCT01356251 last name in the first field: nelson correct last name: holland
NCT01860404 last name in the first field: myers correct last name: kirschen
NCT01860404 last name in the second field: myers correct last name: kirschen
NCT01860404 last name in the third field: swann correct last name: kirschen
NCT02224729 last name in the first field: filicko-o'hara correct last name: sharma
NCT02238327 last name in the second field: morris correct last name: kessinger
NCT02325401 last name in the first field: wise-draper correct last name: hashemi-sadraei
NCT02325401 last name in the second field: wise-draper correct last name: hashemi-sadraei
NCT02689427 last name in the first field: ueno correct last name: lim
NCT02689427 last name in the third field: ueno correct last name: lim
NCT02726880 last name in the first field: zajac correct last 

In [23]:
# We need to get the last name, first name initial, first name, e-mail and organization
# Now let's do that for the PubMed articles

# I need to find the right author because in PubMed he/she is not always the first on the list
def get_correct_author(CT_PMID, i, last_name, initial):
    authors = recursive_node_search(CT_PMID[i][1],['MedlineCitation','Article','AuthorList'])
    backup = None
    for author in authors:
        flag = False
        for child in author:
            if(child.tag == 'LastName' and child.text.lower() == last_name):
                flag = True # I need the flag to check both last name and initials
                backup = author
            if(child.tag == 'Initials' and child.text.lower()[0] == initial and flag):
                return author
    return None

def get_default_organization(CT_PMID, i):
    authors = recursive_node_search(CT_PMID[i][1],['MedlineCitation','Article','AuthorList'])
    for author in authors:
        for child in author:
            if(child.tag == 'AffiliationInfo'):
                for aff in child:
                    if(aff.tag == 'Affiliation'):
                        orgs.append(aff)
                        return aff.text.lower().strip()
    return ""

def get_organization(CT_PMID, i, last_name, initial):
    authors = recursive_node_search(CT_PMID[i][1],['MedlineCitation','Article','AuthorList'])
    for author in authors:
        doubleFlag = False
        flag = False
        for child in author:
            if(child.tag == 'LastName' and child.text.lower() == last_name):
                flag = True # I need the flag to check both last name and initials
            if(child.tag == 'Initials' and child.text.lower()[0] == initial and flag):
                doubleFlag = True
            if(child.tag == 'AffiliationInfo' and doubleFlag):
                for aff in child:
                    if(aff.tag == 'Affiliation'):
                        orgs.append(aff)
                        return aff.text.lower().strip()
    return get_default_organization(CT_PMID, i)

def get_default_mail(CT_PMID, i):
    authors = recursive_node_search(CT_PMID[i][1],['MedlineCitation','Article','AuthorList'])
    for author in authors:
        for child in author:
            if(child.tag == 'AffiliationInfo'):
                for aff in child:
                    if(aff.tag == 'Affiliation'):
                        strings = aff.text.split(' ')
                        for string in strings:
                            if("@" in string):
                                mails.append(string)
                                if(string[-1] == '.'):
                                    string = string[:-1]
                                return string.strip()
    return ""

def get_mail(CT_PMID, i, last_name, initial):
    authors = recursive_node_search(CT_PMID[i][1],['MedlineCitation','Article','AuthorList'])
    backup = None
    for author in authors:
        doubleFlag = False
        flag = False
        for child in author:
            if(child.tag == 'LastName' and child.text.lower() == last_name):
                flag = True # I need the flag to check both last name and initials
                backup = author
            if(child.tag == 'Initials' and child.text.lower()[0] == initial and flag):
                doubleFlag = True
            if(child.tag == 'AffiliationInfo' and doubleFlag):
                for aff in child:
                    if(aff.tag == 'Affiliation'):
                        strings = aff.text.split(' ')
                        for string in strings:
                            if("@" in string):
                                mails.append(string)
                                if(string[-1] == '.'):
                                    string = string[:-1]
                                return string.strip()
    return get_default_mail(CT_PMID, i)

def extrapolate_name_xml_node(author):
    if(author == None):
        return None, None, None
    for child in author:
        if(child.tag == 'LastName'):
            last_name = child.text.lower()
        if(child.tag == 'ForeName'):
            first_name = child.text.lower()
        if(child.tag == 'Initials'):
            first_name_initial = child.text.lower()[0]
    return last_name, first_name_initial, first_name

orgs = []
mails = []

for i in range(len(CT_PMID)):
    last_name = CT_PMID[i][2]
    first_initial = CT_PMID[i][3]
    author = None
    author = get_correct_author(CT_PMID, i, last_name, first_initial)
    
    # Here we wil extrapolate the organization
    organization_name = get_organization(CT_PMID, i, last_name, first_initial)
    #if(organization_name != ""):
        #print(organization_name,'\n')
        
    mail = get_mail(CT_PMID, i, last_name, first_initial)
    #if(mail != None):
        #print(mail)
    
    if(author == None):
        print('not found:',i)
    last_name_PM, first_name_initials_PM, first_name_PM = extrapolate_name_xml_node(author)
    if(last_name_PM != last_name):
        print(i,last_name_PM,last_name)
    if(first_name_initials_PM != first_initial):
        print(i,first_name_initials_PM,first_initial)
    CT_PMID[i].extend((last_name_PM, first_name_initials_PM, first_name_PM, organization_name, mail))
    
print(len(orgs), 1139)
print(len(mails), 1139)

1034 1139
490 1139


In [24]:
CT_PMID

[[<Element 'clinical_study' at 0x000002785A2BD3B8>,
  <Element 'PubmedArticle' at 0x000002785A2A0F98>,
  'abdulkarim',
  'b',
  'bassam',
  'ahs cancer control alberta',
  '',
  'abdulkarim',
  'b',
  'b',
  '',
  ''],
 [<Element 'clinical_study' at 0x000002786737C6D8>,
  <Element 'PubmedArticle' at 0x000002785CA11908>,
  'gawin',
  'f',
  'frank',
  'friends research institute, inc.',
  '',
  'gawin',
  'f',
  'frank',
  'yale university school of medicine, substance abuse center, 34 park st, new haven, ct 06519, usa. arthur.margolin@yale.edu',
  'arthur.margolin@yale.edu'],
 [<Element 'clinical_study' at 0x00000278692BB4F8>,
  <Element 'PubmedArticle' at 0x000002785CA25688>,
  'deutsch',
  's',
  'steven',
  'washington d.c. veterans affairs medical center',
  '',
  'deutsch',
  's',
  'steven',
  'department of bioengineering, the pennsylvania state university, university park, pennsylvania 16802, usa.',
  ''],
 [<Element 'clinical_study' at 0x00000278692C0D18>,
  <Element 'PubmedAr

Now we will add the equality to the matrix (first name equality, organization equality, e-mail equality) and the common answer.

In [25]:
def get_common_answer(CT_ID, df):
    index = df.index[df['CT'] == CT_ID].tolist()[0]
    common_answer = df.iloc[index]['CommonAnswer']
    return common_answer

def get_similarity(phrase1, phrase2):
    words1 = phrase1.split(" ")
    words2 = phrase2.split(" ")
    counter = 0
    for word1 in words1:
        for word2 in words2:
            if(word1 == word2):
                counter += 1
                break
    return counter/len(words1)        

for i in range(len(CT_PMID)):
    
    # First name check
    if(CT_PMID[i][4] == CT_PMID[i][4 + elements_in_clinical_trial]):
        first_name_equality = 1
    else:
        first_name_equality = 0
    
    # Organization similarity (from 0.0 to 1.0)
    # Let's remove punctuation to have comparison easier
    CT_PMID[i][5] = CT_PMID[i][5].replace('.', '').replace(',', '').replace(';', '').replace('-', ' ')
    CT_PMID[i][5 + elements_in_clinical_trial] = CT_PMID[i][5 + elements_in_clinical_trial].replace('.', '').replace(',', '').replace(';', '').replace('-', ' ')
    
    organization_similarity = get_similarity(CT_PMID[i][5], CT_PMID[i][5 + elements_in_clinical_trial])
    #print(organization_similarity, CT_PMID[i][5], '2222', CT_PMID[i][10])
    print(i,CT_PMID[i][10])
    
    # e-mail check
    if(CT_PMID[i][6] == CT_PMID[i][6 + elements_in_clinical_trial]):
        email_equality = 1
    else:
        email_equality = 0
        
    # I get the common answer from the gold standard
    common_answer = get_common_answer(recursive_node_search(CT_PMID[i][0],['id_info','nct_id']).text, df)
    
    # I add everything to the matrix
    CT_PMID[i].extend((first_name_equality, organization_similarity, email_equality, common_answer))

0 
1 yale university school of medicine substance abuse center 34 park st new haven ct 06519 usa arthurmargolin@yaleedu
2 department of bioengineering the pennsylvania state university university park pennsylvania 16802 usa
3 department of health services administration university of alabama at birmingham 1675 university boulevard room 544 birmingham al 35294 3361 usa eberner@uabedu
4 diabetes endocrinology and obesity branch national institute of diabetes and digestive and kidney diseases national institutes of health bethesda maryland 20892 usa phillipg@intraniddknihgov
5 division of medical oncology new york university school of medicine nyu cancer institute ny 10016 usa howardhochster@mednyuedu
6 duke university medical center durham north carolina usa
7 karen t brown richard k do mithat gonen anne m covey george i getrajdman constantinos t sofocleous william r jarnagin michael i d'angelica peter j allen joseph p erinjeri lynn a brody gerald p o'neill kristian n johnson alessandra 

269 service de chirurgie vasculaire hôpital de la timone marseille france smalikov@yahoocom
270 shoklo malaria research unit mae sot tak thailand faculty of tropical medicine mahidol university bangkok thailand nuffield department of clinical medicine centre for clinical vaccinology and tropical medicine university of oxford oxford united kingdom electronic address: francois@tropmedresac
271 anaestesiologisk intensiv afdeling v odense universitetshospital dk 5000 odense c denmark palletoft@ouhregionsyddanmarkdk
272 division of hematology & oncology department of internal medicine uc davis cancer center university of california davis sacramento california 95817 usa
273 department of symptom research m d anderson cancer center houston texas usa
274 tcm center for aids prevention and treatment china academy of chinese medical sciences beijing 100700 china
275 department of cystic fibrosis royal brompton hospital london sw3 6np uk
276 department of obstetrics and gynecology hadassah hebrew

558 wilmer eye institute johns hopkins university school of medicine baltimore maryland usa
559 department of anesthesiology university of pittsburgh medical center pittsburgh pennsylvania usa
560 centre for research in neurodegenerative diseases and division of neurology department of medicine toronto western hospital university of toronto ontario canada
561 from the wellcome trust medical research council institute of metabolic science university of cambridge (zas mew rh hrm) and wolfson diabetes and endocrine clinic cambridge university hospitals nhs foundation trust (sh ds hrm) cambridge the elsie bertram diabetes centre (rct hrm) and the department of obstetrics and gynaecology (kps) norfolk and norwich university hospitals nhs foundation trust and the norwich medical school university of east anglia (hrm) norwich the ipswich diabetes centre ipswich hospital nhs trust ipswich (gr) and the division of epidemiology and biostatistics leeds institute of cardiovascular and metabolic me

808 department of rehabilitation medicine faculty of medicine siriraj hospital mahidol university bangkok 10700 thailand
809 cardiology department "bagdasar arseni" emergency hospital bucharest romania
810 1 division of endocrinology diabetes metabolism and nutrition mayo clinic rochester minnesota usa
811 division of orthopaedic surgery london health sciences centre university hospital 339 windermere road london on n6a 5a5 canada
812 department of neurology oregon health & science university portland or usa parkinson's disease research education and clinical center portland veterans affairs medical center portland or usa
813 1 health literacy and learning program division of general internal medicine northwestern university chicago il 2 division of transplantation department of surgery emory university atlanta ga 3 division of gastroenterology and hepatology university of pennsylvania philadelphia pa
814 department of surgical gastroenterology university of copenhagen herlev hospital 

1090 
1091 deansley centre royal wolverhampton hospital wednesfield road wolverhampton wv10 0qp uk
1092 nsabp headquarters pittsburgh pa 15261
1093 temple university school of pharmacy philadelphia pa usa
1094 center for hematologic malignancies knight cancer institute oregon health & science university portland or usa chenan@ohsuedu
1095 laboratoire de physique des particules in2p3/cnrs et université de savoie annecy le vieux france
1096 division of neonatology department of pediatrics s orsola malpighi university hospital bologna italy
1097 national institute for medical research university of pierre and marie curie paris france
1098 division of rheumatology department of medicine the university of western ontario london canada janetpope@sjhclondononca
1099 klinik für unfall  und wiederherstellungschirurgie bg unfallklinik tübingen
1100 department of dermatology university at buffalo buffalo ny usa
1101 china military institute of chinese materia medica 302 military hospital of china

Let's now convert our CT_PMID to a Dataframe in pandas, so it will be easier to use

And let's change the names on it so it will be easier to understand.

In [26]:
data_frame = pd.DataFrame(CT_PMID)
data_frame = data_frame.drop([0,1], axis = 1)
data_frame.columns = ['last_name_CT', 'first_name_initial_CT', 'first_name_CT', 'organization_CT', 'mail_CT', 'last_name_PM', 'first_name_initial_PM', 'first_name_PM', 'organization_PM', 'mail_PM', 'first_name_equality', 'organization_equality','email_equality', 'common_answer']
data_frame

Unnamed: 0,last_name_CT,first_name_initial_CT,first_name_CT,organization_CT,mail_CT,last_name_PM,first_name_initial_PM,first_name_PM,organization_PM,mail_PM,first_name_equality,organization_equality,email_equality,common_answer
0,abdulkarim,b,bassam,ahs cancer control alberta,,abdulkarim,b,b,,,0,0.000000,1,1
1,gawin,f,frank,friends research institute inc,,gawin,f,frank,yale university school of medicine substance a...,arthur.margolin@yale.edu,1,0.000000,0,1
2,deutsch,s,steven,washington dc veterans affairs medical center,,deutsch,s,steven,department of bioengineering the pennsylvania ...,,1,0.000000,1,1
3,maisiak,r,richard s,university of alabama at birmingham,,maisiak,r,richard s,department of health services administration u...,eberner@uab.edu,1,1.000000,0,1
4,gorden,p,phillip,national institute of diabetes and digestive a...,,gorden,p,phillip,diabetes endocrinology and obesity branch nati...,phillipg@intra.niddk.nih.gov,1,0.900000,0,1
5,hochster,h,howard,eastern cooperative oncology group,,hochster,h,howard,division of medical oncology new york universi...,howard.hochster@med.nyu.edu,1,0.250000,0,1
6,fabian,c,carol,university of kansas medical center,bkimler@kumc.edu,fabian,c,carol,duke university medical center durham north ca...,,1,0.600000,0,1
7,dematteo,r,ronald,american college of surgeons,,dematteo,r,ronald,karen t brown richard k do mithat gonen anne m...,brown6@mskcc.org,1,0.750000,0,1
8,klein,j,jonathan d,university of rochester,,klein,j,julius,university of california irvine ca usa,marie-helene.milot@usherbrooke.ca,0,0.666667,0,0
9,dispenzieri,a,angela,mayo clinic,,dispenzieri,a,a,division of hematology mayo clinic and mayo fo...,,0,1.000000,1,1


Let's visualize the number of no and the number of yes.

In [27]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected = True)

total = len(data_frame)
yes_percentage = data_frame['common_answer'].sum() / total
no_percentage = (len(data_frame) - data_frame['common_answer'].sum()) / total

data = [go.Bar(
            x = ['Yes', 'No'],
            y = [data_frame['common_answer'].sum(), len(data_frame) - data_frame['common_answer'].sum()]
    )]

layout = go.Layout(
    title = 'For this pair, Is this the same scientist?',
    xaxis = dict(
        title = 'common answer'
    ),
    yaxis = dict(
        title = 'number of answers'
    )
)

fig = go.Figure(data = data, layout = layout)
iplot(fig)

print('percentage of yes:',yes_percentage,'%')
print('percentage of no:',no_percentage,'%')

percentage of yes: 0.7506584723441615 %
percentage of no: 0.24934152765583845 %


Now let's divide the dataframe (data_frame) into X and Y.

X wil represent the attributes and Y will represent if in that pair the scientist is the same (1) or not (0).

We will use X to predict Y

In [28]:
X = data_frame.drop(['last_name_CT', 'first_name_initial_CT', 'first_name_CT', 'organization_CT', 'mail_CT', 'last_name_PM', 'first_name_initial_PM', 'first_name_PM', 'organization_PM', 'mail_PM', 'common_answer'], axis = 1)
Y = data_frame['common_answer']
X
for i in range(len(CT_PMID)):
    if(CT_PMID[i][6] != "" and CT_PMID[i][11] != ""):
        print(CT_PMID[i][6],CT_PMID[i][11],CT_PMID[i][14])

hfine@mail.nih.gov jraizer@nmff.org 0
ubraun@bcm.edu ubraun@bcm.edu 1
jennifer.hagman@childrenscolorado.org Guido.Frank@ucdenver.edu 0
jalid.sehouli@charite.de klapdor.ruediger@mh-hannover.de 0
jleventh@nmh.org a-tambur@northwestern.edu 0
leve@uoregon.edu misaki.natsuaki@ucr.edu 0
scott.powers@cchmc.org spatton2@kumc.edu 0
Said.Ibrahim2@va.gov ola.rolfson@vgregion.se 0
andrea.l.harzstark@kp.org slovins@mskcc.org 0
bmgulluoglu@marmara.edu.tr makkiprik@marmara.edu.tr 0
John.Wagner@jefferson.edu jacob117@umn.edu 0
irene_ghobrial@dfci.harvard.edu cnathan@med.cornell.edu 0
Jonathan_Friedberg@urmc.rochester.edu Carla_Casulo@URMC.Rochester.edu 0
seiferheldw@nrgoncology.org phbrown@mdanderson.org 0
fuad.shihab@hsc.utah.edu anthony.langone@vanderbilt.edu 0
chapmanp@mskcc.org m.sime@beatson.gla.ac.uk 0
alberto.chiappori@moffitt.org gerold.bepler@moffitt.org 0
hherf@med.unc.edu g.dhaens@amc.uva.nl 0
john.sampson@duke.edu vrede001@mc.duke.edu 0
kyna@snubh.org miwang@mdanderson.org 0
randerson@nono

We will split the X and Y in training set and testing set

In [29]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 5)

In [30]:
# normalizing

#sc = StandardScaler()

#X_train = sc.fit_transform(X_train)
#X_test = sc.transform(X_test)

In [31]:
# Random Forest

rfc = RandomForestClassifier(n_estimators = 500)
rfc.fit(X_train, Y_train)
pred_rfc = rfc.predict(X_test)

#Let's see how well has done
print(classification_report(Y_test, pred_rfc))
print(confusion_matrix(Y_test, pred_rfc))

#Let's see the ones that aren't correct


res = Y_test != pred_rfc

for index in np.where(res):
    print(X_test.iloc[index])
    print('predicted :',pred_rfc[index])
    print('real result :',Y_test.iloc[index])

             precision    recall  f1-score   support

          0       0.67      0.73      0.70        51
          1       0.92      0.90      0.91       177

avg / total       0.86      0.86      0.86       228

[[ 37  14]
 [ 18 159]]
      first_name_equality  organization_equality  email_equality
117                     0               0.125000               0
956                     1               0.333333               1
540                     0               0.666667               0
267                     1               0.000000               1
634                     1               1.000000               1
623                     1               0.000000               1
747                     0               0.400000               0
475                     1               0.000000               1
984                     0               0.200000               1
1052                    0               0.000000               1
1066                    1               0.50000

In [32]:
#SVM Classifier

clf = svm.SVC()
clf.fit(X_train, Y_train)
pred_clf = clf.predict(X_test)

#Let's see how well has done
print(classification_report(Y_test, pred_clf))
print(confusion_matrix(Y_test, pred_clf))

#Let's see the ones that aren't correct
res = Y_test != pred_clf

for index in np.where(res):
    print(X_test.iloc[index])
    print('predicted :',pred_clf[index])
    print('real result :',Y_test.iloc[index])

             precision    recall  f1-score   support

          0       0.66      0.76      0.71        51
          1       0.93      0.89      0.91       177

avg / total       0.87      0.86      0.86       228

[[ 39  12]
 [ 20 157]]
      first_name_equality  organization_equality  email_equality
322                     0               0.666667               0
117                     0               0.125000               0
956                     1               0.333333               1
267                     1               0.000000               1
634                     1               1.000000               1
623                     1               0.000000               1
475                     1               0.000000               1
501                     0               0.500000               0
984                     0               0.200000               1
1052                    0               0.000000               1
580                     0               0.50000

In [33]:
# Neural Network

mlpc = MLPClassifier(hidden_layer_sizes = (11,11,11), max_iter = 5000)
mlpc.fit(X_train, Y_train)
pred_mlpc = mlpc.predict(X_test)

#Let's see how well has done
print(classification_report(Y_test, pred_mlpc))
print(confusion_matrix(Y_test, pred_mlpc))

#Let's see the ones that aren't correct
res = Y_test != pred_mlpc

for index in np.where(res):
    print(X_test.iloc[index])
    print('predicted :',pred_mlpc[index])
    print('real result :',Y_test.iloc[index])

             precision    recall  f1-score   support

          0       0.67      0.65      0.66        51
          1       0.90      0.91      0.90       177

avg / total       0.85      0.85      0.85       228

[[ 33  18]
 [ 16 161]]
      first_name_equality  organization_equality  email_equality
234                     0               0.500000               1
322                     0               0.666667               0
117                     0               0.125000               0
956                     1               0.333333               1
267                     1               0.000000               1
634                     1               1.000000               1
623                     1               0.000000               1
747                     0               0.400000               0
475                     1               0.000000               1
445                     0               0.500000               1
984                     0               0.20000

In [34]:
np.where(df['1']!=df['CommonAnswer'])

(array([ 253,  299,  337,  350,  356,  411,  438,  494,  517,  618,  632,
         882,  884,  898,  915, 1134], dtype=int64),)

In [35]:
np.where(df['2']!=df['CommonAnswer'])

(array([   2,    7,   17,   23,   31,   54,   73,   74,   82,   94,  136,
         140,  225,  243,  246,  266,  291,  297,  357,  360,  363,  372,
         395,  417,  477,  478,  510,  543,  544,  559,  568,  587,  596,
         599,  602,  615,  626,  644,  658,  692,  699,  716,  739,  749,
         750,  753,  785,  795,  796,  810,  826,  839,  858,  872,  876,
         878,  892,  942,  943,  949,  965,  980,  983,  994, 1017, 1047,
        1060, 1073, 1083, 1087, 1091, 1122, 1136, 1156], dtype=int64),)

In [36]:
np.where(df['3']!=df['CommonAnswer'])

(array([ 83,  89, 172, 230, 236, 298, 352, 353, 373, 377, 408, 409, 530,
        591, 606, 650, 651, 708, 859, 894, 902, 997], dtype=int64),)

In [37]:
CT_PMID

[[<Element 'clinical_study' at 0x000002785A2BD3B8>,
  <Element 'PubmedArticle' at 0x000002785A2A0F98>,
  'abdulkarim',
  'b',
  'bassam',
  'ahs cancer control alberta',
  '',
  'abdulkarim',
  'b',
  'b',
  '',
  '',
  0,
  0.0,
  1,
  1],
 [<Element 'clinical_study' at 0x000002786737C6D8>,
  <Element 'PubmedArticle' at 0x000002785CA11908>,
  'gawin',
  'f',
  'frank',
  'friends research institute inc',
  '',
  'gawin',
  'f',
  'frank',
  'yale university school of medicine substance abuse center 34 park st new haven ct 06519 usa arthurmargolin@yaleedu',
  'arthur.margolin@yale.edu',
  1,
  0.0,
  0,
  1],
 [<Element 'clinical_study' at 0x00000278692BB4F8>,
  <Element 'PubmedArticle' at 0x000002785CA25688>,
  'deutsch',
  's',
  'steven',
  'washington dc veterans affairs medical center',
  '',
  'deutsch',
  's',
  'steven',
  'department of bioengineering the pennsylvania state university university park pennsylvania 16802 usa',
  '',
  1,
  0.0,
  1,
  1],
 [<Element 'clinical_st

Ok, now that we have done a baseline classifier, let's do a classifier that will be able to get better results