# Open and Load MeSH xml file

The latest file is desc2023.xml.  Download this to your working directory.

We want to extract the MeSH heading, the unique id (descriptor), and any associated Entry terms

For example: https://www.ncbi.nlm.nih.gov/mesh/?term=D065626

name:     Non-alcoholic Fatty Liver Disease\
descriptorui:   D065626\
treenumber: C06.552.241.519\
entry terms: Non alcoholic Fatty Liver Disease\
            NAFLD\
            Nonalcoholic Fatty Liver Disease\
            Fatty Liver, Nonalcoholic\
            Fatty Livers, Nonalcoholic\
            Liver, Nonalcoholic Fatty\
            Livers, Nonalcoholic Fatty\
            Nonalcoholic Fatty Liver\
            Nonalcoholic Fatty Livers\
            Nonalcoholic Steatohepatitis\
            Nonalcoholic Steatohepatitides\
            Steatohepatitides, Nonalcoholic\
            Steatohepatitis, Nonalcoholic;

We will nake use of the entry terms to find the descriptorui, name, treenumber, and terms.

Since this task is to identify disease indications (as opposed to parts of the body etc) we'll limit the extracted data to treetops in ['C', 'F']. C = Diseases, F= Psychiatry and Psychology



In [5]:
from xml.etree import cElementTree as elemtree
from datetime import date

"""
Use this to parse XML from MeSH (Medical Subject Headings). More information 
on the format at: http://www.ncbi.nlm.nih.gov/mesh
End users will primarily want to call the `parse_mesh` function and do something
with the output.
"""

def parse_mesh(filename):
    """Parse a mesh file, successively generating
    `DescriptorRecord` instance for subsequent processing."""
    for _evt, elem in elemtree.iterparse(filename):
        if elem.tag == 'DescriptorRecord':
            yield DescriptorRecord.from_xml_elem(elem)

def date_from_mesh_xml(xml_elem):
    year = xml_elem.find('./Year').text
    month = xml_elem.find('./Month').text
    day = xml_elem.find('./Day').text
    return date(int(year), int(month), int(day))

class PharmacologicalAction(object):
    """A pharmacological action, denoting the effects of a MeSH descriptor."""
    
    def __init__(self, descriptor_ui):
        self.descriptor_ui = descriptor_ui
    
    @classmethod
    def from_xml_elem(cls, elem):
        descriptor_ui = elem.find('./DescriptorReferredTo/DescriptorUI')
        return cls(descriptor_ui)

class SlotsToNoneMixin(object):
    def __init__(self, **kwargs):
        for attr in self.__slots__:
            setattr(self, attr, kwargs.get(attr, None))
    
    def __repr__(self):
        attrib_repr = ', '.join(u'%s=%r' % (attr, getattr(self, attr)) for attr in self.__slots__)
        return self.__class__.__name__ + '(' + attrib_repr + ')'

class Term(SlotsToNoneMixin):
    """A term from within a MeSH concept."""

    __slots__ = ('term_ui', 'string', 'is_concept_preferred', 'is_record_preferred',
      'is_permuted', 'lexical_tag', 'date_created', 'thesaurus_list')
    
    @classmethod
    def from_xml_elem(cls, elem):
        term = cls()
        term.is_concept_preferred = elem.get('ConceptPreferredTermYN', None) == 'Y'
        term.is_record_preferred = elem.get('RecordPreferredTermYN', None) == 'Y'
        term.is_permuted = elem.get('IsPermutedTermYN', None) == 'Y'
        term.lexical_tag = elem.get('LexicalTag')
        for child_elem in elem:
            if child_elem.tag == 'TermUI':
                term.term_ui = child_elem.text
            elif child_elem.tag == 'String':
                term.string = child_elem.text
                #term.name = [th_elem.text for th_elem in child_elem]
            elif child_elem.tag == 'DateCreated':
                term.date_created = date_from_mesh_xml(child_elem)
            elif child_elem.tag == 'ThesaurusIDlist':
                term.thesaurus_list = [th_elem.text for th_elem in child_elem]
        return term

class SemanticType(SlotsToNoneMixin):
    __slots__ = ('ui', 'name')
    
    @classmethod
    def from_xml_elem(cls, elem):
        sem_type = cls()
        for child_elem in elem:
            if child_elem.tag == 'SemanticTypeUI':
                sem_type.ui = child_elem.text
            elif child_elem.tag == 'SemanticTypeName':
                sem_type.name = child_elem.text

class Concept(SlotsToNoneMixin):
    """A concept within a MeSH Descriptor."""
    __slots__ = ( 'ui', 'name', 'is_preferred', 'umls_ui', 'casn1_name', 'registry_num', 
      'scope_note', 'sem_types', 'terms')
    
    @classmethod
    def from_xml_elem(cls, elem):
        concept = cls()
        concept.is_preferred = elem.get('PreferredConceptYN', None) == 'Y'
        for child_elem in elem:
            if child_elem.tag == 'ConceptUI':
                concept.ui = child_elem.text
            elif child_elem.tag == 'ConceptName':
                concept.name = child_elem.find('./String').text
            elif child_elem.tag == 'ConceptUMLSUI':
                concept.umls_ui
            elif child_elem.tag == 'CASN1Name':
                concept.casn1_name = child_elem.text
            elif child_elem.tag == 'RegistryNumber':
                concept.registry_num = child_elem.text
            elif child_elem.tag == 'ScopeNote':
                concept.scope_note = child_elem.text
            elif child_elem.tag == 'SemanticTypeList':
                concept.sem_types = [SemanticType.from_xml_elem(st_elem)
                  for st_elem in child_elem.findall('SemanticType')]
            elif child_elem.tag == 'TermList':
                concept.terms = [Term.from_xml_elem(term_elem)
                  for term_elem in child_elem.findall('Term')]
        return concept

class DescriptorRecord(SlotsToNoneMixin):
    "A MeSH Descriptor Record."""
    
    __slots__ = ('ui', 'name', 'date_created', 'date_revised', 'pharm_actions', 
      'tree_numbers', 'concepts')
    
    @classmethod
    def from_xml_elem(cls, elem):
        rec = cls()
        for child_elem in elem:
            if child_elem.tag == 'DescriptorUI':
                rec.ui = child_elem.text
            elif child_elem.tag == 'DescriptorName':
                rec.name = child_elem.find('./String').text
            elif child_elem.tag == 'DateCreated':
                rec.date_created = date_from_mesh_xml(child_elem)
            elif child_elem.tag == 'DateRevised':
                rec.date_revised = date_from_mesh_xml(child_elem)
            elif child_elem.tag == 'TreeNumberList':
                rec.tree_numbers = [tn_elem.text
                  for tn_elem in child_elem.findall('TreeNumber')]
            elif child_elem.tag == 'ConceptList':
                rec.concepts = [Concept.from_xml_elem(c_elem) 
                  for c_elem in child_elem.findall('Concept')]
            elif child_elem.tag == 'PharmacologicalActionList':
                rec.pharm_actions = [PharmacologicalAction.from_xml_elem(pa_elem) 
                  for pa_elem in child_elem.findall('PharmacologicalAction')]
        return rec

data is a generator which loops over the xml. 

xml_to_list pulls out the desired elements of the xml for each descriptor e.g. D065626

In [8]:
import csv
import os

treetops = ['F','C']
counter=0

data_dir = 'C:\\Users\\Richard.Geoghegan\\Documents\\NLP\\MeSH'
data_file = 'desc2023.xml'
data_file = os.path.join(data_dir, data_file)

print('data file: {}'.format(data_file))
data=parse_mesh(data_file)

res=[]
#curr=[next(data)]

def xml_to_list(counter):
    try:
        while True:
            curr=[next(data)]
            counter +=1
            for i, val in enumerate([[item.ui, item.name, item.concepts, item.tree_numbers] 
                                     for item in curr 
                                     if any(s for s in str(item.tree_numbers or 'EMPTY') if any(xs in s for xs in treetops))]):
                ui=val[0]
                name=val[1]
                conc=val[2]
                tree = [item for item in val[3] if any(s for s in item if any(xs in s for xs in treetops))]
                top = list(set(item.split('.')[0] for item in tree))
                #print(top)
                for j in range(len(conc)):
                    for k in range(len(conc[j].terms)):
                        res.append([tree, ui, name, conc[j].terms[k].term_ui, conc[j].terms[k].string])
                        if (counter % 1000==0):
                            print('Progress report...', counter, i,j,k)
    except StopIteration:
        print(ui)
        print(name)
        print(conc)
        print(tree)
        print(top)
        pass
    finally:
        return res

res=xml_to_list(0)

data_file='mesh_terms_treetop_test'
data_file = os.path.join(data_dir, data_file+'.csv')
print('output file: {}'.format(data_file))

with open(data_file, 'w', newline='') as f:
    wr = csv.writer(f, delimiter ='|')
    wr.writerows(res)

data file: C:\Users\Richard.Geoghegan\Documents\NLP\MeSH\desc2023.xml
Progress report... 5000 0 0 0
Progress report... 5000 0 0 1
Progress report... 5000 0 0 2
Progress report... 5000 0 0 3
Progress report... 5000 0 0 4
Progress report... 5000 0 0 5
Progress report... 5000 0 0 6
Progress report... 5000 0 0 7
Progress report... 5000 0 0 8
Progress report... 5000 0 0 9
Progress report... 5000 0 1 0
Progress report... 5000 0 1 1
Progress report... 5000 0 1 2
Progress report... 5000 0 1 3
Progress report... 5000 0 1 4
Progress report... 5000 0 1 5
Progress report... 5000 0 1 6
Progress report... 5000 0 2 0
Progress report... 5000 0 2 1
Progress report... 5000 0 2 2
Progress report... 5000 0 2 3
Progress report... 5000 0 2 4
Progress report... 5000 0 2 5
Progress report... 5000 0 2 6
Progress report... 5000 0 3 0
Progress report... 5000 0 3 1
Progress report... 5000 0 3 2
Progress report... 5000 0 3 3
Progress report... 5000 0 3 4
Progress report... 5000 0 3 5
Progress report... 5000 0 4 0


Look at a few examples 

NASH = D065626\
Glioblastoma = D005909\
AIDS = D000163

In [60]:
uiInspect = 'D065626'
listInspect = [item for item in res if item[1]==uiInspect]
print('example: {}'.format(uiInspect))
print('treenumber: {}'.format(list(set([item[0] for item in [item[0] for item in listInspect]]))))
print('name: {}'.format(list(set([item for item in [item[2] for item in listInspect]]))))
for item in listInspect:
    print(' ... term: {} : {}' .format(item[-2], item[-1]))


example: D065626
treenumber: ['C06.552.241.519']
name: ['Non-alcoholic Fatty Liver Disease']
 ... term: T747320 : Non-alcoholic Fatty Liver Disease
 ... term: T747320 : Non alcoholic Fatty Liver Disease
 ... term: T747417 : NAFLD
 ... term: T747319 : Nonalcoholic Fatty Liver Disease
 ... term: T747321 : Fatty Liver, Nonalcoholic
 ... term: T747321 : Fatty Livers, Nonalcoholic
 ... term: T747321 : Liver, Nonalcoholic Fatty
 ... term: T747321 : Livers, Nonalcoholic Fatty
 ... term: T747321 : Nonalcoholic Fatty Liver
 ... term: T747321 : Nonalcoholic Fatty Livers
 ... term: T853030 : Nonalcoholic Steatohepatitis
 ... term: T853030 : Nonalcoholic Steatohepatitides
 ... term: T853030 : Steatohepatitides, Nonalcoholic
 ... term: T853030 : Steatohepatitis, Nonalcoholic
