# MeSH ID Parsing Code

MeSH database download: https://www.nlm.nih.gov/databases/download/mesh.html

2024 Descriptor MeSH XML: https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2024.xml

Some discoveries about XML structure
1. MeSH ID always first, followed by the entity name
2. Synonyms are known as "Entry Term", found in the XML near the end of each entity

In [12]:
import requests
import xml.etree.ElementTree as ET
import pandas as pd
import csv

## Parse MeSH XML into Lookup Table

In [3]:
url = 'https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2024.xml'

response = requests.get(url)
response.raise_for_status()

In [4]:
root = ET.fromstring(response.content)
root

<Element 'DescriptorRecordSet' at 0x7f549ad88540>

In [5]:
# Parsing function

def parse_mesh_xml(root):
    for record in root.findall('DescriptorRecord'):
        # Find the MeSH ID (first ID starting with 'D')
        mesh_id = record.find('DescriptorUI').text
        
        # Find the primary name of the chemical/disease
        name = record.find('DescriptorName/String').text
        
        # Initialize a list to hold the entry terms (starting with the primary name)
        entry_terms = [name]
        
        # Look for entry terms (synonyms)
        for entry_term in record.findall('.//TermList/Term/String'):
            entry_terms.append(entry_term.text)
        
        # Store the MeSH ID along with all names and entry terms
        mesh_lookup[mesh_id] = entry_terms

    return mesh_lookup

In [10]:
mesh_lookup = {}

mesh_lookup_table = parse_mesh_xml(root)

for mesh_id, terms in list(mesh_lookup_table.items())[:25]:
    print(f"MeSH ID: {mesh_id}, Names: {' | '.join(terms)}")

MeSH ID: D000001, Names: Calcimycin | Calcimycin | 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7-dioxaspiro(5.5)undec-2-yl)methyl)-, (6S-(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))- | A-23187 | A 23187 | Antibiotic A23187 | A23187, Antibiotic | A23187
MeSH ID: D000002, Names: Temefos | Temefos | Temephos | Difos | Abate
MeSH ID: D000003, Names: Abattoirs | Abattoirs | Abattoir | Slaughterhouses | Slaughterhouse | Slaughter Houses | House, Slaughter | Houses, Slaughter | Slaughter House
MeSH ID: D000004, Names: Abbreviations as Topic | Abbreviations as Topic | Acronyms as Topic
MeSH ID: D000005, Names: Abdomen | Abdomen | Abdomens
MeSH ID: D000006, Names: Abdomen, Acute | Abdomen, Acute | Abdomens, Acute | Acute Abdomen | Acute Abdomens
MeSH ID: D000007, Names: Abdominal Injuries | Abdominal Injuries | Injuries, Abdominal | Abdominal Injury | Injury, Abdominal
MeSH ID: D000008, Names: Abdominal Neoplasms | Abdominal Neop

In [13]:
# Save as csv

csv_file = 'mesh_lookup_table.csv'

with open(csv_file, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['MeSH ID', 'Names/Entry Terms'])
    
    for mesh_id, terms in mesh_lookup_table.items():
        writer.writerow([mesh_id, ' | '.join(terms)])

print(f"CSV file '{csv_file}' has been created.")

CSV file 'mesh_lookup_table.csv' has been created.


In [14]:
mesh_lookup_df = pd.read_csv('mesh_lookup_table.csv')
mesh_lookup_df.head(25)

Unnamed: 0,MeSH ID,Names/Entry Terms
0,D000001,Calcimycin | Calcimycin | 4-Benzoxazolecarboxy...
1,D000002,Temefos | Temefos | Temephos | Difos | Abate
2,D000003,Abattoirs | Abattoirs | Abattoir | Slaughterho...
3,D000004,Abbreviations as Topic | Abbreviations as Topi...
4,D000005,Abdomen | Abdomen | Abdomens
5,D000006,"Abdomen, Acute | Abdomen, Acute | Abdomens, Ac..."
6,D000007,Abdominal Injuries | Abdominal Injuries | Inju...
7,D000008,Abdominal Neoplasms | Abdominal Neoplasms | Ab...
8,D000009,Abdominal Muscles | Abdominal Muscles | Abdomi...
9,D000010,"Abducens Nerve | Abducens Nerve | Nerve, Abduc..."


In [16]:
mesh_lookup_df.shape

(30764, 2)