# MeSH-Based Semantic Normalization and Categorization

Notebook: MeSH_Keyword_and_Synonym_Matching.ipynb
Authors: Elizaveta Popova, Negin Babaiha
Institution: University of Bonn, Fraunhofer SCAI
Date: 09/04/2025

Description:
    This notebook processes the MeSH descriptor XML file (desc2025.xml) to support downstream tasks in
    semantic triple extraction and evaluation. It includes two major functionalities:

    1. Triple Entity Synonym Matching:
        - Extracts unique entities from GPT and CBM triples.
        - Normalizes and aligns them to MeSH descriptors using synonym lookup.
        - Output: 
            - `mesh_triples_synonyms.json`: entity → preferred MeSH term and synonyms
            - `unmatched_triples.json`: entities with no MeSH match

    2. Category Keyword Extraction:
        - Parses MeSH descriptors to identify terms related to 6 key pathophysiological categories.
        - Matches based on keyword heuristics.
        - Output: `mesh_category_terms.json` — category → term list.

    These outputs are used for normalization, comparison, and evaluation in BioBERT-based triple matching.



MeSH for Triples

In [4]:
# === Map Triple Entities to MeSH Synonyms with COVID Normalization (Exact Match) ===
"""
Loads subject/object terms from GPT and CBM triples, normalizes them, and maps them to MeSH descriptors.
All COVID-related entities are explicitly normalized to 'covid-19' based on exact string matches only.

Outputs:
    - mesh_triples_synonyms.json
"""

import pandas as pd
import xml.etree.ElementTree as ET
import json
import re

# === Step 1: Load triples from Excel files ===
file_paths = [
    "../data/gold_standard_comparison/Triples_CBM_Gold_Standard.xlsx",
    "../data/gold_standard_comparison/Triples_GPT_for_comparison.xlsx"
]

all_subjects = set()
all_objects = set()

for path in file_paths:
    df = pd.read_excel(path)
    if 'Subject' in df.columns and 'Object' in df.columns:
        subjects = df['Subject'].dropna().str.lower().str.strip()
        objects = df['Object'].dropna().str.lower().str.strip()
        all_subjects.update(subjects)
        all_objects.update(objects)

raw_entities = sorted(all_subjects.union(all_objects))
print(f"Collected {len(raw_entities)} unique entities from triples.")

# === Normalize entities ===
def normalize_entity(text):
    text = text.replace("_", " ")
    text = re.sub(r"\s+", " ", text)
    return text.strip().lower()

normalized_mapping = {entity: normalize_entity(entity) for entity in raw_entities}
normalized_entities = sorted(set(normalized_mapping.values()))

# === Step 2: Parse MeSH XML descriptors ===
mesh_path = "../data/MeSh_data/desc2025.xml"
tree = ET.parse(mesh_path)
root = tree.getroot()

descriptor_synonyms = {}

for descriptor in root.findall(".//DescriptorRecord"):
    descriptor_ui = descriptor.findtext("./DescriptorUI")
    descriptor_name = descriptor.findtext("./DescriptorName/String")
    if not descriptor_name:
        continue
    descriptor_name_norm = normalize_entity(descriptor_name)

    synonyms = set()
    synonyms.add(descriptor_name_norm)

    for term in descriptor.findall(".//TermList/Term/String"):
        if term.text:
            synonyms.add(normalize_entity(term.text))

    descriptor_synonyms[descriptor_name_norm] = {
        "uid": descriptor_ui,
        "preferred": descriptor_name,
        "synonyms": sorted(synonyms)
    }

print(f"Loaded {len(descriptor_synonyms)} descriptors from MeSH.")

# === Step 3: Build reverse synonym lookup ===
synonym_lookup = {}
for descriptor_data in descriptor_synonyms.values():
    for synonym in descriptor_data["synonyms"]:
        synonym_lookup[synonym] = descriptor_data

# === Step 4: Match entities to MeSH and normalize COVID terms by exact match ===
entity_to_mesh = {}
unmatched = []

# Define COVID override terms (exact match only, all lowercased and normalized)
covid_keywords = [
    "covid", "covid-19", "covid 19", "sars cov 2", "covid19",
    "sars-cov-2", "sars-cov-2 infection", "covid-19 infection", "covid-19_infection", "neurocovid",
    "replicated severe acute respiratory syndrome coronavirus 2", "sars-cov-2 virus",
    "severe acute respiratory syndrome coronavirus 2", "severe acute respiratory syndrome coronavirus"
]
covid_keywords = [normalize_entity(term) for term in covid_keywords]

def is_exact_covid_match(text):
    return text in covid_keywords

for original, normalized in normalized_mapping.items():
    if is_exact_covid_match(normalized):
        entity_to_mesh[original] = {
            "normalized": "covid-19",
            "uid": "COVID",
            "preferred": "COVID-19",
            "synonyms": covid_keywords
        }
    elif normalized in synonym_lookup:
        match = synonym_lookup[normalized]
        entity_to_mesh[original] = {
            "normalized": normalized,
            "uid": match["uid"],
            "preferred": match["preferred"],
            "synonyms": match["synonyms"]
        }
    else:
        unmatched.append(original)

# === Step 5: Save results ===
with open("../data/MeSh_data/mesh_triples_synonyms.json", "w", encoding="utf-8") as f:
    json.dump(entity_to_mesh, f, indent=2, ensure_ascii=False)

print(f"Found MeSH matches (incl. exact COVID overrides) for {len(entity_to_mesh)} entities.")
print(f"Still unmatched: {len(unmatched)} entities.")


Collected 1574 unique entities from triples.
Loaded 30956 descriptors from MeSH.
Found MeSH matches (incl. exact COVID overrides) for 437 entities.
Still unmatched: 1137 entities.


MeSH for Categories

In [1]:
# === MeSH-Based Category Keyword Extraction ===
"""
Extracts MeSH terms grouped by conceptual categories relevant to COVID-19 and neurodegeneration.
Uses predefined seed keywords to identify relevant descriptors.
"""

import xml.etree.ElementTree as ET
import json
import re
import pandas as pd
from collections import defaultdict

# === Load MeSH descriptor file ===
tree = ET.parse('../data/MeSh_data/desc2025.xml')  # Update path if necessary
root = tree.getroot()

# === Define core category-matching keywords (seeds) ===
category_keywords = {
    "Viral Entry and Neuroinvasion": [
        "neuroinvasion", "receptor", "ACE2", "blood-brain barrier", "BBB", "virus entry", "olfactory", 
        "retrograde transport", "endocytosis", "direct invasion", "cranial nerve", "neural pathway", 
        "transcribrial", "neurotropic", "trans-synaptic", "neuronal route", "olfactory nerve", 
        "hematogenous", "choroid plexus", "neuronal transmission", "entry into CNS"
    ],
    "Immune and Inflammatory Response": [
        "immune", "cytokine", "inflammation", "interferon", "TNF", "IL-6", "IL6", "cytokine storm", 
        "immune response", "inflammatory mediators", "macrophage", "microglia", "neutrophil", 
        "lymphocyte", "innate immunity", "immune dysregulation", "chemokine", "T cell", "NLRP3", 
        "antibody", "immune activation", "immune imbalance", "immune-mediated", "complement"
    ],
    "Neurodegenerative Mechanisms": [
        "neurodegeneration", "protein aggregation", "apoptosis", "cell death", "synaptic loss", 
        "neurotoxicity", "oxidative stress", "mitochondrial dysfunction", "tau", "amyloid", 
        "α-synuclein", "prion", "demyelination", "neuron loss", "misfolded proteins", 
        "chronic neuronal damage", "neurodegenerative", "neuroinflammation"
    ],
    "Vascular Effects": [
        "stroke", "thrombosis", "vascular", "ischemia", "coagulation", "blood clot", "microthrombi", 
        "endothelial", "vasculitis", "hemorrhage", "blood vessel", "vascular damage", "capillary", 
        "clotting", "hypoperfusion", "angiopathy", "vasculopathy"
    ],
    "Psychological and Neurological Symptoms": [
        "cognitive", "memory", "fatigue", "depression", "anxiety", "brain fog", "psychiatric", 
        "mood", "confusion", "neuropsychiatric", "emotional", "behavioral", "neurocognitive", 
        "insomnia", "psychosocial", "attention", "motivation", "executive function", "suicidality"
    ],
    "Systemic Cross-Organ Effects": [
        "lungs", "liver", "kidney", "systemic", "multi-organ", "gastrointestinal", "heart", 
        "cardiovascular", "endocrine", "renal", "pancreas", "organ failure", "liver damage", 
        "pulmonary", "myocardial", "respiratory", "hypoxia", "oxygen deprivation", "fibrosis"
    ]
}

# === Parse MeSH XML and extract matching terms per category ===
category_terms = defaultdict(set)

for descriptor in root.findall('DescriptorRecord'):
    descriptor_name_el = descriptor.find('DescriptorName/String')
    if descriptor_name_el is None:
        continue

    descriptor_name = descriptor_name_el.text
    term_elements = descriptor.findall('ConceptList/Concept/TermList/Term/String')
    synonyms = [term_el.text for term_el in term_elements if term_el is not None]
    all_text = f"{descriptor_name} " + ' '.join(synonyms)

    for category, keywords in category_keywords.items():
        if any(keyword.lower() in all_text.lower() for keyword in keywords):
            category_terms[category].update([descriptor_name] + synonyms)

# === Convert sets to lists ===
for category in category_terms:
    category_terms[category] = sorted(list(category_terms[category]))

# === Preview sample output ===
category_name = "Immune and Inflammatory Response"
print(f"=== Preview: {category_name} ===")
for term in category_terms[category_name][:25]:  # Show first 25 terms
    print("-", term)

# === Export to JSON ===
output_path = "../data/MeSh_data/mesh_category_terms.json"
with open(output_path, "w") as f:
    json.dump(category_terms, f, indent=2)

print(f"\nExtraction complete! Terms saved to: {output_path}")


=== Preview: Immune and Inflammatory Response ===
- 1, ADP-ribosyl Cyclase
- 1, IFN-gamma Receptor
- 120a Antigen, CD
- 120b Antigen, CD
- 12E7 Antigen
- 12E7 Protein
- 19S Gamma Globulin
- 2, C-EBP-Related Protein
- 23-C-EBP Protein
- 28 kDa Protein, Adipocyte
- 293 Cell, HEK
- 293 Cells, HEK
- 293T Cell
- 293T Cells
- 4 1BB Receptor
- 4 1BB Receptors
- 4-1BB Receptor
- 4-1BB Receptors
- 40-C-EBP Protein
- 4F2 Antigen
- 4F2 Antigen, Human
- 4F2-antigen
- 60B8 A Antigen
- 60B8 B Antigen
- 60B8-A Antigen

Extraction complete! Terms saved to: ../data/MeSh_data/mesh_category_terms.json
