# Introduction to the Node Normalization Service (NodeNorm)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NCATSTranslator/NodeNormalization/blob/master/documentation/NodeNormalization.ipynb)

## Introduction

The [Node Normalization Service](https://nodenormalization-sri.renci.org/docs) (NodeNorm) takes an input [CURIE](https://en.wikipedia.org/wiki/CURIE), and returns:

* The preferred CURIE for this entity
* All other known equivalent identifiers for the entity
* Semantic types for the entity as defined by the [Biolink Model](https://biolink.github.io/biolink-model/)

The data currently served by the Node Normalization Service is created by [Babel](https://github.com/NCATSTranslator/Babel), a pipeline for combining identifiers from hundreds of data sources including ontologies and publicly accessible databases into a series of _cliques_ -- sets of identifiers ordered as per the preferred prefix list for that semantic type.

### Instances of NodeNorm

For the examples in this document, we will use the _development_ version of the Node Normalization Service, hosted at https://nodenormalization-sri.renci.org/ by the [Renaissance Computing Institute](https://renci.org/) (RENCI) at the University of North Carolina. This version is updated more frequently than the production instance listed below.

The production instance of NodeNorm is hosted by the NCATS Translator project at https://nodenorm.transltr.io/docs, and may be older than the development version hosted below. As with other NCATS Translator tools, a [CI instance](https://nodenorm.ci.transltr.io/docs) and a [Test instance](https://nodenorm.test.transltr.io/docs) have also been deployed, but are not likely to be useful for non-Translator users.

## Status

The NodeNorm instance can be queried with information about the status and size of its databases. Most importantly, this will tell you what version of Babel outputs have been loaded into a NodeNorm instance and how much memory all of its seven databases are using.

In [13]:
import json
import requests

status = requests.get('https://nodenormalization-sri.renci.org/status')
print(json.dumps(status.json(), indent=2))

{
  "status": "running",
  "babel_version": "2025sep1",
  "babel_version_url": "https://github.com/ncatstranslator/Babel/blob/master/releases/2025sep1.md",
  "databases": {
    "eq_id_to_id_db": {
      "dbname": "id-id",
      "count": 688874821,
      "used_memory_rss_human": "58.86G",
      "is_cluster": false
    },
    "id_to_eqids_db": {
      "dbname": "id-eq-id",
      "count": 490286665,
      "used_memory_rss_human": "110.77G",
      "is_cluster": false
    },
    "id_to_type_db": {
      "dbname": "id-categories",
      "count": 490286665,
      "used_memory_rss_human": "36.53G",
      "is_cluster": false
    },
    "curie_to_bl_type_db": {
      "dbname": "semantic-count",
      "count": 137,
      "used_memory_rss_human": "27.71M",
      "is_cluster": false
    },
    "info_content_db": {
      "dbname": "info-content",
      "count": 3315882,
      "used_memory_rss_human": "232.67M",
      "is_cluster": false
    },
    "gene_protein_db": {
      "dbname": "conflation-db"

For Babel 2025sep1, these are the seven Redis NodeNorm databases:

| Redis database name | Database purpose                                                                                                                                                 | Number of keys | Memory used |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------- | -------------------------------- |
| id-id               | Maps CURIEs to their preferred CURIE                                                                                                                             | 688,874,821    | 58.86G                           |
| id-eq-id            | Maps preferred CURIEs to a list of equivalent identifiers, labels, descriptions and other information.                                                           | 490,286,665    | 110.77G                          |
| id-categories       | Maps preferred CURIEs to their Biolink type.                                                                                                                     | 490,286,665    | 36.53G                           |
| semantic-count      | Records prefixes for each Biolink type.                                                                                                                          | 137            | 27.71M                           |
| info-content        | Maps preferred CURIEs to their information content value.                                                                                                        | 3,315,882      | 232.72M                          |
| conflation-db       | Maps CURIEs conflated under GeneProtein conflation to the entire conflation (e.g. “UniProtKB:P00734” will be mapped to [“NCBIGene:2147”, “UniProtKB:P00734”, …]  | 33,592,119     | 4.70G                            |
| chemical-drug-db    | Maps CURIEs conflated under DrugChemical conflation to the entire conflation (e.g. “CHEBI:9919” is mapped to “["CHEBI:35854”, “CHEBI:9919”, “UMLS:C5679319", …]” | 106,507        | 210.62M                          |

## Normalization

Given one or more Compact URIs (CURIES), `get_normalized_node` will return a list of equivalent identifiers for the entity, along with the Translator-preferred identifier, and the semantic type(s) for the entity.  This service is merely returning pre-computed values, and does no equivalence inference on its own.  If a CURIE is unknown to it, then null is returned.

In this example, `get_normalized_node` is called with a MeSH identifier for water. This is normalized to [CHEBI:15377](https://www.ebi.ac.uk/chebi/CHEBI:15377), the ChEBI identifier for water. Other equivalent identifiers are returned as well, including UNII, PubChem Compound, ChEMBL Compound and DrugBank identifiers. Note that several MeSH identifiers are combined, including MESH:D060766 "Drinking Water". This class is typed as a [biolink:SmallMolecule](https://biolink.github.io/biolink-model/SmallMolecule/), and has an [information content](https://github.com/NCATSTranslator/Babel/blob/master/docs/README.md#what-are-information-content-values) of 47.5.

**Note**: by default, GeneProtein conflation is turned on but DrugChemical conflation is turned off. You can read more about conflation later on in this document.

In [14]:
result = requests.get('https://nodenormalization-sri.renci.org/get_normalized_nodes',
                     params={'curie':"MESH:D014867"})
print( json.dumps( result.json(), indent = 2))

{
  "MESH:D014867": {
    "id": {
      "identifier": "CHEBI:15377",
      "label": "Water"
    },
    "equivalent_identifiers": [
      {
        "identifier": "CHEBI:15377",
        "label": "water"
      },
      {
        "identifier": "CHEBI:44701"
      },
      {
        "identifier": "CHEBI:27313"
      },
      {
        "identifier": "CHEBI:10743"
      },
      {
        "identifier": "CHEBI:44819"
      },
      {
        "identifier": "CHEBI:44292"
      },
      {
        "identifier": "CHEBI:43228"
      },
      {
        "identifier": "CHEBI:42043"
      },
      {
        "identifier": "CHEBI:42857"
      },
      {
        "identifier": "CHEBI:13352"
      },
      {
        "identifier": "CHEBI:5585"
      },
      {
        "identifier": "UNII:059QF0KO0R",
        "label": "WATER"
      },
      {
        "identifier": "PUBCHEM.COMPOUND:962",
        "label": "Water"
      },
      {
        "identifier": "CHEMBL.COMPOUND:CHEMBL1098659",
        "label": "WATER"
  

#### Batching

To improve performance, multiple CURIEs may be batched into a single function call as a POST request:

In [15]:
result = requests.post('https://nodenormalization-sri.renci.org/get_normalized_nodes',
                     json={"curies":["HP:0007354", "HGNC:613", "CURIE:NOTHING"]})
print( json.dumps( result.json(), indent = 2))

{
  "HP:0007354": {
    "id": {
      "identifier": "MONDO:0004976",
      "label": "amyotrophic lateral sclerosis"
    },
    "equivalent_identifiers": [
      {
        "identifier": "MONDO:0004976",
        "label": "amyotrophic lateral sclerosis"
      },
      {
        "identifier": "DOID:332",
        "label": "amyotrophic lateral sclerosis"
      },
      {
        "identifier": "orphanet:803"
      },
      {
        "identifier": "UMLS:C0002736",
        "label": "Amyotrophic Lateral Sclerosis"
      },
      {
        "identifier": "MESH:D000690",
        "label": "Amyotrophic Lateral Sclerosis"
      },
      {
        "identifier": "MEDDRA:10002026"
      },
      {
        "identifier": "MEDDRA:10052889"
      },
      {
        "identifier": "MEDDRA:10090869"
      },
      {
        "identifier": "NCIT:C34373",
        "label": "Amyotrophic Lateral Sclerosis"
      },
      {
        "identifier": "SNOMEDCT:86044005"
      },
      {
        "identifier": "medgen:274"
 

#### Descriptions

Descriptions are included for concepts with descriptions in [UberGraph](https://github.com/INCATools/ubergraph/). You can use the `description` flag to include descriptions where available. Note that descriptions are included for each identifier, and the first description is included with the preferred identifier.

In [16]:
result = requests.post('https://nodenormalization-sri.renci.org/get_normalized_nodes',
                     json={"curies":["HP:0007354", "HGNC:613", "CURIE:NOTHING"], "description": True})
print( json.dumps( result.json(), indent = 2))

{
  "HP:0007354": {
    "id": {
      "identifier": "MONDO:0004976",
      "label": "amyotrophic lateral sclerosis"
    },
    "equivalent_identifiers": [
      {
        "identifier": "MONDO:0004976",
        "label": "amyotrophic lateral sclerosis",
        "description": "Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease characterized by progressive muscular paralysis reflecting degeneration of motor neurons in the primary motor cortex, corticospinal tracts, brainstem and spinal cord."
      },
      {
        "identifier": "DOID:332",
        "label": "amyotrophic lateral sclerosis"
      },
      {
        "identifier": "orphanet:803"
      },
      {
        "identifier": "UMLS:C0002736",
        "label": "Amyotrophic Lateral Sclerosis"
      },
      {
        "identifier": "MESH:D000690",
        "label": "Amyotrophic Lateral Sclerosis"
      },
      {
        "identifier": "MEDDRA:10002026"
      },
      {
        "identifier": "MEDDRA:10052889"
      },
   

#### Taxa

You can use the `include_taxa` flag to return taxa for concepts that are specific to a concept, such as genes and proteins. Taxa are returned as a list of NCBI Taxon identifiers (such as `NCBITaxon:9606` ["_Homo sapiens_"](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/9606/)) for each identifier, with a combined list for the overall identifier.

In [17]:
result = requests.post('https://nodenormalization-sri.renci.org/get_normalized_nodes',
                     json={"curies":["HGNC:613", "UniProtKB:P05503"], "include_taxa": True})
print( json.dumps( result.json(), indent = 2))

{
  "HGNC:613": {
    "id": {
      "identifier": "NCBIGene:348",
      "label": "APOE"
    },
    "equivalent_identifiers": [
      {
        "identifier": "NCBIGene:348",
        "label": "APOE",
        "taxa": [
          "NCBITaxon:9606"
        ]
      },
      {
        "identifier": "ENSEMBL:ENSG00000130203",
        "label": "APOE (Hsap)"
      },
      {
        "identifier": "HGNC:613",
        "label": "APOE"
      },
      {
        "identifier": "OMIM:107741"
      },
      {
        "identifier": "UMLS:C1412481",
        "label": "APOE gene"
      },
      {
        "identifier": "UniProtKB:A0A0S2Z3D5",
        "label": "A0A0S2Z3D5_HUMAN Apolipoprotein E (Fragment) (trembl)",
        "taxa": [
          "NCBITaxon:9606"
        ]
      },
      {
        "identifier": "UniProtKB:P02649",
        "label": "APOE_HUMAN Apolipoprotein E (sprot)",
        "taxa": [
          "NCBITaxon:9606"
        ]
      },
      {
        "identifier": "PR:P02649",
        "label": "apoli

## Conflation

NodeNorm allows identifiers to be optionally combined into broader concepts at query time. There are two conflations currently available:
* GeneProtein conflation combines protein-encoding genes with their gene products, which can combine
* DrugChemical conflation combines drug formulations with their active ingredients, allowing you to normalize a specific formulation of a drug (such as [UMLS:C0704942 "acetaminophen 16 MG/ML Oral Solution"](https://uts.nlm.nih.gov/uts/umls/concept/C0704942)) to the active ingredient (in this case, [CHEBI:46195 "Acetaminophen"](https://www.ebi.ac.uk/chebi/CHEBI:46195)).

More details about conflation is available [in the Babel repository](https://github.com/NCATSTranslator/Babel/blob/master/docs/Conflation.md).

### Conflation examples

First, let's try to normalize a protein ([UniProtKB:P05503 "Cytochrome c oxidase subunit 1" in rats](https://www.uniprot.org/uniprotkb/P05503/entry)) and a drug formulation ([UMLS:C0704942 "acetaminophen 16 MG/ML Oral Solution"](https://uts.nlm.nih.gov/uts/umls/concept/C0704942)) with both NGeneProtein conflation (`conflate`) and DrugChemical conflation (`drug_chemical_conflate`) turned off.

In [18]:
result = requests.post('https://nodenormalization-sri.renci.org/get_normalized_nodes',
           json={"curies":["UniProtKB:P05503", "UMLS:C0704942"],
           "conflate": False,
           "drug_chemical_conflate": False})
print( json.dumps( result.json(), indent = 2))

{
  "UniProtKB:P05503": {
    "id": {
      "identifier": "UniProtKB:P05503",
      "label": "COX1_RAT Cytochrome c oxidase subunit 1 (sprot)"
    },
    "equivalent_identifiers": [
      {
        "identifier": "UniProtKB:P05503",
        "label": "COX1_RAT Cytochrome c oxidase subunit 1 (sprot)",
        "taxa": [
          "NCBITaxon:10116"
        ]
      },
      {
        "identifier": "PR:P05503",
        "label": "cytochrome c oxidase subunit 1 (rat)"
      },
      {
        "identifier": "ENSEMBL:ENSRNOP00000039048"
      },
      {
        "identifier": "ENSEMBL:ENSRNOP00000039048.3"
      }
    ],
    "taxa": [
      "NCBITaxon:10116"
    ],
    "type": [
      "biolink:Protein",
      "biolink:GeneProductMixin",
      "biolink:Polypeptide",
      "biolink:ChemicalEntityOrGeneOrGeneProduct",
      "biolink:ChemicalEntityOrProteinOrPolypeptide",
      "biolink:BiologicalEntity",
      "biolink:ThingWithTaxon",
      "biolink:NamedThing",
      "biolink:GeneOrGeneProduct",
  

Now let's retry that query with both GeneProtein conflation (`conflate`) and DrugChemical conflation (`drug_chemical_conflate`) turned on. You will see the larger conflated cliques returned for both identifiers.

In [19]:
result = requests.post('https://nodenormalization-sri.renci.org/get_normalized_nodes',
           json={"curies":["UniProtKB:P05503", "UMLS:C0704942"],
           "conflate": True,
           "drug_chemical_conflate": True})
print( json.dumps( result.json(), indent = 2))

{
  "UniProtKB:P05503": {
    "id": {
      "identifier": "NCBIGene:26195",
      "label": "mt-Co1"
    },
    "equivalent_identifiers": [
      {
        "identifier": "NCBIGene:26195",
        "label": "mt-Co1",
        "taxa": [
          "NCBITaxon:10116"
        ]
      },
      {
        "identifier": "RGD:621871",
        "label": "Mt-co1"
      },
      {
        "identifier": "UniProtKB:P05503",
        "label": "COX1_RAT Cytochrome c oxidase subunit 1 (sprot)",
        "taxa": [
          "NCBITaxon:10116"
        ]
      },
      {
        "identifier": "PR:P05503",
        "label": "cytochrome c oxidase subunit 1 (rat)"
      },
      {
        "identifier": "ENSEMBL:ENSRNOP00000039048"
      },
      {
        "identifier": "ENSEMBL:ENSRNOP00000039048.3"
      },
      {
        "identifier": "UniProtKB:Q8HIC9",
        "label": "Q8HIC9_RAT Cytochrome c oxidase subunit 1 (trembl)",
        "taxa": [
          "NCBITaxon:10116"
        ]
      }
    ],
    "taxa": [
      "

## Metadata

There are two metadata services that can be used to find out what sorts of results have been incorporated into the Node Normalization service.  These return the semantic types that are included, and the prefixes included for each type.

#### Which types have been normalized?

In [20]:
import json
import requests

result = requests.get('https://nodenormalization-sri.renci.org/get_semantic_types')
print( json.dumps( result.json(), indent = 2))

{
  "semantic_types": {
    "types": [
      "biolink:Procedure",
      "biolink:Agent",
      "biolink:Phenomenon",
      "biolink:CellLine",
      "biolink:Cell",
      "biolink:OntologyClass",
      "biolink:GrossAnatomicalStructure",
      "biolink:Cohort",
      "biolink:MolecularEntity",
      "biolink:NucleicAcidEntity",
      "biolink:AnatomicalEntity",
      "biolink:PhenotypicFeature",
      "biolink:GeneProductMixin",
      "biolink:ComplexMolecularMixture",
      "biolink:ChemicalEntityOrProteinOrPolypeptide",
      "biolink:GeneGroupingMixin",
      "biolink:InformationContentEntity",
      "biolink:MolecularMixture",
      "biolink:Drug",
      "biolink:Pathway",
      "biolink:DiseaseOrPhenotypicFeature",
      "biolink:OrganismTaxon",
      "biolink:BiologicalProcessOrActivity",
      "biolink:MacromolecularMachineMixin",
      "biolink:OrganismalEntity",
      "biolink:SubjectOfInvestigation",
      "biolink:ChemicalMixture",
      "biolink:Polypeptide",
      "biolink

#### Which prefixes are supported?

Even if a semantic type has some identifier equivalence, not every vocabulary has been included.  To see which vocabularies are likely to give useful results for a specific Biolink class, call:

More than one type can be queried:

In [21]:
result = requests.get('https://nodenormalization-sri.renci.org/get_curie_prefixes/',
                     params={'semantic_type':["biolink:ChemicalEntity","biolink:Disease"]})
print( json.dumps( result.json(), indent = 2))

{
  "biolink:ChemicalEntity": {
    "curie_prefix": {
      "PUBCHEM.COMPOUND": "123887334",
      "INCHIKEY": "115975484",
      "CAS": "4112274",
      "HMDB": "217920",
      "CHEMBL.COMPOUND": "2479770",
      "UNII": "138975",
      "CHEBI": "218762",
      "MESH": "256235",
      "UMLS": "603550",
      "DrugCentral": "4995",
      "GTOPDB": "13265",
      "RXCUI": "124800",
      "DRUGBANK": "15274",
      "KEGG.COMPOUND": "16039"
    }
  },
  "biolink:Disease": {
    "curie_prefix": {
      "UMLS": "353693",
      "SNOMEDCT": "98214",
      "NCIT": "25975",
      "MONDO": "26294",
      "orphanet": "11048",
      "MESH": "11255",
      "medgen": "21044",
      "icd11.foundation": "4056",
      "MEDDRA": "47094",
      "DOID": "11951",
      "ICD10": "2520",
      "ICD9": "2233",
      "OMIM": "10210",
      "EFO": "3760",
      "HP": "2361",
      "OMIM.PS": "573",
      "KEGG.DISEASE": "40",
      "icd11": "5",
      "MP": "4"
    }
  }
}


Or you can just ask for all the prefix counts for every type:

In [22]:
result = requests.get('https://nodenormalization-sri.renci.org/get_curie_prefixes')
responses = result.json()

# Display only the first three responses.
print( json.dumps( dict(list(responses.items())[:3]), indent = 2))

{
  "biolink:CellLine": {
    "curie_prefix": {
      "CLO": "38810"
    }
  },
  "biolink:Pathway": {
    "curie_prefix": {
      "SMPDB": "30248",
      "REACT": "22114",
      "PANTHER.PATHWAY": "175",
      "GO": "588"
    }
  },
  "biolink:PhysiologicalProcess": {
    "curie_prefix": {
      "UMLS": "112"
    }
  }
}


## TRAPI

**Warning**: the NodeNorm TRAPI API endpoints have been deprecated, and will be removed from a future NodeNorm release.

NodeNorm can normalize entire TRAPI messages. Here we have a message in terms of HGNC and DOID, and the normalizer returns a message using NCBIGene and MONDO.

In [23]:
trapi_message = {
    "message": {
        "query_graph": {
            "nodes": {
                "n1": {
                    "id": "HGNC:11603",
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n2": {
                    "categories": [
                        "biolink:Disease"
                    ]
                }
            },
            "edges": {
                "e1": {
                    "subject": "n1",
                    "object": "n2"
                }
            }
        },
        "knowledge_graph": {
            "nodes": {
                "HGNC:11603": {
                    "name": "TBX4",
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "DOID:3083": {
                    "name": "chronic obstructive pulmonary disease",
                    "categories": [
                        "biolink:Disease"
                    ]
                }
            },
            "edges": {
                "2d38345a-e9bf-4943-accb-dccba351dd04": {
                    "subject": "NCBIGene:9496",
                    "object": "DOID:3083",
                    "predicate": "biolink:related_to",
                    "sources": []
                }
            }
        },
        "results": [
            {
                "node_bindings": {
                    "n1": [
                        {
                            "id": "HGNC:11603"
                        }
                    ],
                    "n2": [
                        {
                            "id": "DOID:3083"
                        }
                    ]
                },
                "analyses": [{
                    "resource_id": "infores:openpredict",
                    "score": "0.682246949249408",
                    "scoring_method": "Model confidence between 0 and 1",
                    "edge_bindings": {
                        "e1": [
                            {
                                "id": "2d38345a-e9bf-4943-accb-dccba351dd04"
                            }
                        ]
                    }
                }]
            }
        ]
    }
}

In [24]:
result = requests.post('https://nodenormalization-sri.renci.org/1.4/query',json=trapi_message)
print(result.status_code)

200


In [25]:
print(json.dumps(result.json(), indent=2))

{
  "message": {
    "query_graph": {
      "nodes": {
        "n1": {
          "categories": [
            "biolink:Gene"
          ],
          "is_set": false,
          "constraints": [],
          "id": "HGNC:11603"
        },
        "n2": {
          "categories": [
            "biolink:Disease"
          ],
          "is_set": false,
          "constraints": []
        }
      },
      "edges": {
        "e1": {
          "subject": "n1",
          "object": "n2"
        }
      }
    },
    "knowledge_graph": {
      "nodes": {
        "NCBIGene:9496": {
          "categories": [
            "biolink:PhysicalEssenceOrOccurrent",
            "biolink:GeneProductMixin",
            "biolink:ThingWithTaxon",
            "biolink:MacromolecularMachineMixin",
            "biolink:GeneOrGeneProduct",
            "biolink:GenomicEntity",
            "biolink:NamedThing",
            "biolink:BiologicalEntity",
            "biolink:Gene",
            "biolink:Protein",
            "b