# NodeNormalization

## Introduction

The [Node Normalization Service](https://nodenormalization-sri.renci.org/docs) takes an input [CURIE](https://en.wikipedia.org/wiki/CURIE), and returns:

* The preferred CURIE for this entity
* All other known equivalent identifiers for the entity
* Semantic types for the entity as defined by the [Biolink Model](https://biolink.github.io/biolink-model/)

The data currently served by the Node Normalization Service is created by [Babel](https://github.com/TranslatorSRI/Babel), a pipeline for combining identifiers from hundreds of data sources including ontologies and publicly accessible databases into a series of _cliques_ -- sets of identifiers ordered as per the preferred prefix list for that semantic type.

## Metadata

There are two metadata services that can be used to find out what sorts of results have been incorporated into NodeNormalization.  These return the semantic types that are included, and the prefixes included for each type.

Which types have been normalized?

In [1]:
import json 
import requests

result = requests.get('https://nodenormalization-sri.renci.org/get_semantic_types')
print( json.dumps( result.json(), indent = 2))

{
  "semantic_types": {
    "types": [
      "biolink:GenomicEntity",
      "biolink:BiologicalProcessOrActivity",
      "biolink:MolecularActivity",
      "biolink:DiseaseOrPhenotypicFeature",
      "biolink:GeneFamily",
      "biolink:Cell",
      "biolink:OrganismalEntity",
      "biolink:Pathway",
      "biolink:MolecularEntity",
      "biolink:Gene",
      "biolink:BiologicalEntity",
      "biolink:BiologicalProcess",
      "biolink:AnatomicalEntity",
      "biolink:OrganismTaxon",
      "biolink:Entity",
      "biolink:Disease",
      "biolink:CellularComponent",
      "biolink:ChemicalSubstance",
      "biolink:PhenotypicFeature",
      "biolink:NamedThing"
    ]
  }
}


Even if a semantic type has some identifier equivalence, not every vocabulary has been included.  To see which vocabularies are likely to give useful results, call:

In [2]:
result = requests.get('https://nodenormalization-sri.renci.org/get_curie_prefixes/',
                     params={'semantic_type':"biolink:ChemicalSubstance"})
print( json.dumps( result.json(), indent = 2))

{
  "biolink:ChemicalSubstance": {
    "curie_prefix": {
      "PUBCHEM.COMPOUND": "96566919",
      "INCHIKEY": "96349470",
      "CHEMBL.COMPOUND": "1889978",
      "MESH": "274298",
      "CHEBI": "135769",
      "KEGG.COMPOUND": "18744",
      "HMDB": "113979",
      "UNII": "82937",
      "DRUGBANK": "10742",
      "GTOPDB": "8895"
    }
  }
}


More than one type can be queried:

In [3]:
result = requests.get('https://nodenormalization-sri.renci.org/get_curie_prefixes/',
                     params={'semantic_type':["biolink:ChemicalSubstance","biolink:Disease"]})
print( json.dumps( result.json(), indent = 2))

{
  "biolink:ChemicalSubstance": {
    "curie_prefix": {
      "PUBCHEM.COMPOUND": "96566919",
      "INCHIKEY": "96349470",
      "CHEMBL.COMPOUND": "1889978",
      "MESH": "274298",
      "CHEBI": "135769",
      "KEGG.COMPOUND": "18744",
      "HMDB": "113979",
      "UNII": "82937",
      "DRUGBANK": "10742",
      "GTOPDB": "8895"
    }
  },
  "biolink:Disease": {
    "curie_prefix": {
      "UMLS": "225822",
      "SNOMEDCT": "152614",
      "MEDDRA": "23228",
      "NCIT": "39158",
      "MONDO": "44526",
      "ORPHANET": "18282",
      "MESH": "21066",
      "HP": "3478",
      "DOID": "19624",
      "OMIM": "28954",
      "EFO": "3820",
      "ICD10": "24",
      "ICD9": "12",
      "MP": "4",
      "medgen": "4"
    }
  }
}


## Normalization

Given one or more Compact URIs (CURIES), `get_normalized_node` will return a list of equivalent identifiers for the entity, along with the Translator-preferred identifier, and the semantic type(s) for the entity.  This service is merely returning pre-computed values, and does no equivalence inference on its own.  If a CURIE is unknown to it, then null is returned.

In this example, `get_normalized_node` is called with a MeSH identifier.   MeSH contains many different semantic types, but the service correctly identifies the term.

In [4]:
result = requests.get('https://nodenormalization-sri.renci.org/get_normalized_nodes',
                     params={'curie':"MESH:D014867"})
print( json.dumps( result.json(), indent = 2))

{
  "MESH:D014867": {
    "id": {
      "identifier": "PUBCHEM.COMPOUND:22247451",
      "label": "Hydron;hydroxide"
    },
    "equivalent_identifiers": [
      {
        "identifier": "PUBCHEM.COMPOUND:22247451",
        "label": "Hydron;hydroxide"
      },
      {
        "identifier": "PUBCHEM.COMPOUND:962",
        "label": "Water"
      },
      {
        "identifier": "CHEMBL.COMPOUND:CHEMBL1098659",
        "label": "WATER"
      },
      {
        "identifier": "UNII:059QF0KO0R"
      },
      {
        "identifier": "CHEBI:15377",
        "label": "water"
      },
      {
        "identifier": "DRUGBANK:DB09145"
      },
      {
        "identifier": "MESH:D014867",
        "label": "Water"
      },
      {
        "identifier": "HMDB:HMDB0002111"
      },
      {
        "identifier": "KEGG.COMPOUND:C00001",
        "label": "H2O"
      },
      {
        "identifier": "INCHIKEY:IKBQPNVYXHKVJS-LVZFUZTISA-N"
      }
    ],
    "type": [
      "biolink:ChemicalSubstance",
    

To improve performance, multiple CURIEs may be batched into a single function call as a POST:

In [5]:
result = requests.post('https://nodenormalization-sri.renci.org/get_normalized_nodes',
                     json={"curies":["HP:0007354", "HGNC:613", "CURIE:NOTHING"]})
print( json.dumps( result.json(), indent = 2))

{
  "HP:0007354": {
    "id": {
      "identifier": "MONDO:0004976",
      "label": "amyotrophic lateral sclerosis"
    },
    "equivalent_identifiers": [
      {
        "identifier": "MONDO:0004976",
        "label": "amyotrophic lateral sclerosis"
      },
      {
        "identifier": "DOID:332"
      },
      {
        "identifier": "OMIM:MTHU030638"
      },
      {
        "identifier": "OMIM:MTHU038375"
      },
      {
        "identifier": "ORPHANET:803"
      },
      {
        "identifier": "EFO:0000253",
        "label": "amyotrophic lateral sclerosis"
      },
      {
        "identifier": "UMLS:C0002736",
        "label": "Amyotrophic Lateral Sclerosis"
      },
      {
        "identifier": "UMLS:C0393554",
        "label": "Amyotrophic Lateral Sclerosis With Dementia"
      },
      {
        "identifier": "MESH:D000690",
        "label": "Amyotrophic Lateral Sclerosis"
      },
      {
        "identifier": "MEDDRA:10002026"
      },
      {
        "identifier": "NCI

## TRAPI

Node normalization will now operate on TRAPI messages (version 1.0)

Here we have a message in terms of HGNC and DOID, and the normalizer returns a message using NCBIGene and MONDO.

In [11]:
trapi_message = {
    "message": {
        "query_graph": {
            "nodes": {
                "n1": {
                    "id": "HGNC:11603",
                    "category": [
                        "biolink:Gene"
                    ]
                },
                "n2": {
                    "category": [
                        "biolink:Disease"
                    ]
                }
            },
            "edges": {
                "e1": {
                    "subject": "n1",
                    "object": "n2"
                }
            }
        },
        "knowledge_graph": {
            "nodes": {
                "HGNC:11603": {
                    "name": "TBX4",
                    "category": [
                        "biolink:Gene"
                    ]
                },
                "DOID:3083": {
                    "name": "chronic obstructive pulmonary disease",
                    "category": [
                        "biolink:Disease"
                    ]
                }
            },
            "edges": {
                "2d38345a-e9bf-4943-accb-dccba351dd04": {
                    "subject": "NCBIGene:9496",
                    "object": "DOID:3083",
                    "predicate": "biolink:related_to",
                    "relation": "RO:0003304"
                }
            }
        },
        "results": [
            {
                "node_bindings": {
                    "n1": [
                        {
                            "id": "HGNC:11603"
                        }
                    ],
                    "n2": [
                        {
                            "id": "DOID:3083"
                        }
                    ]
                },
                "edge_bindings": {
                    "e1": [
                        {
                            "id": "2d38345a-e9bf-4943-accb-dccba351dd04"
                        }
                    ]
                }
            }
        ]
    }
}

In [12]:
result = requests.post('https://nodenormalization-sri.renci.org/response',json=trapi_message)
print(result.status_code)

200


In [13]:
print(json.dumps(result.json(), indent=2))

{
  "message": {
    "query_graph": {
      "nodes": {
        "n1": {
          "id": "NCBIGene:9496",
          "category": [
            "biolink:Gene"
          ],
          "is_set": false
        },
        "n2": {
          "id": null,
          "category": [
            "biolink:Disease"
          ],
          "is_set": false
        }
      },
      "edges": {
        "e1": {
          "subject": "n1",
          "object": "n2",
          "predicate": null,
          "relation": null
        }
      }
    },
    "knowledge_graph": {
      "nodes": {
        "NCBIGene:9496": {
          "category": [
            "biolink:Gene",
            "biolink:Gene",
            "biolink:GenomicEntity",
            "biolink:MolecularEntity",
            "biolink:BiologicalEntity",
            "biolink:NamedThing",
            "biolink:Entity"
          ],
          "name": "TBX4",
          "attributes": [
            {
              "type": "biolink:same_as",
              "value": [
     