![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/49.Human_Phenotype_Extraction_And_HPO_Code_Mapping.ipynb)

# Human Phenotype Extraction and HPO Code Mapping

In [None]:
import json
import os

from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

In [6]:
import json
import os

import sparknlp
import sparknlp_jsl
import pandas as pd

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 6.0.0
Spark NLP_JSL Version : 6.0.0


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

# 🔬 Mapping Phenotypes to HPO Codes using a Pretrained Pipeline

This notebook demonstrates how to use a **pretrained Healthcare NLP pipeline** to extract phenotype entities from clinical or biomedical text and map them to their corresponding **Human Phenotype Ontology (HPO)** codes.

The **Human Phenotype Ontology (HPO)** provides a standardized vocabulary of phenotypic abnormalities encountered in human disease. Mapping observed symptoms and clinical signs to HPO codes enables better data interoperability, facilitates downstream analyses (e.g., phenotype-driven gene prioritization), and supports integration with biomedical knowledge graphs and clinical decision support systems.

In this notebook, we will:
- Load a pretrained Healthcare NLP pipeline.
- Input raw clinical text containing phenotypic descriptions.
- Automatically extract phenotype entities.
- Map these entities to their standardized **HPO codes**.

This approach ensures consistent terminology and paves the way for scalable, ontology-aware clinical text mining in biomedical research and applications.

---


In [7]:
pipeline = PretrainedPipeline("hpo_mapper_pipeline", "en", "clinical/models")

hpo_mapper_pipeline download started this may take some time.
Approx size to download 3.8 MB
[OK!]


In [8]:
pipeline.model.stages

[DocumentAssembler_b3b9016ab6cf,
 REGEX_TOKENIZER_9603fae13961,
 StopWordsCleaner_26952c0abfc8,
 TokenAssembler_43965952f049,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_180904360030,
 ENTITY_EXTRACTOR_b3e9ddf00ea8,
 CHUNKER-MAPPER_2c1f125ecd86]

### Sample Text

In [9]:
text = '''APNEA: Presumed apnea of prematurity since < 34 wks gestation at birth.
HYPERBILIRUBINEMIA: At risk for hyperbilirubinemia d/t prematurity.
1/25-1/30: Received Amp/Gent while undergoing sepsis evaluation.'''

### **Results**

In [13]:
clinical_result = pipeline.fullAnnotate(text)[0]


hpoterm_result = []
begin = []
end = []
entity = []
hpo_code = []


for n, m in zip(clinical_result['hpo_term'], clinical_result['hpo_code']):

    hpoterm_result.append(n.result)
    begin.append(n.begin)
    end.append(n.end)
    entity.append(n.metadata['entity'])
    hpo_code.append(m.result)



df_clinical = pd.DataFrame({'chunk':hpoterm_result, 'begin': begin, 'end' : end , 'label' : entity, "hpo_code" : hpo_code})

df_clinical

Unnamed: 0,chunk,begin,end,label,hpo_code
0,APNEA,0,4,HPO,HP:0002104
1,apnea,16,20,HPO,HP:0002104
2,HYPERBILIRUBINEMIA,66,83,HPO,HP:0002904
3,hyperbilirubinemia,91,108,HPO,HP:0002904
4,sepsis,167,172,HPO,HP:0100806
