![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/healthcare-nlp/37.0.Human_Phenotype_Extraction_And_HPO_Code_Mapping.ipynb)

# Human Phenotype Extraction and HPO Code Mapping

Extracting human phenotype information from unstructured clinical text is critical for advancing diagnosis, research, and precision medicine. However, narrative descriptions of patient traits and symptoms are highly variable, making automated analysis challenging.

The **Human Phenotype Ontology (HPO)** offers a standardized vocabulary to address this gap. By applying Natural Language Processing (NLP) techniques, we can detect phenotype mentions in text and map them to their corresponding HPO codes.

This process transforms free-text data into structured, computable formats, enabling better patient stratification, genetic diagnosis, and large-scale clinical studies.

This notebook demonstrates how to use:

- [Pretrained Pipeline](https://nlp.johnsnowlabs.com/2025/05/02/hpo_mapper_pipeline_en.html) to extract phenotype entities and map to their corresponding **HPO** codes.

- [Human Phenotypes Text Matcher](https://nlp.johnsnowlabs.com/2025/05/01/hpo_matcher_en.html),

- [HPO Code Mapper](https://nlp.johnsnowlabs.com/2025/05/01/hpo_mapper_en.html),

## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

## Colab Setup


In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.settings.enforce_versions=True
nlp.install(refresh_install=True)

In [None]:
from johnsnowlabs import nlp, medical
# Automatically load license data and start a session with all jars user has access to

spark = nlp.start()

In [5]:
spark

In [6]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only
from sparknlp_jsl.pipeline_tracer import PipelineTracer

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import json
import string
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Mapping Phenotypes to HPO Codes using a Pretrained Pipeline

A [pretrained pipeline](https://nlp.johnsnowlabs.com/2025/05/02/hpo_mapper_pipeline_en.html) can be used to extract phenotype entities from clinical or biomedical text and map them to their corresponding **Human Phenotype Ontology (HPO)** codes.

The Pretrained Pipeline (PP) will:
- Load a pretrained Healthcare NLP pipeline.
- Input raw clinical text containing phenotypic descriptions.
- Automatically extract phenotype entities.
- Map these entities to their standardized **HPO codes**.

This approach ensures consistent terminology and paves the way for scalable, ontology-aware clinical text mining in biomedical research and applications.



In [7]:
text = """APNEA: Presumed apnea of prematurity since < 34 wks gestation at birth.
HYPERBILIRUBINEMIA: At risk for hyperbilirubinemia d/t prematurity.
1/25-1/30: Received Amp/Gent while undergoing sepsis evaluation."""

In [8]:
pipeline = nlp.PretrainedPipeline("hpo_mapper_pipeline", "en", "clinical/models")

hpo_mapper_pipeline download started this may take some time.
Approx size to download 1.6 GB
[OK!]


In [9]:
pipeline.model.stages

[DocumentAssembler_d0bc7e7b5701,
 REGEX_TOKENIZER_412cf3aec0e0,
 StopWordsCleaner_07ae15b4ab3f,
 TokenAssembler_3c5fe8db3f38,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_ff483303fc87,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 ENTITY_EXTRACTOR_b3e9ddf00ea8,
 REGEX_MATCHER_a9e4d3edf33b,
 MERGE_9d4376ade17e,
 ChunkFilterer_b773ee6df7c5,
 CHUNKER-MAPPER_2c1f125ecd86,
 ASSERTION_DL_25881ab6309e]

Using **.fullAnnotate** provides more comprehensive output, including both the annotations and the metadata for each token.

In [10]:
clinical_result = pipeline.fullAnnotate(text)[0]

In [11]:
clinical_result.keys()

dict_keys(['hpo_code', 'raw_ner_chunk', 'document', 'ner_chunk', 'hpo_term', 'assertion', 'cleanTokens', 'regex_matches', 'token', 'clean_tokens', 'embeddings', 'cleanTokens_newDoc', 'sentence'])

In [12]:
clinical_result["ner_chunk"]

[Annotation(chunk, 16, 20, apnea, {'entity': 'HPO', 'ner_source': 'hpo_term', 'chunk': '1', 'original_or_matched': '', 'sentence': '0'}, []),
 Annotation(chunk, 91, 108, hyperbilirubinemia, {'entity': 'HPO', 'ner_source': 'hpo_term', 'chunk': '3', 'original_or_matched': '', 'sentence': '1'}, []),
 Annotation(chunk, 167, 172, sepsis, {'entity': 'HPO', 'ner_source': 'hpo_term', 'chunk': '4', 'original_or_matched': '', 'sentence': '2'}, [])]

In [13]:
clinical_result["hpo_code"]

[Annotation(labeled_dependency, 16, 20, HP:0002104, {'resolved_text': 'HP:0002104', 'distance': '0.0', 'entity': 'HPO', '__distance_function__': 'levenshtein', 'relation': 'hpo_code', '__trained__': 'apnea', 'all_k_distances': '0.0:::0.0', 'ner_source': 'hpo_term', 'chunk': '1', 'original_or_matched': '', 'sentence': '0', 'all_k_resolutions': 'HP:0002104:::', 'ops': '0.0', 'all_relations': '', 'target_text': 'apnea', '__relation_name__': 'hpo_code'}, []),
 Annotation(labeled_dependency, 91, 108, HP:0002904, {'resolved_text': 'HP:0002904', 'distance': '0.0', 'entity': 'HPO', '__distance_function__': 'levenshtein', 'relation': 'hpo_code', '__trained__': 'hyperbilirubinemia', 'all_k_distances': '0.0:::0.0', 'ner_source': 'hpo_term', 'chunk': '3', 'original_or_matched': '', 'sentence': '1', 'all_k_resolutions': 'HP:0002904:::', 'ops': '0.0', 'all_relations': '', 'target_text': 'hyperbilirubinemia', '__relation_name__': 'hpo_code'}, []),
 Annotation(labeled_dependency, 167, 172, HP:0100806,

In [14]:
hpoterm_result = []
begin = []
end = []
entity = []
hpo_code = []
assertions = []

for term, code, assertion in zip(clinical_result['ner_chunk'], clinical_result['hpo_code'], clinical_result['assertion']):

    hpoterm_result.append(term.result)
    begin.append(term.begin)
    end.append(term.end)
    entity.append(term.metadata['entity'])
    hpo_code.append(code.result)
    assertions.append(assertion.result)

df_clinical = pd.DataFrame({'chunk':hpoterm_result, 'begin': begin, 'end' : end , 'label' : entity, "hpo_code" : hpo_code, "assertion": assertions})

df_clinical

Unnamed: 0,chunk,begin,end,label,hpo_code,assertion
0,apnea,16,20,HPO,HP:0002104,possible
1,hyperbilirubinemia,91,108,HPO,HP:0002904,present
2,sepsis,167,172,HPO,HP:0100806,present


In [15]:
visualiser = nlp.viz.NerVisualizer()

visualiser.display(clinical_result, label_col='ner_chunk', document_col='cleanTokens_newDoc')

The pipeline includes a section header detector, which helps distinguish when an HPO term is also used as a section header in clinical text. If you'd like to view the section headers along with the HPO terms, you can set the blacklist parameter in the chunk filterer stage to an empty list ([]). This will ensure that both HPO terms and section headers are returned.

In [16]:
pipeline.model.stages

[DocumentAssembler_d0bc7e7b5701,
 REGEX_TOKENIZER_412cf3aec0e0,
 StopWordsCleaner_07ae15b4ab3f,
 TokenAssembler_3c5fe8db3f38,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_ff483303fc87,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 ENTITY_EXTRACTOR_b3e9ddf00ea8,
 REGEX_MATCHER_a9e4d3edf33b,
 MERGE_9d4376ade17e,
 ChunkFilterer_b773ee6df7c5,
 CHUNKER-MAPPER_2c1f125ecd86,
 ASSERTION_DL_25881ab6309e]

In [17]:
pipeline.model.stages[-3].getBlackList()

['SECTION_HEADER']

In [18]:
pipeline.model.stages[-3] = pipeline.model.stages[-3].setBlackList([])
print("Filterer Black List:", pipeline.model.stages[-3].getBlackList())

empty_data = spark.createDataFrame([[""]]).toDF("text")
pipeline.transform(empty_data)

clinical_result = pipeline.fullAnnotate(text)[0]

hpoterm_result = []
begin = []
end = []
entity = []
hpo_code = []
assertions = []

for term, code, assertion in zip(clinical_result['ner_chunk'], clinical_result['hpo_code'], clinical_result['assertion']):

    hpoterm_result.append(term.result)
    begin.append(term.begin)
    end.append(term.end)
    entity.append(term.metadata['entity'])
    hpo_code.append(code.result)
    assertions.append(assertion.result)

df_clinical = pd.DataFrame({'chunk':hpoterm_result, 'begin': begin, 'end' : end , 'label' : entity, "hpo_code" : hpo_code, "assertion": assertions})

df_clinical

Filterer Black List: []


Unnamed: 0,chunk,begin,end,label,hpo_code,assertion
0,APNEA:,0,5,SECTION_HEADER,NONE,present
1,apnea,16,20,HPO,HP:0002104,possible
2,HYPERBILIRUBINEMIA:,66,84,SECTION_HEADER,NONE,present
3,hyperbilirubinemia,91,108,HPO,HP:0002904,present
4,sepsis,167,172,HPO,HP:0100806,present


In [19]:
visualiser = nlp.viz.NerVisualizer()

visualiser.display(clinical_result, label_col='ner_chunk', document_col='cleanTokens_newDoc')

## Human Phenotypes Text Matcher

[Human Phenotypes Text Matcher](https://nlp.johnsnowlabs.com/2025/05/01/hpo_matcher_en.html) identifies mentions of phenotype terms within unstructured text by matching them against a curated list or ontology like HPO.

Text Matcher enables **quick and accurate detection of phenotype-related information**, supporting downstream tasks like HPO code mapping, diagnosis support, and clinical data analysis.

In [20]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

stopwords_cleaner = nlp.StopWordsCleaner.pretrained("stopwords_removal_hpo", "en", "clinical/models") \
    .setInputCols("token")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)

token_assembler = nlp.TokenAssembler()\
    .setInputCols(['document',"cleanTokens"])\
    .setOutputCol("cleanTokens_newDoc")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["cleanTokens_newDoc"]) \
    .setOutputCol("sentence")

tokenizer_2 = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("clean_tokens")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "clean_tokens"])\
    .setOutputCol("embeddings")

entityExtractor = medical.TextMatcherModel().pretrained("hpo_matcher","en","clinical/models")\
    .setInputCols(["sentence", "clean_tokens"])\
    .setOutputCol("hpo_term")\
    .setCaseSensitive(False)\
    .setMergeOverlapping(False)

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "hpo_term", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","hpo_term","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])

matcher_pipeline = nlp.Pipeline(
                    stages = [
                        documentAssembler,
                        tokenizer,
                        stopwords_cleaner,
                        token_assembler,
                        sentenceDetector,
                        tokenizer_2,
                        word_embeddings,
                        entityExtractor,
                        clinical_assertion,
                        assertion_filterer
                  ])

stopwords_removal_hpo download started this may take some time.
Approximate size to download 1.3 KB
[OK!]
sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
hpo_matcher download started this may take some time.
Approximate size to download 2 MB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]


In [21]:
text = """
The patient exhibited poor coordination, and spasticity in the lower limbs.
Neurological examination revealed dysarthria and ataxic gait.
Brain MRI showed cerebellar atrophy.
There was also a history of seizures, intellectual disability, and anxiety.
"""

data = spark.createDataFrame([[text]]).toDF("text")
matcher_model = matcher_pipeline.fit(data)

In [22]:
result = matcher_model.transform(data)

result.select(F.explode(F.arrays_zip(
              result.assertion_filtered.result,
              result.assertion_filtered.begin,
              result.assertion_filtered.end,
              result.assertion_filtered.metadata,
              )).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('label')).show(50, truncate=70)

+-----------------------+-----+---+-----+
|                  chunk|begin|end|label|
+-----------------------+-----+---+-----+
|      poor coordination|   18| 34|  HPO|
|             spasticity|   37| 46|  HPO|
|             dysarthria|   95|104|  HPO|
|            ataxic gait|  106|116|  HPO|
|     cerebellar atrophy|  136|153|  HPO|
|               seizures|  169|176|  HPO|
|intellectual disability|  179|201|  HPO|
|                anxiety|  204|210|  HPO|
+-----------------------+-----+---+-----+



## HPO Code Mapper

[HPO Code Mapper](https://nlp.johnsnowlabs.com/2025/05/01/hpo_mapper_en.html) links extracted phenotype mentions from text to their corresponding **Human Phenotype Ontology (HPO)** codes.

This mapping standardizes clinical descriptions, enabling structured analysis, improving data interoperability, and supporting precision medicine applications.

In [23]:
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

stopwords_cleaner = nlp.StopWordsCleaner.pretrained("stopwords_removal_hpo", "en", "clinical/models") \
    .setInputCols("token")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)

token_assembler = nlp.TokenAssembler()\
    .setInputCols(['document',"cleanTokens"])\
    .setOutputCol("cleanTokens_newDoc")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["cleanTokens_newDoc"]) \
    .setOutputCol("sentence")

tokenizer_2 = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("clean_tokens")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "clean_tokens"])\
    .setOutputCol("embeddings")

entityExtractor = medical.TextMatcherModel().pretrained("hpo_matcher","en","clinical/models")\
    .setInputCols(["sentence", "clean_tokens"])\
    .setOutputCol("hpo_term")\
    .setCaseSensitive(False)\
    .setMergeOverlapping(False)

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "hpo_term", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","hpo_term","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    #.setWhiteList(["Present"])

mapper = medical.ChunkMapperModel().pretrained("hpo_mapper","en", "clinical/models")\
    .setInputCols(["assertion_filtered"])\
    .setOutputCol("hpo_code")\
    .setLowerCase(True)

mapper_pipeline = nlp.Pipeline(stages=[
                      documentAssembler,
                      tokenizer,
                      stopwords_cleaner,
                      token_assembler,
                      sentenceDetector,
                      tokenizer_2,
                      word_embeddings,
                      entityExtractor,
                      clinical_assertion,
                      assertion_filterer,
                      mapper
                  ])

data = spark.createDataFrame([[text]]).toDF("text")

mapper_model = mapper_pipeline.fit(data)

stopwords_removal_hpo download started this may take some time.
Approximate size to download 1.3 KB
[OK!]
sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
hpo_matcher download started this may take some time.
Approximate size to download 2 MB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]
hpo_mapper download started this may take some time.
Approximate size to download 1.4 MB
[OK!]


In [24]:
result = mapper_model.transform(data)

result.select(F.explode(F.arrays_zip(
              result.assertion_filtered.result,
              result.assertion_filtered.begin,
              result.assertion_filtered.end,
              result.assertion_filtered.metadata,
              result.assertion.result,
              result.hpo_code.result)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias('ner_label'),
              F.expr("cols['4']").alias('assertion_label'),
              F.expr("cols['5']").alias('hpo_code')).show(50, truncate=70)

+-----------------------+-----+---+---------+---------------+----------+
|                  chunk|begin|end|ner_label|assertion_label|  hpo_code|
+-----------------------+-----+---+---------+---------------+----------+
|      poor coordination|   18| 34|      HPO|        present|HP:0002370|
|             spasticity|   37| 46|      HPO|        present|HP:0001257|
|             dysarthria|   95|104|      HPO|        present|HP:0001260|
|            ataxic gait|  106|116|      HPO|        present|HP:0002066|
|     cerebellar atrophy|  136|153|      HPO|        present|HP:0001272|
|               seizures|  169|176|      HPO|        present|HP:0001250|
|intellectual disability|  179|201|      HPO|        present|HP:0001249|
|                anxiety|  204|210|      HPO|        present|HP:0000739|
+-----------------------+-----+---+---------+---------------+----------+





This time, use LightPipeline instead of .transform().

Light Pipelines:

-    Optimized for single-machine, in-memory processing.

-    Faster for small to medium datasets.

-    Return a list of dictionaries or pandas DataFrame.

-    Useful for real-time predictions and prototyping.



**Light Pipelines are nearly 10x faster than Spark ML Pipelines.**

In [25]:
light_model = nlp.LightPipeline(mapper_model)

light_result = light_model.fullAnnotate(text)

The **NerVisualizer** highlights the named entities that are identified by Spark NLP and also displays their labels as decorations on top of the analyzed text.

In [26]:
visualiser = nlp.viz.NerVisualizer()

visualiser.display(light_result[0], label_col='assertion_filtered', document_col='cleanTokens_newDoc')