![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb)

📌 SNOMED CT is one of the most comprehensive and precise multilingual health terminologies in the world, widely used for the electronic exchange of clinical information. It serves as a key standard in many countries and enhances interoperability by enabling structured encoding of medical terms.

In this notebook, we introduce the "biolordresolve_snomed_augmented" model, specifically designed for entity resolution and normalization of medical terms in Spanish within the SNOMED CT hierarchy. This model:

Provides extensive and accurate coverage of medical terms in Spanish.

Enhances interoperability by mapping concepts to other coding systems like ICD-9 and ICD-10.

Facilitates the extraction and structuring of clinical information from medical texts.

This notebook will demonstrate how to use "biolordresolve_snomed_augmented" to normalize medical terms and improve the quality of clinical data processing in Spanish. Let's get started! 🚀



# Colab Setup

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.4.1 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
import json
import os

import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp_jsl.pipeline_tracer import PipelineTracer
from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser

import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline,PipelineModel


import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"25G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.5.3
Spark NLP_JSL Version : 5.5.3


# Sentence Entity Resolver Models


A common NLP problem in biomedical aplications is to identify the presence of clinical entities in a given text. This clinical entities could be diseases, symptoms, drugs, results of clinical investigations or others.

To convert a sentence or document into a vector for semantic search or to build a recommendation system, one of the most popularly advised approaches is to pass the text through a transformer model like BERT, etc, and collect the embedding vector of CLS token or average out the embeddings of the tokens from the last layer to get a single vector.

Truth be told, this approach of finding similar documents through embedding the CLS token or average embedding of the last layer performs much worse than averaging of word2vec/Glove embedding to form a sentence/document vector.
On top of that, word2vec/Glove averaging is very fast to run when compared to extracting a vector through the transformer model.

A better approach for transformer-based embedding is to use fine-tuned Siamese network variants (SBERT etc) that are trained to embed similar sentences/ documents to a closer embedding space and separate the non-similar ones. That’s what we are doing here at Sentence Resolvers and it is why we outperform Chunk Resolvers.

Otherwise, the raw embedding vectors (CLS, etc) from the last layers of these transformer models don't yield any superior results for similarity search when compared to avg word2vec/Glove embeddings.

Other than providing the code in the "result" field it provides more metadata about the matching process:

- sentence -> Sentence ID
- target_text -> Text to resolve
- resolved_text -> Best match text
- confidence -> Relative confidence for the top match (distance to probability)
- confidence_ratio -> Relative confidence for the top match. TopMatchConfidence / SecondMatchConfidence
- alternative_codes -> List of other plausible codes (in the KNN neighborhood)
- all_k_resolutions -> All codes descriptions
- all_k_results -> All resolved codes for metrics calculation purposes
- all_k_aux_labels -> All auxillary information if the model contains

We create a new pipeline that from each of these problems will try to assign an resolution on the content, the sentence embeddings and some pretrained models for resolver annotation.

The architecture of this new pipeline will be as follows:

- DocumentAssembler (text -> document)

- SentenceDetector (document -> sentence)

- Tokenizer (sentence -> token)

- WordEmbeddingsModel ([sentence, token] -> embeddings)

- MedicalNerModel ([sentence, token, embeddings] -> ner)

- NerConverter (["sentence, token, ner] -> ner_chunk

- Chunk2Doc (ner_chunk) -> ner_chunk_doc

- BertSentenceEmbeddings (ner_chunk_doc) -> sbert_embeddings

- SentenceEntityResolverModel ([ner_chunk, sbert_embeddings] -> resolution)

So from a text we end having a list of Named Entities (ner_chunk) and their resolutions.

`setPreservePosition(True)` takes exactly the original indices (under some tokenization conditions it might include some undesires chars like `")","]"...)`

`setPreservePosition(False)` takes adjusted indices based on substring indexing of the first (for begin) and last (for end) tokens


## Healthcare NLP for Data Scientists Course

If you are not familiar with the components in this notebook, you can check [Healthcare NLP for Data Scientists Udemy Course](https://www.udemy.com/course/healthcare-nlp-for-data-scientists/) and the [MOOC Notebooks](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP) for each components.

### Helper Function
Writing Generic Function For Getting the Codes and Relation Pair

In [None]:
# returns LP resolution results

import pandas as pd
pd.set_option('display.max_colwidth', 0)

def get_codes (lp, text, vocab='icd10cm_code', aux=False, hcc=False):
    """
    Extracts codes from text using a given LightPipeline model and returns the results in pandas DataFrame format.

    Parameters:
    - lp (LightPipeline): The LightPipeline model to be used for annotation.
    - text (str): The text from which codes are to be extracted.
    - vocab (str, optional): The vocabulary to use for code extraction. Default is 'icd10cm_code'.
    - aux (bool, optional): Whether to include auxiliary information. Default is False.
    - hcc (bool, optional): Whether to extract hierarchical condition category (HCC) related information in the ICD-10-CM model results. Default is False.

    Returns:
    - pandas.DataFrame: A DataFrame containing extracted codes along with their metadata.
                         Columns include 'chunks', 'begin', 'end', 'code', 'resolutions', 'all_codes', 'all_resolutions', 'all_k_aux_labels', 'all_distances'.
                         If 'hcc' is True, additional columns 'billable', 'hcc_status', and 'hcc_code' will be included.
    """

    full_light_result = lp.fullAnnotate(text)

    chunks = [chunk.result for chunk in full_light_result[0]['ner_chunk']]
    begin = [chunk.begin for chunk in full_light_result[0]['ner_chunk']]
    end = [chunk.end for chunk in full_light_result[0]['ner_chunk']]
    codes = [code.result for code in full_light_result[0][vocab]]
    resolutions = [code.metadata['resolved_text'].split(':::') for code in full_light_result[0][vocab]]
    all_codes = [code.metadata['all_k_results'].split(':::') for code in full_light_result[0][vocab]]
    all_resolutions = [code.metadata['all_k_resolutions'].split(':::') for code in full_light_result[0][vocab]]
    all_distances = [code.metadata['all_k_distances'].split(':::') for code in full_light_result[0][vocab]]
    all_cosines = [code.metadata['all_k_cosine_distances'].split(':::') for code in full_light_result[0][vocab]]
    all_k_aux_labels = [
        code.metadata.get('all_k_aux_labels', '').split(':::') if aux else []
        for code in full_light_result[0][vocab]
    ]

    df = pd.DataFrame({
        'chunks': chunks,
        'begin': begin,
        'end': end,
        'code': codes,
        'resolution': resolutions,
        'all_codes': all_codes,
        'all_resolutions': all_resolutions,
        'all_k_aux_labels': all_k_aux_labels,
        'all_cos_distances': all_cosines
    })

    if not aux:
        df = df.drop(['all_k_aux_labels'], axis=1)

    if hcc:

        df['billable'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[0] for i in x])
        df['hcc_status'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[1] for i in x])
        df['hcc_code'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[2] for i in x])
        df = df.drop(['all_k_aux_labels'], axis=1)

    return df

## Sentence Entity Resolver (SNOMED)

🔍  Please see the [Clinical_Entity_Resolvers](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb) notebook for more examples

In [None]:
document_assembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")\

word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","es")\
  .setInputCols(["sentence","token"])\
  .setOutputCol("embeddings")

ner_eu = MedicalNerModel.pretrained("ner_eu_clinical_condition", "es", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner_eu")

ner_eu_converter = NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner_eu"]) \
  .setOutputCol("ner_eu_chunk")

chunk2doc = Chunk2Doc()\
  .setInputCols("ner_eu_chunk")\
  .setOutputCol("ner_chunk_doc")

biolord_embeddings = XlmRoBertaSentenceEmbeddings.pretrained("sent_xlm_roberta_biolord_2023_m","xx")\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("biolord_embeddings")

snomed_resolver = SentenceEntityResolverModel.pretrained("biolordresolve_snomed_augmented","es", "clinical/models") \
      .setInputCols(["biolord_embeddings"]) \
      .setOutputCol("snomed_code")\
      .setDistanceFunction("EUCLIDEAN")

snomed_pipeline = Pipeline(stages = [
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_eu,
    ner_eu_converter,
    chunk2doc,
    biolord_embeddings,
    snomed_resolver
])


clinical_note = ("La paciente, con antecedente de diabetes mellitus gestacional evolucionada a tipo 2 y obesidad, presenta vómitos de una semana de evolución junto con dolorosa inflamación de sínfisis de pubis que dificulta la deambulación.")

data = spark.createDataFrame([[clinical_note]]).toDF("text")

snomed_result = snomed_pipeline.fit(data).transform(data)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
ner_eu_clinical_condition download started this may take some time.
[OK!]
sent_xlm_roberta_biolord_2023_m download started this may take some time.
Approximate size to download 973.6 MB
[OK!]
biolordresolve_snomed_augmented download started this may take some time.
[OK!]


In [None]:
snomed_result.select(F.explode(F.arrays_zip(snomed_result.ner_eu_chunk.result,
                                                          snomed_result.ner_eu_chunk.begin,
                                                          snomed_result.ner_eu_chunk.end,
                                                          snomed_result.ner_eu_chunk.metadata,
                                                          snomed_result.snomed_code.result,
                                                          snomed_result.snomed_code.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']").alias("begin"),
                          F.expr("cols['2']").alias("end"),
                          F.expr("cols['3']['entity']").alias("ner_label"),
                          F.expr("cols['4']").alias("snomed"),
                          F.expr("cols['5']['resolved_text']").alias("resolution"),
                          F.expr("cols['5']['all_k_resolutions']").alias("all_resolution"),
                          F.expr("cols['5']['all_k_results']").alias("all_result"))\
                       .filter("ner_label!='O'")\
                       .show(1000,truncate=False)

+-----------------------------+-----+---+------------------+---------+-------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Visualization

📺 `EntityResolverVisualizer` is a tool for visualization of Snomed CT  results.

In [None]:
result = snomed_result.collect()

In [None]:
from sparknlp_display import EntityResolverVisualizer

visualiser = EntityResolverVisualizer()

for i in range(len(result)):
  visualiser.display(result[i], 'ner_eu_chunk', 'snomed_code', save_path="example_resolver.html")

**If you have already extracted the relevant chunks, you can directly retrieve SNOMED CT codes using the following pipeline:**

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("ner_chunk")

biolord_embeddings = XlmRoBertaSentenceEmbeddings\
    .pretrained("sent_xlm_roberta_biolord_2023_m","xx")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("biolord_embeddings")

snomed_ct_resolver = SentenceEntityResolverModel.pretrained("biolordresolve_snomed_augmented","es", "clinical/models") \
    .setInputCols(["biolord_embeddings"]) \
    .setOutputCol("snomed_code")\
    .setDistanceFunction("EUCLIDEAN")

snomed_pipelineModel = PipelineModel(
    stages = [
        documentAssembler,
        biolord_embeddings,
        snomed_ct_resolver])

snomed_lp = LightPipeline(snomed_pipelineModel)

sent_xlm_roberta_biolord_2023_m download started this may take some time.
Approximate size to download 973.6 MB
[OK!]
biolordresolve_snomed_augmented download started this may take some time.
[OK!]


In [None]:
text = 'EPOC'

%time get_codes (snomed_lp, text, vocab='snomed_code')

CPU times: user 26.5 ms, sys: 15.3 ms, total: 41.9 ms
Wall time: 1.41 s


Unnamed: 0,chunks,begin,end,code,resolution,all_codes,all_resolutions,all_cos_distances
0,EPOC,0,3,13645005,[EPOC - enfermedad pulmonar obstructiva crónica [EPOC - enfermedad pulmonar obstructiva crónica]],"[13645005, 135836000, 413839001, 297241004, 90616004, 17097001, 125295001, 233762004, 266899005, 185086009, 86555001, 708030004]","[EPOC - enfermedad pulmonar obstructiva crónica [EPOC - enfermedad pulmonar obstructiva crónica], EPOC terminal [EPOC terminal], enfermedad pulmonar crónica [enfermedad pulmonar crónica], antecedente familiar de EPOC [antecedente familiar de EPOC], síndrome de hiperventilación crónica [síndrome de hiperventilación crónica], enfermedad respiratoria crónica [enfermedad respiratoria crónica], enfisema crónico [enfisema crónico], neumoconiosis crónica [neumoconiosis crónica], antecedente familiar de bronquitis/enfermedad pulmonar obstructiva crónica [antecedente familiar de bronquitis/enfermedad pulmonar obstructiva crónica], bronquitis crónica con enfisema [bronquitis crónica con enfisema], mucoviscidosis con compromiso pulmonar [mucoviscidosis con compromiso pulmonar], fibrosis pulmonar combinada con enfisema [fibrosis pulmonar combinada con enfisema]]","[0.1265, 0.1274, 0.1574, 0.1646, 0.1654, 0.1723, 0.1730, 0.1818, 0.1829, 0.1843, 0.1886, 0.1893]"


In [None]:
text = 'Insuficiencia cardíaca congestiva'

%time get_codes (snomed_lp, text, vocab='snomed_code')

CPU times: user 22.7 ms, sys: 9.81 ms, total: 32.5 ms
Wall time: 1.34 s


Unnamed: 0,chunks,begin,end,code,resolution,all_codes,all_resolutions,all_cos_distances
0,Insuficiencia cardíaca congestiva,0,32,42343007,[insuficiencia cardíaca congestiva [insuficiencia cardíaca congestiva]],"[42343007, 84114007, 88805009, 698594003, 48447003, 56265001, 128238001, 96311000119109, 105981003]","[insuficiencia cardíaca congestiva [insuficiencia cardíaca congestiva], insuficiencia cardíaca [insuficiencia cardíaca], insuficiencia cardíaca congestiva crónica [insuficiencia cardíaca congestiva crónica], insuficiencia cardíaca congestiva sintomática [insuficiencia cardíaca congestiva sintomática], insuficiencia cardíaca crónica [insuficiencia cardíaca crónica], cardiopatía (trastorno) [cardiopatía], cardiopatía crónica [cardiopatía crónica], exacerbación de insuficiencia cardíaca congestiva [exacerbación de insuficiencia cardíaca congestiva], disfunción cardíaca [disfunción cardíaca]]","[0.0007, 0.0409, 0.0657, 0.0689, 0.0898, 0.0944, 0.1124, 0.1259, 0.1327]"


In [None]:
text = 'TB'

%time get_codes (snomed_lp, text, vocab='snomed_code')

CPU times: user 27.5 ms, sys: 11.3 ms, total: 38.7 ms
Wall time: 1.29 s


Unnamed: 0,chunks,begin,end,code,resolution,all_codes,all_resolutions,all_cos_distances
0,TB,0,1,56717001,[tuberculosis [tuberculosis]],"[56717001, 186269001, 373576009, 63309002, 427099000, 15202009, 415760001, 25629007, 88356006, 51014003, 36354002, 154283005, 22990009, 423997002, 123583002]","[tuberculosis [tuberculosis], tuberculosis ótica [tuberculosis ótica], infección causada por Mycobacterium tuberculosis (trastorno) [infección causada por Mycobacterium tuberculosis], tuberculosis primaria [tuberculosis primaria], tuberculosis activa [tuberculosis activa], tuberculoma (trastorno) [tuberculoma], tuberculosis-estado [tuberculosis-estado], tuberculosis aguda (trastorno) [tuberculosis aguda], complejo primario tuberculoso (trastorno) [complejo primario tuberculoso], tuberculosis secundaria [tuberculosis secundaria], bacilo de la tuberculosis humana [bacilo de la tuberculosis humana], tuberculosis pulmonar (trastorno) [tuberculosis pulmonar], tuberculosis crónica [tuberculosis crónica], TB extrapulmonar [TB extrapulmonar], tubercúlide [tubercúlide]]","[0.0112, 0.0316, 0.1233, 0.1237, 0.1259, 0.1522, 0.1658, 0.1665, 0.1696, 0.1861, 0.2058, 0.2130, 0.2142, 0.2182, 0.2184]"


In [None]:
text = 'Dolor de cabeza'

%time get_codes (snomed_lp, text, vocab='snomed_code')

CPU times: user 27.6 ms, sys: 12.2 ms, total: 39.8 ms
Wall time: 1.34 s


Unnamed: 0,chunks,begin,end,code,resolution,all_codes,all_resolutions,all_cos_distances
0,Dolor de cabeza,0,14,25064002,[dolor de cabeza [dolor de cabeza]],"[25064002, 230461009, 735938006, 193031009, 398057008, 364760007, 162309007, 122711000119109, 1263575002, 712831003, 425839000, 162307009]","[dolor de cabeza [dolor de cabeza], trastorno relativo a la cefalea [trastorno relativo a la cefalea], cefalea aguda [cefalea aguda], cefalea en acúmulos [cefalea en acúmulos], cefalea por tensión [cefalea por tensión], características de la cefalea - hallazgo [características de la cefalea - hallazgo], cefalea fulgurante [cefalea fulgurante], cefalea despertador [cefalea despertador], cefalea en forma de moneda [cefalea en forma de moneda], cefaleas frecuentes [cefaleas frecuentes], dolor irradiado a la cabeza [dolor irradiado a la cabeza], cefalea continua [cefalea continua]]","[0.0011, 0.0508, 0.0655, 0.0664, 0.0688, 0.0820, 0.0823, 0.1068, 0.1088, 0.1126, 0.1134, 0.1135]"


In [None]:
text = 'Embarazo'

%time get_codes (snomed_lp, text, vocab='snomed_code')

CPU times: user 26.3 ms, sys: 9.67 ms, total: 36 ms
Wall time: 1.3 s


Unnamed: 0,chunks,begin,end,code,resolution,all_codes,all_resolutions,all_cos_distances
0,Embarazo,0,7,289908002,[embarazo [embarazo]],"[289908002, 161732006, 77386006, 255409004, 169563005, 127364007, 366321006, 118185001, 169565003, 87527008, 9899009, 276986009]","[embarazo [embarazo], embarazada [embarazada], embarazada (hallazgo) [embarazo confirmado], mujer embarazada (persona) [mujer embarazada], embarazada - en historia clínica (hallazgo) [embarazada - en historia clínica], primigesta (hallazgo) [primigesta], embarazo - hallazgo [embarazo - hallazgo], hallazgo relacionado con el embarazo [hallazgo relacionado con el embarazo], embarazada - gestación planeada (hallazgo) [embarazada - gestación planeada], embarazo a término (hallazgo) [embarazo a término], embarazo ovárico [embarazo ovárico], preparto [preparto]]","[0.0231, 0.0420, 0.0707, 0.1157, 0.1558, 0.1595, 0.1705, 0.1837, 0.1886, 0.1981, 0.2086, 0.2177]"


In [None]:
text = 'ERGE'

%time get_codes (snomed_lp, text, vocab='snomed_code')

CPU times: user 26 ms, sys: 12.9 ms, total: 38.9 ms
Wall time: 1.34 s


Unnamed: 0,chunks,begin,end,code,resolution,all_codes,all_resolutions,all_cos_distances
0,ERGE,0,3,235595009,[RGE [RGE]],"[235595009, 717847008, 717846004, 9733003, 413216002, 57433008, 10999201000119106, 300290000, 413211007, 245754007, 40719004, 266433003, 266435005, 79693006]","[RGE [RGE], enfermedad por reflujo gastroesofágico erosiva [enfermedad por reflujo gastroesofágico erosiva], enfermedad por reflujo gastroesofágico no erosiva [enfermedad por reflujo gastroesofágico no erosiva], reflujo gastroduodenal [reflujo gastroduodenal], gastropatía erosiva [gastropatía erosiva], gastritis por reflujo [gastritis por reflujo], reflujo gastroesofágico en un niño (trastorno) [reflujo gastroesofágico en un niño], reflujo gástrico excesivo [reflujo gástrico excesivo], duodenopatía erosiva [duodenopatía erosiva], enfermedad del reflujo gastroesofágico con ulceración [enfermedad del reflujo gastroesofágico con ulceración], esofagitis erosiva [esofagitis erosiva], enfermedad por reflujo gastroesofágico con esofagitis [enfermedad por reflujo gastroesofágico con esofagitis], enfermedad por reflujo gastroesofágico sin esofagitis [enfermedad por reflujo gastroesofágico sin esofagitis], alteración en la digestión intestinal de células epiteliales [alteración en la digestión intestinal de células epiteliales]]","[0.1240, 0.1427, 0.2325, 0.2373, 0.2456, 0.2506, 0.2531, 0.2542, 0.2555, 0.2561, 0.2591, 0.2703, 0.2705, 0.2779]"
