

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_SNOMED_TERM.ipynb)


# Detect SNOMED Terms

This NER model is designed to identify `SNOMED` terms within clinical documents, utilizing the `embeddings_clinical` embeddings model.


> 📌To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## 🔧1. Colab Setup

- Import License Keys

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 5.2.2
Spark NLP_JSL Version : 5.2.1


## 🔍2. Select the model `ner_snomed_term` and construct the pipeline

**🔎You can find all these models and more [NLP Models Hub](https://nlp.johnsnowlabs.com/models?edition=Spark+NLP+for+Healthcare)**

In [4]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_snomed_term", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[document_assembler,
                            sentence_detector,
                            tokenizer,
                            embeddings,
                            ner,
                            ner_converter])

empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = nlpPipeline.fit(empty_df)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_snomed_term download started this may take some time.
[OK!]


## 📝3. Create example inputs

In [5]:
text_list = [
"""The individual experienced myocardial infarction leading to the implementation of coronary artery bypass grafting and was advised to initiate aspirin therapy""",
"""After experiencing relapsing appendicitis the patient underwent an appendectomy followed by treatment for postoperative infection with targeted antibiotics.""",
"""The patient presented with symptoms of bronchitis for which the physician recommended a chest X-ray and prescribed amoxicillin.""",
"""There is mild regional left ventricular systolic dysfunction to mid inferior akinesia/ hypokinesis.""",
"""He described the sensation as a heavy load, mentioning that he experienced lightheadedness and felt cold yet without any sweating, nausea or vomiting.""",
"""She denies associated chest pain, fevers or soreness"""
]

## 🚀4. Run the pipeline to find Entities

In [6]:
from pyspark.sql.functions import col, explode
import pandas as pd

df = spark.createDataFrame(pd.DataFrame({"text": text_list}))
result = pipelineModel.transform(df)

# Explode the 'ner_chunk' to flatten the structure and access its fields directly
result_exploded = result.withColumn("ner_chunk", explode("ner_chunk"))

# Select and display the required fields with the correct order
result_exploded.select(
    col("ner_chunk.result").alias("ner_chunk"),
    col("ner_chunk.begin").alias("begin"),
    col("ner_chunk.end").alias("end"),
    col("ner_chunk.metadata.entity").alias("ner_label"),
    col("ner_chunk.metadata.confidence").alias("confidence")
).show(50, truncate=False)

+-------------------------------------+-----+---+-----------+----------+
|ner_chunk                            |begin|end|ner_label  |confidence|
+-------------------------------------+-----+---+-----------+----------+
|myocardial infarction                |27   |47 |snomed_term|0.68795   |
|coronary artery bypass grafting      |82   |112|snomed_term|0.688025  |
|aspirin therapy                      |142  |156|snomed_term|0.4837    |
|relapsing appendicitis               |19   |40 |snomed_term|0.66155   |
|appendectomy                         |67   |78 |snomed_term|0.775     |
|postoperative                        |106  |118|snomed_term|0.6485    |
|infection                            |120  |128|snomed_term|0.9901    |
|bronchitis                           |39   |48 |snomed_term|0.5015    |
|chest X-ray                          |88   |98 |snomed_term|0.66865003|
|left ventricular systolic dysfunction|23   |59 |snomed_term|0.74782497|
|akinesia/                            |77   |85 |sn

## 👀5. Visualization of Detected Entities

In [7]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

for i in range(len(text_list)):
  visualiser.display(result = result.collect()[i] ,label_col = 'ner_chunk', document_col = 'document')
  print("\n\n")





























