

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_GM_DE.ipynb)




# **SNOMED coding for German**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## 1. Colab Setup

Import license keys

In [None]:
import os
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

sparknlp_version = license_keys["PUBLIC_VERSION"]
jsl_version = license_keys["JSL_VERSION"]

print ('SparkNLP Version:', sparknlp_version)
print ('SparkNLP-JSL Version:', jsl_version)

Install dependencies

In [2]:
%%capture
for k,v in license_keys.items(): 
    %set_env $k=$v

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jsl_colab_setup.sh
!bash jsl_colab_setup.sh

# Install Spark NLP Display for visualization
!pip install --ignore-installed spark-nlp-display

Import dependencies into Python

In [3]:
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

Start the Spark session

In [4]:
spark = sparknlp_jsl.start(license_keys['SECRET'])

# manually start session
# params = {"spark.driver.memory" : "16G",
#           "spark.kryoserializer.buffer.max" : "2000M",
#           "spark.driver.maxResultSize" : "2000M"}

# spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

## 2. Construct the pipeline

For more details: https://github.com/JohnSnowLabs/spark-nlp-models#pretrained-models---spark-nlp-for-healthcare

In [25]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

embeddings_clinical = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_healthcare_slim", "de", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
      .setInputCols(["sentence", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(["TREATMENT","MEDICAL_CONDITION"]) 

c2doc = Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 

sbert_embedder = BertSentenceEmbeddings.pretrained("sent_bert_base_cased", "de")\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")

snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed", "de", "clinical/models") \
      .setInputCols(["ner_chunk_doc", "sbert_embeddings"]) \
      .setOutputCol("resolution")\
      .setDistanceFunction("EUCLIDEAN")


snomed_pipeline = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings_clinical,
        clinical_ner,
        ner_converter,
        c2doc,
        sbert_embedder,
        snomed_resolver])

empty_df= spark.createDataFrame([[""]]).toDF("text")

model= snomed_pipeline.fit(empty_df)

light_pipeline = sparknlp.base.LightPipeline(model)

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
w2v_cc_300d download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
ner_healthcare_slim download started this may take some time.
Approximate size to download 14.2 MB
[OK!]
sent_bert_base_cased download started this may take some time.
Approximate size to download 390.2 MB
[OK!]
sbertresolve_snomed download started this may take some time.
Approximate size to download 231.8 MB
[OK!]


## 3. Create example inputs

In [26]:
# Enter examples as strings in this array
input_list = [
    """ Die Armschmerzen begannen vor zwei Tagen. Dies ist ein 67-jähriger Patient ohne Anamnese übertragbarer Krankheit (wie Syphilis, AIDS oder andere). Sie hat eine Vorgeschichte von Adenokarzinom und Lungentumoren  . Sie wurde 2003 mit Hygrom der linken Hemisphäre operiert . Sie unterzog sich auch 6 Monate lang einer aggressiven Chemotherapie . """,

]

# 4. Run the pipeline

In [27]:
df = spark.createDataFrame(pd.DataFrame({"text": input_list}))
result = model.transform(df)
light_result = light_pipeline.fullAnnotate(input_list[0])

# 5. Visualize

Full Pipeline

In [28]:
result.select(
    F.explode(
        F.arrays_zip('ner_chunk.result', 
                     'ner_chunk.begin',
                     'ner_chunk.end',
                     'ner_chunk.metadata',
                     'resolution.metadata', 'resolution.result')
    ).alias('cols')
).select(
    F.expr("cols['0']").alias('chunk'),
    F.expr("cols['1']").alias('begin'),
    F.expr("cols['2']").alias('end'),
    F.expr("cols['3']['entity']").alias('entity'),
    F.expr("cols['4']['resolved_text']").alias('snomed_description'),
    F.expr("cols['5']").alias('snomed_code'),
).show(truncate=False)

+-------------+-----+---+-----------------+------------------+-----------+
|chunk        |begin|end|entity           |snomed_description|snomed_code|
+-------------+-----+---+-----------------+------------------+-----------+
|Armschmerzen |5    |16 |MEDICAL_CONDITION|Armschmerzen      |98121      |
|Krankheit    |104  |112|MEDICAL_CONDITION|Krankheit         |32339      |
|Syphilis     |119  |126|MEDICAL_CONDITION|Syphilis          |22350      |
|AIDS         |129  |132|MEDICAL_CONDITION|HIV-Krankheit     |29605      |
|Adenokarzinom|179  |191|MEDICAL_CONDITION|Adenokarzinom     |23451      |
|Lungentumoren|197  |209|MEDICAL_CONDITION|Lungentumor       |17816      |
|Hygrom       |233  |238|MEDICAL_CONDITION|Hygrom            |86744      |
|Chemotherapie|328  |340|TREATMENT        |Chemotherapie     |65389      |
+-------------+-----+---+-----------------+------------------+-----------+



Light Pipeline

In [29]:
from sparknlp_display import EntityResolverVisualizer

vis = EntityResolverVisualizer()

## To set custom label colors:
vis.set_label_colors({'TREATMENT':'#800080', 'MEDICAL_CONDITION':'#77b5fe'})

vis.display(light_result[0], 'ner_chunk', 'resolution', 'document')