

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb)




# **ICD10-CM coding**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload `license_keys.json` to the folder that opens.
Otherwise, you can look at the example outputs at the bottom of the notebook.



## 1. Colab Setup

Import license keys

In [1]:
import os
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

sparknlp_version = license_keys["PUBLIC_VERSION"]
jsl_version = license_keys["JSL_VERSION"]

print ('SparkNLP Version:', sparknlp_version)
print ('SparkNLP-JSL Version:', jsl_version)

Saving v3_spark_nlp_for_healthcare.json to v3_spark_nlp_for_healthcare.json
SparkNLP Version: 3.0.1
SparkNLP-JSL Version: 3.0.0


Install dependencies

In [2]:
%%capture
for k,v in license_keys.items(): 
    %set_env $k=$v

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jsl_colab_setup.sh
!bash jsl_colab_setup.sh

# Install Spark NLP Display for visualization
!pip install --ignore-installed spark-nlp-display

Import dependencies into Python

In [3]:
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl


Start the Spark session

In [4]:
spark = sparknlp_jsl.start(license_keys['SECRET'])

# manually start session
# params = {"spark.driver.memory" : "16G",
#           "spark.kryoserializer.buffer.max" : "2000M",
#           "spark.driver.maxResultSize" : "2000M"}

# spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

## 2. Select the Entity Resolver model and construct the pipeline

**NOTE: The mapping below is an example of how ICD10 resolvers work with different NER models. You can choose different combinations  according to your input data and requirements.** 



Select the models:

**ICD10 Entity Resolver models:**

1.   **chunkresolve_icd10cm_clinical**
2.   **chunkresolve_icd10cm_diseases_clinical**
3.   **chunkresolve_icd10cm_injuries_clinical**
4.   **chunkresolve_icd10cm_musculoskeletal_clinical**
5.   **chunkresolve_icd10cm_neoplasms_clinical**
6.   **chunkresolve_icd10cm_puerile_clinical**



For more details: https://github.com/JohnSnowLabs/spark-nlp-models#pretrained-models---spark-nlp-for-healthcare

In [5]:
#ner and entity resolver mapping
ner_er_dict = {'chunkresolve_icd10cm_clinical': 'ner_clinical',
              'chunkresolve_icd10cm_diseases_clinical': 'ner_diseases',
              'chunkresolve_icd10cm_injuries_clinical': 'ner_clinical',
              'chunkresolve_icd10cm_musculoskeletal_clinical': 'ner_clinical',
              'chunkresolve_icd10cm_neoplasms_clinical': 'ner_bionlp',
              'chunkresolve_icd10cm_puerile_clinical': 'ner_jsl'}
# ER models are specfic to the codes they are trained on, so we need to filter out entities that will cause noise.
wl_er_dict = {'chunkresolve_icd10cm_clinical': ['PROBLEM'],
              'chunkresolve_icd10cm_diseases_clinical': ['Disease'],
              'chunkresolve_icd10cm_injuries_clinical': ['PROBLEM'],
              'chunkresolve_icd10cm_musculoskeletal_clinical': ['PROBLEM'],
              'chunkresolve_icd10cm_neoplasms_clinical': ['CANCER','PATHOLOGICAL_FORMATION'],
              'chunkresolve_icd10cm_puerile_clinical': ['PROBLEM']}

# Change this to the model you want to use and re-run the cells below.
model = 'chunkresolve_icd10cm_clinical'

Create the pipeline

In [6]:
document_assembler = DocumentAssembler() \
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = SentenceDetector() \
    .setInputCols(['document'])\
    .setOutputCol('sentences')

tokenizer = Tokenizer()\
    .setInputCols(['sentences']) \
    .setOutputCol('tokens')

embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models')\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained(ner_er_dict[model], "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"])\
    .setOutputCol("ner_tags") 

#using defined whitelist. You can define your own as well.
ner_chunker = NerConverter()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunk").setWhiteList(wl_er_dict[model])

chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("ner_chunk", "embeddings")\
    .setOutputCol("chunk_embeddings")

entity_resolver = \
    ChunkEntityResolverModel.pretrained(model,"en","clinical/models")\
    .setInputCols("tokens","chunk_embeddings").setOutputCol("resolution")
    
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_chunker,
    chunk_embeddings,
    entity_resolver])

empty_df = spark.createDataFrame([['']]).toDF("text")
pipeline_model = pipeline.fit(empty_df)

light_pipeline = sparknlp.base.LightPipeline(pipeline_model)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
chunkresolve_icd10cm_clinical download started this may take some time.
Approximate size to download 166.2 MB
[OK!]


## 3. Create example inputs

In [7]:
# Enter examples as strings in this array
input_list = [
"""The patient is a 5-month-old infant who presented initially on Monday with a cold, cough, and runny nose for 2 days. Mom states she had no fever. Her appetite was good but she was spitting up a lot. She had no difficulty breathing and her cough was described as dry and hacky. At that time, physical exam showed a right TM, which was red. Left TM was okay. She was fairly congested but looked happy and playful. She was started on Amoxil and Aldex and we told to recheck in 2 weeks to recheck her ear. Mom returned to clinic again today because she got much worse overnight. She was having difficulty breathing. She was much more congested and her appetite had decreased significantly today. She also spiked a temperature yesterday of 102.6 and always having trouble sleeping secondary to congestion.""",
             ]

# 4. Run the pipeline

In [8]:
df = spark.createDataFrame(pd.DataFrame({"text": input_list}))
result = pipeline_model.transform(df)
light_result = light_pipeline.fullAnnotate(input_list[0])

# 5. Visualize

Full Pipeline

In [17]:
result.select(
    F.explode(
        F.arrays_zip('ner_chunk.result', 
                     'ner_chunk.begin',
                     'ner_chunk.end',
                     'ner_chunk.metadata',
                     'resolution.metadata', 'resolution.result')
    ).alias('cols')
).select(
    F.expr("cols['0']").alias('chunk'),
    F.expr("cols['1']").alias('begin'),
    F.expr("cols['2']").alias('end'),
    F.expr("cols['3']['entity']").alias('entity'),
    F.expr("cols['4']['resolved_text']").alias('icd10_description'),
    F.expr("cols['5']").alias('icd10_code'),
).show(truncate=False)

+--------------------+-----+---+-------+------------------------------------------------------+----------+
|chunk               |begin|end|entity |icd10_description                                     |icd10_code|
+--------------------+-----+---+-------+------------------------------------------------------+----------+
|a cold, cough       |75   |87 |PROBLEM|Chronic obstructive pulmonary disease, unspecified    |J449      |
|runny nose          |94   |103|PROBLEM|Nasal congestion                                      |R0981     |
|fever               |139  |143|PROBLEM|O'nyong-nyong fever                                   |A921      |
|difficulty breathing|210  |229|PROBLEM|Shortness of breath                                   |R0602     |
|her cough           |235  |243|PROBLEM|Cough                                                 |R05       |
|dry                 |262  |264|PROBLEM|Dry beriberi                                          |E5111     |
|hacky               |270  |274|PROBL

Light Pipeline

In [20]:
light_result[0]['resolution']

[Annotation(entity, 75, 87, J449, {'chunk': '0', 'all_k_results': 'J449:::R05:::J00:::P800:::L502:::J440:::J45991:::G4483:::T483X5S:::T483X5A:::A3791:::A90:::T483X2A:::T50A16S:::A011:::T483X4A:::T483X3A:::A3701:::T483X4D:::T483X3D:::T483X4S:::A3700:::T50A15D:::T50A13A:::F5221', 'all_k_distances': '1.0191:::1.0633:::1.1106:::1.2013:::1.2018:::1.2401:::1.2861:::1.2883:::1.3233:::1.3330:::1.3475:::1.3516:::1.3541:::1.3666:::1.3689:::1.3723:::1.3785:::1.3813:::1.3822:::1.3826:::1.3846:::1.4072:::1.4106:::1.4257:::1.4269', 'confidence': '0.0533', 'all_k_cosine_distances': '0.2588:::0.2379:::0.2812:::0.3276:::0.1878:::0.2292:::0.3362:::0.2893:::0.3013:::0.2671:::0.2467:::0.4483:::0.2444:::0.2319:::0.3157:::0.2505:::0.2532:::0.2383:::0.2554:::0.2633:::0.2869:::0.2387:::0.2220:::0.2137:::0.3210', 'all_k_resolutions': 'Chronic obstructive pulmonary disease, unspecified:::Cough:::Acute nasopharyngitis [common cold]:::Cold injury syndrome:::Urticaria due to cold and heat:::Chronic obstructive pul