

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICD10_CM.ipynb)




# **ICD10-CM coding**

To run this yourself, you will need to upload your license keys to the notebook. Otherwise, you can look at the example outputs at the bottom of the notebook. To upload license keys, open the file explorer on the left side of the screen and upload `workshop_license_keys.json` to the folder that opens.

## 1. Colab Setup

Import license keys

In [2]:
import os
import json

with open('/content/spark_nlp_for_healthcare.json', 'r') as f:
    license_keys = json.load(f)

license_keys.keys()

secret = license_keys['SECRET']
os.environ['SPARK_NLP_LICENSE'] = license_keys['SPARK_NLP_LICENSE']
os.environ['AWS_ACCESS_KEY_ID'] = license_keys['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = license_keys['AWS_SECRET_ACCESS_KEY']
sparknlp_version = license_keys["PUBLIC_VERSION"]
jsl_version = license_keys["JSL_VERSION"]

print ('SparkNLP Version:', sparknlp_version)
print ('SparkNLP-JSL Version:', jsl_version)

SparkNLP Version: 2.6.0
SparkNLP-JSL Version: 2.6.0


Install dependencies

In [3]:
# Install Java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp==$sparknlp_version
! python -m pip install --upgrade spark-nlp-jsl==$jsl_version --extra-index-url https://pypi.johnsnowlabs.com/$secret

openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing)
[K     |████████████████████████████████| 215.7MB 68kB/s 
[K     |████████████████████████████████| 204kB 41.7MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Collecting spark-nlp==2.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/e4/30/1bd0abcc97caed518efe527b9146897255dffcf71c4708586a82ea9eb29a/spark_nlp-2.6.0-py2.py3-none-any.whl (125kB)
[K     |████████████████████████████████| 133kB 3.2MB/s 
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-2.6.0
Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/2.6.0-8388813d58b67fa25bf9cf603393363af96dba16
Collecting spark-nlp-jsl==2.6.0
  Downloading https://pypi.johnsnowlabs.com/2.6.0-8388813d58b67fa25bf9cf603393363af96dba16/spark-nlp-jsl/spark_nlp_

Import dependencies into Python

In [4]:
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ['PATH'] = os.environ['JAVA_HOME'] + "/bin:" + os.environ['PATH']

import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl


Start the Spark session

In [5]:
spark = sparknlp_jsl.start(secret)

## 2. Select the Entity Resolver model and construct the pipeline

Select the models:

**ICD10 Entity Resolver models:**

1.   **chunkresolve_icd10cm_clinical**
2.   **chunkresolve_icd10cm_diseases_clinical**
3.   **chunkresolve_icd10cm_injuries_clinical**
4.   **chunkresolve_icd10cm_musculoskeletal_clinical**
5.   **chunkresolve_icd10cm_neoplasms_clinical**
6.   **chunkresolve_icd10cm_puerile_clinical**



For more details: https://github.com/JohnSnowLabs/spark-nlp-models#pretrained-models---spark-nlp-for-healthcare

In [12]:
#ner and entity resolver mapping
ner_er_dict = {'chunkresolve_icd10cm_clinical': 'ner_clinical',
              'chunkresolve_icd10cm_diseases_clinical': 'ner_diseases',
              'chunkresolve_icd10cm_injuries_clinical': 'ner_jsl',
              'chunkresolve_icd10cm_musculoskeletal_clinical': 'ner_jsl',
              'chunkresolve_icd10cm_neoplasms_clinical': 'ner_jsl',
              'chunkresolve_icd10cm_puerile_clinical': 'ner_clinical'}
# ER models are specfic to the codes they are trained on, so we need to filter out entities that will cause noise.
wl_er_dict = {'chunkresolve_icd10cm_clinical': ['PROBLEM'],
              'chunkresolve_icd10cm_diseases_clinical': ['Disease'],
              'chunkresolve_icd10cm_injuries_clinical': ['Diagnosis'],
              'chunkresolve_icd10cm_musculoskeletal_clinical': ['Diagnosis'],
              'chunkresolve_icd10cm_neoplasms_clinical': ['Diagnosis'],
              'chunkresolve_icd10cm_puerile_clinical': ['PROBLEM']}

# Change this to the model you want to use and re-run the cells below.
model = 'chunkresolve_icd10cm_clinical'

Create the pipeline

In [13]:
document_assembler = DocumentAssembler() \
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = SentenceDetector() \
    .setInputCols(['document'])\
    .setOutputCol('sentences')

tokenizer = Tokenizer()\
    .setInputCols(['sentences']) \
    .setOutputCol('tokens')

embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models')\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

ner_model = NerDLModel().pretrained(ner_er_dict[model], 'en', 'clinical/models')\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")   

#using defined whitelist. You can define your own as well.
ner_chunker = NerConverter()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunk").setWhiteList(wl_er_dict[model])

chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("ner_chunk", "embeddings")\
    .setOutputCol("chunk_embeddings")

entity_resolver = \
    ChunkEntityResolverModel.pretrained(model,"en","clinical/models")\
    .setInputCols("tokens","chunk_embeddings").setOutputCol("resolution")
    
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_chunker,
    chunk_embeddings,
    entity_resolver])

empty_df = spark.createDataFrame([['']]).toDF("text")
pipeline_model = pipeline.fit(empty_df)

light_pipeline = sparknlp.base.LightPipeline(pipeline_model)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
chunkresolve_icd10cm_clinical download started this may take some time.
Approximate size to download 166.3 MB
[OK!]


## 3. Create example inputs

In [21]:
# Enter examples as strings in this array
input_list = [
"""Nature and course of the diagnosis has been discussed with the patient. Based on her presentation without any history of obvious fall or trauma and past history of malignant melanoma, this appears to be a pathological fracture of the left proximal hip. At the present time, I would recommend obtaining a bone scan and repeat x-rays, which will include AP pelvis, femur, hip including knee. She denies any pain elsewhere. She does have a past history of back pain and sciatica, but at the present time, this appears to be a metastatic bone lesion with pathological fracture. I have discussed the case with Dr. X and recommended oncology consultation.

With the above fracture and presentation, she needs a left hip hemiarthroplasty versus calcar hemiarthroplasty, cemented type. Indication, risk, and benefits of left hip hemiarthroplasty has been discussed with the patient, which includes, but not limited to bleeding, infection, nerve injury, blood vessel injury, dislocation early and late, persistent pain, leg length discrepancy, myositis ossificans, intraoperative fracture, prosthetic fracture, need for conversion to total hip replacement surgery, revision surgery, DVT, pulmonary embolism, risk of anesthesia, need for blood transfusion, and cardiac arrest. She understands above and is willing to undergo further procedure. The goal and the functional outcome have been explained. Further plan will be discussed with her once we obtain the bone scan and the radiographic studies. We will also await for the oncology feedback and clearance.""",
]

# 4. Run the pipeline

In [22]:
df = spark.createDataFrame(pd.DataFrame({"text": input_list}))
result = pipeline_model.transform(df)
light_result = light_pipeline.fullAnnotate(input_list[0])

# 5. Visualize

Full Pipeline

In [23]:
result.select(
    F.explode(
        F.arrays_zip('resolution.metadata', 'resolution.begin' , 'resolution.end', 'resolution.result')
    ).alias('cols')
).select(
    F.expr("cols['0']['token']").alias('token/chunk'),
    F.expr("cols['1']").alias('begin'),
    F.expr("cols['2']").alias('end'),
    F.expr("cols['0']['resolved_text']").alias('resolved_text'),
    F.expr("cols['3']").alias('icd10_code'),
).toPandas()

Unnamed: 0,token/chunk,begin,end,resolved_text,icd10_code
0,obvious fall,121,132,"Unspecified fall, subsequent encounter",W19XXXD
1,trauma,137,142,"Obstetric trauma, unspecified",O719
2,malignant melanoma,164,181,Malignant melanoma of lip,C430
3,a pathological fracture of the left proximal hip,203,250,"Pseudocoxalgia, left hip",M9132
4,pain,405,408,Precordial pain,R072
5,back pain,453,461,Low back pain,M545
6,sciatica,467,474,"Sciatica, unspecified side",M5430
7,a metastatic bone lesion,521,544,Pyoderma,L080
8,pathological fracture,551,571,"Pathological fracture, pelvis, sequela",M84454S
9,the above fracture,656,673,"Fracture of manubrium, initial encounter for o...",S2221XB


Light Pipeline

In [24]:
light_result[0]['resolution']

[Annotation(entity, 121, 132, W19XXXD, {'chunk': '0', 'all_k_results': 'W19XXXD:::W19XXXA:::V00121S:::V00211S:::W19XXXS:::Z9181:::V00131S:::W07XXXS:::V00311A:::W051XXS:::W1839XS:::W07XXXA:::W15XXXD:::W1802XS:::W091XXD:::W102XXA:::W1839XD:::W1839XA:::W11XXXS:::W1800XS:::W1809XS:::W009XXS:::V9338XS:::W1802XD:::W1802XA', 'all_k_distances': '0.8096:::0.9425:::0.9545:::0.9545:::1.0125:::1.0359:::1.1416:::1.1928:::1.2479:::1.3207:::1.3289:::1.3686:::1.3762:::1.3879:::1.3972:::1.4045:::1.4337:::1.4352:::1.4709:::1.4717:::1.4731:::1.4738:::1.4778:::1.4877:::1.4931', 'confidence': '0.0628', 'all_k_resolutions': 'Unspecified fall, subsequent encounter:::Unspecified fall, initial encounter:::Fall from non-in-line roller-skates, sequela:::Fall from ice-skates, sequela:::Unspecified fall, sequela:::History of falling:::Fall from skateboard, sequela:::Fall from chair, sequela:::Fall from snowboard, initial encounter:::Fall from non-moving nonmotorized scooter, sequela:::Other fall on same level, seq