

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_SNOMED.ipynb)




# **SNOMED coding**

To run this yourself, you will need to upload your license keys to the notebook. Otherwise, you can look at the example outputs at the bottom of the notebook. To upload license keys, open the file explorer on the left side of the screen and upload `workshop_license_keys.json` to the folder that opens.

## 1. Colab Setup

Import license keys

In [1]:
import os
import json

with open('/content/spark_nlp_for_healthcare.json', 'r') as f:
    license_keys = json.load(f)

license_keys.keys()

secret = license_keys['SECRET']
os.environ['SPARK_NLP_LICENSE'] = license_keys['SPARK_NLP_LICENSE']
os.environ['AWS_ACCESS_KEY_ID'] = license_keys['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = license_keys['AWS_SECRET_ACCESS_KEY']
sparknlp_version = license_keys["PUBLIC_VERSION"]
jsl_version = license_keys["JSL_VERSION"]

print ('SparkNLP Version:', sparknlp_version)
print ('SparkNLP-JSL Version:', jsl_version)

SparkNLP Version: 2.6.0
SparkNLP-JSL Version: 2.6.0


Install dependencies

In [2]:
# Install Java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp==$sparknlp_version
! python -m pip install --upgrade spark-nlp-jsl==$jsl_version --extra-index-url https://pypi.johnsnowlabs.com/$secret

openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing)
[K     |████████████████████████████████| 215.7MB 60kB/s 
[K     |████████████████████████████████| 204kB 37.5MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Collecting spark-nlp==2.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/e4/30/1bd0abcc97caed518efe527b9146897255dffcf71c4708586a82ea9eb29a/spark_nlp-2.6.0-py2.py3-none-any.whl (125kB)
[K     |████████████████████████████████| 133kB 2.7MB/s 
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-2.6.0
Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/2.6.0-8388813d58b67fa25bf9cf603393363af96dba16
Collecting spark-nlp-jsl==2.6.0
  Downloading https://pypi.johnsnowlabs.com/2.6.0-8388813d58b67fa25bf9cf603393363af96dba16/spark-nlp-jsl/spark_nlp_

Import dependencies into Python

In [2]:
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ['PATH'] = os.environ['JAVA_HOME'] + "/bin:" + os.environ['PATH']

import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl


Start the Spark session

In [3]:
spark = sparknlp_jsl.start(secret)

## 2. Select the Entity Resolver model and construct the pipeline

Select the models:


* SNOMED Entity Resolver models: **chunkresolve_snomed_findings_clinical**




For more details: https://github.com/JohnSnowLabs/spark-nlp-models#pretrained-models---spark-nlp-for-healthcare

In [4]:
# Change this to the model you want to use and re-run the cells below.
ER_MODEL_NAME = "chunkresolve_snomed_findings_clinical"
NER_MODEL_NAME = "ner_clinical"

Create the pipeline

In [5]:
document_assembler = DocumentAssembler() \
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = SentenceDetector() \
    .setInputCols(['document'])\
    .setOutputCol('sentences')

tokenizer = Tokenizer()\
    .setInputCols(['sentences']) \
    .setOutputCol('tokens')

embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models')\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

ner_model = NerDLModel().pretrained(NER_MODEL_NAME, 'en', 'clinical/models')\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")   

# filtering ner output by whitelisting
ner_chunker = NerConverter()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunk").setWhiteList(['PROBLEM','TEST'])

chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("ner_chunk", "embeddings")\
    .setOutputCol("chunk_embeddings")

entity_resolver = \
    ChunkEntityResolverModel.pretrained(ER_MODEL_NAME,"en","clinical/models")\
    .setInputCols("tokens","chunk_embeddings").setOutputCol("resolution")
    
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_chunker,
    chunk_embeddings,
    entity_resolver])

empty_df = spark.createDataFrame([['']]).toDF("text")
pipeline_model = pipeline.fit(empty_df)
light_pipeline = LightPipeline(pipeline_model)

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
chunkresolve_snomed_findings_clinical download started this may take some time.
Approximate size to download 162.6 MB
[OK!]


## 3. Create example inputs

In [6]:
# Enter examples as strings in this array
input_list = [
"""She is followed by Dr. X in our office and has a history of severe tricuspid regurgitation with mild elevation and PA pressure. On 05/12/08, preserved left and right ventricular systolic function, aortic sclerosis with apparent mild aortic stenosis, and bi-atrial enlargement. She has previously had a Persantine Myoview nuclear rest-stress test scan completed at ABCD Medical Center in 07/06 that was negative. She has had significant mitral valve regurgitation in the past being moderate, but on the most recent echocardiogram on 05/12/08, that was not felt to be significant. She has a history of hypertension and EKGs in our office show normal sinus rhythm with frequent APCs versus wandering atrial pacemaker. She does have a history of significant hypertension in the past. She has had dizzy spells and denies clearly any true syncope. She has had bradycardia in the past from beta-blocker therapy."""
]

# 4. Run the pipeline

In [7]:
df = spark.createDataFrame(pd.DataFrame({"text": input_list}))
result = pipeline_model.transform(df)
light_result = light_pipeline.fullAnnotate(input_list[0])

# 5. Visualize

Full Pipeline

In [8]:
result.select(
    F.explode(
        F.arrays_zip('resolution.metadata', 'resolution.begin' , 'resolution.end', 'resolution.result')
    ).alias('cols')
).select(
    F.expr("cols['0']['token']").alias('token/chunk'),
    F.expr("cols['1']").alias('begin'),
    F.expr("cols['2']").alias('end'),
    F.expr("cols['0']['resolved_text']").alias('resolved_text'),
    F.expr("cols['3']").alias('snomed_code'),
).toPandas()

Unnamed: 0,token/chunk,begin,end,resolved_text,snomed_code
0,severe tricuspid regurgitation,60,89,Tricuspid regurgitation,111287006
1,mild elevation,96,109,Mild present pain,301380003
2,PA pressure,115,125,Increased pressure,51590001
3,aortic sclerosis,197,212,Non-rheumatic aortic sclerosis,315615007
4,apparent mild aortic stenosis,219,247,Isolated aortic stenosis,276790000
5,bi-atrial enlargement,254,274,Right atrial enlargement,67751000119106
6,a Persantine Myoview nuclear rest-stress test ...,300,349,Hepatitis A test negative (finding),165996008
7,significant mitral valve regurgitation,424,461,Mitral valve regurgitation,48724000
8,recent echocardiogram,507,527,Echocardiogram normal,169240004
9,hypertension,600,611,Hypertension,38341003


Light Pipeline

In [9]:
light_result[0]['resolution']

[Annotation(entity, 60, 89, 111287006, {'chunk': '0', 'all_k_results': '111287006:::297286001:::83119008:::233865008:::67696008:::703319006:::48724000:::194741006:::194990000:::472840007:::703194006:::78104003:::10337008:::703292000:::250988002:::29928006:::703185000:::84642008:::233861004:::703195007:::194736003:::194726006:::1131009:::91434003:::253582009', 'all_k_distances': '0.6667:::1.3167:::1.4276:::1.4818:::1.5533:::1.6170:::1.6680:::1.8817:::1.9035:::1.9400:::1.9400:::2.0000:::2.0167:::2.0198:::2.0953:::2.2063:::2.2077:::2.2598:::2.2604:::2.3081:::2.3270:::2.3957:::2.4388:::2.5232:::2.5611', 'confidence': '0.1309', 'all_k_resolutions': 'Tricuspid regurgitation:::Paraprosthetic tricuspid regurgitation:::Congenital tricuspid regurgitation:::Functional tricuspid regurgitation:::Rheumatic tricuspid regurgitation:::Tricuspid stenosis with regurgitation:::Mitral regurgitation:::Rheumatic tricuspid stenosis and regurgitation:::Tricuspid valve regurgitation, nonrheumatic:::Regurgitatio