

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICDO.ipynb)




# **ICDO coding**

To run this yourself, you will need to upload your license keys to the notebook. Otherwise, you can look at the example outputs at the bottom of the notebook. To upload license keys, open the file explorer on the left side of the screen and upload `workshop_license_keys.json` to the folder that opens.

## 1. Colab Setup

Import license keys

In [1]:
import os
import json

with open('/content/spark_nlp_for_healthcare.json', 'r') as f:
    license_keys = json.load(f)

license_keys.keys()

secret = license_keys['SECRET']
os.environ['SPARK_NLP_LICENSE'] = license_keys['SPARK_NLP_LICENSE']
os.environ['AWS_ACCESS_KEY_ID'] = license_keys['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = license_keys['AWS_SECRET_ACCESS_KEY']
sparknlp_version = license_keys["PUBLIC_VERSION"]
jsl_version = license_keys["JSL_VERSION"]

print ('SparkNLP Version:', sparknlp_version)
print ('SparkNLP-JSL Version:', jsl_version)

SparkNLP Version: 2.6.0
SparkNLP-JSL Version: 2.6.0


Install dependencies

In [2]:
# Install Java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp==$sparknlp_version
! python -m pip install --upgrade spark-nlp-jsl==$jsl_version --extra-index-url https://pypi.johnsnowlabs.com/$secret

openjdk version "11.0.8" 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing)
[K     |████████████████████████████████| 215.7MB 65kB/s 
[K     |████████████████████████████████| 204kB 44.3MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Collecting spark-nlp==2.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/e4/30/1bd0abcc97caed518efe527b9146897255dffcf71c4708586a82ea9eb29a/spark_nlp-2.6.0-py2.py3-none-any.whl (125kB)
[K     |████████████████████████████████| 133kB 2.7MB/s 
[?25hInstalling collected packages: spark-nlp
Successfully installed spark-nlp-2.6.0
Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/2.6.0-8388813d58b67fa25bf9cf603393363af96dba16
Collecting spark-nlp-jsl==2.6.0
  Downloading https://pypi.johnsnowlabs.com/2.6.0-8388813d58b67fa25bf9cf603393363af96dba16/spark-nlp-jsl/spark_nlp_

Import dependencies into Python

In [2]:
os.environ['JAVA_HOME'] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ['PATH'] = os.environ['JAVA_HOME'] + "/bin:" + os.environ['PATH']

import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

import sparknlp
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl


Start the Spark session

In [3]:
spark = sparknlp_jsl.start(secret)

## 2. Select the Entity Resolver model and construct the pipeline

Select the models:

**ICDO Entity Resolver models:**

1.   **chunkresolve_icdo_clinical**

**NER models that support neoplasms:**
1.   **ner_bionlp**

For more details: https://github.com/JohnSnowLabs/spark-nlp-models#pretrained-models---spark-nlp-for-healthcare

In [4]:
# Change this to the model you want to use and re-run the cells below.
ER_MODEL_NAME = "chunkresolve_icdo_clinical"
NER_MODEL_NAME = "ner_bionlp"

Create the pipeline

In [5]:
document_assembler = DocumentAssembler() \
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = SentenceDetector() \
    .setInputCols(['document'])\
    .setOutputCol('sentences')

tokenizer = Tokenizer()\
    .setInputCols(['sentences']) \
    .setOutputCol('tokens')

embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models')\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

clinical_ner_model = NerDLModel().pretrained(NER_MODEL_NAME, 'en', 'clinical/models').setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("clinical_ner_tags")   

# using whitelist to filter out entities
clinical_ner_chunker = NerConverter()\
    .setInputCols(["sentences", "tokens", "clinical_ner_tags"])\
    .setOutputCol("clinical_ner_chunks").setWhiteList(["Pathological_formation",'Cancer'])

chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("clinical_ner_chunks", "embeddings")\
    .setOutputCol("chunk_embeddings")

entity_resolver = \
    ChunkEntityResolverModel.pretrained(ER_MODEL_NAME,"en","clinical/models")\
    .setInputCols("tokens","chunk_embeddings").setOutputCol("resolution")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    clinical_ner_model,
    clinical_ner_chunker,
    chunk_embeddings,
    entity_resolver])

empty_df = spark.createDataFrame([['']]).toDF("text")
pipeline_model = pipeline.fit(empty_df)
light_pipeline = LightPipeline(pipeline_model)

pos_clinical download started this may take some time.
Approximate size to download 1.7 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.6 MB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_bionlp download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
chunkresolve_icdo_clinical download started this may take some time.
Approximate size to download 8.2 MB
[OK!]


## 3. Create example inputs

In [10]:
# Enter examples as strings in this array
input_list = [
"""A very pleasant 63-year-old hypertensive, nondiabetic, African-American female with a history of peritoneal mesothelioma. The patient has received prior intravenous chemotherapy. Due to some increasing renal insufficiency and difficulties with hydration, it was elected to change her to intraperitoneal therapy. She had her first course with intraperitoneal cisplatin, which was very difficultly tolerated by her. Therefore, on the last hospitalization for IP chemo, she was switched to Taxol. The patient since her last visit has done relatively well. She had no acute problems and has basically only chronic difficulties. She has had some decrease in her appetite, although her weight has been stable. She has had no fever, chills, or sweats. Activity remains good and she has continued difficulty with depression associated with type 1 bipolar disease. She had a recent CT scan of the chest and abdomen. The report showed the following findings. In the chest, there was a small hiatal hernia and a calcification in the region of the mitral valve. There was one mildly enlarged mediastinal lymph node. Several areas of ground-glass opacity were noted in the lower lungs, which were subtle and nonspecific. No pulmonary masses were noted. In the abdomen, there were no abnormalities of the liver, pancreas, spleen, and left adrenal gland. On the right adrenal gland, a 17 x 13 mm right adrenal adenoma was noted. There were some bilateral renal masses present, which were not optimally evaluated due to noncontrast study. A hyperdense focus in the lower pole of the left kidney was felt to most probably represent a hemorrhagic renal cyst. It was unchanged from February and measured 9 mm. There was again minimal left pelvic/iliac _______ with right and left peritoneal catheters noted and were unremarkable. Mesenteric nodes were seen, which were similar in appearance to the previous study that was felt somewhat more conspicuous due to opacified bowel adjacent to them. There was a conglomerate omental mass, which had decreased in volume when compared to previous study, now measuring 8.4 x 1.6 cm. In the pelvis, there was a small amount of ascites in the right pelvis extending from the inferior right paracolic gutter. No suspicious osseous lesions were noted.""",
]

# 4. Run the pipeline

In [11]:
df = spark.createDataFrame(pd.DataFrame({"text": input_list}))
result = pipeline_model.transform(df)
light_result = light_pipeline.fullAnnotate(input_list[0])

# 5. Visualize

Full Pipeline

In [12]:
result.select(
    F.explode(
        F.arrays_zip('resolution.metadata', 'resolution.begin' , 'resolution.end', 'resolution.result')
    ).alias('cols')
).select(
    F.expr("cols['0']['token']").alias('token/chunk'),
    F.expr("cols['1']").alias('begin'),
    F.expr("cols['2']").alias('end'),
    F.expr("cols['0']['resolved_text']").alias('resolved_text'),
    F.expr("cols['3']").alias('idco_code'),
).toPandas()

Unnamed: 0,token/chunk,begin,end,resolved_text,idco_code
0,peritoneal mesothelioma,97,119,"Mesothelioma, malignant",9050/3
1,pulmonary masses,1211,1226,Pulmonary blastoma,8972/3
2,adrenal adenoma,1387,1401,Adrenal cortical carcinoma,8370/3
3,renal cyst,1629,1638,Renal cell carcinoma,8312/3
4,osseous lesions,2242,2256,"Paget disease, extramammary",8542/3


Light Pipeline

In [13]:
light_result[0]['resolution']

[Annotation(entity, 97, 119, 9050/3, {'chunk': '0', 'all_k_results': '9050/3:::9052/3:::9051/3:::9053/3:::8815/0:::9150/1:::8815/3:::9150/3:::9150/0:::8680/3:::9651/3:::8630/3:::8714/3:::8200/3:::9110/3:::8000/3:::9560/3:::8830/3:::8632/3:::8692/3:::8245/3:::9130/3:::8561/3:::9580/3:::8243/3', 'all_k_distances': '0.7993:::0.9531:::1.0560:::1.0570:::1.0850:::1.2076:::1.2089:::1.2089:::1.2216:::1.6302:::1.6310:::1.6321:::1.6347:::1.6351:::1.6382:::1.6383:::1.6385:::1.6459:::1.6461:::1.6492:::1.6786:::1.7340:::1.7371:::1.7567:::1.7786', 'confidence': '0.0740', 'all_k_resolutions': 'Mesothelioma, malignant:::Epithel. mesothelioma, mal.:::Fibrous mesothelioma, malignant:::Mesothelioma, biphasic, malignant:::Solitary fibrous tumor:::Hemangiopericytoma, NOS:::Solitary fibrous tumor, malignant:::Hemangiopericytoma, malignant:::Hemangiopericytoma, benign:::Paraganglioma, malignant:::Hodgkin lymphoma, lymphocyte-rich:::Androblastoma, malignant:::PEComa, malignant:::Adenoid cystic carcinoma:::Mes