![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICDO.ipynb)

## **Resolve Oncology terminology using the ICD-O taxonomy**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## **Colab Setup**

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

## **Install dependencies**

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

## **Import dependencies into Python and start the Spark session**

In [3]:
import json
import os

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession

import sparknlp
import sparknlp_jsl

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F

from sparknlp_display import EntityResolverVisualizer

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np

params = {"spark.driver.memory":"16G",
          "spark.kryoserializer.buffer.max":"2000M",
          "spark.driver.maxResultSize":"2000M"}

spark = sparknlp_jsl.start(secret = SECRET, params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.4.4
Spark NLP_JSL Version : 3.5.2


### **🔎 For about models**

📌 **sbiobertresolve_icdo_augmented**--> *This model maps extracted clinical entities to ICD-O codes using sbiobert_base_cased_mli Sentence BERT Embeddings. Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding ICD-O codes to present more granularity with respect to body parts mentioned. It also returns the original Topography and Histology codes, and their descriptions.*






### **🔎 Helper Function**


In [6]:
# returns spark df resolution results

def get_codes_from_df(result_df, chunk, output_col, hcc= False):
    
    
    if hcc:
        
        df = result_df.select(F.explode(F.arrays_zip(chunk+'.result', 
                                                           chunk+'.metadata', 
                                                           output_col+'.result', 
                                                           output_col+'.metadata')).alias("cols")) \
                                     .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                                             F.expr("cols['0']").alias("ner_chunk"),
                                             F.expr("cols['1']['entity']").alias("entity"), 
                                             F.expr("cols['2']").alias("icd10_code"),
                                             F.expr("cols['3']['all_k_results']").alias("all_codes"),
                                             F.expr("cols['3']['all_k_resolutions']").alias("resolutions"),
                                             F.expr("cols['3']['all_k_aux_labels']").alias("hcc_list")).toPandas()



        codes = []
        resolutions = []
        hcc_all = []

        for code, resolution, hcc in zip(df['all_codes'], df['resolutions'], df['hcc_list']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))
            hcc_all.append(hcc.split(":::"))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        df['hcc_list'] = hcc_all
        
    else:
                       
        df = result_df.select(F.explode(F.arrays_zip(chunk+'.result', 
                                                           chunk+'.metadata', 
                                                           output_col+'.result', 
                                                           output_col+'.metadata')).alias("cols")) \
                                     .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                                             F.expr("cols['0']").alias("ner_chunk"),
                                             F.expr("cols['1']['entity']").alias("entity"), 
                                             F.expr("cols['2']").alias(f"{output_col}"),
                                             F.expr("cols['3']['all_k_results']").alias("all_codes"),
                                             F.expr("cols['3']['all_k_resolutions']").alias("resolutions")).toPandas()



        codes = []
        resolutions = []

        for code, resolution in zip(df['all_codes'], df['resolutions']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        
    
    return df

# **📌 "sbiobertresolve_icdo_augmented" model**

### **🔎Define Spark NLP pipeline**

In [12]:
document_assembler = DocumentAssembler() \
        .setInputCol('text') \
        .setOutputCol('document')

sentenceDetector = SentenceDetectorDLModel.pretrained()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
        .setInputCols(["sentence", "token"])\
        .setOutputCol("embeddings")

ner_clinical = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = NerConverterInternal()\
        .setInputCols(["sentence", "token", "ner"])\
        .setOutputCol("ner_chunk")\
        .setWhiteList(["Oncological"])

c2doc = Chunk2Doc()\
        .setInputCols("ner_chunk")\
        .setOutputCol("ner_chunk_doc") 

sbert_embedder = BertSentenceEmbeddings\
        .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
        .setInputCols(["ner_chunk_doc"])\
        .setOutputCol("sbert_embeddings")
 
icdo_resolver = SentenceEntityResolverModel\
        .pretrained("sbiobertresolve_icdo_augmented","en", "clinical/models") \
        .setInputCols(["ner_chunk", "sbert_embeddings"]) \
        .setOutputCol("icdo_code")\
        .setDistanceFunction("EUCLIDEAN")

    
pipeline = Pipeline(stages=[
    document_assembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_clinical,
    ner_converter,
    c2doc,
    sbert_embedder,
    icdo_resolver
])

empty_df = spark.createDataFrame([[""]]).toDF("text")
pipeline_model = pipeline.fit(empty_df)


sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_jsl download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icdo_augmented download started this may take some time.
[OK!]


In [13]:
sample_text = """TRAF6 is a putative oncogene in a variety of cancers including  bladder cancer , and skin cancer. WWP2 appears to regulate the expression of the well characterized tumor suppressor phosphatase and tensin homolog (PTEN)   in endometrial cancer   and squamous cell carcinoma."""

clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")

icdo_result = pipeline_model.transform(clinical_note_df)

In [14]:
res_pd = get_codes_from_df(icdo_result, 'ner_chunk', 'icdo_code')

In [15]:
res_pd.head(10)

Unnamed: 0,sent_id,ner_chunk,entity,icdo_code,all_codes,resolutions
0,0,cancers,Oncological,8000/3,"[8000/3, 8010/3, 8010/9, 800, 8420/3, 8140/3, 8010/3-C76.0, 8010/6, 8010/3-C44.5, 8010/3-C26.0, 8010/3-C76.1, 8000/1, 8240/3, 8010/3-C06.9, 8021/3, 8010/9-C44.9, 8530/3, 8550/3, 8001/1, 8010/3-C77.8, 8230/3, 8010/3-C21.0, 8070/3, 8010/3-C44.9]","[cancer, carcinoma, carcinomatosis, neoplasms, ceruminous carcinoma, adenocarcinoma, carcinoma, of head, face or neck, secondary carcinoma, carcinoma, of skin of trunk, carcinoma, of intestinal tract, carcinoma, of thorax, neoplasm, carcinoid, carcinoma, of mouth, carcinoma, anaplastic, carcinomatosis of skin, inflammatory carcinoma, acinar carcinoma, tumor cells, carcinoma, of lymph nodes of multiple regions, solid carcinoma, carcinoma, of anus, squamous carcinoma, carcinoma, of skin]"
1,0,bladder cancer,Oncological,8010/3-C67.9,"[8010/3-C67.9, 8010/3-C67.5, 8230/3-C67.9, 8140/3-C67.9, 8441/3-C67.9, 8120/3-C67.9, 8070/3-C67.9, 8980/3-C67.9, 8140/3-C67.5, 8230/3-C67.5, 8051/3-C67.9, 8510/3-C67.9, 8050/3-C67.9, 8051/3-C67.5, 8560/3-C67.9, 8010/2-C67.9, 8070/3-C67.5, 8120/3-C67.5, 8010/3-C67.1, 8130/3-C67.9, 8120/2-C67.9, 8510/3-C67.5, 8050/3-C67.5, 8120/3, 8980/3-C67.5]","[carcinoma, of bladder, carcinoma, of bladder neck, solid carcinoma, of bladder, adenocarcinoma, of bladder, serous carcinoma, of bladder, transitional cell carcinoma, of bladder, squamous cell carcinoma, of bladder, carcinosarcoma, of bladder, adenocarcinoma, of bladder neck, solid carcinoma, of bladder neck, verrucous carcinoma, of bladder, medullary carcinoma, of bladder, papillary carcinoma, of bladder, verrucous carcinoma, of bladder neck, adenosquamous carcinoma of bladder, carcinoma in situ, of bladder, squamous cell carcinoma, of bladder neck, transitional cell carcinoma, of bladder neck, carcinoma, of dome of bladder, papillary urothelial carcinoma of bladder, urothelial carcinoma in situ of bladder, medullary carcinoma, of bladder neck, papillary carcinoma, of bladder neck, urothelial carcinoma, carcinosarcoma, of bladder neck]"
2,0,skin cancer,Oncological,8010/3-C44.9,"[8010/3-C44.9, 8010/9-C44.9, 8070/3-C44.9, 8140/3-C44.9, 8980/3-C44.9, 8010/3-C44.5, 8409/3-C44.9, 8560/3-C44.9, 8051/3-C44.9, 8010/2-C44.9, 8201/3-C44.9, 8575/3-C44.9, 8390/3, 8230/3-C44.9, 8070/3, 8094/3-C44.9, 8410/3-C44.9, 8110/3-C44.9, 8010/3, 8070/3-C44.5, 8010/3-C44.4, 8051/3-C44.5, 8247/3-C44.9, 8440/3-C44.9]","[carcinoma, of skin, carcinomatosis of skin, squamous cell carcinoma, of skin, adenocarcinoma, of skin, carcinosarcoma, of skin, carcinoma, of skin of trunk, porocarcinoma, of skin, adenosquamous carcinoma of skin, verrucous carcinoma, of skin, carcinoma in situ, of skin, cribriform carcinoma, of skin, metaplastic carcinoma, of skin, skin appendage carcinoma, solid carcinoma, of skin, squamous carcinoma, basosquamous carcinoma of skin, sebaceous carcinoma of skin, pilomatrical carcinoma of skin, carcinoma, squamous cell carcinoma, of skin of trunk, carcinoma, of skin of scalp and neck, verrucous carcinoma, of skin of trunk, merkel cell carcinoma of skin, cystadenocarcinoma, of skin]"
3,1,tumor suppressor phosphatase,Oncological,9020/1,"[9020/1, 8409/3, 8409/0, 8405/0, 8800/3-C16.4, 8010/3-C75.0, 9507/0, 8711/0-C16.4, 8022/3, 9064/3-C75.0, 8010/3-C76.3, 8001/3-C75.0, 8010/3-C16.4, 8980/3-C16.4, 8103/0, 8140/0-C75.0, 8140/3-C75.0, 9701/3-C71.3, 8010/3-C48.2, 8022/3-C75.0, 9719/3, C75.0, 8120/0, 8110/0]","[phyllodes tumor, porocarcinoma, poroma, papillary hidradenoma, sarcoma, of pylorus, carcinoma, of parathyroid gland, pacinian tumor, glomus tumor, of pylorus, pleomorphic carcinoma, germinoma of parathyroid gland, carcinoma, of pelvis, tumor cells, malignant of parathyroid gland, carcinoma, of pylorus, carcinosarcoma, of pylorus, pilar tumor, adenoma, of parathyroid gland, adenocarcinoma, of parathyroid gland, sezary syndrome of parietal lobe, carcinoma, of peritoneum, pleomorphic carcinoma of parathyroid gland, polymorphic reticulosis, parathyroid gland, transitional papilloma, pilomatrixoma]"
4,1,endometrial cancer,Oncological,8380/3,"[8380/3, 8010/3-C54.1, 8380/3-C57.9, 8575/3-C54.1, 8560/3-C54.1, 8441/3-C54.1, 8140/3-C54.1, 8051/3-C54.1, 8384/3-C54.1, 8230/3-C54.1, 8440/3-C54.1, 8021/3-C54.1, 8010/2-C54.1, 8070/3-C54.1, 8380/3-C53.0, 8262/3-C54.1, 8575/3-C53.0, 8201/3-C54.1, 8120/3-C54.1, 8980/3-C54.1, 8050/3-C54.1, 8380/3-C54.1, 8510/3-C54.1, 8140/2-C54.1]","[endometrioid carcinoma, carcinoma, of endometrium, endometrioid adenocarcinoma, of female genital tract, metaplastic carcinoma, of endometrium, adenosquamous carcinoma of endometrium, serous carcinoma, of endometrium, adenocarcinoma, of endometrium, verrucous carcinoma, of endometrium, adenocarcinoma, endocervical type, of endometrium, solid carcinoma, of endometrium, cystadenocarcinoma, of endometrium, carcinoma, anaplastic, of endometrium, carcinoma in situ, of endometrium, squamous cell carcinoma, of endometrium, endometrioid adenocarcinoma, of endocervix, villous adenocarcinoma of endometrium, metaplastic carcinoma, of endocervix, cribriform carcinoma, of endometrium, transitional cell carcinoma, of endometrium, carcinosarcoma, of endometrium, papillary carcinoma, of endometrium, endometrioid adenocarcinoma, of endometrium, medullary carcinoma, of endometrium, adenocarcinoma in situ, of endometrium]"
5,1,squamous cell carcinoma,Oncological,8070/3,"[8070/3, 8051/3, 8070/2, 8052/3, 8070/3-C44.5, 8075/3, 8560/3, 8070/3-C44.9, 8070/3-C76.1, 8075/3-C44.5, 8075/3-C44.9, 8070/3-C76.0, 805-808, 8094/3, 8070/3-C32.9, 8441/3, 8070/3-C77.9, 8074/3, 8074/3-C76.0, 8085/3, 8560/3-C44.9]","[squamous cell carcinoma, verrucous squamous cell carcinoma, squamous cell carcinoma in situ, papillary squamous cell carcinoma, squamous cell carcinoma, of skin of trunk, squamous cell carcinoma, adenoid, adenosquamous carcinoma, squamous cell carcinoma, of skin, squamous cell carcinoma, of thorax, squamous cell carcinoma, adenoid of skin of trunk, squamous cell carcinoma, adenoid of skin, squamous cell carcinoma, of head, face or neck, squamous cell neoplasms, basosquamous carcinoma, squamous cell carcinoma, of larynx, serous carcinoma, squamous cell carcinoma, of lymph node, squamous cell carcinoma, spindle cell, squamous cell carcinoma, spindle cell of head, face or neck, squamous cell carcinoma, hpv positive, adenosquamous carcinoma of skin]"


In [16]:
light_model = LightPipeline(pipeline_model)
light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'icdo_code',
               document_col='document'
               )