![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/ER_ICDO.ipynb)

## **Resolve Oncology terminology using the ICD-O taxonomy**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 4.2.8
Spark NLP_JSL Version : 4.2.8


### **🔎 For about models**

📌 **sbiobertresolve_icdo_augmented**--> *This model maps extracted clinical entities to ICD-O codes using sbiobert_base_cased_mli Sentence BERT Embeddings. Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding ICD-O codes to present more granularity with respect to body parts mentioned. It also returns the original Topography and Histology codes, and their descriptions.*






### **🔎 Helper Function**


In [4]:
# returns spark df resolution results

def get_codes_from_df(result_df, chunk, output_col, hcc= False):
    
    
    if hcc:
        
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                     result_df[chunk].metadata, 
                                                     result_df[output_col].result, 
                                                     result_df[output_col].metadata)).alias("cols")) \
                      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                              F.expr("cols['0']").alias("ner_chunk"),
                              F.expr("cols['1']['entity']").alias("entity"), 
                              F.expr("cols['2']").alias("icd10_code"),
                              F.expr("cols['3']['all_k_results']").alias("all_codes"),
                              F.expr("cols['3']['all_k_resolutions']").alias("resolutions"),
                              F.expr("cols['3']['all_k_aux_labels']").alias("hcc_list")).toPandas()



        codes = []
        resolutions = []
        hcc_all = []

        for code, resolution, hcc in zip(df['all_codes'], df['resolutions'], df['hcc_list']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))
            hcc_all.append(hcc.split(":::"))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        df['hcc_list'] = hcc_all
        
    else:
                       
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                           result_df[chunk].metadata, 
                                                           result_df[output_col].result, 
                                                           result_df[output_col].metadata)).alias("cols")) \
                      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                              F.expr("cols['0']").alias("ner_chunk"),
                              F.expr("cols['1']['entity']").alias("entity"), 
                              F.expr("cols['2']").alias(f"{output_col}"),
                              F.expr("cols['3']['all_k_results']").alias("all_codes"),
                              F.expr("cols['3']['all_k_resolutions']").alias("resolutions")).toPandas()



        codes = []
        resolutions = []

        for code, resolution in zip(df['all_codes'], df['resolutions']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        
    
    return df

# **📌 "sbiobertresolve_icdo_augmented" model**

### **🔎Define Spark NLP pipeline**

In [5]:
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

sentenceDetector = SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

embeddings_clinical = BertEmbeddings.pretrained('biobert_pubmed_base_cased') \
    .setInputCols(['sentence', 'token']) \
    .setOutputCol('embeddings')
        
clinical_ner = MedicalNerModel.pretrained("ner_bionlp_biobert", "en", "clinical/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_tags")
        
ner_chunker = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_tags"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["Cancer"])

c2doc = Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 

sbert_embedder = BertSentenceEmbeddings\
      .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")
 
icdo_resolver = SentenceEntityResolverModel\
     .pretrained("sbiobertresolve_icdo_augmented","en", "clinical/models") \
     .setInputCols(["sbert_embeddings"]) \
     .setOutputCol("icdo_code")\
     .setDistanceFunction("EUCLIDEAN")

    
pipeline = Pipeline(
    stages=[
        document_assembler, 
        sentenceDetector,
        tokenizer,
        embeddings_clinical,
        clinical_ner,
        ner_chunker,
        c2doc,
        sbert_embedder,
        icdo_resolver
    ])

empty_df = spark.createDataFrame([[""]]).toDF("text")
pipeline_model = pipeline.fit(empty_df)


sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
ner_bionlp_biobert download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icdo_augmented download started this may take some time.
[OK!]


In [6]:
sample_text = """TRAF6 is a putative oncogene in a variety of cancers including  bladder cancer , and skin cancer. WWP2 appears to regulate the expression of the well characterized tumor suppressor phosphatase and tensin homolog (PTEN)   in endometrial cancer   and squamous cell carcinoma."""

clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")

icdo_result = pipeline_model.transform(clinical_note_df)

In [7]:
res_pd = get_codes_from_df(icdo_result, 'ner_chunk', 'icdo_code')

In [8]:
res_pd.head(10)

Unnamed: 0,sent_id,ner_chunk,entity,icdo_code,all_codes,resolutions
0,0,cancers,Cancer,8000/3,"[8000/3, 8010/3, 8010/9, 800, 8420/3, 8140/3, 8010/3-C76.0, 8010/6, 8010/3-C44.5, 8010/3-C26.0, 8010/3-C76.1, 8000/1, 8240/3, 8010/3-C06.9, 8021/3, 8010/9-C44.9, 8530/3, 8550/3, 8001/1, 8010/3-C77...","[cancer, carcinoma, carcinomatosis, neoplasms, ceruminous carcinoma, adenocarcinoma, carcinoma, of head, face or neck, secondary carcinoma, carcinoma, of skin of trunk, carcinoma, of intestinal tr..."
1,0,bladder cancer,Cancer,8010/3-C67.9,"[8010/3-C67.9, 8010/3-C67.5, 8230/3-C67.9, 8140/3-C67.9, 8441/3-C67.9, 8120/3-C67.9, 8070/3-C67.9, 8980/3-C67.9, 8140/3-C67.5, 8230/3-C67.5, 8051/3-C67.9, 8510/3-C67.9, 8050/3-C67.9, 8051/3-C67.5,...","[carcinoma, of bladder, carcinoma, of bladder neck, solid carcinoma, of bladder, adenocarcinoma, of bladder, serous carcinoma, of bladder, transitional cell carcinoma, of bladder, squamous cell ca..."
2,0,skin cancer,Cancer,8010/3-C44.9,"[8010/3-C44.9, 8010/9-C44.9, 8070/3-C44.9, 8140/3-C44.9, 8980/3-C44.9, 8010/3-C44.5, 8409/3-C44.9, 8560/3-C44.9, 8051/3-C44.9, 8010/2-C44.9, 8201/3-C44.9, 8575/3-C44.9, 8390/3, 8230/3-C44.9, 8070/...","[carcinoma, of skin, carcinomatosis of skin, squamous cell carcinoma, of skin, adenocarcinoma, of skin, carcinosarcoma, of skin, carcinoma, of skin of trunk, porocarcinoma, of skin, adenosquamous ..."
3,1,tumor,Cancer,8000/1,"[8000/1, 8040/1, 8001/1, 9365/3, 8000/6, 8103/0, 9364/3, 8940/0, 8561/0, 9230/1, 8000/3, 9365/3-C76.1, 8100/0, 8158/3, 800, 8711/0, 9135/1, 8935/1, 8010/3, 8815/1, 8960/3, 8312/3, 8153/3]","[tumor, tumorlet, tumor cells, askin tumor, tumor, secondary, pilar tumor, ewing tumor, mixed tumor, warthin tumor, codman tumor, cancer, askin tumor of thorax, brooke tumor, acth-producing tumor,..."
4,1,endometrial cancer,Cancer,8380/3,"[8380/3, 8010/3-C54.1, 8380/3-C57.9, 8575/3-C54.1, 8560/3-C54.1, 8441/3-C54.1, 8140/3-C54.1, 8051/3-C54.1, 8384/3-C54.1, 8230/3-C54.1, 8440/3-C54.1, 8021/3-C54.1, 8010/2-C54.1, 8070/3-C54.1, 8380/...","[endometrioid carcinoma, carcinoma, of endometrium, endometrioid adenocarcinoma, of female genital tract, metaplastic carcinoma, of endometrium, adenosquamous carcinoma of endometrium, serous carc..."
5,1,squamous cell carcinoma,Cancer,8070/3,"[8070/3, 8051/3, 8070/2, 8052/3, 8070/3-C44.5, 8075/3, 8560/3, 8070/3-C44.9, 8070/3-C76.1, 8075/3-C44.5, 8075/3-C44.9, 8070/3-C76.0, 805-808, 8094/3, 8070/3-C32.9, 8441/3, 8070/3-C77.9, 8074/3, 80...","[squamous cell carcinoma, verrucous squamous cell carcinoma, squamous cell carcinoma in situ, papillary squamous cell carcinoma, squamous cell carcinoma, of skin of trunk, squamous cell carcinoma,..."


In [9]:
from sparknlp_display import EntityResolverVisualizer

light_model = LightPipeline(pipeline_model)
light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'icdo_code',
               document_col='document'
               )