![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/ER_ICDO.ipynb)

## **Resolve Oncology terminology using the ICD-O taxonomy**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## **Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## 2. Start Session

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

### **🔎 For about models**

📌 **sbiobertresolve_icdo_augmented**--> *This model maps extracted clinical entities to ICD-O codes using sbiobert_base_cased_mli Sentence BERT Embeddings. Given an oncological entity found in the text (via NER models like ner_jsl), it returns top terms and resolutions along with the corresponding ICD-O codes to present more granularity with respect to body parts mentioned. It also returns the original Topography and Histology codes, and their descriptions.*






### **🔎 Helper Function**


In [None]:
# returns spark df resolution results

def get_codes_from_df(result_df, chunk, output_col, hcc= False):
    
    
    if hcc:
        
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                     result_df[chunk].metadata, 
                                                     result_df[output_col].result, 
                                                     result_df[output_col].metadata)).alias("cols")) \
                      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                              F.expr("cols['0']").alias("ner_chunk"),
                              F.expr("cols['1']['entity']").alias("entity"), 
                              F.expr("cols['2']").alias("icd10_code"),
                              F.expr("cols['3']['all_k_results']").alias("all_codes"),
                              F.expr("cols['3']['all_k_resolutions']").alias("resolutions"),
                              F.expr("cols['3']['all_k_aux_labels']").alias("hcc_list")).toPandas()



        codes = []
        resolutions = []
        hcc_all = []

        for code, resolution, hcc in zip(df['all_codes'], df['resolutions'], df['hcc_list']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))
            hcc_all.append(hcc.split(":::"))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        df['hcc_list'] = hcc_all
        
    else:
                       
        df = result_df.select(F.explode(F.arrays_zip(result_df[chunk].result, 
                                                           result_df[chunk].metadata, 
                                                           result_df[output_col].result, 
                                                           result_df[output_col].metadata)).alias("cols")) \
                      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
                              F.expr("cols['0']").alias("ner_chunk"),
                              F.expr("cols['1']['entity']").alias("entity"), 
                              F.expr("cols['2']").alias(f"{output_col}"),
                              F.expr("cols['3']['all_k_results']").alias("all_codes"),
                              F.expr("cols['3']['all_k_resolutions']").alias("resolutions")).toPandas()



        codes = []
        resolutions = []

        for code, resolution in zip(df['all_codes'], df['resolutions']):

            codes.append(code.split(':::'))
            resolutions.append(resolution.split(':::'))

        df['all_codes'] = codes  
        df['resolutions'] = resolutions
        
    
    return df

# **📌 "sbiobertresolve_icdo_augmented" model**

### **🔎Define Spark NLP pipeline**

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")\

embeddings_clinical = nlp.BertEmbeddings.pretrained('biobert_pubmed_base_cased') \
    .setInputCols(['sentence', 'token']) \
    .setOutputCol('embeddings')
        
clinical_ner = medical.NerModel.pretrained("ner_bionlp_biobert", "en", "clinical/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner_tags")
        
ner_chunker = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_tags"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["Cancer"])

c2doc = nlp.Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 

sbert_embedder = nlp.BertSentenceEmbeddings\
      .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")
 
icdo_resolver = medical.SentenceEntityResolverModel\
     .pretrained("sbiobertresolve_icdo_augmented","en", "clinical/models") \
     .setInputCols(["ner_chunk", "sbert_embeddings"]) \
     .setOutputCol("icdo_code")\
     .setDistanceFunction("EUCLIDEAN")

    
pipeline = Pipeline(
    stages=[
        document_assembler, 
        sentenceDetector,
        tokenizer,
        embeddings_clinical,
        clinical_ner,
        ner_chunker,
        c2doc,
        sbert_embedder,
        icdo_resolver
    ])

empty_df = spark.createDataFrame([[""]]).toDF("text")
pipeline_model = pipeline.fit(empty_df)


sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[OK!]
ner_bionlp_biobert download started this may take some time.
[OK!]
sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]
sbiobertresolve_icdo_augmented download started this may take some time.
[OK!]


In [None]:
sample_text = """TRAF6 is a putative oncogene in a variety of cancers including  bladder cancer , and skin cancer. WWP2 appears to regulate the expression of the well characterized tumor suppressor phosphatase and tensin homolog (PTEN)   in endometrial cancer   and squamous cell carcinoma."""

clinical_note_df = spark.createDataFrame([[sample_text]]).toDF("text")

icdo_result = pipeline_model.transform(clinical_note_df)

In [None]:
res_pd = get_codes_from_df(icdo_result, 'ner_chunk', 'icdo_code')

In [None]:
res_pd.head(10)

Unnamed: 0,sent_id,ner_chunk,entity,icdo_code,all_codes,resolutions
0,0,cancers,Cancer,8000/3,"[8000/3, 8010/3, 8010/9, 800, 8420/3, 8140/3, ...","[cancer, carcinoma, carcinomatosis, neoplasms,..."
1,0,bladder cancer,Cancer,8010/3-C67.9,"[8010/3-C67.9, 8010/3-C67.5, 8230/3-C67.9, 814...","[carcinoma, of bladder, carcinoma, of bladder ..."
2,0,skin cancer,Cancer,8010/3-C44.9,"[8010/3-C44.9, 8010/9-C44.9, 8070/3-C44.9, 814...","[carcinoma, of skin, carcinomatosis of skin, s..."
3,1,tumor,Cancer,8000/1,"[8000/1, 8040/1, 8001/1, 9365/3, 8000/6, 8103/...","[tumor, tumorlet, tumor cells, askin tumor, tu..."
4,1,endometrial cancer,Cancer,8380/3,"[8380/3, 8010/3-C54.1, 8380/3-C57.9, 8575/3-C5...","[endometrioid carcinoma, carcinoma, of endomet..."
5,1,squamous cell carcinoma,Cancer,8070/3,"[8070/3, 8051/3, 8070/2, 8052/3, 8070/3-C44.5,...","[squamous cell carcinoma, verrucous squamous c..."


In [None]:
from sparknlp_display import EntityResolverVisualizer

light_model = LightPipeline(pipeline_model)
light_result = light_model.fullAnnotate(sample_text)

er_vis = EntityResolverVisualizer()

er_vis.display(light_result[0],
               label_col='ner_chunk',
               resolution_col = 'icdo_code',
               document_col='document'
               )