![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare/NER_BIOMARKER.ipynb)

## **Detect Radiology Entities**

To run this yourself, you will need to upload your license keys to the notebook. Just Run The Cell Below in order to do that. Also You can open the file explorer on the left side of the screen and upload license_keys.json to the folder that opens. Otherwise, you can look at the example outputs at the bottom of the notebook.

## **Colab Setup**

In [None]:
import json, os
from google.colab import files

if 'spark_jsl.json' not in os.listdir():
  license_keys = files.upload()
  os.rename(list(license_keys.keys())[0], 'spark_jsl.json')

with open('spark_jsl.json') as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)
os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [None]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql.types import StringType, IntegerType

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

# **🔎Define Spark NLP pipeline**

In [None]:
documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = SentenceDetectorDLModel.pretrained() \
      .setInputCols(["document"]) \
      .setOutputCol("sentence") 
 
tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_biomarker", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[documentAssembler,
                               sentenceDetector,
                               tokenizer,
                               word_embeddings,
                               clinical_ner,
                               ner_converter])

sentence_detector_dl download started this may take some time.
Approximate size to download 354.6 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_biomarker download started this may take some time.
[OK!]


## **Create example inputs**

In [None]:
sample_text = ["""All the tumor tissues contained small cell carcinoma components. 4 cases coexisted with other histologic types of bladder cancers, and 2 out of the 9 cases had three different cell components. All the patients had muscle invasion, and 4 cases showed lymph nodes metastasis, 3 cases showed invasion of neighboring structures (seminal vesicle or uterus), and 1 case was highly suspected of liver metastasis. Immunohistochemistry results showed that PCK, Syn, NSE, and CD56 were all positive, but LCA was negative."""]

## **Use the pipeline to create outputs**

In [None]:
from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(sample_text,StringType()).toDF('text')
result = nlpPipeline.fit(df).transform(df)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, 
                                     result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(30, truncate=False)

+--------------------+---------------------+
|chunk               |ner_label            |
+--------------------+---------------------+
|tumor               |Tumor_Finding        |
|small cell          |CancerModifier       |
|carcinoma           |CancerDx             |
|bladder cancers     |CancerDx             |
|metastasis          |Metastasis           |
|liver metastasis    |Metastasis           |
|Immunohistochemistry|Test                 |
|PCK                 |Biomarker            |
|Syn                 |Biomarker            |
|NSE                 |Biomarker            |
|CD56                |Biomarker            |
|positive            |Biomarker_Measurement|
|LCA                 |Biomarker            |
|negative            |Biomarker_Measurement|
+--------------------+---------------------+



## **Visualize results**

In [None]:
from sparknlp_display import NerVisualizer

NerVisualizer().display(
    result = result.collect()[0],
    label_col = 'ner_chunk',
    document_col = 'document'
)