![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/27.Oncology_Model.ipynb)

# **ONCOLOGY MODELS**

This notebook includes details about different kinds of pretrained models to extract oncology-related information from clinical texts, together with examples of each type of model.

## Setup

In [None]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

locals().update(license_keys)

os.environ.update(license_keys)

In [None]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

# Installing Spark NLP Display Library for visualization
! pip install -q spark-nlp-display

In [3]:
import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp_jsl.pretrained import InternalResourceDownloader

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline,PipelineModel

import pandas as pd
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

print("Spark NLP Version :", sparknlp.version())
print("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

spark

Spark NLP Version : 4.3.2
Spark NLP_JSL Version : 4.3.2


## **List of Pretrained Models**

In [4]:
df = pd.DataFrame()
for model_type in ['MedicalNerModel', 'BertForTokenClassification', 'RelationExtractionModel', 'RelationExtractionDLModel', 'AssertionDLModel']:
    model_list = sorted(list(set([model[0] for model in InternalResourceDownloader.returnPrivateModels(model_type) if 'oncology' in model[0]])))
    if len(model_list) > 0:
      if model_type == "MedicalNerModel":
        model_list = list(filter(lambda x: "wip" not in x, model_list))
      df = pd.concat([df, pd.DataFrame(model_list, columns = [model_type])], axis = 1)
    
df.fillna('')

Unnamed: 0,MedicalNerModel,RelationExtractionModel,RelationExtractionDLModel,AssertionDLModel
0,ner_oncology,re_oncology_biomarker_result_wip,redl_oncology_biobert_wip,assertion_oncology_demographic_binary_wip
1,ner_oncology_anatomy_general,re_oncology_granular_wip,redl_oncology_biomarker_result_biobert_wip,assertion_oncology_family_history_wip
2,ner_oncology_anatomy_general_healthcare,re_oncology_location_wip,redl_oncology_granular_biobert_wip,assertion_oncology_problem_wip
3,ner_oncology_anatomy_granular,re_oncology_size_wip,redl_oncology_location_biobert_wip,assertion_oncology_response_to_treatment_wip
4,ner_oncology_biomarker,re_oncology_temporal_wip,redl_oncology_size_biobert_wip,assertion_oncology_smoking_status_wip
5,ner_oncology_biomarker_healthcare,re_oncology_test_result_wip,redl_oncology_temporal_biobert_wip,assertion_oncology_test_binary_wip
6,ner_oncology_demographics,re_oncology_wip,redl_oncology_test_result_biobert_wip,assertion_oncology_treatment_binary_wip
7,ner_oncology_diagnosis,,,assertion_oncology_wip
8,ner_oncology_posology,,,
9,ner_oncology_response_to_treatment,,,


## NER Models

The NER models from the list include different entity groups and levels of granularity. If you want to extract as much information as possible from oncology texts, then ner_oncology is the best option for you, as it is the most general and granular model. But you may want to use other models depending on your needs (for instance, if you need to extract information related with staging, ner_oncology_tnm would be the most suitable model).

In [5]:
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
 
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setSplitChars(["-", "\/"])

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

# ner_oncology

ner_oncology = MedicalNerModel.pretrained("ner_oncology","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_oncology")\

ner_oncology_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_oncology"])\
    .setOutputCol("ner_oncology_chunk")

# ner_oncology_tnm

ner_oncology_tnm = MedicalNerModel.pretrained("ner_oncology_tnm","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_oncology_tnm")\

ner_oncology_tnm_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_oncology_tnm"])\
    .setOutputCol("ner_oncology_tnm_chunk")

# # ner_oncology_biomarker

ner_oncology_biomarker = MedicalNerModel.pretrained("ner_oncology_biomarker","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_oncology_biomarker")\

ner_oncology_biomarker_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner_oncology_biomarker"])\
    .setOutputCol("ner_oncology_biomarker_chunk")

ner_stages = [document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_oncology,
    ner_oncology_converter,
    ner_oncology_tnm,
    ner_oncology_tnm_converter,
    ner_oncology_biomarker,
    ner_oncology_biomarker_converter]

ner_pipeline = Pipeline(stages=ner_stages)

empty_data = spark.createDataFrame([[""]]).toDF("text")

ner_model = ner_pipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_oncology download started this may take some time.
[OK!]
ner_oncology_tnm download started this may take some time.
[OK!]
ner_oncology_biomarker download started this may take some time.
[OK!]


In [6]:
ner_oncology_labels = sorted(list(set([label.split('-')[-1] for label in ner_oncology.getClasses() if label != 'O'])))

len(ner_oncology_labels)

49

In [7]:
label_df = pd.DataFrame()
for column in range((len(ner_oncology_labels)//10)+1):
  label_df = pd.concat([label_df, pd.DataFrame(ner_oncology_labels, columns = [''])[column*10:(column+1)*10].reset_index(drop= True)], axis = 1)

label_df.fillna('')

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,Adenopathy,Cycle_Number,Hormonal_Therapy,Race_Ethnicity,Site_Lung
1,Age,Date,Imaging_Test,Radiation_Dose,Site_Lymph_Node
2,Biomarker,Death_Entity,Immunotherapy,Radiotherapy,Site_Other_Body_Part
3,Biomarker_Result,Direction,Invasion,Relative_Date,Smoking_Status
4,Cancer_Dx,Dosage,Line_Of_Therapy,Response_To_Treatment,Staging
5,Cancer_Score,Duration,Metastasis,Route,Targeted_Therapy
6,Cancer_Surgery,Frequency,Oncogene,Site_Bone,Tumor_Finding
7,Chemotherapy,Gender,Pathology_Result,Site_Brain,Tumor_Size
8,Cycle_Count,Grade,Pathology_Test,Site_Breast,Unspecific_Therapy
9,Cycle_Day,Histological_Type,Performance_Status,Site_Liver,


In [8]:
ner_oncology_tnm_labels = sorted(list(set([label.split('-')[-1] for label in ner_oncology_tnm.getClasses() if label != 'O'])))

print(ner_oncology_tnm_labels)

['Cancer_Dx', 'Lymph_Node', 'Lymph_Node_Modifier', 'Metastasis', 'Staging', 'Tumor', 'Tumor_Description']


In [9]:
ner_oncology_biomarker_labels = sorted(list(set([label.split('-')[-1] for label in ner_oncology_biomarker.getClasses() if label != 'O'])))

print(ner_oncology_biomarker_labels)

['Biomarker', 'Biomarker_Result']


In [10]:
sample_text_1 = '''A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy, total anterior hysterectomy with radical pelvic lymph nodes dissection due to ovarian carcinoma (mucinous-type carcinoma, stage Ic) 1 year ago. Patient's medical compliance was poor and failed to complete her chemotherapy (cyclophosphamide 750 mg/m2, carboplatin 300 mg/m2). Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast in 2 months. Core needle biopsy revealed metaplastic carcinoma. Neoadjuvant chemotherapy with the regimens of Taxotere (75 mg/m2), Epirubicin (75 mg/m2), and Cyclophosphamide (500 mg/m2) was given for 6 cycles with poor response, followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting. Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions. The histopathologic examination revealed a metaplastic carcinoma with squamous differentiation associated with adenomyoepithelioma. Immunohistochemistry study showed that the tumor cells are positive for epithelial markers-cytokeratin (AE1/AE3) stain, and myoepithelial markers, including cytokeratin 5/6 (CK 5/6), p63, and S100 stains. Expressions of hormone receptors, including ER, PR, and Her-2/Neu, were all negative. The dissected axillary lymph nodes showed metastastic carcinoma with negative hormone receptors in 3 nodes. The patient was staged as pT3N1aM0, with histologic tumor grade III.'''

sample_text_2 = '''She underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.'''

sample_text_3 = '''In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9, CD10, CD13, CD19, CD20, CD34, CD38, CD58, CD66c, CD123, HLA-DR, cCD79a, and TdT on flow cytometry.

Measurements of serum tumor markers showed elevated level of cytokeratin 19 fragment (Cyfra21-1: 4.77 ng/mL), neuron-specific enolase (NSE: 19.60 ng/mL), and squamous cell carcinoma antigen (SCCA: 2.58 ng/mL). The results were negative for serum carbohydrate antigen 125 (CA125), carcinoembryonic antigen (CEA) and vascular endothelial growth factor (VEGF). Immunohistochemical staining showed positive staining for CK5/6, P40 and PD-L1 (+ 80% tumor cells), and negative staining for TTF-1, PD-1 and weakly positive staining for ALK. Molecular analysis indicated no EGFR mutation or ROS1 fusion.'''

In [11]:
data = spark.createDataFrame(pd.DataFrame([sample_text_1, sample_text_2, sample_text_3], columns = ['text']))

In [12]:
results = ner_model.transform(data).collect()

In [13]:
from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

In [14]:
from google.colab import widgets

t = widgets.TabBar(["ner_oncology_biomarker", "ner_oncology_tnm", "ner_oncology"])

with t.output_to(0):
    visualiser.display(results[2], label_col='ner_oncology_biomarker_chunk')

with t.output_to(1):
    visualiser.display(results[1], label_col='ner_oncology_tnm_chunk')

with t.output_to(2):
    visualiser.display(results[0], label_col='ner_oncology_chunk')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Relation Extraction Models

RE Models are used to link entities that are related. For oncology entities, you can use general models (such as re_oncology_granular_wip) or you can select a specific model depending on your needs (e.g. re_oncology_size_wip to link tumors and their sizes, or re_oncology_biomarker_result_wip to link biomarkers and their results).

In [15]:
pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos_tags")

dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en") \
    .setInputCols(["sentence", "pos_tags", "token"]) \
    .setOutputCol("dependencies")

re_oncology_granular_wip = RelationExtractionModel.pretrained("re_oncology_granular_wip", "en", "clinical/models") \
    .setInputCols(["embeddings", "pos_tags", "ner_oncology_chunk", "dependencies"]) \
    .setOutputCol("re_oncology_granular_wip") \
    .setRelationPairs(['Date-Cancer_Dx', 'Cancer_Dx-Date', 'Tumor_Finding-Site_Breast', 'Site_Breast-Tumor_Finding',
                       'Relative_Date-Tumor_Finding', 'Tumor_Fiding-Relative_Date', 'Tumor_Finding-Tumor_Size', 'Tumor_Size-Tumor_Finding',
                       'Pathology_Test-Cancer_Dx', 'Cancer_Dx-Pathology_Test']) \
    .setMaxSyntacticDistance(10)    

re_oncology_size_wip = RelationExtractionModel.pretrained("re_oncology_size_wip", "en", "clinical/models") \
    .setInputCols(["embeddings", "pos_tags", "ner_oncology_chunk", "dependencies"]) \
    .setOutputCol("re_oncology_size_wip") \
    .setRelationPairs(['Tumor_Finding-Tumor_Size', 'Tumor_Size-Tumor_Finding']) \
    .setMaxSyntacticDistance(10)    

re_oncology_biomarker_result_wip = RelationExtractionModel.pretrained("re_oncology_biomarker_result_wip", "en", "clinical/models") \
    .setInputCols(["embeddings", "pos_tags", "ner_oncology_biomarker_chunk", "dependencies"]) \
    .setOutputCol("re_oncology_biomarker_result_wip") \
    .setRelationPairs(['Biomarker-Biomarker_Result', 'Biomarker_Result-Biomarker']) \
    .setMaxSyntacticDistance(10)      

re_stages = ner_stages + [pos_tagger, dependency_parser, re_oncology_granular_wip, re_oncology_size_wip, re_oncology_biomarker_result_wip]

re_pipeline = Pipeline(stages=re_stages)

re_model = re_pipeline.fit(empty_data)

pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
re_oncology_granular_wip download started this may take some time.
Approximate size to download 261 KB
[OK!]
re_oncology_size_wip download started this may take some time.
Approximate size to download 261.3 KB
[OK!]
re_oncology_biomarker_result_wip download started this may take some time.
Approximate size to download 259.6 KB
[OK!]


In [16]:
sample_text_4 = '''Two years ago, she noted a palpable right breast mass, 15 cm in size. Core needle biopsy revealed metaplastic carcinoma.'''

sample_text_5 = '''The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long.'''

sample_text_6 = '''Immunohistochemical staining showed positive staining for CK5/6, P40 and PD-L1, and negative staining for TTF-1, PD-1 and weakly positive staining for ALK. Immunohistochemistry study showed that the tumor cells are positive for epithelial markers-cytokeratin and myoepithelial markers, including cytokeratin 5/6, p63, and S100 stains.'''

In [17]:
re_data = spark.createDataFrame(pd.DataFrame([sample_text_4, sample_text_5, sample_text_6], columns = ['text']))

In [18]:
re_results = re_model.transform(re_data).collect()

In [19]:
from sparknlp_display import RelationExtractionVisualizer

re_visualiser = RelationExtractionVisualizer()

In [20]:
re_t = widgets.TabBar(["re_oncology_biomarker_result_wip", "re_oncology_size_wip", "re_oncology_granular_wip"])

with re_t.output_to(0):
    re_visualiser.display(re_results[2], relation_col='re_oncology_biomarker_result_wip')

with re_t.output_to(1):
    re_visualiser.display(re_results[1], relation_col='re_oncology_size_wip')

with re_t.output_to(2):
    re_visualiser.display(re_results[0], relation_col='re_oncology_granular_wip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Assertion Status Models

With assertion status models, you will be able to identify if entities included in texts are mentioned as something present, absent, hypothetical, possible, etc. You can either try using the general assertion_oncology_wip model, or other models that are recommended for specific entity groups (such as assertion_oncology_problem_wip, which should be used for problem entities like Cancer_Dx or Metastasis).

In [21]:
assertion_oncology_wip = AssertionDLModel.pretrained("assertion_oncology_wip", "en", "clinical/models") \
    .setInputCols(["sentence", 'ner_oncology_chunk', "embeddings"]) \
    .setOutputCol("assertion_oncology_wip")

assertion_oncology_problem_wip = AssertionDLModel.pretrained("assertion_oncology_problem_wip", "en", "clinical/models") \
    .setInputCols(["sentence", 'ner_oncology_tnm_chunk', "embeddings"]) \
    .setOutputCol("assertion_oncology_problem_wip")

assertion_oncology_treatment_binary_wip = AssertionDLModel.pretrained("assertion_oncology_treatment_binary_wip", "en", "clinical/models") \
    .setInputCols(["sentence", 'ner_oncology_chunk', "embeddings"]) \
    .setOutputCol("assertion_oncology_treatment_binary_wip")

assertion_stages = ner_stages + [assertion_oncology_wip, assertion_oncology_problem_wip, assertion_oncology_treatment_binary_wip]

assertion_pipeline = Pipeline(stages=assertion_stages)

assertion_model = assertion_pipeline.fit(empty_data)

assertion_oncology_wip download started this may take some time.
[OK!]
assertion_oncology_problem_wip download started this may take some time.
[OK!]
assertion_oncology_treatment_binary_wip download started this may take some time.
[OK!]


In [22]:
sample_text_7 = 'The patient is suspected to have colorectal cancer. Family history is positive for other cancers. The result of the biopsy was positive. A CT scan was ordered to rule out metastases.'

sample_text_8 = 'The patient was diagnosed with breast cancer. She was suspected to have metastases in her lungs. Her family history is positive for ovarian cancer.'

sample_text_9 = 'The patient underwent a mastectomy. We recommend to start radiotherapy. The patient refused to chemotherapy.'

In [23]:
assertion_data = spark.createDataFrame(pd.DataFrame([sample_text_7, sample_text_8, sample_text_9], columns = ['text']))

In [24]:
assertion_results = assertion_model.transform(assertion_data).collect()

In [25]:
from sparknlp_display import AssertionVisualizer

assertion_visualiser = AssertionVisualizer()

In [26]:
assertion_t = widgets.TabBar(["assertion_oncology_treatment_binary_wip", "assertion_oncology_problem_wip", "assertion_oncology_wip"])

with assertion_t.output_to(0):
    assertion_visualiser.display(assertion_results[2], label_col ='ner_oncology_chunk', assertion_col='assertion_oncology_treatment_binary_wip')

with assertion_t.output_to(1):
    assertion_visualiser.display(assertion_results[1], label_col ='ner_oncology_tnm_chunk', assertion_col='assertion_oncology_problem_wip')

with assertion_t.output_to(2):
    assertion_visualiser.display(assertion_results[0], label_col ='ner_oncology_chunk', assertion_col='assertion_oncology_wip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>