<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

# COLAB ENVIRONMENT SETUP

In [None]:
# Fundamental Import and installation of Java
import os, shutil
from google.colab import drive
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Licensed Environment Setup
def setup_license_from_gdrive(mount_path, colab_path, aws_credentials_filename, license_filename):
    drive.mount(mount_path, force_remount=True)
    aws_dir = "/root/.aws"
    if not os.path.exists(aws_dir):
        os.makedirs(aws_dir)
    shutil.copyfile(os.path.join(mount_path, colab_path, aws_credentials_filename),os.path.join(aws_dir, "credentials"))
    with open(os.path.join(mount_path, colab_path, license_filename), "r") as f:
        license = f.readline().replace("\n","")
        os.environ["JSL_NLP_LICENSE"] = license
    with open(os.path.join(mount_path, colab_path, secret_filename), "r") as f:
        secret = f.readline().replace("\n","")
    return secret

mount_path = '/content/gdrive'
colab_path = 'My Drive/Colab Notebooks'
aws_credentials_filename = 'credentials'
license_filename = 'license'
secret_filename = 'secret'
version = "2.4.6"
secret = setup_license_from_gdrive(mount_path, colab_path, aws_credentials_filename, license_filename)

# Intallation of Spark NLP Enterprise and its pyhon dependencies
! pip install spark-nlp-jsl==$version --extra-index-url https://pypi.johnsnowlabs.com/$secret

# Spark NLP Imports
from pyspark.sql import SparkSession
import sparknlp, sparknlp_jsl
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

# Creation of suitable SparkSession with proper JARS
spark = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .master("local[*]") \
        .config("spark.driver.memory", "8G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "1G") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.4.5") \
        .config("spark.jars", f"https://pypi.johnsnowlabs.com/{secret}/spark-nlp-jsl-2.4.6.jar")\
        .getOrCreate()

print("spark version:", spark.version)
print("spark-nlp version:", sparknlp.version())
print("spark-nlp-jsl version:", sparknlp_jsl.version())

# ICD-O - SNOMED Entity Resolution - version 2.4.6

## Example for ICD-O Entity Resolution Pipeline
A common NLP problem in medical applications is to identify histology behaviour in documented cancer studies.

In this example we will use Spark-NLP to identify and resolve histology behavior expressions and resolve them to an ICD-O code.

Some cancer related clinical notes (taken from https://www.cancernetwork.com/case-studies):  
https://www.cancernetwork.com/case-studies/large-scrotal-mass-multifocal-intra-abdominal-retroperitoneal-and-pelvic-metastases  
https://oncology.medicinematters.com/lymphoma/chronic-lymphocytic-leukemia/case-study-small-b-cell-lymphocytic-lymphoma-and-chronic-lymphoc/12133054
https://oncology.medicinematters.com/lymphoma/epidemiology/central-nervous-system-lymphoma/12124056
https://oncology.medicinematters.com/lymphoma/case-study-cutaneous-t-cell-lymphoma/12129416

Note 1: Desmoplastic small round cell tumor
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
A 35-year-old African-American man was referred to our urology clinic by his primary care physician for consultation about a large left scrotal mass. The patient reported a 3-month history of left scrotal swelling that had progressively increased in size and was associated with mild left scrotal pain. He also had complaints of mild constipation, with hard stools every other day. He denied any urinary complaints. On physical examination, a hard paratesticular mass could be palpated in the left hemiscrotum extending into the left groin, separate from the left testicle, and measuring approximately 10 × 7 cm in size. A hard, lower abdominal mass in the suprapubic region could also be palpated in the midline. The patient was admitted urgently to the hospital for further evaluation with cross-sectional imaging and blood work.

Laboratory results, including results of a complete blood cell count with differential, liver function tests, coagulation panel, and basic chemistry panel, were unremarkable except for a serum creatinine level of 2.6 mg/dL. Typical markers for a testicular germ cell tumor were within normal limits: the beta–human chorionic gonadotropin level was less than 1 mIU/mL and the alpha fetoprotein level was less than 2.8 ng/mL. A CT scan of the chest, abdomen, and pelvis with intravenous contrast was obtained, and it showed large multifocal intra-abdominal, retroperitoneal, and pelvic masses (Figure 1). On cross-sectional imaging, a 7.8-cm para-aortic mass was visualized compressing the proximal portion of the left ureter, creating moderate left hydroureteronephrosis. Additionally, three separate pelvic masses were present in the retrovesical space, each measuring approximately 5 to 10 cm at their largest diameter; these displaced the bladder anteriorly and the rectum posteriorly.

The patient underwent ultrasound-guided needle biopsy of one of the pelvic masses on hospital day 3 for definitive diagnosis. Microscopic examination of the tissue by our pathologist revealed cellular islands with oval to elongated, irregular, and hyperchromatic nuclei; scant cytoplasm; and invading fibrous tissue—as well as three mitoses per high-powered field (Figure 2). Immunohistochemical staining demonstrated strong positivity for cytokeratin AE1/AE3, vimentin, and desmin. Further mutational analysis of the cells detected the presence of an EWS-WT1 fusion transcript consistent with a diagnosis of desmoplastic small round cell tumor.
</div>

Note 2: SLL and CLL
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
A 72-year-old man with a history of diabetes mellitus, hypertension, and hypercholesterolemia self-palpated a left submandibular lump in 2012. Complete blood count (CBC) in his internist’s office showed solitary leukocytosis (white count 22) with predominant lymphocytes for which he was referred to a hematologist. Peripheral blood flow cytometry on 04/11/12 confirmed chronic lymphocytic leukemia (CLL)/small lymphocytic lymphoma (SLL): abnormal cell population comprising 63% of CD45 positive leukocytes, co-expressing CD5 and CD23 in CD19-positive B cells. CD38 was negative but other prognostic markers were not assessed at that time. The patient was observed regularly for the next 3 years and his white count trend was as follows: 22.8 (4/2012) --> 28.5 (07/2012) --> 32.2 (12/2012) --> 36.5 (02/2013) --> 42 (09/2013) --> 44.9 (01/2014) --> 75.8 (2/2015). His other counts stayed normal until early 2015 when he also developed anemia (hemoglobin [HGB] 10.9) although platelets remained normal at 215. He had been noticing enlargement of his cervical, submandibular, supraclavicular, and axillary lymphadenopathy for several months since 2014 and a positron emission tomography (PET)/computed tomography (CT) scan done in 12/2014 had shown extensive diffuse lymphadenopathy within the neck, chest, abdomen, and pelvis. Maximum standardized uptake value (SUV max) was similar to low baseline activity within the vasculature of the neck and chest. In the abdomen and pelvis, however, there was mild to moderately hypermetabolic adenopathy measuring up to SUV of 4. The largest right neck nodes measured up to 2.3 x 3 cm and left neck nodes measured up to 2.3 x 1.5 cm. His right axillary lymphadenopathy measured up to 5.5 x 2.6 cm and on the left measured up to 4.8 x 3.4 cm. Lymph nodes on the right abdomen and pelvis measured up to 6.7 cm and seemed to have some mass effect with compression on the urinary bladder without symptoms. He underwent a bone marrow biopsy on 02/03/15, which revealed hypercellular marrow (60%) with involvement by CLL (30%); flow cytometry showed CD38 and ZAP-70 positivity; fluorescence in situ hybridization (FISH) analysis showed 13q deletion/monosomy 13; IgVH was unmutated; karyotype was 46XY.
</div>

Note 3: CNS lymphoma
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
A 56-year-old woman began to experience vertigo, headaches, and frequent falls. A computed tomography (CT) scan of the brain revealed the presence of a 1.6 x 1.6 x 2.1 cm mass involving the fourth ventricle (Figure 14.1). A gadolinium-enhanced magnetic resonance imaging (MRI) scan confirmed the presence of the mass, and a stereotactic biopsy was performed that demonstrated a primary central nervous system lymphoma (PCNSL) with a diffuse large B-cell histology. Complete blood count (CBC), lactate dehydrogenase (LDH), and beta-2-microglobulin were normal. Systemic staging with a positron emission tomography (PET)/CT scan and bone marrow biopsy showed no evidence of lymphomatous involvement outside the CNS. An eye exam and lumbar puncture showed no evidence of either ocular or leptomeningeal involvement.
</div>

Note 4: Cutaneous T-cell lymphoma
<div style="border:2px solid #747474; background-color: #e3e3e3; margin: 5px; padding: 10px"> 
An 83-year-old female presented with a progressing pruritic cutaneous rash that started 8 years ago. On clinical exam there were numerous coalescing, infiltrated, scaly, and partially crusted erythematous plaques distributed over her trunk and extremities and a large fungating ulcerated nodule on her right thigh covering 75% of her total body surface area (Figure 10.1). Lymphoma associated alopecia and a left axillary lymphadenopathy were also noted. For the past 3–4 months she reported fatigue, severe pruritus, night sweats, 20 pounds of weight loss, and loss of appetite. 
</div>

In [1]:
import sys, os, time
import sparknlp_jsl
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from sparknlp.pretrained import ResourceDownloader

import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType

In [2]:
spark = sparknlp_jsl.start("####")

Let's create a dataset with all four case studies

In [3]:
notes = []
notes.append("""A 35-year-old African-American man was referred to our urology clinic by his primary care physician for consultation about a large left scrotal mass. The patient reported a 3-month history of left scrotal swelling that had progressively increased in size and was associated with mild left scrotal pain. He also had complaints of mild constipation, with hard stools every other day. He denied any urinary complaints. On physical examination, a hard paratesticular mass could be palpated in the left hemiscrotum extending into the left groin, separate from the left testicle, and measuring approximately 10 × 7 cm in size. A hard, lower abdominal mass in the suprapubic region could also be palpated in the midline. The patient was admitted urgently to the hospital for further evaluation with cross-sectional imaging and blood work.
Laboratory results, including results of a complete blood cell count with differential, liver function tests, coagulation panel, and basic chemistry panel, were unremarkable except for a serum creatinine level of 2.6 mg/dL. Typical markers for a testicular germ cell tumor were within normal limits: the beta–human chorionic gonadotropin level was less than 1 mIU/mL and the alpha fetoprotein level was less than 2.8 ng/mL. A CT scan of the chest, abdomen, and pelvis with intravenous contrast was obtained, and it showed large multifocal intra-abdominal, retroperitoneal, and pelvic masses (Figure 1). On cross-sectional imaging, a 7.8-cm para-aortic mass was visualized compressing the proximal portion of the left ureter, creating moderate left hydroureteronephrosis. Additionally, three separate pelvic masses were present in the retrovesical space, each measuring approximately 5 to 10 cm at their largest diameter; these displaced the bladder anteriorly and the rectum posteriorly.
The patient underwent ultrasound-guided needle biopsy of one of the pelvic masses on hospital day 3 for definitive diagnosis. Microscopic examination of the tissue by our pathologist revealed cellular islands with oval to elongated, irregular, and hyperchromatic nuclei; scant cytoplasm; and invading fibrous tissue—as well as three mitoses per high-powered field (Figure 2). Immunohistochemical staining demonstrated strong positivity for cytokeratin AE1/AE3, vimentin, and desmin. Further mutational analysis of the cells detected the presence of an EWS-WT1 fusion transcript consistent with a diagnosis of desmoplastic small round cell tumor.""")
notes.append("""A 72-year-old man with a history of diabetes mellitus, hypertension, and hypercholesterolemia self-palpated a left submandibular lump in 2012. Complete blood count (CBC) in his internist’s office showed solitary leukocytosis (white count 22) with predominant lymphocytes for which he was referred to a hematologist. Peripheral blood flow cytometry on 04/11/12 confirmed chronic lymphocytic leukemia (CLL)/small lymphocytic lymphoma (SLL): abnormal cell population comprising 63% of CD45 positive leukocytes, co-expressing CD5 and CD23 in CD19-positive B cells. CD38 was negative but other prognostic markers were not assessed at that time. The patient was observed regularly for the next 3 years and his white count trend was as follows: 22.8 (4/2012) --> 28.5 (07/2012) --> 32.2 (12/2012) --> 36.5 (02/2013) --> 42 (09/2013) --> 44.9 (01/2014) --> 75.8 (2/2015). His other counts stayed normal until early 2015 when he also developed anemia (hemoglobin [HGB] 10.9) although platelets remained normal at 215. He had been noticing enlargement of his cervical, submandibular, supraclavicular, and axillary lymphadenopathy for several months since 2014 and a positron emission tomography (PET)/computed tomography (CT) scan done in 12/2014 had shown extensive diffuse lymphadenopathy within the neck, chest, abdomen, and pelvis. Maximum standardized uptake value (SUV max) was similar to low baseline activity within the vasculature of the neck and chest. In the abdomen and pelvis, however, there was mild to moderately hypermetabolic adenopathy measuring up to SUV of 4. The largest right neck nodes measured up to 2.3 x 3 cm and left neck nodes measured up to 2.3 x 1.5 cm. His right axillary lymphadenopathy measured up to 5.5 x 2.6 cm and on the left measured up to 4.8 x 3.4 cm. Lymph nodes on the right abdomen and pelvis measured up to 6.7 cm and seemed to have some mass effect with compression on the urinary bladder without symptoms. He underwent a bone marrow biopsy on 02/03/15, which revealed hypercellular marrow (60%) with involvement by CLL (30%); flow cytometry showed CD38 and ZAP-70 positivity; fluorescence in situ hybridization (FISH) analysis showed 13q deletion/monosomy 13; IgVH was unmutated; karyotype was 46XY.""")
notes.append("A 56-year-old woman began to experience vertigo, headaches, and frequent falls. A computed tomography (CT) scan of the brain revealed the presence of a 1.6 x 1.6 x 2.1 cm mass involving the fourth ventricle (Figure 14.1). A gadolinium-enhanced magnetic resonance imaging (MRI) scan confirmed the presence of the mass, and a stereotactic biopsy was performed that demonstrated a primary central nervous system lymphoma (PCNSL) with a diffuse large B-cell histology. Complete blood count (CBC), lactate dehydrogenase (LDH), and beta-2-microglobulin were normal. Systemic staging with a positron emission tomography (PET)/CT scan and bone marrow biopsy showed no evidence of lymphomatous involvement outside the CNS. An eye exam and lumbar puncture showed no evidence of either ocular or leptomeningeal involvement.") 
notes.append("An 83-year-old female presented with a progressing pruritic cutaneous rash that started 8 years ago. On clinical exam there were numerous coalescing, infiltrated, scaly, and partially crusted erythematous plaques distributed over her trunk and extremities and a large fungating ulcerated nodule on her right thigh covering 75% of her total body surface area (Figure 10.1). Lymphoma associated alopecia and a left axillary lymphadenopathy were also noted. For the past 3–4 months she reported fatigue, severe pruritus, night sweats, 20 pounds of weight loss, and loss of appetite.")

data = spark.createDataFrame([(n,) for n in notes], StructType([StructField("description", StringType())]))

And let's build a SparkNLP pipeline with the following stages:
- DocumentAssembler: Entry annotator for our pipelines; it creates the data structure for the Annotation Framework
- SentenceDetector: Annotator to pragmatically separate complete sentences inside each document
- Tokenizer: Annotator to separate sentences in tokens (generally words)
- WordEmbeddings: Vectorization of word tokens, in this case using word embeddings trained from PubMed, ICD10 and other clinical resources.
- EntityResolver: Annotator that performs search for the KNNs, in this case trained from ICDO Histology Behavior.

In order to find cancer related chunks, we are going to use a pretrained Search Trie wrapped up in our TextMatcher Annotator; and to identify treatments/procedures we are going to use our good old NER.

- TextMatcher: Trained with a Cancer Glossary and an augmented dataset from JSL Data Market this annotator makes sure to return just found phrases in a search Trie. In this case ICDO phrases.


- NerDLModel: TensorFlow based Named Entity Recognizer, trained to extract PROBLEMS, TREATMENTS and TESTS
- NerConverter: Chunk builder out of tokens tagged by the Ner Model

In [4]:
docAssembler = DocumentAssembler().setInputCol("description").setOutputCol("document")

sentenceDetector = SentenceDetector().setInputCols("document").setOutputCol("sentence")

tokenizer = Tokenizer().setInputCols("sentence").setOutputCol("token")

#Working on adjusting WordEmbeddingsModel to work with the subset of matched tokens
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("word_embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


TextMatcher Strategy

In [5]:
icdo_ner = NerDLModel.pretrained("ner_bionlp", "en", "clinical/models")\
    .setInputCols("sentence", "token", "word_embeddings")\
    .setOutputCol("icdo_ner")

icdo_chunk = NerConverter().setInputCols("sentence","token","icdo_ner").setOutputCol("icdo_chunk")

icdo_chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("icdo_chunk", "word_embeddings")\
    .setOutputCol("icdo_chunk_embeddings")

icdo_chunk_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models")\
    .setInputCols("token","icdo_chunk_embeddings")\
    .setOutputCol("tm_icdo_code")

ner_bionlp download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
chunkresolve_icdo_clinical download started this may take some time.
Approximate size to download 8.2 MB
[OK!]


Ner Model Strategy

In [6]:
clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "word_embeddings"]) \
  .setOutputCol("ner")

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

ner_chunk_tokenizer = ChunkTokenizer()\
    .setInputCols("ner_chunk")\
    .setOutputCol("ner_token")

ner_chunk_embeddings = ChunkEmbeddings()\
    .setInputCols("ner_chunk", "word_embeddings")\
    .setOutputCol("ner_chunk_embeddings")

ner_clinical download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


In [8]:
#SNOMED Resolution
ner_snomed_resolver = \
    EnsembleEntityResolverModel.pretrained("ensembleresolve_snomed_clinical","en","clinical/models")\
    .setInputCols("ner_token","ner_chunk_embeddings").setOutputCol("snomed_result")

ensembleresolve_snomed_clinical download started this may take some time.
Approximate size to download 592.9 MB
[OK!]


In [9]:
pipelineFull = Pipeline().setStages([
    docAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    
    clinical_ner, 
    ner_converter, 
    ner_chunk_embeddings,
    ner_chunk_tokenizer,
    ner_snomed_resolver,
    
    icdo_ner,
    icdo_chunk,
    icdo_chunk_embeddings, 
    icdo_chunk_resolver
])

Let's train our Pipeline and make it ready to start transforming

In [10]:
pipelineModelFull = pipelineFull.fit(data)

In [11]:
output = pipelineModelFull.transform(data).cache()

### EntityResolver:  
Trained on an augmented ICDO Dataset from JSL Data Market it provides histology codes resolution for the matched expressions. Other than providing the code in the "result" field it provides more metadata about the matching process:  

- target_text -> Text to resolve
- resolved_text -> Best match text
- confidence -> Relative confidence for the top match (distance to probability)
- confidence_ratio -> Relative confidence for the top match. TopMatchConfidence / SecondMatchConfidence
- alternative_codes -> List of other plausible codes (in the KNN neighborhood)
- alternative_confidence_ratios -> Rest of confidence ratios
- all_k_results -> All resolved codes for metrics calculation purposes
- sentence -> SentenceId
- chunk -> ChunkId

In [12]:
output.withColumn("note",F.monotonically_increasing_id()).select(F.col("note"),F.explode(F.arrays_zip("icdo_chunk.result","tm_icdo_code.result","tm_icdo_code.metadata")).alias("icdo_result")) \
.select("note",
        F.expr("icdo_result['0']").alias("chunk"),
        F.expr("substring(icdo_result['2'].resolved_text,0,25)").alias("resolved_text"),
        F.expr("icdo_result['1']").alias("code"),
        #F.expr("icdo_result['2'].alternative_codes").alias("alternative_codes"),
        F.expr("round(icdo_result['2'].confidence_ratio,2)").alias("confidence")) \
.distinct() \
.orderBy(["note","confidence"], ascending=[True,False]) \
.toPandas()

Unnamed: 0,note,chunk,resolved_text,code,confidence
0,936302870528,blood,"Leukemia, NOS",9800/3,1.61
1,936302870528,Lymph nodes,Renal cell carcinoma,8312/3,1.53
2,936302870528,right abdomen,Renal cell carcinoma,8312/3,1.37
3,936302870528,left neck nodes,Renal cell carcinoma,8312/3,1.28
4,936302870528,right neck nodes,Renal cell carcinoma,8312/3,1.28
...,...,...,...,...,...
96,1537598291968,axillary lymphadenopathy,Angioimmunoblastic T-cell,9705/3,1.04
97,1537598291968,thigh,Fibromyxosarcoma,8811/3,1.01
98,1537598291968,Lymphoma,Mantle cell lymphoma,9673/3,1.00
99,1537598291968,extremities,Capillary hemangioma,9131/0,1.00


In [14]:
output.select(F.explode(F.arrays_zip("ner_chunk.result","ner_chunk.metadata","snomed_result.result","snomed_result.metadata")).alias("icdo_result")) \
.select(F.expr("substring(icdo_result['0'],0,35)").alias("chunk"),
        F.expr("icdo_result['1'].entity").alias("entity"),
        #F.expr("icdo_result['3'].target_text").alias("target_text"),
        F.expr("substring(icdo_result['3'].resolved_text,0,35)").alias("resolved_text"),
        #F.expr("icdo_result['2']").alias("code"),
        #F.expr("icdo_result['2'].alternative_codes").alias("alternative_codes"),
        F.expr("round(icdo_result['3'].confidence_ratio,2)").alias("conf")
       ) \
.distinct() \
.orderBy("conf",ascending=False)\
.toPandas()

Unnamed: 0,chunk,entity,resolved_text,conf
0,a primary central nervous system ly,PROBLEM,Primary central nervous system lymp,4.10
1,numerous coalescing,PROBLEM,Numerous,3.80
2,a bone marrow biopsy,TEST,Bone marrow biopsy,2.57
3,hemoglobin [HGB],TEST,Hemoglobin,2.49
4,CLL)/small lymphocytic lymphoma,PROBLEM,Lymphocytic lymphoma,2.20
...,...,...,...,...
121,karyotype,TEST,Karyotype,0.50
122,platelets,TEST,Platelets,0.50
123,night sweats,PROBLEM,Night sweats,0.50
124,bone marrow biopsy,TEST,Bone marrow biopsy,0.50
