## Get Started with Spark NLP for Healthcare

## Getting the keys and installation

1. In order to get trial keys for Spark NLP for Healthcare
, fill the form at https://www.johnsnowlabs.com/spark-nlp-try-free/ and you will get your keys to your email in a few minutes.

2. On a new cluster or existing one

  - add the following to the `Advanced Options -> Spark` tab, in `Spark.Config` box:

    ```bash
    spark.local.dir /var
    spark.kryoserializer.buffer.max 1000M
    spark.serializer org.apache.spark.serializer.KryoSerializer
    ```
  - add the following to the `Advanced Options -> Spark` tab, in `Environment Variables` box:

    ```bash
    AWS_ACCESS_KEY_ID=xxx
    AWS_SECRET_ACCESS_KEY=yyy
    SPARK_NLP_LICENSE=zzz
    ```

3. Download the followings with AWS CLI to your local computer

    `$ aws s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/$jsl_secret/spark-nlp-jsl-$jsl_version.jar spark-nlp-jsl-$jsl_version.jar`

    `$ aws s3 cp --region us-east-2 s3://pypi.johnsnowlabs.com/$jsl_secret/spark-nlp-jsl/spark_nlp_jsl-$jsl_version-py3-none-any.whl spark_nlp_jsl-$jsl_version-py3-none-any.whl` 

4. In `Libraries` tab inside your cluster:

 - Install New -> PyPI -> `spark-nlp==$public_version` -> Install
 - Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:$public_version` -> Install

 - add following jars for the Healthcare library that you downloaded above:
        - Install New -> Python Whl -> upload `spark_nlp_jsl-$jsl_version-py3-none-any.whl`

        - Install New -> Jar -> upload `spark-nlp-jsl-$jsl_version.jar`

5. Now you can attach your notebook to the cluster and use Spark NLP!

For more information, see 

  https://nlp.johnsnowlabs.com/docs/en/install#databricks-support

  https://nlp.johnsnowlabs.com/docs/en/licensed_install#install-spark-nlp-for-healthcare-on-databricks
  
The follwing notebook is prepared and tested on **r2.2xlarge at 8.0 (includes Apache Spark 3.1.1, Scala 2.12)** on Databricks

In order to get more detailed examples, please check this repository : https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Healthcare/databricks_notebooks

Let's import the libraries which we will use in the following cells.

In [0]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
#nlp.install()

[91m🚨 Your Spark-Healthcare is outdated, installed==4.3.0 but latest version==4.2.4
You can run [92m jsl.install() [39mto update Spark-Healthcare
[91m🚨 Your Spark-OCR is outdated, installed==4.3.1 but latest version==4.2.1
You can run [92m jsl.install() [39mto update Spark-OCR


In [0]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

import os
import json
import string
import numpy as np
import pandas as pd

from pyspark.ml import Pipeline, PipelineModel

pd.set_option('max_colwidth', 100)
pd.set_option('display.max_columns', 100)  
pd.set_option('display.expand_frame_repr', False)

spark



**Read Dataset**

We will download a sample file and create a spark dataframe.

In [0]:
! wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv

In [0]:
pubMedDF = spark.read.option("header", "true").csv("dbfs:/pubmed_sample_text_small.csv")
                
pubMedDF.show(truncate=100)

## 1. Clinical NER Pipeline
We will extract clinical entities from text by using `ner_clinical_large` model.

In [0]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line

#sentenceDetector = SentenceDetector()\
        #.setInputCols(["document"])\
        #.setOutputCol("sentence")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")
 
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel.pretrained("pos_clinical", "en", "clinical/models") \
        .setInputCols(["sentence", "token"])\
        .setOutputCol("pos_tags")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nerPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        pos_tagger,
        clinical_ner,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

ner_model = nerPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[ | ][OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[ | ][OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[ | ][OK!]
ner_clinical_large download started this may take some time.
[ | ][ / ][ — ][OK!]


In [0]:
ner_model.stages

Out[8]: [DocumentAssembler_52151993d357,
 SentenceDetectorDLModel_6bafc4746ea5,
 REGEX_TOKENIZER_ae8cd8ae39a3,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 POS_6f55785005bf,
 MedicalNerModel_1a8637089929,
 NER_CONVERTER_54880fada05d]

In [0]:
clinical_ner.getClasses()

Out[9]: ['O',
 'B-TREATMENT',
 'I-TREATMENT',
 'B-PROBLEM',
 'I-PROBLEM',
 'B-TEST',
 'I-TEST']

In [0]:
result = ner_model.transform(pubMedDF.limit(100))

In [0]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|            pos_tags|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The human KCNJ9 (...|[{document, 0, 95...|[{document, 0, 12...|[{token, 0, 2, Th...|[{word_embeddings...|[{pos, 0, 2, DD, ...|[{named_entity, 0...|[{chunk, 48, 106,...|
|BACKGROUND: At pr...|[{document, 0, 14...|[{document, 0, 19...|[{token, 0, 9, BA...|[{word_embeddings...|[{pos, 0, 9, NN, ...|[{named_entity, 0...|[{chunk, 67, 79, ...|
|OBJECTIVE: To inv...|[{document, 0, 15...|[{document, 0, 14...|[{token, 0, 8, OB...|[{word_embeddings...|[{pos, 0, 8, NN, ...|[{named_entity, 0...|[{

In [0]:
result.select("sentence.result").show(truncate=100)

+----------------------------------------------------------------------------------------------------+
|                                                                                              result|
+----------------------------------------------------------------------------------------------------+
|[The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying pota...|
|[BACKGROUND: At present, it is one of the most important issues for the treatment of breast cance...|
|[OBJECTIVE: To investigate the relationship between preoperative atrialfibrillation and early and...|
|[Combined EEG/fMRI recording has been used to localize the generators of EEGevents and to identif...|
|[Kohlschutter syndrome is a rare neurodegenerative disorder presenting withintractable seizures, ...|
|[Statistical analysis of neuroimages is commonly approached with intergroupcomparisons made by re...|
|[The synthetic DOX-LNA conjugate was characterized by proton nuclear mag

In [0]:
from pyspark.sql import functions as F

result_df = result.select(F.explode(F.arrays_zip(result.token.result,result.ner.result)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"))

result_df.show(50, truncate=100)

+-------------------+-----------+
|              token|  ner_label|
+-------------------+-----------+
|                The|          O|
|              human|          O|
|              KCNJ9|          O|
|                  (|          O|
|                Kir|          O|
|                3.3|          O|
|                  ,|          O|
|              GIRK3|          O|
|                  )|          O|
|                 is|          O|
|                  a|          O|
|             member|          O|
|                 of|          O|
|                the|B-TREATMENT|
|G-protein-activated|I-TREATMENT|
|           inwardly|I-TREATMENT|
|         rectifying|I-TREATMENT|
|          potassium|I-TREATMENT|
|                  (|I-TREATMENT|
|               GIRK|I-TREATMENT|
|                  )|          O|
|            channel|          O|
|             family|          O|
|                  .|          O|
|               Here|          O|
|                 we|          O|
|           de

Lets count the ner_labels.

In [0]:
result_df.select("token", "ner_label").groupBy('ner_label').count().orderBy('count', ascending=False).show(truncate=False)

+-----------+-----+
|ner_label  |count|
+-----------+-----+
|O          |13860|
|I-PROBLEM  |1591 |
|B-PROBLEM  |914  |
|I-TREATMENT|739  |
|I-TEST     |634  |
|B-TREATMENT|624  |
|B-TEST     |534  |
+-----------+-----+



In [0]:
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.begin, result.ner_chunk.end, result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['3']['sentence']").alias("sentence_id"),
              F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label")).show(truncate=False)

+-----------+-----------------------------------------------------------+-----+---+---------+
|sentence_id|chunk                                                      |begin|end|ner_label|
+-----------+-----------------------------------------------------------+-----+---+---------+
|0          |the G-protein-activated inwardly rectifying potassium (GIRK|48   |106|TREATMENT|
|1          |the genomicorganization                                    |142  |164|TREATMENT|
|1          |a candidate gene forType II diabetes mellitus              |210  |254|PROBLEM  |
|2          |byapproximately                                            |380  |394|TREATMENT|
|3          |single nucleotide polymorphisms                            |464  |494|TREATMENT|
|3          |aVal366Ala substitution                                    |532  |554|PROBLEM  |
|3          |an 8 base-pair                                             |561  |574|PROBLEM  |
|3          |insertion/deletion                             

**We can also filter NER results to get specific entities by using `setWhiteList()` parameter. In this example we will get only `PROBLEM` entities.**

In [0]:
ner_converter_filter = medical.NerConverterInternal()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")\
        .setWhiteList(["PROBLEM"])

nerFilteredPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter_filter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

ner_filtered_model = nerFilteredPipeline.fit(empty_data)

In [0]:
filtered_result = ner_filtered_model.transform(pubMedDF.limit(100))

In [0]:
filtered_result.select(F.explode(F.arrays_zip(filtered_result.ner_chunk.result, 
                                              filtered_result.ner_chunk.begin, 
                                              filtered_result.ner_chunk.end, 
                                              filtered_result.ner_chunk.metadata)).alias("cols")) \
               .select(F.expr("cols['3']['sentence']").alias("sentence_id"),
                       F.expr("cols['0']").alias("chunk"),
                       F.expr("cols['1']").alias("begin"),
                       F.expr("cols['2']").alias("end"),
                       F.expr("cols['3']['entity']").alias("ner_label")).show(truncate=False)

+-----------+---------------------------------------------+-----+----+---------+
|sentence_id|chunk                                        |begin|end |ner_label|
+-----------+---------------------------------------------+-----+----+---------+
|1          |a candidate gene forType II diabetes mellitus|210  |254 |PROBLEM  |
|3          |aVal366Ala substitution                      |532  |554 |PROBLEM  |
|3          |an 8 base-pair                               |561  |574 |PROBLEM  |
|3          |insertion/deletion                           |581  |598 |PROBLEM  |
|4          |the transcript in various humantissues       |648  |685 |PROBLEM  |
|4          |fat andskeletal muscle                       |749  |770 |PROBLEM  |
|5          |furtherstudies                               |830  |843 |PROBLEM  |
|5          |Type II diabetes                             |940  |955 |PROBLEM  |
|0          |breast cancer                                |84   |96  |PROBLEM  |
|6          |change         

**As you can see, we got only `PROBLEM` entities from the text.**

### NER Visualization

We have sparknlp_display library for visualization. This library works with LightPipeline results.

In [0]:
sample_text = [pubMedDF.limit(3).collect()[i][0] for i in range(3)]

In [0]:
sample_text[1]

Out[20]: 'BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes. METHODS: Vinorelbinewas administered at a dose level of 25 mg/m(2) intravenously on days 1 and 8 of a3 week cycle. Patients were given three or more cycles in the absence of tumorprogression. A maximum of nine cycles were administered. RESULTS: The responserate in 50 evaluable patients was 20.0% (10 out of 50; 95% confidence interval,10.0-33.7%). Responders plus those who had minor response (MR) or no change (NC) accounted for 58.0% [10 partial responses (PRs) + one MR + 18 NCs out of 50]. TheKaplan-Meier estimate (50% poin

In [0]:
ner_lp = nlp.LightPipeline(ner_model)
light_result = ner_lp.fullAnnotate(sample_text[1])

In [0]:
visualiser = nlp.viz.NerVisualizer()

vis = visualiser.display(light_result[0], label_col='ner_chunk', document_col='document', return_html=True)

# Change color of an entity label
#visualiser.set_label_colors({'PROBLEM':'#008080', 'TEST':'#800080', 'TREATMENT':'#808080'})
#visualiser.display(light_result[0], label_col='ner_chunk')

# Set label filter
# vis = visualiser.display(light_result, label_col='ner_chunk', document_col='document',
                   #labels=['PROBLEM','TEST','TREATMENT])
  
displayHTML(vis)

**There are many NER models for different purposes in Spark NLP. Lets show what if we use `jsl_ner_wip_clinical` model that has about 80 different NER label.**

In [0]:
jsl_ner = medical.NerModel.pretrained("jsl_ner_wip_clinical","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")

jslPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        pos_tagger,
        jsl_ner,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

jsl_model = jslPipeline.fit(empty_data)

jsl_ner_wip_clinical download started this may take some time.
[ | ][ / ][ — ][OK!]


In [0]:
jsl_lp = nlp.LightPipeline(jsl_model)
jsl_light_result = jsl_lp.fullAnnotate(sample_text[1])

In [0]:
visualiser = nlp.viz.NerVisualizer()

vis = visualiser.display(jsl_light_result[0], label_col='ner_chunk', document_col='document', return_html=True)

displayHTML(vis)

**If you want to go over more about NER, you can check this comprehensive notebook :**

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/databricks_notebooks/1.Clinical_Named_Entity_Recognition_Model_v3.0.ipynb

## 2. Clinical Assertion

Now we will check the assertion status of the clinical entities. We will use `ner_clinical_large` model for NER detection, and `assertion_dl` model for checking the assertion status of detected entities. While doing that, we will use the same pipeline that we created fot detecting NER.

In [0]:
# Assertion model trained on i2b2 (sampled from MIMIC) dataset
clinical_assertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
assertionPipeline = nlp.Pipeline(stages=[
    nerPipeline,
    clinical_assertion
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

assertion_model = assertionPipeline.fit(empty_data)

assertion_dl download started this may take some time.
[ | ][OK!]


**This time we will use LightPipeline while implementing.**

In [0]:
sample_text[0]

Out[27]: 'The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying potassium (GIRK) channel family. Here we describe the genomicorganization of the KCNJ9 locus on chromosome 1q21-23 as a candidate gene forType II diabetes mellitus in the Pima Indian population. The gene spansapproximately 7.6 kb and contains one noncoding and two coding exons separated byapproximately 2.2 and approximately 2.6 kb introns, respectively. We identified14 single nucleotide polymorphisms (SNPs), including one that predicts aVal366Ala substitution, and an 8 base-pair (bp) insertion/deletion. Ourexpression studies revealed the presence of the transcript in various humantissues including pancreas, and two major insulin-responsive tissues: fat andskeletal muscle. The characterization of the KCNJ9 gene should facilitate furtherstudies on the function of the KCNJ9 protein and allow evaluation of thepotential role of the locus in Type II diabetes.'

In [0]:
assertion_light = nlp.LightPipeline(assertion_model)

**We can use `annotate` method to get faster results for short sentences.**

In [0]:
assertion_anno_res = assertion_light.annotate(sample_text[0])

In [0]:
assertion_anno_res.keys()

Out[30]: dict_keys(['document', 'ner_chunk', 'assertion', 'token', 'ner', 'embeddings', 'pos_tags', 'sentence'])

**Lets create a pandas dataframe to see our results obviously.**

In [0]:
pd.DataFrame(list(zip(assertion_anno_res["ner_chunk"], assertion_anno_res["assertion"])), columns=["ner_chunk", "assertion"])

Unnamed: 0,ner_chunk,assertion
0,the G-protein-activated inwardly rectifying potassium (GIRK,conditional
1,the genomicorganization,present
2,a candidate gene forType II diabetes mellitus,present
3,byapproximately,present
4,single nucleotide polymorphisms,present
5,aVal366Ala substitution,present
6,an 8 base-pair,present
7,insertion/deletion,absent
8,Ourexpression studies,present
9,the transcript in various humantissues,present


**This time we will use `fullAnnotate` method on our text to get metadata results.**

In [0]:
assertion_result = assertion_light.fullAnnotate(sample_text[0])[0]

In [0]:
assertion_result.keys()

Out[33]: dict_keys(['document', 'ner_chunk', 'assertion', 'token', 'ner', 'embeddings', 'pos_tags', 'sentence'])

In [0]:
chunks=[]
entities=[]
status=[]

for n,m in zip(assertion_result['ner_chunk'],assertion_result['assertion']):

    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    status.append(m.result)

df = pd.DataFrame({'chunks':chunks, 'entities':entities, 'assertion':status})

In [0]:
df

Unnamed: 0,chunks,entities,assertion
0,the G-protein-activated inwardly rectifying potassium (GIRK,TREATMENT,conditional
1,the genomicorganization,TREATMENT,present
2,a candidate gene forType II diabetes mellitus,PROBLEM,present
3,byapproximately,TREATMENT,present
4,single nucleotide polymorphisms,TREATMENT,present
5,aVal366Ala substitution,PROBLEM,present
6,an 8 base-pair,PROBLEM,present
7,insertion/deletion,PROBLEM,absent
8,Ourexpression studies,TEST,present
9,the transcript in various humantissues,PROBLEM,present


**Also we can filter assertion results by using `AssertionFilterer` annotator. We will use the same pipeline that we vreated before to get the assertions. We will try to get only `present` assertions.**

In [0]:
assertion_filterer = medical.AssertionFilterer()\
      .setInputCols("sentence","ner_chunk","assertion")\
      .setOutputCol("assertion_filtered")\
      .setWhiteList(["present"])

assertionFilteredPipeline = nlp.Pipeline(stages=[
    assertionPipeline,
    assertion_filterer
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

assertion_filtered_model = assertionFilteredPipeline.fit(empty_data)

In [0]:
assertion_filtered_light = nlp.LightPipeline(assertion_filtered_model)

In [0]:
assertion_filtered_result = assertion_filtered_light.fullAnnotate(sample_text[0])[0]

In [0]:
assertion_filtered_result.keys()

Out[39]: dict_keys(['assertion_filtered', 'document', 'ner_chunk', 'assertion', 'token', 'ner', 'embeddings', 'pos_tags', 'sentence'])

In [0]:
assertion_filtered_result["assertion_filtered"]

Out[40]: [Annotation(chunk, 142, 164, the genomicorganization, {'chunk': '1', 'confidence': '0.80715', 'ner_source': 'ner_chunk', 'assertion': 'present', 'entity': 'TREATMENT', 'sentence': '1'}),
 Annotation(chunk, 210, 254, a candidate gene forType II diabetes mellitus, {'chunk': '2', 'confidence': '0.7543429', 'ner_source': 'ner_chunk', 'assertion': 'present', 'entity': 'PROBLEM', 'sentence': '1'}),
 Annotation(chunk, 380, 394, byapproximately, {'chunk': '3', 'confidence': '0.7924', 'ner_source': 'ner_chunk', 'assertion': 'present', 'entity': 'TREATMENT', 'sentence': '2'}),
 Annotation(chunk, 464, 494, single nucleotide polymorphisms, {'chunk': '4', 'confidence': '0.6369667', 'ner_source': 'ner_chunk', 'assertion': 'present', 'entity': 'TREATMENT', 'sentence': '3'}),
 Annotation(chunk, 532, 554, aVal366Ala substitution, {'chunk': '5', 'confidence': '0.53615', 'ner_source': 'ner_chunk', 'assertion': 'present', 'entity': 'PROBLEM', 'sentence': '3'}),
 Annotation(chunk, 561, 574, an 8 b

Here is the `present` entities.

In [0]:
chunks=[]
entities=[]


for n in assertion_filtered_result['assertion_filtered']:

    chunks.append(n.result)
    entities.append(n.metadata['entity']) 


filtered_df = pd.DataFrame({'chunks':chunks, 'entities':entities})

filtered_df

Unnamed: 0,chunks,entities
0,the genomicorganization,TREATMENT
1,a candidate gene forType II diabetes mellitus,PROBLEM
2,byapproximately,TREATMENT
3,single nucleotide polymorphisms,TREATMENT
4,aVal366Ala substitution,PROBLEM
5,an 8 base-pair,PROBLEM
6,Ourexpression studies,TEST
7,the transcript in various humantissues,PROBLEM
8,furtherstudies,PROBLEM
9,the KCNJ9 protein,TREATMENT


### Assertion Visualization

We can visualize the assertion status of detected entities by using `AssertionVisualizer` module of `sparknlp_display` library.

In [0]:
assertion_vis = nlp.viz.AssertionVisualizer()

## To set custom label colors:
assertion_vis.set_label_colors({'TREATMENT':'#008080', 'PROBLEM':'#800080'}) #set label colors by specifying hex codes

vis = assertion_vis.display(assertion_result, 
                            label_col = 'ner_chunk', 
                            assertion_col = 'assertion',
                            document_col = 'document' ,
                            return_html=True
                      )

displayHTML(vis)

**If you want to go more over about assertion model examples, you can check this comprehensive notebook :**

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.Clinical_Assertion_Model.ipynb

## 3. Relation Extraction

In this section, we will show an example of relation extraction models. We will use the same NER pipeline that we created before to extract clinical entities and `re_clinical` model to extract relations between these entities. The set of relations defined in the 2010 i2b2 relation challenge:

**TrIP:** A certain treatment has improved or cured a medical problem (eg, ‘infection resolved with antibiotic course’)

**TrWP:** A patient's medical problem has deteriorated or worsened because of or in spite of a treatment being administered (eg, ‘the tumor was growing despite the drain’)

**TrCP:** A treatment caused a medical problem (eg, ‘penicillin causes a rash’)

**TrAP:** A treatment administered for a medical problem (eg, ‘Dexamphetamine for narcolepsy’)

**TrNAP:** The administration of a treatment was avoided because of a medical problem (eg, ‘Ralafen which is contra-indicated because of ulcers’)

**TeRP:** A test has revealed some medical problem (eg, ‘an echocardiogram revealed a pericardial effusion’)

**TeCP:** A test was performed to investigate a medical problem (eg, ‘chest x-ray done to rule out pneumonia’)

**PIP:** Two problems are related to each other (eg, ‘Azotemia presumed secondary to sepsis’)

In [0]:
dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tags", "token"])\
    .setOutputCol("dependencies")

clinical_re_Model = medical.RelationExtractionModel()\
    .pretrained("re_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(4)\
    #.setRelationPairs(["problem-test", "problem-treatment"]) we can set the possible relation pairs (if not set, all the relations will be calculated)

relPipeline = nlp.Pipeline(stages=[
    nerPipeline,
    dependency_parser,
    clinical_re_Model
])


empty_data = spark.createDataFrame([[""]]).toDF("text")

rel_model = relPipeline.fit(empty_data)

dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[ | ][OK!]
re_clinical download started this may take some time.
Approximate size to download 6 MB
[ | ][ / ][OK!]


In [0]:
rel_model.stages

Out[44]: [PipelineModel_946c7c11abc6,
 dependency_e7755462ba78,
 RelationExtractionModel_9c255241fec3]

In [0]:
sample_text[1]

Out[45]: 'BACKGROUND: At present, it is one of the most important issues for the treatment of breast cancer to develop the standard therapy for patients previously treated with anthracyclines and taxanes. With the objective of determining the usefulnessof vinorelbine monotherapy in patients with advanced or recurrent breast cancerafter standard therapy, we evaluated the efficacy and safety of vinorelbine inpatients previously treated with anthracyclines and taxanes. METHODS: Vinorelbinewas administered at a dose level of 25 mg/m(2) intravenously on days 1 and 8 of a3 week cycle. Patients were given three or more cycles in the absence of tumorprogression. A maximum of nine cycles were administered. RESULTS: The responserate in 50 evaluable patients was 20.0% (10 out of 50; 95% confidence interval,10.0-33.7%). Responders plus those who had minor response (MR) or no change (NC) accounted for 58.0% [10 partial responses (PRs) + one MR + 18 NCs out of 50]. TheKaplan-Meier estimate (50% poin

In [0]:
rel_light = nlp.LightPipeline(rel_model)
relation_res = rel_light.fullAnnotate(sample_text[1])[0]

In [0]:
relation_res.keys()

Out[47]: dict_keys(['document', 'ner_chunk', 'token', 'relations', 'ner', 'embeddings', 'pos_tags', 'dependencies', 'sentence'])

In [0]:
rel_pairs=[]
  
for rel in relation_res["relations"]:
    rel_pairs.append((
          rel.result, 
          rel.metadata['entity1'], 
          rel.metadata['entity1_begin'],
          rel.metadata['entity1_end'],
          rel.metadata['chunk1'], 
          rel.metadata['entity2'],
          rel.metadata['entity2_begin'],
          rel.metadata['entity2_end'],
          rel.metadata['chunk2'], 
          rel.metadata['confidence']
      ))

rel_df = pd.DataFrame(rel_pairs, columns=['relation','entity1','entity1_begin','entity1_end','chunk1','entity2','entity2_begin','entity2_end','chunk2', 'confidence'])
rel_df[rel_df.relation!="O"]

Unnamed: 0,relation,entity1,entity1_begin,entity1_end,chunk1,entity2,entity2_begin,entity2_end,chunk2,confidence
0,TrAP,TREATMENT,67,79,the treatment,PROBLEM,84,96,breast cancer,0.99830663
3,TeRP,TREATMENT,109,128,the standard therapy,TREATMENT,186,192,taxanes,0.9952891
5,TeRP,TREATMENT,229,268,the usefulnessof vinorelbine monotherapy,TREATMENT,287,343,advanced or recurrent breast cancerafter standard therapy,0.9998728
9,TeRP,TEST,856,858,MR),PROBLEM,866,871,change,0.99999976
12,TeRP,TEST,902,918,partial responses,TEST,921,923,PRs,0.99784946
13,TeRP,TEST,932,933,MR,TEST,940,946,NCs out,0.9762975
14,PIP,PROBLEM,1110,1126,Themajor toxicity,PROBLEM,1132,1147,myelosuppression,0.99352974
15,TeRP,PROBLEM,1110,1126,Themajor toxicity,TREATMENT,1183,1204,requirediscontinuation,0.8337787
16,TeRP,PROBLEM,1110,1126,Themajor toxicity,TREATMENT,1209,1217,treatment,0.7689903
17,TeRP,PROBLEM,1132,1147,myelosuppression,TREATMENT,1183,1204,requirediscontinuation,0.9301102


### Relation Visualization

We can visualize relations between entities by using `RelationExtractionVisualizer` module of `sparknlp_display` lìbrary.

In [0]:
re_vis = nlp.viz.RelationExtractionVisualizer()

vis = re_vis.display(relation_res,
                     relation_col = 'relations',
                     document_col = 'document',
                     show_relations=True,
                     return_html=True)

displayHTML(vis)

**If you want to go more over about relation extraction model examples, you can check this comprehensive notebook :**

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/10.Clinical_Relation_Extraction.ipynb

## 4. Entity Resolution

There are many entity resolution models for different kinds of purposes in Spark NLP. But mainly, we can collect these models in two categories:

* Chunk Entity Resolver Models
* Sentence Entity Resolver Models

Here are some of the resolver models in Spark NLP:

- sbiobertresolve_icd10cm 
- sbiobertresolve_icd10cm_augmented
- sbiobertresolve_icd10cm_slim_normalized
- sbiobertresolve_icd10cm_slim_billable_hcc
- sbertresolve_icd10cm_slim_billable_hcc_med
- sbiobertresolve_icd10pcs
- sbiobertresolve_snomed_findings (with clinical_findings concepts from CT version)
- sbiobertresolve_snomed_findings_int  (with clinical_findings concepts from INT version)
- sbiobertresolve_snomed_auxConcepts (with Morph Abnormality, Procedure, Substance, Physical Object, Body Structure concepts from CT version)
- sbiobertresolve_snomed_auxConcepts_int  (with Morph Abnormality, Procedure, Substance, Physical Object, Body Structure concepts from INT version)
- sbiobertresolve_rxnorm
- sbiobertresolve_rxcui
- sbiobertresolve_icdo
- sbiobertresolve_cpt
- sbiobertresolve_loinc
- sbiobertresolve_HPO
- sbiobertresolve_umls_major_concepts
- sbiobertresolve_umls_findings
- ...

We will use the same NER pipeline and `sbiobertresolve_icd10cm_slim_billable_hcc` ICD10 CM entity resolver model.

In [0]:
c2doc = nlp.Chunk2Doc()\
      .setInputCols("ner_chunk")\
      .setOutputCol("ner_chunk_doc") 

sbert_embedder = nlp.BertSentenceEmbeddings\
      .pretrained("sbert_jsl_medium_uncased",'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sbert_embeddings")

icd_resolver = medical.SentenceEntityResolverModel.pretrained("sbertresolve_icd10cm_slim_billable_hcc_med","en", "clinical/models") \
      .setInputCols(["sbert_embeddings"]) \
      .setOutputCol("icd10_code")\
      .setDistanceFunction("EUCLIDEAN")


resolverPipeline = nlp.Pipeline(stages=[
        nerPipeline,
        c2doc,
        sbert_embedder,
        icd_resolver
    
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
resolver_model = resolverPipeline.fit(empty_data)

sbert_jsl_medium_uncased download started this may take some time.
Approximate size to download 146.8 MB
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][OK!]
sbertresolve_icd10cm_slim_billable_hcc_med download started this may take some time.
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][OK!]


In [0]:
res_light = nlp.LightPipeline(resolver_model)

In [0]:
res_anno = res_light.annotate("bladder cancer")

In [0]:
res_anno

Out[53]: {'document': ['bladder cancer'],
 'ner_chunk': ['bladder cancer'],
 'token': ['bladder', 'cancer'],
 'sbert_embeddings': ['bladder cancer'],
 'ner': ['B-PROBLEM', 'I-PROBLEM'],
 'embeddings': ['bladder', 'cancer'],
 'pos_tags': ['JJR', 'NN'],
 'icd10_code': ['C671'],
 'ner_chunk_doc': ['bladder cancer'],
 'sentence': ['bladder cancer']}

In [0]:
list(zip(res_anno["ner_chunk"], res_anno["icd10_code"]))

Out[54]: [('bladder cancer', 'C671')]

In [0]:
resolver_res = res_light.fullAnnotate(sample_text[1])[0]

In [0]:
resolver_res.keys()

Out[56]: dict_keys(['document', 'ner_chunk', 'token', 'sbert_embeddings', 'ner', 'embeddings', 'pos_tags', 'icd10_code', 'ner_chunk_doc', 'sentence'])

In [0]:
chunks = []
codes = []
begin = []
end = []
resolutions= []
all_distances = []
all_codes= []
all_cosines = []
all_k_aux_labels= []
confidence = []
entity = []

for chunk, code in zip(resolver_res['ner_chunk'], resolver_res["icd10_code"]):

    begin.append(chunk.begin)
    entity.append(chunk.metadata['entity'])
    end.append(chunk.end)
    chunks.append(chunk.result)
    codes.append(code.result) 
    confidence.append(code.metadata['confidence'])
    all_codes.append(code.metadata['all_k_results'].split(':::'))
    resolutions.append(code.metadata['all_k_resolutions'].split(':::'))
    all_distances.append(code.metadata['all_k_distances'].split(':::'))
    all_cosines.append(code.metadata['all_k_cosine_distances'].split(':::'))
    all_k_aux_labels.append(code.metadata['all_k_aux_labels'].split(':::'))
    
df = pd.DataFrame({'chunks':chunks, 'entity':entity, 'begin': begin, 'end':end, 'code':codes, 'all_codes':all_codes, 
                   'resolutions':resolutions, 'all_k_aux_labels':all_k_aux_labels,'all_distances':all_cosines})



df['billable'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[0] for i in x])
df['hcc_status'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[1] for i in x])
df['hcc_score'] = df['all_k_aux_labels'].apply(lambda x: [i.split('||')[2] for i in x])
df['confidence'] = confidence

df = df.drop(['all_k_aux_labels'], axis=1)

In [0]:
pd.set_option("display.max_colwidth", 100)

In [0]:
df

Unnamed: 0,chunks,entity,begin,end,code,all_codes,resolutions,all_distances,billable,hcc_status,hcc_score,confidence
0,the treatment,TREATMENT,67,79,Z7689,"[Z7689, Z789, F4329, N5313, Z3141, Z37, Q438, E45, Z9189, T66XXXA, Z7189]","[response to treatment [Persons encountering health services in other specified circumstances], ...","[0.2620, 0.2690, 0.2773, 0.2850, 0.2940, 0.2921, 0.3030, 0.2976, 0.3103, 0.3050, 0.2962]","[1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 21, 0, 0, 0]",0.1487
1,breast cancer,PROBLEM,84,96,C50919,"[C50919, Z1239, C50911, C4452, D0590, D493, C61, C44501, Z853, C50819, C50111, C50929, C50921, C...","[breast cancer [Malignant neoplasm of unspecified site of unspecified female breast], screening ...","[0.0000, 0.1055, 0.1108, 0.1151, 0.1247, 0.1303, 0.1323, 0.1391, 0.1372, 0.1420, 0.1439, 0.1440,...","[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]","[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1]","[0, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, 0, 12, 0, 0, 0, 9, 12]",0.9984
2,the standard therapy,TREATMENT,109,128,Z789,"[Z789, Z7689, Z9289, Z5181, Z7189]","[complete therapeutic response [Other specified health status], ideal weight discussed (regime/t...","[0.2608, 0.2806, 0.2919, 0.2893, 0.3101]","[1, 1, 1, 1, 1]","[0, 0, 0, 0, 0]","[0, 0, 0, 0, 0]",0.2919
3,anthracyclines,TREATMENT,167,180,A220,"[A220, A229, A22, A222, L940, Z1629, B999, D239, A599, A227, Q821, A5272, H5054, L998, N897, A22...","[skin anthrax [Cutaneous anthrax], anthrax infection [Anthrax, unspecified], anthrax [Anthrax], ...","[0.2478, 0.2633, 0.2688, 0.2688, 0.2868, 0.2857, 0.2857, 0.2944, 0.2948, 0.2823, 0.2905, 0.2970,...","[1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 115, 6, 0, 0, 0, 0, 0]",0.0817
4,taxanes,TREATMENT,186,192,H1511,"[H1511, H15119, D4989, C7220, Q103, C781, L293, L291, I722, P2810, I714, R198, Q702, Q7649, L812...","[episcleritis periodica fugax [Episcleritis periodica fugax], episcleritis periodica fugax [Epis...","[0.3039, 0.3039, 0.2873, 0.2735, 0.2948, 0.3006, 0.2971, 0.2971, 0.2937, 0.3039, 0.3005, 0.2809,...","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1]","[0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1]","[0, 0, 0, 10, 0, 8, 0, 0, 108, 0, 108, 0, 0, 0, 0, 108, nan, 111, 11, 11, 0, 0, 11, 108, 107]",0.0579
5,the usefulnessof vinorelbine monotherapy,TREATMENT,229,268,Z7189,"[Z7189, Z7689, Z4931, L049, T464X4S, Z713, K5900, Z3183, Z4932, F1621, F1921, T43214, F18921, Z4...","[accutane treatment counseling [Other specified counseling], incretin mimetic therapy started (s...","[0.1702, 0.1904, 0.2058, 0.2281, 0.2153, 0.2091, 0.2091, 0.2194, 0.2257, 0.2462, 0.2462, 0.2356,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1]","[0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0]","[0, 0, 134, 0, 0, 0, 0, 0, 134, 55, 55, 0, 0, 134, 0, 0]",0.1363
6,advanced or recurrent breast cancerafter standard therapy,TREATMENT,287,343,Z1239,"[Z1239, Z8541, Z923, Z125, C802, T8649, R9721, T86818, Z1503, O3412, O3413, C5091, O3410, O3411,...",[screening exam for breast cancer [Encounter for other screening for malignant neoplasm of breas...,"[0.1534, 0.1573, 0.1573, 0.1514, 0.1550, 0.1550, 0.1631, 0.1582, 0.1690, 0.1640, 0.1650, 0.1732,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]","[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 12, 186, 0, 0, 0, 0, 0, 12, 0, 0, 186, 8, 0, 11, 0, 0, 0, 0]",0.0606
7,vinorelbine,TREATMENT,386,396,E7201,"[E7201, H5353, N8181, O418X, L708, F488, R7989, S0082, S0082XA, I7389, S0082XD, J3489, I898, Y95...","[cystinuria, type 1 [Cystinuria], deuteranomaly [Deuteranomaly], perineocele [Perineocele], find...","[0.2897, 0.2942, 0.2888, 0.3018, 0.2972, 0.3220, 0.3120, 0.3416, 0.3416, 0.3567, 0.3461, 0.3205,...","[1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[23, 0, 0, 0, 0, 0, 0, 0, 0, 108, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",0.0937
8,anthracyclines,TREATMENT,433,446,A220,"[A220, A229, A22, A222, L940, Z1629, B999, D239, A599, A227, Q821, A5272, H5054, L998, N897, A22...","[skin anthrax [Cutaneous anthrax], anthrax infection [Anthrax, unspecified], anthrax [Anthrax], ...","[0.2478, 0.2633, 0.2688, 0.2688, 0.2868, 0.2857, 0.2857, 0.2944, 0.2948, 0.2823, 0.2905, 0.2970,...","[1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 115, 6, 0, 0, 0, 0, 0]",0.0817
9,taxanes,TREATMENT,452,458,H1511,"[H1511, H15119, D4989, C7220, Q103, C781, L293, L291, I722, P2810, I714, R198, Q702, Q7649, L812...","[episcleritis periodica fugax [Episcleritis periodica fugax], episcleritis periodica fugax [Epis...","[0.3039, 0.3039, 0.2873, 0.2735, 0.2948, 0.3006, 0.2971, 0.2971, 0.2937, 0.3039, 0.3005, 0.2809,...","[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1]","[0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1]","[0, 0, 0, 10, 0, 8, 0, 0, 108, 0, 108, 0, 0, 0, 0, 108, nan, 111, 11, 11, 0, 0, 11, 108, 107]",0.0579


**Lets check the confidence level > 0.5 results**

In [0]:
df[df.confidence.astype(float) > 0.5]

Unnamed: 0,chunks,entity,begin,end,code,all_codes,resolutions,all_distances,billable,hcc_status,hcc_score,confidence
1,breast cancer,PROBLEM,84,96,C50919,"[C50919, Z1239, C50911, C4452, D0590, D493, C61, C44501, Z853, C50819, C50111, C50929, C50921, C...","[breast cancer [Malignant neoplasm of unspecified site of unspecified female breast], screening ...","[0.0000, 0.1055, 0.1108, 0.1151, 0.1247, 0.1303, 0.1323, 0.1391, 0.1372, 0.1420, 0.1439, 0.1440,...","[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]","[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1]","[0, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, 0, 12, 0, 0, 0, 9, 12]",0.9984
19,TheKaplan-Meier estimate,TEST,956,979,Z7689,"[Z7689, D720, R700]",[seen by endocrinology service (finding) [Persons encountering health services in other specifie...,"[0.2599, 0.3172, 0.3300]","[1, 1, 1]","[0, 1, 0]","[0, 47, 0]",0.5512


### Entity Resolution Visualization

In [0]:
er_vis = nlp.viz.EntityResolverVisualizer()


## To set custom label colors:
er_vis.set_label_colors({'TREATMENT':'#800080', 'PROBLEM':'#77b5fe'}) #set label colors by specifying hex codes

vis = er_vis.display(resolver_res, 
                     label_col='ner_chunk', 
                     resolution_col = 'icd10_code',
                     document_col='document',
                     return_html=True)

displayHTML(vis)

**If you want to go more over about entity resolution model examples, you can check this comprehensive notebooks :**

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/24.Improved_Entity_Resolvers_in_SparkNLP_with_sBert.ipynb
https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/3.Clinical_Entity_Resolvers.ipynb

### End of Notebook