<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

# Clinical Entity Recognition with Assertion - version 2.3.4

## Example for Named Entity Recognition with Assertion Pipeline

A common NLP problem in biomedical aplications is to identify the presence of clinical entities in a given text. This clinical entities could be diseases, symptoms, drugs, results of clinical investigations or others.

But just identifying the presence of a clinical entity in an unestructured content is not enough for most of real world applications. As clinical care is full of uncertainty, in practice many of the entities refered in a medical record will not be really present in the patient but are mentioned just as working hypothesis, or identify a condition that want to be ruled out by means of a complementary test, or a condition being prevented by an intervention (for instance "patient was vaccinated against hepatitis B" does not imply that patient suffering from hepatitis B). In other cases a disease is mentioned associated with a relative of the patient (as in "Father with Alzheimer disease") as those family history is a risk factor in diseases with a genetic component.

In order to extract this information from the content the Spark-NLP enterprise version includes an Assertion annotator that based in a Machine Learning pretrained model will assign, for every entity identified, a tag that informs about the nature of that entity in terms of certainty: "present", "absent", "hypothesis", "conditional", "associated_with_other_person", etc.

In this example we will use Spark-NLP to identify some entities present in a a list of sentences adding an assertion about their certainty.

### Step 1. Prepare the environment

#### Install OpenSource spark-nlp and pyspark pip packages
As a first step we import the required python dependences including some sparknlp components.

Be sure that you have the required python libraries (pyspark 2.4.4, spark-nlp 2.3.4) by running <code>pip list</code>. Check that the versions are correct.

If some of them is missing you can run:

<code>pip install --ignore-installed pyspark==2.4.4</code><br>
<code>pip install --ignore-installed spark-nlp==2.3.4</code><br>

The --ignore-installed parameter is to overwrite your previous pip package version if already installed.

<i>*If this cell fails means you have not propertly setup the required environment. Please check the pre-requisites guideline at http://www.johnsnowlabs.com</i>

In [1]:
import sys, time, os
sys.path.append("/home/fernandrez/JSL/repos/spark-nlp/python")
sys.path.append("/home/fernandrez/JSL/repos/spark-nlp-internal/python")

from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.util import *

from sparknlp.pretrained import ResourceDownloader

from pyspark.ml import Pipeline, PipelineModel

#### Install Licensed Sparl-NLP package

We will use also some Spark-NLP enterprise functionalities contained in the spark-nlp-jsl package.

You can check that spark-nlp-jsl is installed by running <code>pip install</code>. Check that version installed is 2.3.4.

If it is not then you need to install it by using:

<code>pip install spark-nlp-jsl==2.3.4 --extra-index-url #### --ignore-installed</code>

The #### is a secret code, if you have not received it please contact us at info@johnsnowlabs.com.

<i>*If the next cell fails means your licensed enterprise version is not propertly installed so please check the pre-requisites guideline at http://www.johnsnowlabs.com/</i>

In [2]:
# If this fails, means pip module for enterprise has not been properly set up

from sparknlp_jsl.annotator import *

#### Setup credentials to private JohnSnowLabs models repository with AWS-CLI

Now is time to configure Spark-NLP in order to access private JohnSnowLabs models repository. This access is done via Amazon aws command line interface (AWSCLI).

Instructions about how to install awscli are available at: 

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html

Make sure you configure your credentials with <code>aws configure</code> following the instructions at:

https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html

Please substitute the ACCESS_KEY and SECRET_KEY with the credentials you have recived. If you need your credentials contact us at info@johnsnowlabs.com

#### Start Spark session

The following will initialize the spark session in case you have run the jupyter notebook directly. If you have started the notebook using pyspark this cell is just ignored.

Initializing the spark session takes some seconds (usually less than 1 minute) as the jar from the server needs to be loaded.

We will be using version 2.3.4 of Spark NLP Open Source and 2.3.4 of Spark NLP Enterprise Edition.

The #### in <code>.config("spark.jars", "####")</code> is a secret code, if you have not received it please contact us at info@johnsnowlabs.com.

In [3]:
# This cell will be ignored if jupyter started using pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Global DEMO - Spark NLP Enterprise 2.3.4") \
    .master("local[*]") \
    .config("spark.driver.memory","4G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.3.4") \
    .config("spark.jars", "#####/spark-nlp-jsl-2.3.4.jar") \
    .getOrCreate()

### Step 2. Clinical NER Pipeline creation

In Spark-NLP annotating NLP happens through pipelines. Pipelines are made out of various Annotator steps. In our case the architecture of the Clinical Named Entity Recognition pipeline with Assertion will be:

* DocumentAssembler (text -> document)
* SentenceDetector (document -> sentence)
* Tokenizer (sentence -> token)
* WordEmbeddingsModel ([sentence, token] -> embeddings)
* NerDLModel ([sentence, token, embeddings] -> ner)
* NerConverter([sentence, token, ner] -> ner_chunk)
* AssertionLogRegModel ([sentence, ner_chunk, embeddings] -> assertion)

So from a text we end having a list of Named Entities (Patient problems, Treatments and Tests) along with their certainty assertion tags.

#### Step 2.1 Initialize all the annotators required by the pipeline

The first 3 annotators of the pipeline are "DocumentAssembler", "SentenceDectector" and "Tokenizer":

In [4]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line

sentenceDetector = SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP

tokenizer = Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

The fourth annotator in the pipeline is "WordEmbeddingsModel". We will download a pretrained model available from "clinical/models" named "embeddings_clinical".

When running this cell your are advised to be patient. 

First time you call this pretrained model it needs to be downloaded in your local.

The model size is about will download the embeddings_clinical corpus it takes a while.

The size is about 1.7Gb and will be saved typically in your home folder as

    ~HOMEFOLDER/cached_models/embeddings_clinical_en_2.0.2_2.4_1558454742956.zip

Next times you call it the model is loaded from your cached copy but even in that case it needs to be indexed each time so expect waiting up to 5 minutes (depending on your machine)

In [5]:
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


The next annotator in our pipeline is the pretrained "ner_clinical" NerDLModel avaliable from "clinical/models". It requires as input the "sentence", "token" and "embeddings" (clinical embeddings pretrained model) and will classify each token in four categories:
<ol>
    <li>PROBLEM: for patient problems</li>
    <li>TEST: for tests, labs, etc.</li>
    <li>TREATMENT: for treatments, medicines, etc.</li>
    <li>OTHER: for the rest of tokens.</li>
</ol>

In order to split those identified NER that are consecutive, the B prefix (as B-PROBLEM) will be used at the first token of each NER. The I prefix (as I-PROBLEM) will be used for the rest of tokens inside the NER.

In [6]:
# Named Entity Recognition for clinical concepts. Includes #Problems #Diagnostics

#switch to ner_clinical instead of _noncontrib for better performance, if you are in Linux or MAC
clinical_ner = NerDLModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

ner_clinical download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


The Assertion annotator requires as an input the NER entities in a chunked format so we need the NerConverter annotator to generate that "ner_chunk" column in the Spark dataframe.

In [7]:
# Named Entity Recognition concepts parser, transforms entities into CHUNKS (required for next step: assertion status)

ner_converter = NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

Finally the pretrained AssertionLogRegModel named "assertion_ml" is included. It will classify each named entity into its assertion type: "present", "absent", "hypothetical", "conditional", "associated_with_other_person", etc.

In [8]:
# Assertion Status, verifies whether a particular subject wears a condition or not, and labels the condition by status

assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
  .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
  .setOutputCol("assertion")

assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]


AttributeError: 'AssertionDLModel' object has no attribute 'setGraphFolder'

#### Step 2.2 Define the NER pipeline

Now we will define the actual pipeline that puts together the annotators we have created.

In [17]:
# Build up the pipeline

pipeline = Pipeline(
    stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    assertion
  ])

### Step 3 Create a SparkDataFrame with the content

Now we will create a sample Spark dataframe with some sentences. In production environments a table with several of those sentences could be distributed in a cluster and be run in large scale systems.

In [18]:
# We want to know more about this simple dataframe

data = spark.createDataFrame([
  ["Patient with severe feber and sore throat"],
  ["Patient shows no stomach pain"],
  ["She was maintained on an epidural and PCA for pain control."],
  ["He also became short of breath with climbing a flight of stairs."],
  ["Lung tumour located at the right lower lobe"],
  ["Father with Alzheimer."]
]).toDF("text")

data.show(truncate=False)

+----------------------------------------------------------------+
|text                                                            |
+----------------------------------------------------------------+
|Patient with severe feber and sore throat                       |
|Patient shows no stomach pain                                   |
|She was maintained on an epidural and PCA for pain control.     |
|He also became short of breath with climbing a flight of stairs.|
|Lung tumour located at the right lower lobe                     |
|Father with Alzheimer.                                          |
+----------------------------------------------------------------+



### Step 4 Create a model fiting the NER pipeline with the clinical note.

Now we can use the pipeline and the sentences to generate a model.

In [19]:
# We convert the pipeline into a model, train any annotator if required (not the case)

model = pipeline.fit(data)

### Step 5 Transform/annotate the sentences using the model.

In order to process the data with the new created model we apply a transformation.

This will save in a Spakr DataFrame (output) the resuls of running the model over the clinical note. 


In [20]:
output = model.transform(data)

Lets print a column with the Named Entities chunked and a column with the assertion classification assigned by the model.

We see for example that in the sentence "Patient shows no stomach pain", the sympton "stomach pain" has been identified but correctly asserted as "absent".

In the case of "She was maintained on an epidural and PCA for pain control." the entity "pain control" has been identified and asserted as "hypothetical". In this case the fact that the PCA effectively controled pain is not completely certain, therefore the entity is marked as an hypothesis. However the presence of an epidural procedure and a PCA are considered as certain and asserted as "present".

In the case of "Father with Alzheimer" the Assertion annotator is able to identify that this condition is associated not with the patient, but with a relative.

In [23]:
output.select("ner_chunk", "assertion").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk                                                                                                                                                                                                                                          |assertion                                                                                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------------