![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/MedicalNerModel.ipynb)

# MedicalNerModel

In this notebook, we will examine the `MedicalNerModel` annotator.

This Named Entity recognition annotator is a generic NER model based on Neural Networks. Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets. <br/>

In the original framework, the CNN extracts a fixed length feature vector from character-level features. For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers. They employed a stacked bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores. The extracted features of each word are fed into a forward LSTM network and a backward LSTM network. The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category. These two vectors are then simply added together to produce the final output. In the architecture of the proposed framework in the original paper, 50-dimensional pretrained word embeddings is used for word features, 25-dimension character embeddings is used for char features, and capitalization features (allCaps, upperInitial, lowercase, mixedCaps, noinfo) are used for case features. <br/>


**📖 Learning Objectives:**

1. Understand how to detect Named Entities by using pre-trained models.

2. Become comfortable using the different parameters of the annotator.

**🔗 Helpful Links:**

- For extended examples of usage, see the [Spark NLP Workshop](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/1.Clinical_Named_Entity_Recognition_Model.ipynb)

- Python Documentation: [MedicalNerModel](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/ner/medical_ner/index.html#sparknlp_jsl.annotator.ner.medical_ner.MedicalNerModel)

- Scala Documentation: [MedicalNerModel](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/ner/MedicalNerModel.html)


**Blogposts and videos:**

- [Named Entity Recognition (NER) with BERT in Spark NLP](https://towardsdatascience.com/named-entity-recognition-ner-with-bert-in-spark-nlp-874df20d1d77)

- [State of the art Clinical Named Entity Recognition in Spark NLP - Youtube](https://www.youtube.com/watch?v=YM-e4eOiQ34)

- [Named Entity Recognition for Healthcare with SparkNLP NerDL and NerCRF](https://medium.com/spark-nlp/named-entity-recognition-for-healthcare-with-sparknlp-nerdl-and-nercrf-a7751b6ad571)

- [Named Entity Recognition for Clinical Text](https://medium.com/atlas-research/ner-for-clinical-text-7c73caddd180)

## **📜 Background**


This annotator extracts entities via Neural Network architecture which is Char CNNs - BiLSTM - CRF. It is a neural network architecture that automatically detects word and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering steps.

## **🎬 Colab Setup**

In [None]:
# Install the johnsnowlabs library to access Spark-NLP for Healthcare
! pip install -q johnsnowlabs

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.2/265.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.0/565.0 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.2/676.2 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.6/95.6 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.9/66.9 kB[0m [31m8.

In [None]:
from google.colab import files
print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8734_532.json to spark_nlp_for_healthcare_spark_ocr_8734_532.json


In [None]:
from johnsnowlabs import nlp, medical

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.3.2-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.3.2-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.3.2.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.3.2.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
Installing /root/.johnsnowlabs/py_installs/spark_nlp_jsl-5.3.2-py3-none-any.whl to /usr/bin/python3
Installed 1 products:
💊 Spark-Healthcare==5.3.2 installed! ✅ Heal the planet with NLP! 


In [None]:
import pyspark.sql.functions as F

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8734_532.json
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, running on ⚡ PySpark==3.4.0


In [None]:
spark

## **🖨️ Input/Output Annotation Types**
- Input: `DOCUMENT, TOKEN, WORD_EMBEDDINGS`
- Output: `NAMED_ENTITY`

## **🔎 Parameters**


- `IncludeConfidence` *(Boolean)*: Whether to include confidence scores in annotation metadata (Default: False).

- `IncludeAllConfidenceScores` *(Boolean)*: Whether to include all confidence scores in annotation metadata or just the score of the predicted tag (Default: False).

- `LabelCasing` *(String)*: Set the tag to case sensitive or not. Setting all labels of the NER models upper/lower case. Values: upper|lower.

- `sentenceTokenIndex` *(Boolean)*: Whether to include the token index for each sentence in annotation metadata, by default False. If the value is True, the process might be slowed down.

- `doExceptionHandling`*(Boolean)*: If true, effective batchsize is 1 and exceptions are handled.

### `setIncludeConfidence()`

This parameter is used to decide whether to include confidence scores in annotation metadata.

Firstly, we wil define a NER pipeline with `MedicalNerModel` annotator and other required stages. Then, see the results upon a sample text.

Creating a dataframe with example text:

In [None]:
text = """Mr. ABC is a 60-year-old gentleman who had a markedly abnormal stress test earlier today in my office with severe chest pain after 5 minutes of exercise on the standard Bruce with horizontal ST depressions and moderate apical ischemia on stress imaging only. He required 3 sublingual nitroglycerin in total (please see also admission history and physical for full details).
The patient underwent cardiac catheterization with myself today which showed mild-to-moderate left main distal disease of 30%, moderate proximal LAD with a severe mid-LAD lesion of 99%, and a mid-left circumflex lesion of 80% with normal LV function and some mild luminal irregularities in the right coronary artery with some moderate stenosis seen in the mid to distal right PDA.
I discussed these results with the patient, and he had been relating to me that he was having rest anginal symptoms, as well as nocturnal anginal symptoms, and especially given the severity of the mid left anterior descending lesion, with a markedly abnormal stress test, I felt he was best suited for transfer for PCI. I discussed the case with Dr. X at Medical Center who has kindly accepted the patient in transfer.
CONDITION ON TRANSFER: Stable but guarded. The patient is pain-free at this time.

MEDICATIONS ON TRANSFER:
1. Aspirin 325 mg once a day.
2. Metoprolol 50 mg once a day, but we have had to hold it because of relative bradycardia which he apparently has a history of.
3. Nexium 40 mg once a day.
4. Zocor 40 mg once a day, and there is a fasting lipid profile pending at the time of this dictation. I see that his LDL was 136 on May 3, 2002.
5. Plavix 600 mg p.o. x1 which I am giving him tonight."""

df = spark.createDataFrame([[text]]).toDF("text")

NER pipeline with `MedicalNerModel()`:

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")\
        .setIncludeConfidence(False)

ner_converter = medical.NerConverterInternal()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical_large download started this may take some time.
[OK!]


We've created a pipeline and fit it with an empty dataframe. Now, we will transform our ner model with the sample data and check the results.

In [None]:
result= model.transform(df)
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Mr. ABC is a 60-y...|[{document, 0, 16...|[{document, 0, 25...|[{token, 0, 1, Mr...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 43, 73, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [None]:
result.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true

We will check NER results

In [None]:
result.select(F.explode(F.arrays_zip(result.token.result,
                                     result.ner.result,
                                     result.ner.begin,
                                     result.ner.end,
                                     result.ner.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("token"),
              F.expr("cols['2']").alias("begin"),
              F.expr("cols['3']").alias("end"),
              F.expr("cols['1']").alias("ner_label")).show(50, truncate=False)

+-------------+-----+---+-----------+
|token        |begin|end|ner_label  |
+-------------+-----+---+-----------+
|Mr           |0    |1  |O          |
|.            |2    |2  |O          |
|ABC          |4    |6  |O          |
|is           |8    |9  |O          |
|a            |11   |11 |O          |
|60-year-old  |13   |23 |O          |
|gentleman    |25   |33 |O          |
|who          |35   |37 |O          |
|had          |39   |41 |O          |
|a            |43   |43 |B-PROBLEM  |
|markedly     |45   |52 |I-PROBLEM  |
|abnormal     |54   |61 |I-PROBLEM  |
|stress       |63   |68 |I-PROBLEM  |
|test         |70   |73 |I-PROBLEM  |
|earlier      |75   |81 |O          |
|today        |83   |87 |O          |
|in           |89   |90 |O          |
|my           |92   |93 |O          |
|office       |95   |100|O          |
|with         |102  |105|O          |
|severe       |107  |112|B-PROBLEM  |
|chest        |114  |118|I-PROBLEM  |
|pain         |120  |123|I-PROBLEM  |
|after      

Checking the confidence scores under the metadata.

In [None]:
result.select(F.explode(F.arrays_zip(result.token.result,
                                     result.ner.result,
                                     result.ner.begin,
                                     result.ner.end,
                                     result.ner.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("token"),
              F.expr("cols['2']").alias("begin"),
              F.expr("cols['3']").alias("end"),
              F.expr("cols['1']").alias("ner_label"),
              F.expr("cols['4']['confidence']").alias("confidence")).show(50, truncate=False)

+-------------+-----+---+-----------+----------+
|token        |begin|end|ner_label  |confidence|
+-------------+-----+---+-----------+----------+
|Mr           |0    |1  |O          |null      |
|.            |2    |2  |O          |null      |
|ABC          |4    |6  |O          |null      |
|is           |8    |9  |O          |null      |
|a            |11   |11 |O          |null      |
|60-year-old  |13   |23 |O          |null      |
|gentleman    |25   |33 |O          |null      |
|who          |35   |37 |O          |null      |
|had          |39   |41 |O          |null      |
|a            |43   |43 |B-PROBLEM  |null      |
|markedly     |45   |52 |I-PROBLEM  |null      |
|abnormal     |54   |61 |I-PROBLEM  |null      |
|stress       |63   |68 |I-PROBLEM  |null      |
|test         |70   |73 |I-PROBLEM  |null      |
|earlier      |75   |81 |O          |null      |
|today        |83   |87 |O          |null      |
|in           |89   |90 |O          |null      |
|my           |92   

As seen above, there is no confidence scores under the metadata. <br/>

Now, let's set `.setIncludeConfidence(True)` and fit/transform the pipeline, then see the difference.

In [None]:
clinical_ner = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")\
        .setIncludeConfidence(True)


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

result= model.transform(df)

ner_clinical_large download started this may take some time.
[OK!]


Checking the result

In [None]:
result.select(F.explode(F.arrays_zip(result.token.result,
                                     result.ner.result,
                                     result.ner.begin,
                                     result.ner.end,
                                     result.ner.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("token"),
              F.expr("cols['2']").alias("begin"),
              F.expr("cols['3']").alias("end"),
              F.expr("cols['1']").alias("ner_label"),
              F.expr("cols['4']['confidence']").alias("confidence")).show(50, truncate=False)

+-------------+-----+---+-----------+----------+
|token        |begin|end|ner_label  |confidence|
+-------------+-----+---+-----------+----------+
|Mr           |0    |1  |O          |0.9988    |
|.            |2    |2  |O          |0.9995    |
|ABC          |4    |6  |O          |0.7884    |
|is           |8    |9  |O          |0.9999    |
|a            |11   |11 |O          |0.999     |
|60-year-old  |13   |23 |O          |0.9947    |
|gentleman    |25   |33 |O          |0.987     |
|who          |35   |37 |O          |1.0       |
|had          |39   |41 |O          |0.9997    |
|a            |43   |43 |B-PROBLEM  |0.7255    |
|markedly     |45   |52 |I-PROBLEM  |0.6551    |
|abnormal     |54   |61 |I-PROBLEM  |0.8344    |
|stress       |63   |68 |I-PROBLEM  |0.6409    |
|test         |70   |73 |I-PROBLEM  |0.508     |
|earlier      |75   |81 |O          |0.9998    |
|today        |83   |87 |O          |0.9825    |
|in           |89   |90 |O          |0.9995    |
|my           |92   

After setting `.setIncludeConfidence(True)`, we are able to see the confidence scores.

### `setIncludeAllConfidenceScores()`

This parameter is used to merge confidence scores per label to only predicted label.

Let's set `setIncludeAllConfidenceScores(False)` and see the results.

In [None]:
clinical_ner = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")\
        .setIncludeAllConfidenceScores(False)


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

result= model.transform(df)

ner_clinical_large download started this may take some time.
[OK!]


In [None]:
result.select("ner").show(5, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result.select(F.explode(F.arrays_zip(result.token.result,
                                     result.ner.result,
                                     result.ner.begin,
                                     result.ner.end,
                                     result.ner.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("token"),
              F.expr("cols['2']").alias("begin"),
              F.expr("cols['3']").alias("end"),
              F.expr("cols['1']").alias("ner_label"),
              F.expr("cols['4']['confidence']").alias("confidence")).show(50, truncate=False)

+-------------+-----+---+-----------+----------+
|token        |begin|end|ner_label  |confidence|
+-------------+-----+---+-----------+----------+
|Mr           |0    |1  |O          |0.9988    |
|.            |2    |2  |O          |0.9995    |
|ABC          |4    |6  |O          |0.7884    |
|is           |8    |9  |O          |0.9999    |
|a            |11   |11 |O          |0.999     |
|60-year-old  |13   |23 |O          |0.9947    |
|gentleman    |25   |33 |O          |0.987     |
|who          |35   |37 |O          |1.0       |
|had          |39   |41 |O          |0.9997    |
|a            |43   |43 |B-PROBLEM  |0.7255    |
|markedly     |45   |52 |I-PROBLEM  |0.6551    |
|abnormal     |54   |61 |I-PROBLEM  |0.8344    |
|stress       |63   |68 |I-PROBLEM  |0.6409    |
|test         |70   |73 |I-PROBLEM  |0.508     |
|earlier      |75   |81 |O          |0.9998    |
|today        |83   |87 |O          |0.9825    |
|in           |89   |90 |O          |0.9995    |
|my           |92   

Now, we will set `setIncludeAllConfidenceScores(True)` and see the difference.

In [None]:
clinical_ner = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")\
        .setIncludeAllConfidenceScores(True)


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

result= model.transform(df)

ner_clinical_large download started this may take some time.
[OK!]


Checking the results

In [None]:
result.select("ner").show(5, truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [None]:
result.select(F.explode(F.arrays_zip(result.token.result,
                                     result.ner.result,
                                     result.ner.begin,
                                     result.ner.end,
                                     result.ner.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("TOKEN"),
              F.expr("cols['1']").alias("LABEL"),
              F.expr("cols['2']").alias("BEGIN"),
              F.expr("cols['3']").alias("END"),
              F.expr("cols['4']['B-TREATMENT']").alias("B-TREATMENT"),
              F.expr("cols['4']['I-TREATMENT']").alias("I-TREATMENT"),
              F.expr("cols['4']['B-PROBLEM']").alias("B-PROBLEM"),
              F.expr("cols['4']['I-PROBLEM']").alias("I-PROBLEM"),
              F.expr("cols['4']['B-TEST']").alias("B-TEST"),
              F.expr("cols['4']['I-TEST']").alias("I-TEST"),
              F.expr("cols['4']['O']").alias("O")).show(50, truncate=False)

+-------------+-----------+-----+---+-----------+-----------+---------+---------+------+------+------+
|TOKEN        |LABEL      |BEGIN|END|B-TREATMENT|I-TREATMENT|B-PROBLEM|I-PROBLEM|B-TEST|I-TEST|O     |
+-------------+-----------+-----+---+-----------+-----------+---------+---------+------+------+------+
|Mr           |O          |0    |1  |1.0E-4     |1.0E-4     |7.0E-4   |3.0E-4   |1.0E-4|0.0   |0.9988|
|.            |O          |2    |2  |0.0        |1.0E-4     |0.0      |1.0E-4   |0.0   |3.0E-4|0.9995|
|ABC          |O          |4    |6  |0.0012     |0.0407     |0.0035   |0.0657   |0.001 |0.0996|0.7884|
|is           |O          |8    |9  |0.0        |0.0        |0.0      |0.0      |0.0   |0.0   |0.9999|
|a            |O          |11   |11 |0.0        |2.0E-4     |1.0E-4   |4.0E-4   |0.0   |2.0E-4|0.999 |
|60-year-old  |O          |13   |23 |0.0        |0.0012     |0.0      |0.0032   |0.0   |8.0E-4|0.9947|
|gentleman    |O          |25   |33 |0.0        |0.0032     |1.0E-4   |0.

As seen above, we are able to see the confidence scores for each label after setting `setIncludeAllConfidenceScores(True)`.

### `setLabelCasing()`

This parameter is used to set the tag to case sensitive or not. Setting all labels of the NER models upper/lower case. It takes two possible values; `upper` or `lower`.

Firstly, we will create `MedicalNerModel()` with `setLabelCasing("upper")`.

In [None]:
clinical_ner = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")\
        .setLabelCasing("upper")


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

result= model.transform(df)

ner_clinical_large download started this may take some time.
[OK!]


In [None]:
result.select(F.explode(F.arrays_zip(result.token.result,
                                     result.ner.result,
                                     result.ner.begin,
                                     result.ner.end,
                                     result.ner.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("token"),
              F.expr("cols['2']").alias("begin"),
              F.expr("cols['3']").alias("end"),
              F.expr("cols['1']").alias("ner_label")).show(50, truncate=False)

+-------------+-----+---+-----------+
|token        |begin|end|ner_label  |
+-------------+-----+---+-----------+
|Mr           |0    |1  |O          |
|.            |2    |2  |O          |
|ABC          |4    |6  |O          |
|is           |8    |9  |O          |
|a            |11   |11 |O          |
|60-year-old  |13   |23 |O          |
|gentleman    |25   |33 |O          |
|who          |35   |37 |O          |
|had          |39   |41 |O          |
|a            |43   |43 |B-PROBLEM  |
|markedly     |45   |52 |I-PROBLEM  |
|abnormal     |54   |61 |I-PROBLEM  |
|stress       |63   |68 |I-PROBLEM  |
|test         |70   |73 |I-PROBLEM  |
|earlier      |75   |81 |O          |
|today        |83   |87 |O          |
|in           |89   |90 |O          |
|my           |92   |93 |O          |
|office       |95   |100|O          |
|with         |102  |105|O          |
|severe       |107  |112|B-PROBLEM  |
|chest        |114  |118|I-PROBLEM  |
|pain         |120  |123|I-PROBLEM  |
|after      

Now, let's create `MedicalNerModel()` with `setLabelCasing("lower")` and see the difference.

In [None]:
clinical_ner = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")\
        .setLabelCasing("lower")


nlpPipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter])


empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

result= model.transform(df)

ner_clinical_large download started this may take some time.
[OK!]


In [None]:
result.select(F.explode(F.arrays_zip(result.token.result,
                                     result.ner.result,
                                     result.ner.begin,
                                     result.ner.end,
                                     result.ner.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("token"),
              F.expr("cols['2']").alias("begin"),
              F.expr("cols['3']").alias("end"),
              F.expr("cols['1']").alias("ner_label")).show(50, truncate=False)

+-------------+-----+---+-----------+
|token        |begin|end|ner_label  |
+-------------+-----+---+-----------+
|Mr           |0    |1  |O          |
|.            |2    |2  |O          |
|ABC          |4    |6  |O          |
|is           |8    |9  |O          |
|a            |11   |11 |O          |
|60-year-old  |13   |23 |O          |
|gentleman    |25   |33 |O          |
|who          |35   |37 |O          |
|had          |39   |41 |O          |
|a            |43   |43 |B-problem  |
|markedly     |45   |52 |I-problem  |
|abnormal     |54   |61 |I-problem  |
|stress       |63   |68 |I-problem  |
|test         |70   |73 |I-problem  |
|earlier      |75   |81 |O          |
|today        |83   |87 |O          |
|in           |89   |90 |O          |
|my           |92   |93 |O          |
|office       |95   |100|O          |
|with         |102  |105|O          |
|severe       |107  |112|B-problem  |
|chest        |114  |118|I-problem  |
|pain         |120  |123|I-problem  |
|after      

As seen above, the tags are lowercase.