<a href="https://colab.research.google.com/github/AlfredIsair/Natural-Language-Processing-Projects/blob/main/Clinical-NER-Named-Entity-Recognition/Clinical_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Named entity recognition (NER) is one of the most important building blocks of NLP tasks in the medical domain by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification.

Healthcare providers can use this version of NER to analyze clinical notes, extract keywords, and assign them to specific entities, such as PROBLEM, TEST, or TREATMENT.

We use the ZeroShotNerModel(zero_shot_ner_roberta)  that allows extracting entities by crafting appropriate prompts to query and RoBerTa Question Answeing model.It is the first of its kind and can detect any named entities without using any annotated dataset to train a model.

## Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install -q johnsnowlabs==5.1.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.9/103.9 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m643.8/643.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.4/212.4 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.4/95.4 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m531.2/531.2 kB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m2

In [None]:
from google.colab import files

print('Please Upload your John Snow Labs License using the button below')
license_keys = files.upload()

Please Upload your John Snow Labs License using the button below


Saving spark_nlp_for_healthcare_spark_ocr_8539.json to spark_nlp_for_healthcare_spark_ocr_8539.json


In [None]:
from johnsnowlabs import nlp, medical
# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM

nlp.install()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8539.json
🚨 Outdated Medical Secrets in license file. Version=5.1.3 but should be Version=5.1.0
🚨 Outdated OCR Secrets in license file. Version=5.0.2 but should be Version=5.0.1
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
📋 Stored John Snow Labs License in /root/.johnsnowlabs/licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json
👷 Setting up  John Snow Labs home in /root/.johnsnowlabs, this might take a few minutes.
Downloading 🐍+🚀 Python Library spark_nlp-5.1.0-py2.py3-none-any.whl
Downloading 🐍+💊 Python Library spark_nlp_jsl-5.1.0-py3-none-any.whl
Downloading 🫘+🚀 Java Library spark-nlp-assembly-5.1.0.jar
Downloading 🫘+💊 Java Library spark-nlp-jsl-5.1.0.jar
🙆 JSL Home setup in /root/.johnsnowlabs
👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8539.json
👷 Trying to install compatible secrets. Use nlp.settings.enfor

In [None]:
from johnsnowlabs import nlp, medical

# Automatically load license data and start a session with all jars user has access to
spark = nlp.start()

👌 Detected license file /content/spark_nlp_for_healthcare_spark_ocr_8539.json
👷 Trying to install compatible secrets. Use nlp.settings.enforce_versions=False if you want to install outdated secrets.
👌 Launched [92mcpu optimized[39m session with with: 🚀Spark-NLP==5.1.0, 💊Spark-Healthcare==5.1.0, running on ⚡ PySpark==3.1.2


In [None]:
spark

In [None]:
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pyspark.sql as SQL
from pyspark import keyword_only

## **NER Pipeline**
We  create a pipeline for Zero-Shot NER model with  `documentAssembler`, `sentenceDetector`, `tokenizer`, `zero_shot_ner` and `ner_converter` stages. As you can see, we don't use any embeddings model, because it is already included in the `ZeroShotNerModel` model

We then create a dictionary with the questions for detecting these entities and the labels that we want to see in the result. Then we provided this dictionary to the model by using setEntityDefinitions parameter.For example; we want to detect `PROBLEM`, `DRUG`, `PATIENT_AGE` and `ADMISSION_DATE` , `DIAGNOSIS`, `SYMPTOMS` entities


In [None]:
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

zero_shot_ner = medical.ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clinical/models")\
    .setEntityDefinitions(
        {
            "PROBLEM": ["What is the disease?", "What is his symptom?", "What is her disease?", "What is his disease?",
                        "What is the problem?" ,"What does a patient suffer", 'What was the reason that the patient is admitted to the clinic?'],
            "DRUG": ["Which drug?", "Which is the drug?", "What is the drug?", "Which drug does he use?", "Which drug does she use?", "Which drug do I use?", "Which drug is prescribed for a symptom?"],
            "ADMISSION_DATE": ["When did patient admitted to a clinic?"],
            "PATIENT_AGE": ["How old is the patient?",'What is the gae of the patient?'],
            "DIAGNOSIS": ["What was the final diagnosis?" ,"What were the primary and secondary diagnoses?", "What is the suspected diagnosis?",
                           "What other diagnoses were considered?"],
            "SYMPTOMS" : ["What are the patient's symptoms?", "What are the presenting symptoms?", "What other symptoms are present?"]

        })\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setPredictionThreshold(0.1) # default 0.01

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("ner_chunk")\

pipeline = nlp.Pipeline(stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    zero_shot_ner,
    ner_converter])

ner_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

zero_shot_ner_roberta download started this may take some time.
Approximate size to download 438.9 MB
[OK!]


In [None]:
zero_shot_ner.getClasses()

['PATIENT_AGE', 'PROBLEM', 'ADMISSION_DATE', 'DRUG', 'DIAGNOSIS', 'SYMPTOMS']

In [None]:
zero_shot_ner.extractParamMap()

{Param(parent='ZeroShotNerModel_6958705ca1c8', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='ZeroShotNerModel_6958705ca1c8', name='batchSize', doc='Size of every batch'): 8,
 Param(parent='ZeroShotNerModel_6958705ca1c8', name='maxSentenceLength', doc='Max sentence length to process'): 512,
 Param(parent='ZeroShotNerModel_6958705ca1c8', name='caseSensitive', doc='whether to ignore case in tokens for embeddings matching'): True,
 Param(parent='ZeroShotNerModel_6958705ca1c8', name='predictionThreshold', doc='Minimal confidence score to encode an entity (default is 0.1)'): 0.1,
 Param(parent='ZeroShotNerModel_6958705ca1c8', name='ignoreEntities', doc='List of entities to ignore'): [],
 Param(parent='ZeroShotNerModel_6958705ca1c8', name='engine', doc='Deep Learning engine used for this model'): 'tensorflow',
 Param(parent='ZeroShotNerModel_6958705ca1c8', name='inputCols', doc='previous annotations columns, if renamed'): ['

In [None]:
zero_shot_ner.getPredictionThreshold()

0.1

In [None]:
text_list = ["The doctor pescribed Majezik for my severe headache.",
             "The patient was admitted to the hospital for his colon cancer.",
             "27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis."
            ]

data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")

results = ner_model.transform(data)

In [None]:
results.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|       zero_shot_ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The doctor pescri...|[{document, 0, 51...|[{document, 0, 51...|[{token, 0, 2, Th...|[{named_entity, 0...|[{chunk, 21, 27, ...|
|The patient was a...|[{document, 0, 61...|[{document, 0, 61...|[{token, 0, 2, Th...|[{named_entity, 0...|[{chunk, 49, 60, ...|
|27 years old pati...|[{document, 0, 11...|[{document, 0, 11...|[{token, 0, 1, 27...|[{named_entity, 0...|[{chunk, 0, 11, 2...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



Checking the NER model results.

In [None]:
results\
    .selectExpr("explode(zero_shot_ner) AS entity")\
    .select(
        "entity.metadata.word",
        "entity.result",
        "entity.metadata.sentence",
        "entity.begin",
        "entity.end",
        "entity.metadata.confidence",
        "entity.metadata.question")\
    .show(100, truncate=False)

+-------------+----------------+--------+-----+---+----------+---------------------------------------------------------------+
|word         |result          |sentence|begin|end|confidence|question                                                       |
+-------------+----------------+--------+-----+---+----------+---------------------------------------------------------------+
|The          |O               |0       |0    |2  |null      |null                                                           |
|doctor       |O               |0       |4    |9  |null      |null                                                           |
|pescribed    |O               |0       |11   |19 |null      |null                                                           |
|Majezik      |B-DRUG          |0       |21   |27 |0.64671576|Which drug is prescribed for a symptom?                        |
|for          |O               |0       |29   |31 |null      |null                                             

In [None]:
results.select(F.explode(F.arrays_zip(results.token.result,
                                      results.zero_shot_ner.result,
                                      results.zero_shot_ner.metadata,
                                      results.zero_shot_ner.begin,
                                      results.zero_shot_ner.end)).alias("cols"))\
       .select(F.expr("cols['0']").alias("token"),
               F.expr("cols['1']").alias("ner_label"),
               F.expr("cols['2']['sentence']").alias("sentence"),
               F.expr("cols['3']").alias("begin"),
               F.expr("cols['4']").alias("end"),
               F.expr("cols['2']['confidence']").alias("confidence")).show(50, truncate=100)

+-------------+----------------+--------+-----+---+----------+
|        token|       ner_label|sentence|begin|end|confidence|
+-------------+----------------+--------+-----+---+----------+
|          The|               O|       0|    0|  2|      null|
|       doctor|               O|       0|    4|  9|      null|
|    pescribed|               O|       0|   11| 19|      null|
|      Majezik|          B-DRUG|       0|   21| 27|0.64671576|
|          for|               O|       0|   29| 31|      null|
|           my|               O|       0|   33| 34|      null|
|       severe|     B-DIAGNOSIS|       0|   36| 41| 0.5963669|
|     headache|     I-DIAGNOSIS|       0|   43| 50| 0.5963669|
|            .|               O|       0|   51| 51|      null|
|          The|               O|       0|    0|  2|      null|
|      patient|               O|       0|    4| 10|      null|
|          was|               O|       0|   12| 14|      null|
|     admitted|               O|       0|   16| 23|    

Checking the NER chunks.

In [None]:
results.selectExpr("explode(ner_chunk)").show(100, truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{chunk, 21, 27, Majezik, {chunk -> 0, confidence -> 0.64671576, ner_source -> ner_chunk, entity -> DRUG, sentence -> 0}, []}                                             |
|{chunk, 36, 50, severe headache, {chunk -> 1, confidence -> 0.5963669, ner_source -> ner_chunk, entity -> DIAGNOSIS, sentence -> 0}, []}                                 |
|{chunk, 49, 60, colon cancer, {chunk -> 0, confidence -> 0.8898498, ner_source -> ner_chunk, entity -> PROBLEM, sentence -> 0}, []}        

In [None]:
results.select(F.explode(F.arrays_zip(results.ner_chunk.result,
                                      results.ner_chunk.metadata)).alias("cols"))\
       .select(F.expr("cols['0']").alias("chunk"),
               F.expr("cols['1']['entity']").alias("ner_label"),
               F.expr("cols['1']['confidence']").alias("confidence")).show(50, truncate=100)

+------------------------------------------------+--------------+----------+
|                                           chunk|     ner_label|confidence|
+------------------------------------------------+--------------+----------+
|                                         Majezik|          DRUG|0.64671576|
|                                 severe headache|     DIAGNOSIS| 0.5963669|
|                                    colon cancer|       PROBLEM| 0.8898498|
|                                    27 years old|   PATIENT_AGE| 0.6943085|
|                                         Sep 1st|ADMISSION_DATE|0.95646095|
|a right-sided pleural effusion for thoracentesis|       PROBLEM|0.50026613|
+------------------------------------------------+--------------+----------+



## LightPipelines

In [None]:
import pandas as pd

# fullAnnotate in LightPipeline
print (text_list[-1], "\n")

light_model = nlp.LightPipeline(ner_model)
light_result = light_model.fullAnnotate(text_list[-1])

chunks = []
entities = []
sentence= []
begin = []
end = []

for n in light_result[0]['ner_chunk']:

    begin.append(n.begin)
    end.append(n.end)
    chunks.append(n.result)
    entities.append(n.metadata['entity'])
    sentence.append(n.metadata['sentence'])



df_clinical = pd.DataFrame({'chunks':chunks, 'begin': begin, 'end':end,
                   'sentence_id':sentence, 'entities':entities})

df_clinical.head(20)

27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis. 



Unnamed: 0,chunks,begin,end,sentence_id,entities
0,27 years old,0,11,0,PATIENT_AGE
1,Sep 1st,47,53,0,ADMISSION_DATE
2,a right-sided pleural effusion for thoracentesis,68,115,0,PROBLEM


In [None]:
light_result[0]

{'zero_shot_ner': [Annotation(named_entity, 0, 1, B-PATIENT_AGE, {'sentence': '0', 'word': '27', 'confidence': '0.6943085', 'question': 'How old is the patient?'}, []),
  Annotation(named_entity, 3, 7, I-PATIENT_AGE, {'sentence': '0', 'word': 'years', 'confidence': '0.6943085', 'question': 'How old is the patient?'}, []),
  Annotation(named_entity, 9, 11, I-PATIENT_AGE, {'sentence': '0', 'word': 'old', 'confidence': '0.6943085', 'question': 'How old is the patient?'}, []),
  Annotation(named_entity, 13, 19, O, {'sentence': '0', 'word': 'patient'}, []),
  Annotation(named_entity, 21, 23, O, {'sentence': '0', 'word': 'was'}, []),
  Annotation(named_entity, 25, 32, O, {'sentence': '0', 'word': 'admitted'}, []),
  Annotation(named_entity, 34, 35, O, {'sentence': '0', 'word': 'to'}, []),
  Annotation(named_entity, 37, 42, O, {'sentence': '0', 'word': 'clinic'}, []),
  Annotation(named_entity, 44, 45, O, {'sentence': '0', 'word': 'on'}, []),
  Annotation(named_entity, 47, 49, B-ADMISSION_DAT

### NER Visualizer

In [None]:
visualiser = nlp.viz.NerVisualizer()

for i in text_list:

    light_result = light_model.fullAnnotate(i)
    visualiser.display(light_result[0], label_col='ner_chunk', document_col='document')

    # Change color of an entity label
    # visualiser.set_label_colors({'PROBLEM':'#008080', 'DRUG':'#800080', 'PATIENT_AGE':'#808080'})
    # visualiser.display(light_result[0], label_col='ner_chunk')


    # Set label filter
    # visualiser.display(light_result[0], label_col='ner_chunk', document_col='document',labels=['PROBLEM'])

# Save the Model

Now we will save the Zero-Shot NER model and then we will be able to use this model without definitions. So our model will have the same labels that we defined before.

In [None]:
# save model

ner_model.write().overwrite().save("ner_model")